Jul 30, 2015
The Iterated Prisoners’ Dilemma
20 Years On
ADVANCES IN NATURAL COMPUTATION
Series Editor: Xin Yao (University of Birmingham, UK)
Assoc. Editors: Hans-Paul Schwefel (University of Dortmund, Germany)Byoung-Tak Zhang (Seoul National University, South Korea)Martyn Amos (University of Liverpool, UK)
Published
Vol. 1: Applications of Multi-Objective Evolutionary AlgorithmsEds: Carlos A. Coello Coello (CINVESTAV-IPN, Mexico) andGary B. Lamont (Air Force Institute of Technology, USA)
Vol. 2: Recent Advances in Simulated Evolution and LearningEds: Kay Chen Tan (National University of Singapore, Singapore),Meng Hiot Lim (Nanyang Technological University, Singapore),Xin Yao (University of Birmingham, UK) andLipo Wang (Nanyang Technological University, Singapore)
Vol. 3: Recent Advances in Artificial LifeEds: H. A. Abbass (University of New South Wales, Australia),T. Bossomaier (Charles Sturt University, Australia) andJ. Wiles (The University of Queensland, Australia)
Vol. 4: The Iterated Prisoners’ DilemmaEds: Graham Kendall (The University of Nottingham, UK)Xin Yao (The University of Birmingham, UK)
Steven - The Iterated Prisoners.pmd 3/19/2007, 4:44 PM2
N E W J E R S E Y • L O N D O N • S I N G A P O R E • B E I J I N G • S H A N G H A I • H O N G K O N G • TA I P E I • C H E N N A I
World Scientific
Advances in Natura l Computat ion — Vol . 4
Graham Kendall
Xin Yao
Siang Yew Chong
The Iterated Prisoners’ Dilemma
20 Years On
The University of Nottingham, UK
The University of Birmingham, UK
British Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.
For photocopying of material in this volume, please pay a copying fee through the CopyrightClearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission tophotocopy is not required from the publisher.
ISBN-13 978-981-270-697-3ISBN-10 981-270-697-6
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,electronic or mechanical, including photocopying, recording or any information storage and retrievalsystem now known or to be invented, without written permission from the Publisher.
Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd.
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Printed in Singapore.
Advances in Natural Computation — Vol. 4THE ITERATED PRISONERS’ DILEMMA20 Years On
Steven - The Iterated Prisoners.pmd 3/19/2007, 4:44 PM1
March 1, 2007 18:37 World Scientific Review Volume - 9in x 6in contents
Contents
List of Contributors vii
Chapter 1 The Iterated Prisoner’s Dilemma: 20 Years On 1
Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei
Li and Xin Yao
Chapter 2 Iterated Prisoner’s Dilemma and Evolutinary Game Theory 23
Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei
Li and Xin Yao
Chapter 3 Learning IPD Strategies Through Co-evolution 63
Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei
Li and Xin Yao
Chapter 4 How to Design a Strategy to Win an IPD Tournament 89
Jiawei Li
Chapter 5 An Immune Adaptive Agent for the Iterated Prisoner’s
Dilemma 105
Oscar Alonso and Fernando Nino
Chapter 6 Exponential Smoothed Tit-for-Tat 127
Michael Filzmoser
Chapter 7 Opponent Modelling, Evolution, and The Iterated
Prisoner’s Dilemma 139
Philip Hingston, Dan Dyer, Luigi Barone, Tim French
and Graham Kendall
Chapter 8 On Some Winning Strategies for the Iterated Prisoner’s
Dilemma 171
Wolfgang Slany and Wolfgang Kienreich
v
March 1, 2007 18:37 World Scientific Review Volume - 9in x 6in contents
vi Contents
Chapter 9 Error-Correcting Codes for Team Coordination within a
Noisy Iterated Prisoner’s Dilemma Tournament 205
Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn,
Perukrishnen Vytelingum and Nicholas R. Jennings
Chapter 10 Is it Accidental or Intentional? A Symbolic Approach to
the Noisy Iterated Prisoner’s Dilemma 231
Tsz-Chiu Au and Dana Nau
March 1, 2007 18:39 World Scientific Review Volume - 9in x 6in contributors
List of Contributors
Oscar Alonso,
Computer Systems and Industrial Engineering Department,
National University of Colombia, Bogota
Colombia
Email: [email protected]
Tsz-Chiu Au,
Department of Computer Science and Institute for Systems Research,
University of Maryland,
College Park, MD 20742
USA
Email: [email protected]
Luigi Barone,
Department of Computer Science and Software Engineering,
The Univesity of Western Australia,
35 Stirling Highway,
Crawley, WA, 6009
Australia
Email: [email protected]
Siang Yew Chong,
School of Computer Science,
University of Birmingham,
Birmingham, B15 2TT
UK
Email: [email protected]
vii
March 1, 2007 18:39 World Scientific Review Volume - 9in x 6in contributors
viii List of Contributors
Rajdeep K. Dash,
Electronics and Computer Science,
University of Southampton,
Southampton, SO17 1BJ
UK
Email: [email protected]
Dan Dyer,
Department of Computer Science and Software Engineering,
The University of Western Australia,
35 Stirling Highway,
Crawley, WA, 6009
Australia
Email: [email protected]
Michael Filzmoser,
School of Business Administration,
Economics, and Statistics,
University of Vienna,
Vienna, A-1210
Austria
Email: [email protected]
Tim French,
Department of Computer Science and Software Engineering,
The University of Western Australia,
35 Stirling Highway,
Crawley, WA, 6009
Australia
Email: [email protected]
Philip Hingston,
School of Computer and Information Science,
Edith Cowan University - Mt Lawley Campus,
2 Bradford Street,
Mt Lawley, WA 6050
Australia
Email: [email protected]
March 1, 2007 18:39 World Scientific Review Volume - 9in x 6in contributors
List of Contributors ix
Jan Humble,
School of Computer Science and Information Technology,
University of Nottingham,
Nottingham, NG8 1BB
UK
Email: [email protected]
Nicholas R. Jennings,
Electronics and Computer Science,
University of Southampton,
Southampton, SO17 1BJ,
UK
Email: [email protected]
Graham Kendall,
School of Computer Science and Information Technology,
University of Nottingham,
Nottingham, NG8 1BB
UK
Email: [email protected]
Wolfgang Kienreich,
Know-Center, Inffeldgasse 21a/II,
8010 Graz
Austria
Email: [email protected]
Jiawei Li,
Robot Institute,
Harbin Institute of Technology,
Heilongjiang, 150001,
P. R. China
Email: lijiawei [email protected]
Dana Nau,
Department of Computer Science and Institute for Systems Research,
University of Maryland,
College Park, MD 20742
USA
Email: [email protected]
March 1, 2007 18:39 World Scientific Review Volume - 9in x 6in contributors
x List of Contributors
Fernando Nino,
Computer Systems and Industrial Engineering Department,
National University of Colombia, Bogota
Colombia
Email: [email protected]
Sarvapali D. Ramchurn,
Electronics and Computer Science,
University of Southampton,
Southampton, SO17 1BJ
UK
Email: [email protected]
Alex Rogers,
Electronics and Computer Science,
University of Southampton,
Southampton, SO17 1BJ
UK
Email: [email protected]
Wolfgang Slany,
Institut fur Softwaretechnologie,
Inffeldgasse 16b/II,
TU Graz, A-8010 Graz
Austria
Email: [email protected]
Perukrishnen Vytelingum,
Electronics and Computer Science,
University of Southampton,
Southampton, SO17 1BJ
UK
Email: [email protected]
Xin Yao,
School of Computer Science,
University of Birmingham,
Birmingham, B15 2TT
UK
Email: [email protected]
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
Chapter 1
The Iterated Prisoner’s Dilemma: 20 Years On
Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3,
Xin Yao1
University of Birmingham1, University of Nottingham2, Harbin Institute
of Technology3
1.1. Introduction
In 1984, Robert Axelrod reported the results of two iterated prisoner’s
dilemma (IPD) competitions [Axelrod (1984)]. The booked was to be a
catalyst for much of the research in this area since that time. It is unlikely
that you would write a scientific paper about IPD, without citing Axelrod’s
1984 book. The book is even more remarkable in that it is just as accessible
to a general audience, as well as being an important source of inspiration
for the scientific community.
In 2001, whilst attending the Congress on Evolutionary Computation
(CEC) conference, we were discussing some of the presentations we had
seen which reported recent some of the latest work on the iterated prisoner’s
dilemma. We were paying tribute to the fact that Axelrod’s book had stood
the test of time when somebody made a casual comment suggesting that we
should re-run the competition in 2004, to celebrate the 20th anniversary.
And, so, this book was born.
Of course, since the conversation in Hawaii and the publication of this
book, there have been a lot of people doing a lot of work. Not least of
all Robert Axelrod who was good enough to give up his time to present a
plenary talk at the CEC conference in 2004. At that talk he presented his
latest work which is investigating evolution on a grid based world.
We owe a debt of thanks to the UK’s EPSRC (Engineering and Physical
Sciences Research Council). This is the largest of the UK research coun-
cils which funds research in the UK. When we returned from Hawaii, we
1
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
2 S. Y. Chong et al.
submitted a proposal,a which requested a small amount of funds (£23,718)
in order to re-run, and extend, the competitions that Axelrod had run
20 years earlier. The funds we received from EPSRC allowed us to run two
competitions, one in 2004 and one in 2005. The entrants to the competi-
tions were invited to submit a chapter for consideration in this book. These
chapters underwent a peer review process (see later in this chapter for an
acknowledgement of the reviewers) and those chapters that were successful
form the latter part of this book.
As editors, we feel fortunate to have several winners, second and third
place entries reported in this book. This affords the reader the opportu-
nity to learn, first hand from the authors, what made these strategies so
successful and, perhaps, use some of the ideas and innovations in their own
strategies for future competitions.
1.2. Iterated Prisoner’s Dilemma
Almost every chapter in this book has its own description of the iterated
prisoner’s dilemma. As each chapter can be read in isolation and, for com-
pleteness, we present our own interpretation of the IPD here, along with a
short review of some of the important work in the area.
The prisoner’s dilemma (PD) and iterated prisoners dilemma (IPD)
have been a rich source of research material since the 1950’s. However, the
publication of Axelrod’s book [Axelrod (1984)] in the 1980’s was largely re-
sponsible for bringing this research to the attention to other areas, outside
of game theory, including evolutionary computing, evolutionary biology,
networked computer systems and promoting cooperation between opposing
countries [Goldstein (1991); Fogel (1993); Axelrod and D’Ambrosio (1995)].
Despite the large literature base that now exists (see, for example, [Pound-
stone (1992); Boyd and Lorberbaum (1987); Maynard Smith (1982); Davis
(1997), Axelrod (1997)], this is an on-going area of research, with Darwen
and Yao [Darwen and Yao (1995, 2001); Yao and Darwen (1999)] carrying
out some recent work. Their 2001 work [Darwen and Yao (2001)] extends
the prisoner’s dilemma by offering more choices, other than simply “coop-
erate” or “defect,” and by providing indirect interactions (reputation).
When you play the prisoner’s dilemma you have to decide whether to
cooperate with an opponent, or defect. Both you and your opponent make a
aThe EPSRC grant reference numbers are GR/S63465/01 and GR/S63472/01.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 3
choice and then your decisions are revealed. You receive a payoff according
to the following matrix (where the top line is the payoff to the column).
Cooperate Defect
CooperateR = 3 T = 5
R = 3 S = 0
DefectS = 0 P = 1
T = 5 P = 1
• R is a Reward for mutual cooperation. Therefore, if both players coop-
erate then both receive a reward of 3 points.
• If one player defects and the other cooperates then one player receives
the T emptation to defect payoff (5 in this case) and the other player (the
cooperator) receives the Sucker payoff (zero in this case).
• If both players defect then they both receive the P unishment for mutual
defection payoff (1 in this case).
The question arises: what should you do in such a game?
• Suppose you think the other player will cooperate. If you cooperate
then you will receive a payoff of 3 for mutual cooperation. If you defect
then you will receive a payoff of 5 for the Temptation to Defect payoff.
Therefore, if you think the other player will cooperate then you should
defect, to give you a payoff of 5.
• But what if you think the other player will defect? If you cooperate,
then you get the Sucker payoff of zero. If you defect then you would both
receive the Punishment for Mutual Defection of 1 point. Therefore, if
you think the other player will defect, you should defect as well.
So, you should defect, no matter what option your opponent chooses.
Of course, the same logic holds for your opponent. And, if you both de-
fect you receive a payoff of 1 each, whereas, the better outcome would have
been mutual cooperation with a payoff of 3. The payoff for an individual
is less than that could have been achieved by two cooperating players, thus
the dilemma and the research challenge of finding strategies that promote
mutual cooperation.
In defining a prisoner’s dilemma, certain conditions have to hold. The
values we used above, to demonstrate the game, are not the only values
that could have been used, but they do have to adhere to the conditions
listed below.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
4 S. Y. Chong et al.
Firstly, the order of the payoffs is important. The best a player can
do is T (temptation to defect). The worst a player can do is to get the
sucker payoff, S. If the two players cooperate then the reward for that
mutual cooperation, R, should be better than the punishment for mutual
defection, P . Therefore, the following must hold.
T > R > P > S . (1.1)
Secondly, players should not be allowed to get out of the dilemma by
taking it in turns to exploit each other. Or, to be a little more precise, the
players should not play the game so that they end up with half the time
being exploited and the other half of the time exploiting their opponent.
In other words, an even chance of being exploited or doing the exploiting is
not as good an outcome as both players mutually cooperating. Therefore,
the reward for mutual cooperation should be greater than the average of
the payoff for the temptation and the sucker. That is, the following must
hold.
R > (S + T )/2 . (1.2)
Playing a “one-shot” prisoners dilemma, it is not difficult to decide
which strategy to adopt, but the question arises: can cooperation evolve
from playing the game over and over again, against the same opponent?
If you know how many times you are to play, then there is an argu-
ment that the game is exactly the same as playing the “one-shot” prisoners
dilemma. This is based on the observation that you will defect on the last
iteration as that is the sensible thing to do as, you are in effect playing a
single iteration. Knowing this, it is sensible to defect on the second to last
one as well; and this logic can be applied all the way to the first iteration.
However, this reasoning cannot be used when the number of iterations
is infinite as you know there is always another iteration. In practise, this
translates to not knowing when the game will end.
Experiments, using human players [Scodel (1962, 1963); Minas et al.
(1960); Scodel and Philburn (1959), Scodel et al. (1959); Scodel et al.
(1960)] showed that they, generally, did not cooperate even when it should
have been obvious that the other person was going to cooperate, just as long
as you do. It has been a long term aim to find strategies which causes players
to cooperate. If players would only cooperate then their payoff, over an in-
definite number of games could be maximised, rather than tending towards
defection and hoping the other player would cooperate. In 1979 Axelrod
organised a prisoner’s dilemma competition and invited game theorists to
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 5
submit their strategies [Axelrod (1980a)]. Fourteen entries were received
with an extra one being added (defect or cooperate with equal probabil-
ity). The strategies were competed against each other, including itself. The
winner was Anatol Rapoport who submitted the simple strategy (Tit-for-
Tat) which cooperates on the first move, then does whatever your opponent
did on the previous move. In a second tournament [Axelrod (1980b)], 62
entries were received but, again, the winner was Tit-for-Tat. These two
competitions formed the basis of his important book [Axelrod (1984)].
The prisoners dilemma has a modern day version in the form of the
TV show “Shafted” - a game show recently screened on terrestrial TV in
the UK (note that this show is not a true prisoners dilemma as defined
by Rapoport [Rapoport (1996)], but does demonstrate that the ideas have
wider applicability). At the end of the show two contestants have accu-
mulated a sum of money and they have to decide if to share the money
or to try and get all the money for themselves. Their decision is made
without the knowledge of what the other person has decided to do. If both
contestants cooperate then they share the money. If they both defect then
they both receive nothing. If one cooperates and the other defects, the one
that defected gets all the money and the contestant that cooperated gets
nothing.
Although the prisoners dilemma, in the context of game theory, has been
an active research area for at least 50 [Scodel (1962); Scodel (1963); Minas
et al. (1960); Scodel and Philburn (1959); Scodel et al. (1959); Scodel
et al. (1960)] years (it can be traced back to von Neumann and Morgen-
stern [von Neumann and Morgenstern (1944)] and, of course, John Nash
[Nash (1950, 1953)]), it is still an active research area with, among other
research aims, researchers trying to evolve strategies [O’Riordan (2000)]
that promote cooperation.
Recent research has also considered the prisoner’s dilemma where there
are more than two choices and more than two players. Darwen and Yao have
shown that offering more choices leads to less cooperation [Darwen and Yao
(2001)], although reputation may help [Darwen and Yao (2002); Yao and
Darwen (1999)]. Birk [Birk (1999)] used a multi-payer IPD. His model had
continuous degrees of cooperation (as opposed to the binary; cooperate
or defect). He used a robotic environment and showed that a justified-
snobism strategy, that tries to cooperate slightly more than the average,
is a successful strategy and is evolutionarily stable (that is, it cannot be
invaded by another strategy). O’Riordan and Bradish (2000) also simulated
a multi-player game where the players are involved in many types of games.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
6 S. Y. Chong et al.
They show that cooperation can emerge in a high percentage of 2-player
games.
As well as the academic papers on the subject, there are many books
devoted to game theory and/or the prisoners dilemma. The 1997 book
by Axelrod (1997) re-produces a range of his papers (with commentary)
ranging from 1986 through to 1997. The papers consider areas such as
promoting cooperation using a genetic algorithm, coping with noise and
promoting norms.
1.3. Contents of the Book
This book does not have to be read from cover to cover. Each chapter can
be read independently, with most of the chapters describing the IPD. This
was a conscious decision by the editors as we realised that the book would
be dipped into and we did not want to make any chapter dependent on
any other. Also, each chapter has its own set of references, rather than
having one complete list of references at the end of the book. The book is
structured as follows
Chapter 1
This chapter provides a general introduction to the book. In keeping with
the rest of the book, we also briefly describe the IPD. As well as briefly
describing each chapter. This chapter also presents the results of the two
competitions that we ran in 2004 and 2005.
Chapter 2
Chapter 2 (“Iterated Prisoner’s Dilemma and Evolutionary Game Theory”)
reviews some of the important work in IPD, with particular emphasis (in
the latter part of the chapter) on evolutionary game theory. The chapter
contains over 250 references, which we hope will be a good starting point
for other researchers who are looking to start work in this area.
We have concentrated on the evolutionary aspects of IPD for two rea-
sons. Firstly, this seemed to be an area that was exploited in the entries
we received. Secondly, the literature on IPD is truly vast (perhaps only
exceeded by literature on the traveling salesman problem), and we had to
draw some boundaries and, given the close links that this competition had
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 7
with the Congress on Evolutionary Computation, it seemed appropriate to
report on the evolutionary aspects of IPD.
We apologise to any authors who feel their work should have been in-
cluded in this chapter. We hope you understand that we simply could not
list every paper. However, if you would like to drop us an EMAIL, we would
be happy to consider the inclusion of the reference in any later editions.
Chapter 3
Chapter 3 (“Learning IPD Strategies Through Co-evolution”) reviews an-
other area of IPD that has received scientific interest in recent years; that
of co-evolution. This chapter also discusses an extension to the classic IPD
formulation. That is when there are more than two players and when they
have more than two choices. Similar to chapter two, there is an extensive
list of references for the interested reader.
Chapter 4
This chapter reports the winning strategy from competition 4, from the
event held in 2005. This competition mimics the original ones held by
Axelrod. Only one entry was allowed per person, to stop the cooperating
strategies that had dominated the first competition. Although we believe
that having cooperating strategies is a valid tactic, some competitors felt
that this did not truly mimic the original competitions. For this reason we
introduced an additional competition for the 2005 event. The result was a
win for Jiawei Li, who details his winning strategy in chapter 4, which is
entitled How to Design a Strategy to Win an IPD Tournament.
Chapter 5
The strategy in this chapter attempts to model its opponent using an ar-
tificial immune system. It is interesting to see how relatively new method-
ologies are being used for problems such as IPD, demonstrating that there
is a continuous flow of new ideas which might just be shown to be superior
to all other methods so far. Whilst not appearing in the top ten of any of
the competitions that it entered, it does present an exciting new research
direction for IPD tournaments.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
8 S. Y. Chong et al.
Chapter 6
Michael Filzmoser, reports on a variation of tit-for-tat, which he calls Ex-
ponential Smoothed Tit-for-Tat. Whereas tit-for-tat only considers the last
move of the opponent, exponential smoothed tit-for-tat considers the com-
plete history of the opponent. This discussion is extended to IPD with
noise, as well as the more common IPD, where the actions by the player
are reliably reported.
Chapter 7
In chapter 7 (“Opponent modelling, Evolution, and the Iterated Prisoner’s
Dilemma”), the authors explore the idea of modeling an opponent. It does
this by playing tit-for-tat for the first 50 moves, whilst trying to model the
moves played by the opponent. After 50 moves, subsequent moves are then
based on the model that has been built.
It is interesting to compare this strategy (which came 3rd in competition
4 in 2005), with the strategy described in chapter 4, which also uses a type
of modeling but over a shorter time period. Perhaps this explains why it
was able to achieve better payoffs, as it was able to exploit opponents much
earlier in the game?
Chapter 8
The strategies reported in this chapter were entered in both the 2004 and
2005 events, and performed well in many of the competitions, winning
competition 1 in the 2005 event.
This chapter, more than any other, touches on the debate about coop-
erating strategies, which is why we introduced competition 4 in the 2005
event. If you followed the discussion at the time, many entrants (with some
justification) questioned if allowing multiple strategies from one person was
in the spirit of the original Axelrod competitions. Whilst we agreed with
this, so introduced a single entry rule in 2005, we also argue that these
competitions were about the research that was being carried out and some
of the chapters in this book report on those results. Of course, as the
authors of chapter 8 admit, there are still ways of flouting the rules by
submitting cooperating entries under different names. We hope that the
other entrants will accept this in the spirit of research under which this
was done. As the authors point out, the organisers failed to recognise that
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 9
cooperating strategies had been submitted, but, as they also say, this is a
theoretically difficult problem.
We would also like to take this opportunity to the authors of chapter 8
for missing their OTFT strategy from some of the competitions. It is still
unclear to us why this happened.
Chapter 9
A team from Southampton, who took the first three places in competition
1, in the 2004 competition present chapter 9. Their chapter is an excellent
example of how strategies can cooperate. As strategies have no mechanism
to interact directly, the only way to recognise one of your collaborators is
to somehow communicate through the defect/cooperate choices that you
make.
Chapter 10
One of the competitions that we run included noise, with some low prob-
ability. By noise, we mean that a defect or cooperate signal might be
misinterpreted. This final chapter by Tsz-Chiu Au and Dana Nau explores
this issue using a strategy they call Derived Belief Strategy. It attempts to
model their opponent and then judge if their choice has been affected by
noise. They performed very well in the competition, even when up against
strategies which were cooperating.
1.4. Celebrating the 20th Anniversary: The Competitions
We ran two events. The first was held during the Congress of Evolutionary
Computation Conference in 2004 (June 19-23, Portland, Oregon, USA) and
the next at the Computational Intelligence and Games Conference in 2005
(April 4-6, 2005, Essex UK). At the 2004 event we ran three competitions,
with an additional competition being held in 2005.
(1) The first competition aimed to emulate the original Axelrod competi-
tion. We received some enquiries about whether multiple entries were
allowed. As we had not stated this as a restriction, we allowed it (but
did state we had the right to limit the number, else running the com-
petition may become intractable). At the time, we did not realise the
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
10 S. Y. Chong et al.
controversy that this decision would cause, which is why we modified
the competitions in the 2005 event.
(2) The second competition had noise in it. Each decision had a 0.1 prob-
ability of being mis-interpreted.
(3) The third competition allowed competitors to submit a strategy to an
IPD that has more than one player and more than one payoff, that is,
multi player and multi-choice.
(4) The fourth competition (which was only run in 2005) emulated the
original Axelrod competition. The definition was exactly the same as
competition 1, but we only allowed one entry per person.
The payoff table we used for competitions 1, 2 and 4 is shown in ta-
ble 1.1. The payoff table for competition 3 is shown in table 1.2.
Table 1.1. Payoff table for all IPD competitions
except for the IPD with multiple players and mul-
tiple choices.
Cooperate Defect
CooperateR = 3 T = 5
R = 3 S = 0
DefectS = 0 P = 1
T = 5 P = 1
Table 1.2. Payoff table for IPD competition with multiple players
and multiple payoffs Player BLevels of Cooperation.
Player B
Levels of Cooperation 13
4
1
2
1
40
Player A
1 4 3 2 1 0
3
44
1
43
1
42
1
41
1
4
1
4
1
24
1
23
1
22
1
21
1
2
1
2
1
44
3
43
3
42
3
41
3
4
3
4
0 5 4 3 2 1
To support the competitions, we developed a software framework. This
is discussed in the Appendix, and a URL is supplied so that the software
can be downloaded.
1.5. Competition Results
In the following tables we present the top ten entries from each of
the competitions. The full listings of the results can be seen at
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 11
http://www.prisoners-dilemma.com. Also available on the web site is a
log containing all the interactions that took place.
Table 1.3. Results from 2004 event, competition 1. There were 223 entries (19
web based entries, 195 java based entries and 9 standard entries (RAND, NEG,
ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).
Rank Player Strategy Won Drawn LostTotal
Points
1Gopal
StarSN (StarSN) 105 21 98 117,057Ramchurn
2Gopal
StarS (StarS) 113 48 63 110,611Ramchurn
3Gopal
StarSL (StarSL) 115 46 63 110,511Ramchurn
4
GRIM (GRIM
GRIM (GRIM120 76 28 100,611
Trigger) 1
Trigger)Wolfgang
Kienreich
5Wolfgang OTFT (Omega
90 70 64 100,604Kienreich tit for tat)
6Wolfgang
ADEPT
95 72 57 96,291Kienreich
(ADEPT
Strategy)
7 Emp 1 EMP (Emperor) 90 73 61 95,927
8Bingzhong
() 31 94 99 94,161Wang
9 Hannes Payer
PRobbary
95 75 54 94,123(PRobbary
Historylength 2)
10 Nanlin Jin HCO (HCO) 27 95 102 93,953
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
12 S. Y. Chong et al.
Table 1.4. Results from 2004 event, competition 2. There were 223 entries (19 web
based entries, 195 java based entries and 9 standard entries (RAND, NEG, ALLC,
ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).
Rank Player Strategy Won Drawn LostTotal
Points
1Gopal
StarSN (StarSN) 42 2 180 93,962Ramchurn
2Colm
Mem1 (Mem1) 5 1 218 83,049O’Riordan
3Gopal CoordinateCDCSIAN
158 6 60 83,015Ramchurn (CoordinateCDCSIAN)
4Gopal
PoorD (PoorD) 190 7 27 82,890Ramchurn
5Wolfgang
OTFT (Omega tit for tat) 158 8 58 82,838Kienreich
6Wayne
ltft (ltft) 66 8 150 82,765Davis
7
GRIM
GRIM (GRIM Trigger) 184 7 33 82,591(GRIM
Trigger) 1
8Gopal
MooD (MooD) 193 3 28 82,578Ramchurn
9Gopal
AITFT (AITFT) 60 9 155 82,504Ramchurn
10Gopal
GSTFT (GSTFT) 64 9 151 82,502Ramchurn
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 13
Table 1.5. Results from 2004 event, competition 3. There were 15 entries.
Note that there is only one round in this competition.
Rank Player StrategyTotal
Points
1Gopal
AgentSoton (SOTON AGENT) 3,756Ramchurn
2Gopal
HarshTFT (HarshTFT) 3,756Ramchurn
3Deirdre PCurvepower1Memory2 (Penalty Curve of
3,738Murrihy 1 using opponent’s previous 2 moves)
4Deirdre PCurvepower2Memory2 (Penalty Curve of
3,738Murrihy 2 using opponent’s previous 2 moves)
5Deirdre PCurvepower0.5Memory2 (Penalty Curve
3,738Murrihy of 0.5 using opponent’s previous 2 moves)
6Enda PCurvepower2 (Penalty Curve of 2 using
3,738Howley opponent’s previous move)
7Enda PCurvepower1 (Penalty Curve of 1 using
3,738Howley opponent’s previous move)
8Enda PCurvepower0.5 (Penalty Curve of 0.5
3,738Howley using opponent’s previous move)
9Wolfgang
CNHM (CosaNostra Hitman) 3,738Kienreich
10Wolfgang
CNHM (CosaNostra Hitman) 3,738Kienreich
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
14 S. Y. Chong et al.
Table 1.6. Results from 2005 event, competition 1. There were 192 entries (41 web
based entries, 142 java based entries and 9 standard entries (RAND, NEG, ALLC,
ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).
Rank Player Strategy Won Drawn LostTotal
Points
1Wolfgang
CNGF
48 96 49 100,905Kienreich
(CosaNostra
Godfather)
2 Jia-wei Li
IMM01
46 112 35 98,922(Intelligent
Machine Master
01)
3Carlos G.
CLAS- (CLAS-) 23 95 75 92,174Tardon
4Perukrishnen
SWIN (Soton
61 44 88 90,918Vytelingum
Agent RA -
Competition 1)
5Constantin LORD (the lord
20 102 71 87,617Ionescu strategy)
6GRIM (GRIM GRIM (GRIM
73 114 6 84,805Trigger) 1 Trigger)
7 Tsz-Chiu Au
LSF (Learning of
28 94 71 84,698opponent strategy
with forgiveness)
8 Tsz-Chiu AuDBStft (DBS with
23 97 73 83,867TFT)
9Richard PRobberyL2
14 98 81 83,837Brunauer (PRobberyL2)
10Carlos G.
CLAS2 (CLAS2) 72 96 25 83,746Tardon
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 15
Table 1.7. Results from 2005 event, competition 2. There were 165 entries (26 web
based entries, 130 java based entries and 9 standard entries (RAND, NEG, ALLC,
ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).
Rank Player Strategy Won Drawn Lost Total
Points
1Perukrishnen
BWIN
85 1 80 73,330Vytelingum
(S2Agent1 ZEUS -
Competition 2)
2 Jia-wei LiIMM01 (Intelligent
108 7 51 70,506Machine Master 01)
3 Tsz-Chiu AuDBSy (DBS
35 3 128 68,370(version y))
4 Tsz-Chiu AuDBSz (DBS
27 3 136 68,339(version z))
5 Tsz-Chiu Au
DBSpl (DBS with
37 2 127 67,979learning
prevention)
6 Tsz-Chiu Au
DBSd (Derivative
42 6 118 67,392Belief Strategy
(version d))
7 Tsz-Chiu AuDBSx (DBS
19 9 138 66,719(version x))
8 Tsz-Chiu AuTFTIc (TFT
41 4 121 66,409improved (ver. c))
9 Tsz-Chiu Au
DBSf (Derivative
48 2 116 66,269Belief Strategy
(version f))
10 Tsz-Chiu AuTFTIm (TFT
38 3 125 66,239improved (ver. m))
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
16 S. Y. Chong et al.
Table 1.8. Results from 2005 event, competition 3. There were 34 entries.
Note that there is only one round in this competition.
Rank Player StrategyTotal
Points
1Perukrishnen
$AgentSoton ($SOTON AGENT) 7,558Vytelingum
2Deirdre PCurvepower1Memory2 (Penalty Curve
7,521Murrihy of 1 using opponent’s previous 2 moves)
3Deirdre PCurvepower2Memory2 (Penalty Curve
7,521Murrihy of 2 using opponent’s previous 2 moves)
4Deirdre
PCurvepower0.5Memory2 (Penalty
7,521Murrihy
Curve of 0.5 using opponent’s previous 2
moves)
5 Enda HowleyPCurvepower2 (Penalty Curve of 2 using
7,521opponent’s previous move)
6 Enda HowleyPCurvepower1 (Penalty Curve of 1 using
7,521opponent’s previous move)
7 Enda HowleyPCurvepower0.5 (Penalty Curve of 0.5
7,521using opponent’s previous move)
8Wolfgang
CNHM (CosaNostra Hitman) 7,521Kienreich
9Wolfgang
CNHM (CosaNostra Hitman) 7,521Kienreich
10Wolfgang
CNHM (CosaNostra Hitman) 7,521Kienreich
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 17
Table 1.9. Results from 2005 event, competition 4. There were 50 entries (26 web
based entries, 15 java based entries and 9 standard entries (RAND, NEG, ALLC,
ALLD, TFT, STFT, TFTT, GRIM, Pavlov)).
Rank Player Strategy Won Drawn LostTotal
Points
1 Jia-wei LiAPavlov (Adaptive
11 34 6 30,096Pavlov)
2Wolfgang OTFT (Omega tit for
9 36 6 29,554Kienreich tat)
3Philip
(Modeller) 7 36 8 29,003HingstonMod
4Bruno
GRAD (Gradual) 8 32 11 28,707Beaufils
5Tim
tro1 (tro1) 13 32 6 28,692Romberg
6Richard DETerminatorL6C4
12 32 7 28,523Brunauer (DETerminatorL6C4)
7Hannes DETerminatorL4C4
11 33 7 28,292Payer (DETerminatorL4C4)
8Bennett LOOKDB
22 11 18 28,110McElwee (LookaheadDB)
9Gerhard PRobberyM5C4
11 32 8 27,893Mitterlechner (PRobberyM5C4)
10 Wayne Davis ltft (ltft) 1 44 6 27,834
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
18 S. Y. Chong et al.
1.6. Acknowledgements
We would like to thank the following people who acted as reviewers for the
chapters in this book.
• Muhammad A. Ahmad
• Oscar Alonso
• Dan Ashlock
• Tsz-Chiu Au
• Carlos Eduardo Rodriguez Calderon
• Michel Charpentier
• Wayne Davis
• Jorg Denzinger
• Eugene Eberbach
• Michael Filzmoser
• Nelis Franken
• Nicholas Gessler
• Michal Glomba
• Philip Hingston
• Enda Howley
• Nick Jennings
• Nanlin Jin
• Jacint Jordana
• Wolfgang Kienreich
• Eun-Youn Kim
• Jia-wei Li
• Helmut A. Mayer
• Bennett McElwee
• Gerhard Mitterlechner
• Colm O’Riordan
• Sarvapali Ramchurn,
• Alex Rogers
• Tim Romberg
• Darryl A. Seale
• Wolfgang Slany
• Elpida Tzafestas
• Perukrishnen Vytelingum
• Georgios N. Yannakakis
• Lukas Zebedin
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 19
Appendix: Software Framework
A software library and corresponding application was developed to easily
implement prisoner’s dilemma strategies and tournament competitions be-
tween populations of these. Although a vast array of software is available
for the same purpose they did not contain all our feature requirements. For
several of our experiments we required a game engine that would, among
other things, handle a continuous [normalised] range of moves, arbitrarily
sized payoff matrices, different types of signal noise, multiple (> 2) strate-
gies per game, and logging of partial and completed game results.
The software suite was developed in Java, allowing ease in development
and web deployment. New strategies are easily implemented by imple-
menting a subclass of the Strategy class. The principal requirements are
the implementations of the getMove() and reset() methods which returns
the current strategy move and clears the strategy state between games re-
spectively.
Currently we define two types of games: standard and multi-player. A
standard game involves two competing strategies playing for a number of
rounds, and should mimic the basic game mechanics in the competitions
run by Axelrod. A multi-player game involves several competing strategies
obtaining payoffs for every other opponent it plays against on each round.
A tournament involves every participating strategy and differs for standard
and multi-player type games. A standard tournament pits every strategy
against every other (including self) in a standard game [a la Round Robin].
A multi-player tournament plays a single multi-player game.
An option is available to introduce a Gaussian distributed random num-
ber of rounds to be played, so as to discourage strategies from using the
knowledge of a predefined or static parameter for an unfair advantage.
There is also an option to introduce noise into the output moves, in prin-
ciple to test the robustness of the algorithms. Besides the programming
API, a graphical user interface is available to set up and run PD tourna-
ment competitions (see Figure 1.1).
The software monitors and allows users to log the output of a tourna-
ment with different degrees of detail. However, detailed logs will degrade
performance.
Besides the standard 2× 2 payoff matrix for classic games, there is the
ability to define an arbitrarily sized payoff matrix allowing for a wider range
of allowable moves. Moves are normalised and payoffs are calculated from
the closest allowable move in the payoff matrix.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
20 S. Y. Chong et al.
Fig. 1.1. IPD tournament application.
A number of standard classic strategies are included in the library.
The software can be downloaded for http://prisoners-dilemma.com.
References
Axelrod, R. (1980a). Effective Choices in the Prisoner’s Dilemma, J. Conflict
Resolution, 24, pp. 3-25.
Axelrod, R. (1980b). More Effective Choices in the Prisoner’s Dilemma, J. Con-
flict Resolution, 24, pp. 379-403.
Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York).
Axelrod R. and D’Ambrosio L. (1995). Announcement for Bibliography on the
Evolution of Cooperation, Journal of Conflict Resolution 39, pp. 190.
Axelrod R. (1997). The Compleity of Cooperation (Princeton University Press).
Birk A. (1999). Evolution of Continuous Degrees of Cooperations in an N-Player
Iterated Prisoner’s Dilemma, Technical Report under review, Vrije Univer-
siteit Brussel, AI-Laboratory.
Boyd R. and Lorberbaum J. P. (1987). No Pure Strategy is Evolutionary Stable
in the Repeated Prisoner’s Dilemma, Nature, 327, pp. 58-59.
Darwen P and Yao X. (2002). Co-Evolution in Iterated Prisoners Dilemma with
Intermediate Levels of Cooperation: Application to Missile Defense, In-
ternational Journal of Computational Intelligence and Applications, 2, 1,
pp. 83-107.
Darwen P. and Yao X. (1995). On Evolving Robust Strategies for Iterated Pris-
oners Dilemma, In Progress in Evolutionary Computation, LNAI, 956,
pp. 276-292.
February 8, 2007 8:53 World Scientific Review Volume - 9in x 6in chapter1
The Iterated Prisoner’s Dilemma: 20 Years On 21
Darwen P. and Yao X. (2001). Why More Choices Cause Less Cooperation in
Iterated Prisoner’s Dilemma, Proc. Congress of Evolutionary Computation,
pp. 987-994.
Davis M. Game Theory. (1997). A Nontechnical Introduction (Dover Publica-
tions).
Fogel D. (1993). Evolving Behaviours in the Iterated Prisoners Dilemma. Evolu-
tionary Computation, 1, 1, pp. 77-97.
Goldstein J. (1991). Reciprocity in Superpower Relations: An Empirical Analysis,
International Studies Quarterly, 35, pp. 195-209.
Maynard Smith J. (1982). Evolution and the Theory of Games (Cambridge Uni-
versity Press).
Minas J. S., Scodel A., Marlowe D. and Rawson H. (1960). Some Descriptive
Aspects of Two-Person, Non-Zero-Sum Games, II, Journal of Conflict Res-
olution, 4, pp. 193-197.
Nash J. (1950). The Bargaining Problem, Econometrica, 18, pp. 150-155.
Nash J. (1953). Two-Person Cooperative Games, Econometrica, 21, pp. 128-140.
O’Riordan and Bradish S. (2000). Experiments in the Iterated Prisoner’s Dilemma
and the Voter’s Paradox. 11th Irish Conference on Artificial Intelligence and
Cognitive Science.
O’Riordan C. (2000). A Forgiving Strategy for the Iterated Prisoner’s Dilemma,
Journal of Artificial Societies and Social Simulation, 3, 1.
Poundstone W. (1992). Prisoner’s Dilemma, Doubleday
Rapoport A. (1996). Optimal policies for the prisoners dilemma, Tech report
No. 50, Psychometric Laboratory, Univ. North Carolina, NIH Grant, MH-
10006.
Scodel A. and Philburn R. (1959). Some Personality Correlates of Decision Mak-
ing under Conditions of Risk, Behavioral Science, 4, pp. 19-28.
Scodel A., Minas J. S., Ratoosh P.and Lipetz M. (1959). Some Descriptive Aspects
of Two-Person, Non-Zero-Sum Games, Journal of Conflict Resolution, 3,
pp. 114-119.
Scodel A. and Minas J. S. (1960). The Behavior of Prisoners in a “Prisoner’s
Dilemma” Game, Journal of Psychology, 50, pp. 133-138.
Scodel A. (1962). Induced Collaboration in Some Non-Zero-Sum Games, Journal
of Conflict Resolution, 6, pp. 335-340.
Scodel A. (1963). Probability Preferences and Expected Values. Journal of Psy-
chology, 56, pp. 429-434.
von Neumann J. and Morgenstern O. (1944). Theory of Games and Economic
Behavior (Princeton University Press).
Yao, X and Darwen P. (1999). How Important is Your Reputation in a Multi-
Agent Environment. Proc. Of the 1999 IEEE Conference on Systems, Man
and Cybernetics, IEEE Press, Piscataway, NJ, USA, pp. II-575 – II-580,
Oct.
This page intentionally left blankThis page intentionally left blank
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Chapter 2
Iterated Prisoner’s Dilemma and Evolutionary
Game Theory
Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3,
Xin Yao1
University of Birmingham1, University of Nottingham2, Harbin Institute
of Technology3
2.1. Introduction
The prisoner’s dilemma is a type of non-zero-sum game in which two players
try to maximize their payoff by cooperating with, or betraying the other
player. The term non-zero-sum indicates that whatever benefits accrue to
one player do not necessarily imply similar penalties imposed on the other
player. The Prisoner’s dilemma was originally framed by Merrill Flood and
Melvin Dresher working at RAND Corporation in 1950. Albert W. Tucker
formalized the game with prison sentence payoffs and gave it the “Prisoner’s
Dilemma” name. The classical prisoner’s dilemma (PD) is as follows:
Two suspects, A and B, are arrested by the police. The police
have insufficient evidence for a conviction, and, having sepa-
rated both prisoners, visit each of them to offer the same deal:
if one testifies for the prosecution against the other and the
other remains silent, the betrayer goes free and the silent ac-
complice receives the full 10-year sentence. If both stay silent,
the police can sentence both prisoners to only six months in
jail for a minor charge. If each betrays the other, each will re-
ceive a two-year sentence. Each prisoner must make the choice
of whether to betray the other or to remain silent. However,
neither prisoner knows for sure what choice the other prisoner
will make. So the question this dilemma poses is: What will
happen? How will the prisoners act?
The general form of the PD is represented as the following matrix [Scodel
et al. (1959)]:
23
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
24 S. Y. Chong et al.
Prisoner 2
Cooperate Defect
Cooperate (R, R) (S, T )
Prisoner 1 Defect (T, S) (P,P )
where R, S, T , and P denote Reward for mutual cooperation, Sucker’s
payoff, Temptation to defect, and Punishment for mutual defection respec-
tively, and T > R > P > S and R > 1/2(S + T ). The two constraints
motivate each player to play noncooperatively and prevent any incentive to
alternate between cooperation and defection [Rapoport (1966, 1999)].
Neither prisoner knows the choice of his accomplice. Even if they were
able to talk to each other, neither could be sure that he could trust the
other. The “dilemma” faced by the prisoners here is that, whatever the
other does, each is better off confessing than remaining silent. However,
the payoff when both confess is worse for each player than the outcome
they would have received if they had both remained silent. Traditional
game theory predicts the outcome of PD be mutual defection based on the
concept of Nash equilibrium. To defect is dominant because if both players
choose to defect, no player has anything to gain by changing their own
strategy [Hardin (1968); Nash (1950, 1951, 1996)].
In the Iterated Prisoner’s Dilemma (IPD) game, two players have to
choose their mutual strategy repeatedly, and have memory of their previ-
ous behaviors. Because players who defect in one round can be “punished”
by defections in subsequent rounds and those who cooperate can be re-
warded by cooperation, the appropriate strategy for self-interested players
is no longer obvious in IPD games. If the precise length of an IPD is
known to the players, then the optimal strategy is to defect on each round
(often called All Defect of AllD) [Luce and Raiffa (1957)]. This single ratio-
nal play strategy which is deduced from propagating the single stage Nash
equilibrium of mutual defection backwards through every stage of the game
prevents players from cooperating to achieve higher payoffs [Selten (1965,
1983, 1988); Noldeke and Samuelson (1993)]. If the game has infinite length
or at least the players are not aware of the length of the game, backward
induction is no longer effective and there exists the possibility that cooper-
ation can take place. In fact, there is still controversy about whether or not
backward induction can be applied to infinite (or finite) IPDs [Sobel (1975,
1976); Kavka (1986); Becker and Cudd (1990); Binmore (1997); Binmore et
al. (2002); Bovens (1997)]. However, in IPD experiments, it was not uncom-
mon to see people cooperate to gain a greater payoff not only in repeated
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 25
games but even in one-shot games [Cooper et al. (1996); Croson (2000);
Davis and Holt (1999); Milinski and Wedekind (1998)]. Traditional game
theory interprets the cooperation phenomena in IPDs by means of repu-
tation [Fudenberg and Maskin (1986); Kreps and Wilson (1982); Milgrom
and Roberts (1982)], incomplete information [Harsanyi (1967); Kreps et al.
(1982; Sarin (1999)], or bounded rationality [Anthonisen (1999); Harborne
(1997); Radner (1980, 1986); Simon (1955, 1990); Vegaredondo (1994)].
Evolutionary game theory differs from classical game theory in respect
of focusing on the dynamics of strategy change in a population more than
the properties of strategy equilibrium. In evolutionary game theory, IPD
is an ideal experimental platform for the problem as to how cooperation
occurs and persists, which is considered to be impossible in the static or
deterministic environment. IPD attracted wide interest after Robert Ax-
elrod’s famous book “The Evolution of Cooperation”. In 1979, Robert
Axelrod organized a prisoner’s dilemma tournaments and solicited strate-
gies from game theorists [Axelrod (1980a, 1980b)]. Each of the 14 entries
competed against all others (including itself) over a sequence of 200 moves.
The specific payoff function used is as follows.
Prisoner 2
Cooperate Defect
Cooperate (3, 3) (0, 5)
Prisoner 1 Defect (5, 0) (1, 1)
The winner of the tournament was “tit-for-tat” (TFT) submitted by
Anatol Rapoport. TFT always cooperate on the first move and then mim-
ics whatever the other player did on the previous move. In a second tourna-
ment with 62 entries, again the winner was TFT. Axelrod discovered that
“greedy” strategies tended to do very poorly in the long run while “altru-
istic” strategies did better when PD were repeated over a long period of
time with many players. Then genetic algorithms were introduced to show
how these altruistic strategies evolve in the populations that are initially
dominated by selfishness. The prisoner’s dilemma is therefore of interest
to the social sciences such as economics, politics and sociology, and to the
biological sciences such as ethology and evolutionary biology, as well to
the applied mathematics such as evolutionary computing. Many social and
natural processes, for example arm race between states and price setting for
duopolistic firms, have been abstracted into models in which independent
groups or individuals are engaged in PD games [Brelis (1992); Bunn and
Payne (1988); Hauser (1992); Hemelrijk (1991); Surowiecki (2004)].
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
26 S. Y. Chong et al.
The optimal strategy for the one-shot PD game is simply defection.
However, in the IPD game the optimal strategy depends upon the strate-
gies of the possible opponents. For example, the strategy of Always Co-
operate (AllC) is dominated by the strategy of Always Defect (AllD), and
AllD is optimal in a population consisting of AllD and AllC. However, in
a population consisting of AllD, AllC, and TFT, AllD is not necessarily
the optimal strategy. It appears that all the strategies in the population
determine which strategy is optimal. Although TFT was proved to be ef-
ficient in lots of IPD tournaments and was long considered to be the best
basic strategy, it could be defeated in some specific circumstances [Beaufils,
Delahaye and Mathieu (1996); Wu and Axelrod (1994)]. Therefore, there
is lasting interest for game theorists to find optimal strategies or at least
novel strategies which outperform TFT in IPD tournaments.
Since Axelrod, two types of approaches are developed to test the effi-
ciency or robustness of a strategy and further to derive optimal strategies:
(1) Round-robin tournaments.
(2) Evolutionary dynamics.
Round-robin tournament shows the efficiency of a strategy in competing
with others, while ecological simulation illustrates the evolutionary robust-
ness of a strategy in terms of the number of descendants or survivability
in a certain environment. Lots of novel strategies have been developed and
analyzed by means of these approaches.
By using round-robin tournaments, the interactions between different
strategies can be observed and analyzed. If the statistical distribution of
opposing strategies can be determined an optimal counter-strategy can be
derived mathematically. For example, if the population consists of 50%
TFT and 50% AllC, the optimal strategy should cooperate with TFT and
defect with AllC in order to maximize the payoff. It is easy to design such a
strategy that defects in the first two moves, and then plays always C if the
opponent defected on the second move, otherwise plays always D. A similar
concept in analyzing optimal strategy is Bayesian Nash equilibrium which
is widely used in experimental economics [Bedford and Meilijson (1997);
Gilboa and Schmeidler (2001); Kagel and Roth (1995); Kalai and Lehrer
(1993); Rubinstein (1998);]. In evolutionary dynamics, the processes like
natural selection are simulated where individuals with low scores die off,
and those with high scores flourish. The evolutionary rule that describes
what future states follow from the current state is fixed and deterministic:
for a given time interval only one future state follows from the current state
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 27
[Katok and Hasselblatt (1996)]. The common methodology of the evolution
rule is replicator equations that assume infinite populations, continuous
time, complete mixing and that strategies breed true. Given a population
of strategies and the dynamic equations, the evolutionary process can be
simulated, and how strategies evolve in the population over a short or long
time period can be shown. Optimal strategies can be developed in this way
[Axelrod (1987); Darwen and Yao (1995, 1996, 2001); Lindgren (1992);
Miller (1996)].
2.2. Strategies in IPD Tournaments
Axelrod is the first who attempts to search for efficient strategies by means
of IPD tournament [Axelrod (1980a, 1980b)]. TFT had long been studied as
a strategy of IPD game [Komorita, Sheposh and Braver (1968); Rapoport
and Chammah (1965)]. However, it is after Axelrods tournaments that
TFT become well-known.
According to Axelrod, several conditions are necessary for a strategy to
be successful. These conditions include:
Nice
The most important condition is that the strategy must be “nice”.
That is, it will not defect before its opponent does. Almost all of the
top-scoring strategies are nice. Therefore a selfish strategy will never
defect first.
Retaliating
Axelrod contended that a successful strategy must not be a blind op-
timist. It must always retaliate. An example of a non-retaliating
strategy is AllC. This is a very bad choice, as “nasty” strategies will
ruthlessly exploit such strategies.
Forgiving
Another quality of successful strategies is that they must be forgiving.
Though they will retaliate, they will fall back to cooperating if the
opponent does not continue to defect. This stops long runs of revenge
and counter-revenge, thus maximising payoffs.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
28 S. Y. Chong et al.
Clear
The last quality is being clear, that is making it easier for other strate-
gies to predict its behavior so as to facilitate mutually cooperation.
Stochastic strategies, however, are not clear because of the uncertainty
in their choice.
In a further study, Axelrod noted that just a few of the 62 entries in the
second tournament have reasonably influence on the performance of a given
strategy. He utilized eight strategies as opponents for a simulated evolving
population based on a genetic algorithm approach [Axelrod (1987)]. The
population consisted of deterministic strategies that use outcomes of the
three previous moves to determine a current move. The simulation was con-
ducted using a population of 20 strategies from a total of 270 strategies exe-
cuted repeatedly against the eight representatives. Mutation and crossover
were used to generate new strategies. The typical results indicated that
populations initially generated mutual defection, but subsequently evolved
toward mutual cooperation. Moreover, most of the strategies that evolved
in the simulation actually resemble TFT, having the properties of “Nice”,
“Forgiving”, and “Retaliating”.
Although TFT has been considered to be the most successful strategy
in IPD for several decades, there still is some controversy about it. There
seems to be a lack of theoretical explanation for the strategies like TFT in
traditional game theory. TFT is not subgame perfect, and there are always
subgame perfect equilibria that dominate TFT according to the Folk The-
orem [Binmore (1992); Hargreaves and Varoufakis (1995); Myerson (1991);
Rubinstein (1979); Selten (1965, 1975)]. On the other hand, whether or not
TFT is the most efficient singleton strategy in IPD game is still unclear;
therefore, many researchers are attempting to develop novel strategies that
can outperform TFT.
2.2.1. Heterogeneous TFTs
Since TFT had such success in IPD tournaments and experiments, it is
natural to draw the conclusion that TFT may be improved by slightly
modifying its rule. Many heterogeneous TFTs have been developed in
order to overcome TFT’s shortcoming or to adapt to a certain environment,
for example IPD with noise. Among these strategies, Tit-for-Two-Tats
(TFTT), Generous TFT (GTFT), and Contrite TFT (CTFT) are examples.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 29
A situation that TFT does not handle well is a long series of mutual
retaliations evoked by an occasional defection. The deadlock can be bro-
ken if the co-player behaves more generously than TFT and forgives at
least one defection. TFTT retaliates with defection only after two succes-
sive defections and thus attempts to avoid becoming involved in mutual
retaliations. Usually, TFTT performs well in a population with more coop-
erative strategies but does poorly in a population with more permanently
defective strategies. Similar to TFTT, Benevolent TFT (BTFT) always
cooperates after cooperation and normally defects after defection, but oc-
casionally BTFT responds to defection by cooperation in order to break
up a series of mutual obstruction [Komorita, Sheposh and Braver (1968)].
In experiments of Manarini (1998) and Micko (1997), fixed interval BTFT
strategies were shown to be superior to, or at least equivalent to, TFT in
terms of cooperation as well as in terms of cumulative pay-off. However,
BTFT tends to produce irregularly alternating exploitations and sometimes
resort mutual retaliations.
Allowing some percentage of the other player’s defections to go unpun-
ished has been widely accepted as a good way to cope with noise [Molander
(1985); May (1987); Axelrod and Dion (1988); Bendor et al. (1991); God-
fray (1992); Wu and Axelrod (1994)]. A reciprocating strategy such as TFT
can be modified to forgive the other player’s defection with a certain ratio in
order to decrease the influence of noise. GTFT behaves like TFT but coop-
erates with the probability of q = min[1−(T−R)/(R−S), (R−P )/(T−P )]
when it would otherwise defect. This prevents a single error from echoing
indefinitely. For example, in the case of T = 5, R = 3, P = 1, and S = 0,
q = 1/3. GTFT is said to take over the dominant position of the population
of homogeneous TFT strategies in an evolutionary environment with noise
[Nowak and Sigmund (1992)].
In a noisy environment, retaliating unintended defection often leads to
permanent bilateral retaliation. Therefore, forgiving defection evoked by
unintended defection allows a quick way to recover from error. It is based
upon the idea that one shouldn’t be provoked by the other player’s response
to one’s own unintended defection [Sugden (1986); Boyd (1989)]. The strat-
egy of CTFT has three states: “contrite”, “content” and “provoked”. It
begins in a content state, with cooperation and stays there unless there is
a unilateral defection. If it was the victim while content, it becomes pro-
voked and defects until a cooperation from other player causes it to become
content. If it was the defector while content, it becomes contrite and co-
operates. When contrite, it becomes content only after it has successfully
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
30 S. Y. Chong et al.
cooperated. CTFT can correct its unintended defection in a noisy envi-
ronment. If one of two CTFT players defects, the defecting player will
contritely cooperate on the next move and the other player will defect, and
then both will be content to cooperate on the following move. However,
CTFT is not effective at correcting the other player’s error. For example, if
CTFT is playing TFT and the TFT player defected by accident, the retal-
iation will continue until another error occurs. In an ecological simulation
with noise, GTFT and CTFT competed with the 63 rules of the Second
Round of the Computer Tournament for the Prisoner’s Dilemma [Axelrod
(1984)]. CTFT is the dominant strategy, becoming 97% of the population
at generation 2000 [Wu and Axelrod (1994)].
2.2.2. Pavlov (Win-Stay Lose-Shift)
A possible drawback of TFT is that it performs poorly in a noisy envi-
ronment. Assume that a population of TFT strategies plays IPD with one
another in a noisy environment, where every choice may be occasionally im-
plemented in error. Although a TFT strategy cooperates with its twin at
the beginning, it would get out of cooperation as soon as the other player’s
action is misinterpreted, and then this induces the other player’s defection
in the next round. Therefore, after an error, the result of the game turns
out to be a CD, DC, CD . . . cycle. If a second error happens, the outcome
is as likely to fall into defection as it is to resume cooperation. Coopera-
tion between TFT strategies is easy to break even in the case of low noise
frequency [Donninger (1986); Kraines and Kraines (1995)].
The Pavlov strategy, also known as Win-Stay Lose-Shift or Simpleton
[Rapoport and Chammah (1965)], has been shown to outperform TFT in
the environment with noise [Fudenberg and Maskin (1990); Kraines and
Kraines (1995, 2000)]. Pavlov cooperates when both sides have cooperated
or defected on the previous move, and defects otherwise. Pavlov, as well as
TFT, are a type of memory-one strategies where players only remember and
make use of their own move and their opponent’s move on the last round.
The major difference between Pavlov and TFT is that Pavlov will choose
COOPERATE after a defection as against TFT’s DEFECT, and this helps
Pavlov resume cooperation with those cooperative strategies, such as TFT,
in a noisy environment. When restricted to an environment of memory-one
agents interacting in iterated Prisoners Dilemma games with a 1% noise
level, Pavlov is the only cooperative strategy and one of the very few that
cannot be invaded by a similar strategy [Nowak and Sigmund (1993, 1995)].
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 31
Simulation of evolutionary dynamics of win-stay lose-shift strategies
shows that these strategies are able to adapt to the uncertain environment
even when the noise level is high [Posch (1997)]. In simulated stochas-
tic memory-one strategies for the IPD games, Nowak and Sigmund (1993,
1995) report that cooperative agents using a Pavlov type strategy even-
tually dominate a random population. Memory-one strategies can be ex-
pressed in the form of S(p1, p2, p3, p4), where p1 denotes the probability
of playing C (Cooperate) after a CC outcome, p2 denotes the probability
of playing C after a CD outcome, p3 denotes the probability of playing
C after a DC outcome, and p4 denotes the probability of playing C af-
ter a DD outcome. Most of the well-known strategies can be expressed
in this form. For example, AllC = S(1, 1, 1, 1), AllD = S(0, 0, 0, 0), TFT
= S(1, 0, 1, 0), Pavlov = S(1, 0, 0, 1). Noise is conveniently introduced by
restricting the conditional probabilities pito range between 0 and 1. For ex-
ample, S(0.999, 0.001, 0.999, 0.001) is a TFT strategy with 0.001 probability
of being misinterpreted. In a computer simulation with a population using
the totally random strategy S(0.5, 0.5, 0.5, 0.5), win-stay lose-shift strat-
egy shows its evolutionary robustness in noisy environment. After each
100 generations from a total of 107 generations, 105 mutant strategies that
are generated at random are introduced. Simulation results show that the
populations are dominated by win-stay lose-shift strategy in 33 of a total of
40 simulations. TFT strategies perform poorly in large part because they
do not exploit overly cooperate strategies.
Simulations reveal that Pavlov loses against AllD but can invade TFT,
and that Pavlov cannot be invaded by AllD [Milinski (1993)].
2.2.3. Gradual
The Gradual strategies are like TFT but respond to the opponent with a
gradual pattern. This strategy acts as TFT, except when it is time to for-
give and remember the past. It uses cooperation on the first move and then
continues to do so as long as the other player cooperates. Then after the
first defection of the other player, it defects one time and cooperates two
times; after the second defection of the opponent, it defects two times and
cooperates two times, . . . after the nth defection it reacts with n consec-
utive defections and then calms down its opponent with two cooperations
[Beaufils, Delahaye and Mathieu (1996)].
Both round-robin competitions and ecological evolution experiments are
conducted in order to compare the performance of Gradual with TFT.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
32 S. Y. Chong et al.
Gradual wins in experiments where round-robin competitions are conducted
with several well-known strategies, such as TFT and GRIM. In ecological
evolutionary experiments, gradual and TFT have the same type of evolu-
tion, with the difference of quantity in favor of gradual, which is far away in
front of all other survivors when the population is stabilised. However, it is
efficient to demonstrate that TFT is not always the best, but not efficient to
prove that Gradual always outperforms TFT. Gradual receives fewer points
than TFT while interacting with AllD because Gradual forgives too many
defections. Therefore, if there are lots of defecting strategies like AllD in
the competition, it would be possible that TFT outperforms Gradual in
this case.
Beaufils, Delahaye and Mathieu (1996) try to improve the performance
of Gradual by using a genetic algorithm. 19 different genes are used and a
fitness function evaluates the quality of the strategies. Several new strate-
gies are found after 150 generations of evolution. One of them beats Grad-
ual and TFT in round-robin tournament, as well as in an ecological simu-
lation. In the two cases it has finished first just in front of Gradual, TFT
being two or three places behind, with a wide gap in the score, or in the
size of the stabilised population.
The evolution dynamics of populations including Gradual has also been
studied in Delahaye and Mathieu (1996), Doebeli and Knowlton (1998),
Glomba, Filak, and Kwasnicka (2005), Beaufils, Delahaye, and Mathieu
(1996).
2.2.4. Adaptive strategies
From the viewpoint of automation, the strategies in IPD games can be re-
garded as automatic agents with or without feedback mechanisms. Most
well-known IPD strategies are not adaptive because their responses to any
certain opponent are fixed. It is impossible to improve their performance
since the parameters of their responding mechanism cannot be adjusted.
However, there are still some strategies in IPDs which are adaptive. Al-
though there is still no experimental evidence of adaptive strategies out-
performing non-adaptive ones in IPD games, adaptive strategies are worth
studying since creatures with higher intelligence are all adaptive.
There have been two approaches to developing adaptive strategies.
Firstly, adaptive mechanisms can be implemented by making the parame-
ters of a non-adaptive strategy adjustable. Secondly, new adaptive strate-
gies can be developed by using evolutionary computation, reinforcement
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 33
learning, and other computational techniques [Darwen and Yao (1995,
1996)].
Tzafestas (2000a, 2000b) introduced adaptive tit-for-tat (ATFT) strat-
egy that embedded an adaptive factor into the conventional TFT strategy.
ATFT keeps the advantages of tit-for-tat in the sense of retaliating and
forgiving, and implements some behavioural gradualness that would show
as fewer oscillations between Cooperate and Defect. It uses an estimate
of the opponent’s behavior, whether cooperative or defecting, and reacts
to it in a tit-for-tat manner. To represent degrees of cooperation and de-
fection, a continuous variable named “world” which ranges from 0 (total
defection) to 1 (total cooperation) is applied. The ATFT strategy can then
be formulated as a simple model:
If (opponent played C in the last cycle) then
world = world + r*(1-world)
else
world = world + r*(0-world)
If (world >= 0.5) play C, else play D
r is the adaptation rate here. The TFT strategy corresponds to the case
of r = 1 (immediate convergence to the opponent’s current move). Clearly,
ATFT is an extension of the conventional TFT strategy. By simulating the
spatial IPD games between ATFT, AllD, AllC, and TFT on 2D grid, it
shows that ATFT is fairly stable and resistant to perturbations. Since the
use of a fairly small adaptation rate r will allow more gradual behavior,
ATFT tends to be more robust than TFT in a noisy environment.
Since evolutionary computation has been widely used in simulating the
dynamics of IPD games, it is natural to consider obtaining IPD strategies
directly by using evolutionary approaches [Lindgren (1991); Fogel (1993);
Darwen and Yao (1995, 1996)]. Axelrod (1987) studied how to find effective
strategies by using genetic algorithms as simulation method. He established
an initial population of strategies that is deterministic and uses the outcome
of the three previous moves to make a choice in the current move. By
means of playing IPD games between one another, successful strategies
are selected to have more offspring. Then the new population will display
patterns of behavior that are more like those of the successful strategies of
the previous population, and less like those of the unsuccessful ones. As the
evolution process continues, the strategies with relatively high scores will
flourish while the unsuccessful strategies die out. Simulation results show
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
34 S. Y. Chong et al.
that most of the strategies that were evolved in the simulation actually
resemble TFT and does substantially better than TFT. However, it would
not be accurate to say that these strategies are better than TFT because
they are probably not very robust in other environments [Axelrod (1987)].
Many researchers have found that evolved strategies may lack robust-
ness, i.e., the strategies did well against the local population, but when
something new and innovative appeared they fail [Lindgren (1991); Fogel
(1993)]. Darwen and Yao (1996) applied a technique to prevent the genetic
algorithm from converging to a single optimum and attempted to develop
new IPD strategies without human intervention. It concludes that adding
static opponents to the round robin tournament improves the results of
final population.
Optimal strategies can be determined only if the strategy of the oppo-
nent is known. By means of reinforcement learning, model-based strategies
with the ability of on-line identification of an opponent can be built [Sand-
holm and Crites (1996); Freund et al. (1995); Schmidhuber (1996)]. How
can a player acquire a model of its opponent’s strategy? One possible source
of information available for the player is the history of the game. Another
possible source of information is observed games between the opponent and
other agents. In the case of IPD games, a player can infer an opponent’s
model based on the outcome of the past moves and then adapts its strategy
during the game. Reinforcement learning (RL) is based on the idea that
the tendency to produce an action should be strengthened if it produces
favorable results, and weakened if it produces unfavorable results [Watkins
(1989); Watkins and Dayan (1992); Kaelbling and Moore (1996)]. A model-
based RL approach generates expectation about the opponent’s behavior
by making use of a model of its strategy [Carmel and Markovitch (1997,
1998)]. It is well suited for use in IPD tournament against an unknown
opponent because of its small computational complexity. The major prob-
lem in designing a model-based strategy (MBS) is the risk involved in the
exploration, and thus the issue of exploitation versus exploration. An ex-
ploring action taken by the MBS tests unfamiliar aspects of the opponent
which can yield a more accurate model of the opponent. However, this
action also carries the risk of putting the MBS into a much worse position.
For example, in order to distinguish the strategy ALLC from GRIM and
TFT in IPD tournament, a MBS has to defect at least once and therefore
loses the chance to cooperate with GRIM. The exploratory action affects
not only the current payoff but also the future rewards [Berry and Frist-
edt (1985)]. There have been several approaches developed to solve this
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 35
problem [Berry and Fristedt (1985); Gittins (1989); Sutton (1990); Naren-
dra and Thathachar (1989); Kaelbling (1993); Moore and Atkeson (1993);
Carmel and Markovitch (1998)]. Since possible strategies for a repeated
game is usually infinite, computational complexity is another problem that
needs to be addressed [Ben-porath (1990); Carmel and Markovitch (1998)].
There is seldom a record of an effective MBS in round-robin IPD tourna-
ments. However, the strategy that won competition 4 in 2005 IPD tourna-
ment, Adaptive Pavlov, is such a strategy [Prisoner’s dilemma tournament
result (2005)]. Furthermore, it seems that each of the strategies that ranked
above TFT incorporated a mechanism to explore the opponent.
2.2.5. Group strategies
In the 2004 IPD competition [20th-anniversary Iterated Prisoner’s Dilemma
competition], a team from Southampton University led by Professor N. Jen-
nings introduced a group of strategies, which proved to be more successful
than Tit-for-Tat (see chapter 9).
The group of strategies were designed to recognise each other through a
known series of five to ten moves at the start. Once two Southampton play-
ers recognized each other, they would act as their “master” or “slave” roles
– a master will always defect while a slave will always cooperate in order
for the master to win the maximum points. If the program recognized that
another player was not a Southampton entry, it would immediately defect
to minimise the score of the oppositions. The Southampton group strate-
gies succeeded in defeating any non-grouped strategies and won the top
three positions in the competition [Prisoner’s dilemma tournament result
(2004)].
According to Grossman (2004), it was difficult to tell whether a group
strategy would really beat TFT because most of the “slave” group mem-
bers received far lower scores than the average level and were ranked at
the bottom of the table. The average score of the group strategies is not
necessarily higher than that of TFT.
The significance of group strategies maybe lies in their evolutionarily
characters. None of known strategies in IPD games is an evolutionarily sta-
ble strategy. [Boyd and Loberbaum (1987)] The strategies that are most
likely to be evolutionarily stable, such as AllD or GRIM, can resist the
invasion of some types of strategies but cannot resist the invasion of oth-
ers. For example, a small group of TFT strategies can not invade a large
population of AllD; however, STFT can do. There exists the possibility
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
36 S. Y. Chong et al.
that TFT can successfully invade a population of AllD indirectly. Suppose
that a large population of AllD is continuously attacked by small groups of
STFT. Because every invasion makes a small positive proportion of STFT
remain in the population of AllD, the number of STFT increases gradually.
When the number of STFT is large enough, a small group of TFT can
successfully invade and AllD will die out.
However, group strategies may be evolutionarily stable. By means of
cooperating with group members and defecting against non-group members,
a population of group strategies can prevent any foreigner from successfully
invading. This is, perhaps, the real value of group strategies.
2.3. Evolutionary Dynamics in Games
Traditional game theorists have developed several effective approaches to
study static games based on the assumption of rationality. By using
Neumann-Morgenstern utility, refinement of Nash equilibrium, and rea-
soning, both cooperative and non-cooperative games are analyzed within a
theoretical framework. However, in the area of repeated games, especially
in games where dynamics are concerned, few approaches from traditional
game theory are available.
Evolutionary game theory provides novel approaches to solve dynamic
games. If the precise length of an IPD is known to the players, then the
optimal strategy is to defect on each round. If the game has infinite length
or at least the players are not aware of the length of the game, there exists
the possibility that cooperation happens [Dugatkin (1989); Darwen and
Yao (2002); Akiyama and Kaneko (1995); Doebeli, Blarer, and Ackermann
(1997); Axelrod (1999); Glance and Huberman (1993, 1994); Ikegami and
Kaneko (1990); Schweitzer (2002)].
Nowak and May (1992, 1993) showed that cooperators and defectors
coexist in certain circumstances by introducing spatial evolutionary games,
in which two types of players – cooperators who always cooperate and
defectors who always defect are placed in a two-dimensional spatial array.
In each round, every individual plays the PD game with its immediate
neighbors. The selection scheme is that each lattice is occupied either by
its original owner or by one of the neighbors, depending on who scores
the highest total in that round, and so on to the next round of the game.
Simulation results show that cooperators remain a considerable percentage
of the population in some cases, and defector can invade any a lattice but
can not occupy the whole area.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 37
When the parameters of the payoff matrix are set to be T = 2.8, R = 1.1,
P = 0.1, and S = 0 and the initial state is set to be a random mixture of the
two types of strategies, the evolutionary dynamics of the local interaction
model lead to a state where each player chooses the strategy Defect, the
only ESS in the prisoner’s dilemma. Figure 2.1 shows that the population
converges to a state where everyone defects and no Cooperate strategy
survives after 5 generations.
Generation 1 Generation 2 Generation 3 Generation 6
Fig. 2.1. Spatial Prisoner’s Dilemma with the values T = 2.8, R = 1.1, P = 0.1, and
S = 0 [Nowak and May (1993)].
However, when the parameters of the payoff matrix are set to T = 1.2,
R = 1.1, P = 0.1, and S = 0, the evolutionary dynamics do not converge
to the stable state of defection. Instead, a stable oscillating state where
cooperators and defectors coexist and some regions are occupied in turn by
different strategies.
Generation 1 Generation 2 Generation 19 Generation 20
Fig. 2.2. Spatial Prisoner’s Dilemma with the values T = 1.2, R = 1.1, P = 0.1, and
S = 0 [Nowak and May (1993)].
Moreover, when the parameters of payoff matrix are set to be T = 1.61,
R = 1.01, P = 0.01, and S = 0, the evolutionary dynamics lead to a chaotic
state: regions occupied predominantly by Cooperators may be successfully
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
38 S. Y. Chong et al.
invaded by Defectors, and regions occupied predominantly by Defectors
may be successfully invaded by Cooperators.
Generation 1 Generation 3 Generation 13 Generation 15
Fig. 2.3. Spatial Prisoner’s Dilemma with the values T = 1.61, R = 1.01, P = 0.01,
and S = 0 [Nowak and May (1993)].
If the starting configurations are sufficiently symmetrical, this spatial
version of the PD game can generate chaotically changing spatial patterns,
in which cooperators and defectors both persist indefinitely. For example,
if we set R = 1, P = 0.01, S = 0.0 and T = 1.4, and initial state is that
every individual in a square 69×69 lattice is a cooperator except a defector
in the middle of the lattice. The structure of the evolving lattice varies like
a kaleidoscope, and the ever-changing sequences of spatial patterns can be
very beautiful, as shown in Fig. 2.4. The role of the spatial interaction in
the evolution of cooperation is further studied by Durrett and Levin (1998),
Schweitzer, Behera, and Muhlenbein (2002), Ifti, Killingback, and Doebeli
(2004).
Generation 10 Generation 40 Generation 4000 Generation 6000
Fig. 2.4. Spatial Prisoner’s Dilemma with the values T = 1.4, R = 1, P = 0.01,
and S = 0, where Blue, Red, Green, and Yellow denote cooperators, defectors, new
cooperators, and new defectors respectively [Nowak and May (1993)].
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 39
2.3.1. Evolutionary stable strategy
Just like the Nash equilibrium in traditional game theory, Evolutionarily
Stable Strategy (ESS) is an important concept used in theoretical analysis
of evolutionary games. According to Maynard Smith (1982), an ESS is a
strategy such that, if all the members of a population adopt it, then no
mutant strategy could invade the population under the influence of natural
selection. ESS can be seen as an equilibrium refinement to the Nash equilib-
rium. Suppose that a player in a game can choose between two strategies:
I and J . Let E(J, I) denote the payoff he receives if he chooses the strategy
J while all other players choose I .
Then, the strategy I is evolutionarily stable if either
(1) E(I, I) > E(J, I), or
(2) E(I, I) = E(J, I) and E(I, J) > E(J, J)
is true for all I 6= J [Maynard Smith and Price (1973); Maynard Smith
(1982)].
Thomas (1985) rewrites the definition of ESS in a different form. Fol-
lowing the terminology given in the first definition above, we have
(1) E(I, I) ≥ E(J, I), and
(2) E(I, J) > E(J, J)
From this alternative form of definition, we find that ESS is just a subset
of Nash equilibrium. The benefit of this refinement of Nash equilibrium
is not just to eliminate those weak Nash equilibrium, but to provide an
efficient mathematical tool for dynamic games. Following the concept of
ESS, two approaches to evolutionary game theory have been developed.
The first approach directly applies the concept of ESS to analyze static
games. The second approach simulates the evolutionary process of dynamic
games by constructing a dynamic model, which may take into consideration
the factors of the population, replication dynamics, and strategy fitness.
As an example of using ESS in static games, consider the problem of the
Hawk-Dove game. Two types of animals employ different means to obtain
resources (a favorable habitat, for example) — Hawk always fights for some
resources while Dove never fights. Let V denote the value of the resources,
which can be considered the Darwinian fitness of an individual obtaining
the resource, described by Maynard Smith (1982). Let E(H, D) denote the
payoff to a Hawk against a Dove opponent. If we assume that (1) whenever
two Hawks meet, conflict eventually results and the two individuals are
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
40 S. Y. Chong et al.
equally likely to be injured, (2) the cost of the conflict reduces individual
fitness by some constant value C, (3) when a Hawk meets a Dove, the Dove
immediately retreats and the Hawk obtains the resource, and (4) when two
Doves meet the resource is shared equally between them, the payoff matrix
for Hawk-Dove game will look like this,
Hawk Dove
Hawk ((V -C/2, (V -C)/2) (V, 0)
Dove (0, V ) (v/2, V/2)
Then, it is easy to verify that the strategy Dove is not an ESS because
there is E(D, D) < E(H, D), which means that a pure population of Doves
can be invaded by a Hawk mutant. In the case that the value V of the
resource is greater than the cost C of injury, the strategy Hawk is an ESS
because there is E(H, H) > E(D, H), which means that a Dove mutant
can not invade a group of Hawks. If V < C is true, the Hawk-Dove game
becomes the game of Chicken originated from the 1955 movie Rebel without
a cause. Neither pure Hawk nor pure Dove is ESS in this game. However,
there is an ESS if mixed strategies are permitted [Bishop and Cannings
(1978)].
An evolutionarily stable state is a dynamical property of a population
to return to using a strategy, or mix of strategies, if it is perturbed from
that strategy, or mix of strategies [Maynard Smith (1982)]. A population of
ESS must be evolutionarily stable because it is impossible for any mutant to
invade it. Many biologists and sociologists attempt to explain animal and
human behavior and social structures in terms of ESS [Cohen and Machalek
(1988); Mealey (1995)]. However, a dynamic game is not necessarily con-
verging to a stable state in which ESS is prevalent. For example, using a
spatial model in which each individual plays the Prisoner’s Dilemma with
his or her neighbors, Nowak and May (1992, 1993) show that the result of
the game depends on the specific form of the payoff matrix.
Now imagine a population of players in a society where each one has to
play Prisoner’s Dilemma with another and whether or not one can survive
and breed is determined by his payoff in the game. How will the population
evolve? In order to show the evolutionary process of the population, a model
of dynamics that takes time t into consideration is needed.
2.3.2. Genetic algorithm
A genetic algorithm maintains a population of sample points from the
search space. Each point is represented by a string of characters, known
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 41
as genotype [Holland (1975, 1992, 1995)]. By defining a fitness function
to evaluate them, genetic algorithm proceeds to initialize a population of
solutions randomly, and then improve it through repetitive application of
mutation, crossover, and selection operators.
The common methodology to study the evolutionary dynamics in games
is through replicator equations. Replicator equations usually assume in-
finite populations, continuous time, complete mixing and that strategies
breed true [Taylor (1979); Maynard Smith (1982); Weibull (1995); Hofbauer
and Sigmund (1998)]. Originated from biology and then introduced into
evolutionary game theory by Taylor and Jonker (1978), replicator equations
provide a continuous dynamic model for evolutionary games.
Consider a population of n types of strategies, and let xi
be the fre-
quency of type i. Let A be the n× n payoff matrix. With the assumptions
that the population is infinitely large and strategies are completely mixed
and xiare differentiable functions of time t, a strategy’s fitness, or expected
payoff can be written as (Ax)iif strategies meet one another randomly. The
average fitness of the population as a whole can be written as xT Ax. Then,
the replicator equation is
xi= x
i((Ax)
i− xT Ax) (2.1)
Evolutionary games with a replicator dynamic as described in (2.1) will
converge to a result that strategies with strong fitness bloom in the popu-
lation.
For the Prisoner’s Dilemma, the expected fitness of the strategies Co-
operate and Defect, EC
and ED
respectively, are
EC
= xC
R + xD
S , and ED
= xC
T + xD
P (2.2)
where xC
and xD
denote the proportions of the strategies of Cooperate and
Defect in the population respectively. Let E denote the average fitness of
the entire population, there is
E = xC
Ec+ x
DE
D(2.3)
Then, the replicator equations for this game are
dxC
dt= x
C(E
c− E) ,
dxD
dt= x
D(E
D− E) (2.4)
Since there is T > R and P > S, ED−E
C= x
C(T−R)+x
D(P−S) > 0
holds, and there must be ED
> E > EC
. Therefore, there are dxC
dt
< 0 anddxD
dt
> 0. This means that the number of the strategies of Cooperate will
always decline while the number of the strategies of Defect increases as the
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
42 S. Y. Chong et al.
game goes on. Sooner or later, the proportion of the population choosing
the strategy Cooperate will, in theory, become extinct.
Besides replicator dynamics, there exist other types of dynamics equa-
tions that can be used in modeling evolutionary systems [Akin (1993);
Thomas (1985); Bomze (1998, 2002); Balkenborg and Schlag (2000); Cress-
man, Garay and Hofbauer (2001); Weibull (1995); Hofbauer (1996); Gilboa
and Matsui (1991); Matsui (1992); Fudenberg and Levine (1998); Skyrms
(1990); Swinkels (1993); Smith and Gray (1994)]. Lindgren (1995) and
Hofbauer and Sigmund (2003) have given a comprehensive review of them.
In general, dynamic games are of great complexity. How an evolutionary
system evolves depends not only on the population and dynamic structures
but also on where the evolution starts. Because of dynamic interactions
between multiple players, especially those players with intelligence, genetic
algorithms may converge towards local optima rather than the global opti-
mum. Also, operating on dynamic data sets is difficult as genomes begin to
converge early on towards solutions which may no longer be valid for later
data [Michalewicz (1999); Schmitt (2001)]. Analysis of the evolutionary dy-
namic systems is not just a problem of evolutionary game theory, but a new
direction in applied mathematics [Garay and Hofbauer (2003); Gaunersdor-
fer (1992); Gaunersdorfer, Hofbauer, and Sigmund (1991); Hofbauer (1981,
1984, 1996); Krishna and Sjostrom (1998); Plank (1997); Smith (1995);
Zeeman (1993), Zeeman and Zeeman (2002, 2003)].
2.3.3. Strategies
What strategies should be involved in evolutionary dynamics is a difficult
question. One approach is to take into consideration lots of representa-
tive strategies, for example Axelord (1984), Dacey and Pendegraft (1988),
and Akimov and Soutchanski (1994), since it is impossible to enumerate all
possible strategies. However, it is difficult to say what strategy should be
included and which ones not, and there is little comparability between evo-
lutionary processes with different strategies because the selection of strate-
gies may have great influence on the outcome of the dynamics. Another
approach is to study the interactions between specific strategies, for exam-
ple Nowak and Sigmund (1990, 1992) and Goldstein and Freeman (1990). In
this way, it is convenient to make clear the relationship between strategies
in the evolutionary process; however, generality of complex evolutionary
systems loses to some extent.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 43
Strategies in PD games (or in non-PD games) can be characterized as
either deterministic or stochastic. Deterministic strategies leave nothing to
change and respond to the opponent with predetermined actions; stochastic
strategies, however, leave some uncertainty in their choices.
Oskamp (1971) presents a thorough review of the early studies on the
strategies involved in PD games and non-PD games, for example AllD,
TFT, and lots of stochastic strategies that play C or D with some certain
probabilities [Lave (1965); Bixenstine, Potash, and Wilson (1963); Solomon
(1960); Crumbaugh and Evans (1967); Wilson (1969); Oskamp and Perlman
(1965); Sermat (1967); Heller (1967); Knapp and Podell (1968); Lynch
(1968); Swingle and Coady (1967); Whitworth and Lucker (1969)].
After Axelord’s IPD tournament, memory-one strategies that interact
with the opponent according to both sides’ behavior in the previous move
become prevalent. TFT, Pavlov, Grim Trigger, and many other memory-
one strategies are analyzed in varies of environment: round-robin tourna-
ments, evolutionary dynamics with or without noise [Nowak and Sigmund
(1990, 1992, 1993); Pollock (1989); Wedekind and Milinski (1996); Milinski
and Wedekind (1998); Sigmund (1995); Stephens (2000); Stephens, Mclinn
and Stevens (2002); Sandholm and Crites (1996); Doebeli and Knowlton
(1998); Brauchli, Killingback and Doebeli (1999); Sasaki, Taylor and Fu-
denberg (2000)].
No strategy has been shown to be superior in a dynamic environment,
and even deterministic cooperators can invade defectors in specific circum-
stances. It is not sensible to discuss which strategy is best unless the context
is defined. Comparing TFT with GTFT, Grim (1995) suggests that, in the
non-stochastic Axelrod models, it is TFT that is the general winner; within
a purely stochastic model, the greater generosity of GTFT pays off; in a
model with both stochastic and spatial elements, a level of generosity twice
that of GTFT proves optimal. Pavlov has an obvious advantage over TFT
in noisy environments [Nowak and Sigmund (1993); Kraines and Kraines
(1995)]. In an evolutionary process where AllC, AllD, TFT, and GTFT
strategies are involved, evolution starts off toward defection but then veers
toward cooperation. TFT strategies play a key role in invading the pop-
ulation of defectors. However, GTFT strategies and then more generous
AllCs gradually become dominant once cooperation is widely established,
and this provides an opportunity to AllD to invade again [Nowak and Sig-
mund (1992)]. Additionally, Selten and Stoecker (1986) have studied the
end game behavior in finite IPD supergames, and find that cooperative
behaviors last until shortly before the end of the supergame.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
44 S. Y. Chong et al.
Machine Learning approaches have been introduced into evolutionary
game theory to develop adaptive strategies, especially those for IPD games
[Carmel and Markovitch (1996, 1997, 1998); Littman (1994); Tekol and
Acan (2003); Hingston and Kendall (2004)]. Adaptive strategies, at least
in theory, have obvious advantages over fixed strategies. Among the set of
adaptive strategies, there may be an evolutionarily stable strategy for IPD
games and potential winner of future IPD tournaments.
2.3.4. Population
Population size and structure are of great importance in evolutionary dy-
namics. In general, evolutionary processes in a large population are quite
different from that in small populations [Maynard Smith (1982); Fogel and
Fogel (1995); Fogel, Fogel and Andrew (1997, 1998); Ficici and Pollack
(2000)].
Young and Foster (1991) have studied stochastic effects in a population
consisting of three strategies: AllD, AllC, and TFT. They show that the
outcome of the evolutionary process depends crucially on the amount of
noise, which is inversely proportional to the population size. The more
people there are, the more that random variations in their behavior are
smoothed out in the population proportions. For large populations, the
system tends to drift from TFT to AllC, which is then invaded by AllD.
As a result, most of the players behave as AllD, even though initially most
players may have started as TFT. They conclude that cooperation is viable
in the short run, but not stable in the long run in a large population.
Boyd and Richerson (1988, 1989) suggest that reciprocity is unlikely
to evolve in large groups as a result of natural selection because reciproca-
tors punish defection by withholding future cooperation which will penalize
other cooperators in the group. Boyd and Richerson (1990, 1992) analyze
a model in which the punishment response to defection is directed solely at
defectors. In this model, cooperation reinforced by retribution can lead to
the evolution of cooperation in different ways. There is the possibility that
strategies which cooperate and punish defectors, strategies which cooperate
only if punished, and strategies which cooperate but do not punish coexist
in the long run, as well as the possibility that only one type exists. As the
group size grows larger, however, the conditions for co-operators’ surviving
becomes more difficult.
Glance and Huberman (1994) discuss how to achieve cooperation in
groups of various sizes in n-person PD games and find that there are two
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 45
stable points in large groups: either there is a great deal or very little co-
operation. Cooperation is more likely in smaller groups than in larger ones
and there is greater cooperation when players are allowed more communica-
tion with each other. Large random fluctuations are related to group size.
Groups beyond a certain size may experience increased difficulty of informa-
tional exchange and coordination; further, reneging on contracts is possible
to be prevalent as each member may expect that the effect of his/her action
on other members will be diluted. However, Dugatkin (1990) finds that co-
operation may invade large populations more easily than smaller ones, but it
is likely to represent a smaller proportion of the population in larger groups.
In order to consider the potential importance of the relationship between
population size and cooperative behaviour, two N-person game theoretical
models are presented. The results show that cooperation is frequently not a
pure evolutionarily stable strategy, and that many metapopulations should
be polymorphic for both cooperators and defectors.
It is well accepted that communication among members of a society
leads to more cooperative behaviors [Insko et al. (1987); Orbell, Kragt,
and Dawes (1988)]. Insko et al. (1987, 1988, 1990, 1993) explore the
role of communication on interindividual-intergroup discontinuity in the
context of the extended PD game that adds a third withdrawal choice
to the usual cooperative and uncooperative choices, and interindividual-
intergroup discontinuity is the tendency of intergroup relations to be more
competitive and less cooperative than interindividual relations. The lesser
tendency of individuals to cooperate when there is no communication with
the opponent partially explains the group discontinuity.
Choice and refusal of partners may accelerate the emergence of coop-
eration. Experiments have shown that people who are given the option of
playing or not are more likely to choose to play if they are themselves plan-
ning to cooperate. More cooperative players are more likely to anticipate
that others will be cooperative [Orbell and Dawes (1993)]. Defecting play-
ers are possible to be alienated by cooperators [Schuessler (1989); Kitcher
(1992); Batali and Kitcher (1994)]. In the N-person PD game, it may be
that players can change groups if they don’t satisfy the size of their groups
[Hirshleifer and Rasmusen (1989)]. The option of choice and refusal of
partners in IPD means that players will attempt to select partners ratio-
nally. Analytical studies reveal that the subtle interplay between choice
and refusal in N-player IPD games can result in various long-run player
interaction patterns: mutual cooperation; mixed mutual cooperation and
mutual defection; parasitism; and wallflower seclusion. Simulation studies
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
46 S. Y. Chong et al.
indicate that choice and refusal can accelerate the emergence of coopera-
tion in evolutionary IPD games [Stanley, Ashlock, and Tesfatsion (1994);
Stanley, Ashlock and Smucker (1995)].
The effects of freedom to play, reciprocity and interchange, coalitions
and alliances, and various sizes of groups on evolution are also studied [Or-
bell and Robyn (1993); Alexander and Frans (1992); Glance and Bernardo
(1994); Hemelrijk (1991)]. In a specific scenario, the prestructuration of the
population may determine the evolution of the patterns of interaction that
constitute the final social structure [Eckert, Koch, and Mitlohner (2005)].
2.3.5. Selection scheme
Evolutionary selection schemes can be characterized as either generational
or steady-state schemes [Thierens (1997)]. Generational schemes that are
widely used in evolutionary game theory mean that each generation of a
population is replaced in one step by a new generation. In a system with a
steady-state scheme only a small percentage of the population is replaced in
each generation. Evolutionary selection schemes can be further subdivided
as pure or elitist selection schemes in terms of whether or not there is an
overlap between successive generations. Pure selection schemes allow no
overlap between successive generations: all parents from previous genera-
tion are discarded and the next generation is filled entirely with offspring
from these parents. In elitist schemes, subsequent generations may be the
same: parents with higher fitness are transferred to the next generation
and only poorly performing parents are replaced [Mitchell (1996)].
Pure selection schemes are commonly used in IPD research [Axelord
(1987); Axelrod and Dion (1988); Huberman and Glance (1993); Aki-
mov and Soutchanski (1994); Mill (1996)]. These schemes use fitness-
proportional selection of the parents in combination with single-point
crossover or use a random uniform simple set to select the fittest agent
to produce offspring. A robust society of cooperators emerges only if the
level of competition between the players is neither too small nor too large.
In elitist selection schemes, the population is firstly shuffled randomly and
partitioned into pairs of parents. Then, each pair of parents creates two
offspring, and a local competition between parents and their offspring is
held. Finally, the best two players of each pair of parents are transferred
to the next generation [Thierens and Goldberg (1994)]. In this case, sta-
ble societies of highly cooperative players evolve. It shows that a suitable
model of the selection process is of crucial importance in terms of simulating
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 47
real-world economic situations [Ficici, Melnik, and Pollack (2000); Bragt,
Kemenade and Poutre (2001)].
Selection is clearly an important genetic operator, but opinion is di-
vided over the importance of crossover versus mutation. Some argue that
crossover is the most important, while mutation is only necessary to ensure
that potential solutions are not lost [Grefenstette, Ramsey and Schultz
(1990); Wilson (1987)]. Others argue that crossover in a largely uniform
population only serves to propagate innovations originally found by muta-
tion, and in a non-uniform population crossover is nearly always equivalent
to a very large mutation [Spears (1992)].
2.4. Evolution of Cooperation
A fundamental problem in evolutionary game theory is to explain how
cooperation can emerge in a population of self-interested individuals. Ax-
elrod (1984, 1987) attributes the reason of emergence of cooperation to
the “shadow of the future”: the likelihood and importance of future in-
teraction. This implies that rewards from cooperation should be mutually
expected payoff and to cooperate is a rational choice for self-interested in-
dividuals [Martinez-Coll and Hirshleifer (1991)]. Axelrod’s work has been
subjected to a number of criticisms because his conclusions obviously con-
flict with traditional game theory [Binmore (1994, 1998)], as Nachbar’s
criticism that “Axelrod mistakenly ran an evolutionary simulation of the
finitely repeated Prisoners’ Dilemma. Since the use of a Nash equilibrium
in the finitely repeated Prisoners’ Dilemma necessarily results in both play-
ers always defecting, we then wouldn’t need a computer simulation to know
what would survive if every strategy were present in the initial population of
entries. The winning strategies would never co-operate.” [Nachbar (1992)].
There are also arguments that the conflict stems from the assumption of
Von Neumann-Morgenstern utility. According to Spiro (1988), the prob-
lem with Axelrod’s argument is the oft-discussed problem of interpersonal
utility comparison. Axelrod’s argument, and all game theoretic modeling,
welfare economics, and utilitarian moral philosophy, in fact, would require
that it be possible for one to measure and compare the utilities of different
people. The problem with this assumption is that it is quite impossible to
construct a scale of measurement for human preferences [Rothbard (1997)].
Although evolutionary game theory is aimed primarily towards dynamic
games, while traditional game theory deals with non-dynamic games, there
are still area of intersection, for instance in the field of repeated games.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
48 S. Y. Chong et al.
Furthermore, although evolutionary game theory mainly depends on ex-
periments and computer simulations, its theoretical foundations, i.e. indi-
vidual utility (or preference) and payoff-maximizing, stem from traditional
game theory. Controversies about Axelrod’s work reflect the bifurcation
between evolutionary approaches and the basic assumptions of game the-
ory. Based on the assumption of “rational players”, traditional game theory
regards a finite repeated game as a combination of many singleton games.
“Backward induction” is applied in order to dissect the link between these
singleton games, and then each of them can be analyzed statically [Harsanyi
and Selten (1988)]. The concept of backward induction was first employed
by Von Neumann and Morgenstern (1944) and then developed by Selten
(1965, 1975) based on Nash equilibrium. First, one determines the optimal
strategy of the player who makes the last move of the game. Then, the
optimal action of the next-to-last moving player is determined taking the
last player’s action as given. The process continues in this way backwards
through time until all players’ actions have been determined. Subgame
perfect Nash equilibrium deduced directly from backward induction is an
equilibrium such that players’ strategies constitute a Nash equilibrium in
every subgame of the original game [Aumann (1995)]. Selten proved that
any game which can be broken into “sub-games” containing a sub-set of all
the available choices in the main game will have a subgame perfect Nash
equilibrium. In the case of a finite number of iterations in IPD games, the
unique subgame perfect Nash equilibrium is AllD. However, many psycho-
logical and economic experiments have shown that subjects would not nec-
essarily apply a strategy like AllD [Kahn and Murnighan (1993); McKelvey
and Palfrey (1992); Cooper et al. (1996)]. Game theorists explain these
experimental results in terms of incomplete information, reputation, and
bounded rationality, which are all based on theoretical analysis [Harsanyi
(1967); Kreps et al. (1982); Simon (1990); Bolton (1991); Bolton and Ock-
enfels (2000); Binmore et al. (2002); Samuelson (2001)]. In some sense,
Axelrods work is a parallel of these explanations, but it seems that his ap-
proach is absolutely different. Before a soundly theoretical explanation can
be established, the problem of how cooperation emerges is left unsolved.
As to the problem of how cooperation can persist during evolution, suffi-
cient evidence has been provided to support the point that cooperation can
survive and flourish in a wide range of circumstances if only some conditions
are satisfied. Nowak and Sigmund (1990) have shown that cooperation can
emerge among a population of randomly chosen reactive strategies, as long
as a stochastic version of TFT is added to the population. If cooperators
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 49
can recognize each other with the help of some label they can increase their
payoff by interacting selectively with one another [Frank (1988)]. Social
norms aid in cooperation in many ways [Bendor and Mookherjee (1990);
Kandori (1992); Sethi and Somanathan (1996)]. As to the influence of
payoff variations, Mueller (1988) finds that payoff settings with increas-
ing values of T relative to P promote cooperative behaviour; while Fogel
(1993) regards that smaller values for T promote the evolution of coopera-
tive behaviour. Nachbar (1992) selects a payoff setting strongly favouring
the relative reward of cooperating and finds that this setting elicits an in-
creased degree of cooperation. Kirchkamp (1995) finds that the value of S
becomes less important with longer memory. Also, the effects of popula-
tion structure, repetition, and noise have been studied [Hirshleifer and Coll
(1988); Mueller (1988); Boyd (1989); Marinoff (1992); Hoffmann (2001)].
To end, we note that Binmore (1998) stated:
“. . .One simply cannot get by without learning the underlying theory.
Without any knowledge of the theory, one has no way of assessing the
reliability of a simulation and hence no idea of how much confidence to
repose in the conclusions that it suggests”.
There is still a need for an underlying theory for IPD tournaments. Evo-
lutionary game theory has provided us with many experimental approaches;
however, better theoretical explanations are still needed. Even though IPD
tournaments have been run for over 40 years, we suspect there will be more
as we search for new strategies and new theories which explain the complex
interactions that take place.
Finally, this review has been restricted to the IPD literature. Even so,
we have not been able to include every article and there are, no doubt, omis-
sions. However we hope that this chapter has provided enough information
for the interested reader to follow up on.
References
Akimov V. and Soutchanski M. (1994) Automata simulation of N-person social
dilemma games, Journal of Conflict Resolution, 38, pp. 138-148.
Akin E. (1993) The general topology of dynamical systems, American Mathemat-
ics Society, Providence.
Akiyama E. and Kaneko K. (1995) Evolution of cooperation, differentiation, com-
plexity and diversity in an iterated three-person game, Artificial Life, 2,
pp. 293-304.
Alexander H. and Frans B. (1992) Coalitions and Alliances in Humans and Other
Animals. Oxford: Oxford University Press.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
50 S. Y. Chong et al.
Anthonisen N. (1999) Strong rationalizability for two-player noncooperative
games, Economic Theory, 13, pp. 143-169.
Aumann R. (1995) Backward Induction and Common Knowledge of Rationality,
Games and Economic Behavior, 18, pp. 6-19.
Axelrod R. (1980a) Effective choice in the prisoner’s dilemma, Journal of Conflict
Resolution, 24, pp. 3-25.
Axelrod R. (1980b) More effective choice in the prisoner’s dilemma, Journal of
Conflict Resolution, 24, pp. 379-403.
Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York).
Axelrod R. (1987) The evolution of strategies in the iterated prisoner’s dilemma,
In Davis L., Genetic Algorithms and Simulated Annealing, pp. 32-41.
Axelrod R. (1999) The Complexity of Cooperation: Agent-based Models of Com-
petition and Collaboration. University Press, Princeton, NJ.
Axelrod R. and Dion D. (1988) The further evolution of cooperation, Science,
242, pp. 1385-1390.
Axelrod R. and Hamilton W. (1981) The evolution of cooperation, Science, 211,
4489, pp. 1390-1396.
Balkenborg D. and Schlag K. (2000) Evolutionarily stable sets, International
Journal of Game Theory, 29, pp. 571-595.
Batali J. and Kitcher P. (1994) Evolutionary dynamics of altruistic behaviour
in optional and compulsory versions of the iterated prisoner’s dilemma, In
Rodney A. and Maes P. Artificial Life IV. MIT Press, pp. 343-348.
Beaufils B., Delahaye J., and Mathieu P. (1996) Our meeting with gradual: A
good strategy for the iterated prisoner’s dilemma, Proceedings of the Arti-
ficial Life V, pp. 202-209.
Becker N. and Cudd A. (1990) Indefinitely repeated games: a response to Carroll,
Theory and Decision, 28, pp. 189-195.
Bendor J. and Mookherjee D. (1990) Norms, third-party sanctions, and cooper-
ation, Journal of Law, Economics, and Organization, 6, pp. 33-63.
Bendor R., Kramer M., and Stout S. (1991) When in doubt: cooperation in a
noisy prisoner’s dilemma, Journal of Conflict Resolution, 35, pp. 691-719.
Ben-porath E. (1990) The complexity of computing a best response automaton
in repeated games with mixed strategies, Games and Economic Behavior,
2, pp. 1-12.
Berry D. and Fristedt B. (1985) Bandit problems: sequential allocation of experi-
ments. Chapman and Hall, London.
Binmore K. (1992) Fun and games. Lexington, MA: D.C. Heath and Company.
Binmore K. (1994) Playing fair game theory and the social contract I. MIT Press.
Binmore K. (1997) Rationality and backward induction, Journal of Economic
Methodology, 4, pp. 23-41.
Binmore K. (1998) Review of R. Axelrod’s ‘The complexity of cooperation: agent
based models of competition and collaboration’, Journal of Artificial Soci-
eties and Social Simulation, 1, 1.
Binmore K., McCarthy J., Ponti G., Samuelson L. and Shaked A. (2002) A back-
ward induction experiment, Journal of Economic Theory, 104, pp. 48-88.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 51
Bishop, D. and Cannings, C. (1978) A generalized war of attrition, Journal of
Theoretical Biology, 70, pp. 85-124.
Bixenstine V., Potash H., and Wilson K. (1963) Effects of level of cooperative
choice by the other player on choices in a Prisoner’s Dilemma game, Journal
of Abnormal and Social Psychology, 66, pp. 308-313.
Bolton G. (1991) A comparative model of bargaining: theory and evidence, The
American Economic Review, 81, 5, pp. 1096-1136.
Bolton G. and Ockenfels A. (2000) ERC: a theory of equity, reciprocity, and
competition, The American Economic Review, 90, pp. 166-193.
Bomze I. (1998) Uniform barriers and evolutionarily stable sets, Game Theory,
Experience, Rationality, pp. 225-244.
Bomze I. (2002) Regularity vs. degeneracy in dynamics, games, and optimization:
a unified approach to different aspects, SIAM Review, 44, pp. 394-414.
Boyd R. (1989) Mistakes allow evolutionary stability in the repeated prisoner’s
dilemma game, Journal of Theoretical Biology, 136, 11, pp. 47-56.
Boyd R. (1992) The evolution of reciprocity when conditions vary, Harcourt A.
and Frans B. (eds.) Alliance formation among male baboons: shopping for
profitable partners. Oxford: Oxford University Press, pp. 473-489.
Boyd R. and Loberbaum J. (1987) No pure strategy is evolutionarily stable in
the repeated Prisoner’s Dilemma game, Nature, 327, pp. 58-59.
Boyd R. and Richerson P. (1988) The evolution reciprocity in sizable groups,
Journal of Theoretical Biology, 132, pp. 337-356.
Boyd R. and Richerson P. (1989) The evolution of indirect reciprocity, Social
Networks, 11, pp. 213-236.
Boyd R. and Richerson P. (1990) Group selection among alternative evolutionarily
stable strategies. Journal of Theoretical Biology, 145, pp. 331-342.
Boyd R. and Richerson P. (1992) Punishment allows the evolution of cooperation
(or anything else) in sizable groups, Ethology and Sociobiology, 13, pp. 171-
195.
Bovens L. (1997) The backward induction argument for the finite iterated prison-
ers dilemma and the surprise exam paradox, Analysis, 57, 3, pp. 179-186.
Bragt D., Kemenade C. and Poutre H. (2001) The influence of evolutionary selec-
tion schemes on the iterated prisoner’s dilemma, Computational Economics,
17, pp. 253-263.
Brauchli K., Killingback T. and Doebeli M. (1999) Evolution of cooperation
in spatially structured populations, Journal of Theoretical Biology, 200,
pp. 405-417.
Brelis M. (1992) Reputed mobster defends his honor. Boston Globe, 1, pp. 23.
Bunn G. and Payne R. (1988) Tit-for-tat and the negotiation of nuclear arms
control, Arms Control, 9, pp. 207-233.
Carmel D. and Markovitch S. (1996) Learning models of intelligent agents, Pro-
ceedings of the 13th National Conference on Artificial Intelligence and the
8th Innovative Applications of Artificial Intelligence Conference, 2, pp. 62-
67.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
52 S. Y. Chong et al.
Carmel D. and Markovitch S. (1997) Model-based learning of interaction strate-
gies in multi-agent systems, Journal of Experimental and Theoretical Arti-
ficial Intelligence, 10, 3, pp. 309-332.
Carmel D. and Markovitch S. (1998) How to explore your opponent’s strategy
(almost) optimally, Proceedings of the International Conference on Multi
Agent Systems, pp. 64-71.
Cohen L. and Machalek R. (1988) A general theory of expropriative crime: an
evolutionary ecological approach, American Journal of Sociology, 94, 3,
pp. 465-501.
Cooper R., Jong D., Forsythe R., and Ross T. (1996) Cooperation without repu-
tation: experimental evidence from prisoner’s dilemma games, Games and
Economic Behavior, 12, 2, pp. 187–218.
Cressman R., Garay J. and Hofbauer J. (2001) Evolutionary stability concepts for
N-species frequency-dependent interactions, Journal of Theoretical Biology,
211, pp. 1-10.
Croson R. (2000) Thinking like a game theorist: Factors affecting the frequency
of equilibrium play, Journal of Economic Behavior and Organization, 41,
3, pp. 299–314.
Crumbaugh C. and Evans G. (1967) Presentation format, other-person strategies,
and cooperative behaviour in the prisoner’s dilemma, Psychological Reports,
20, pp. 895-902.
Dacey R. and Pendegraft N. (1988) The optimality of Tit-For-Tat, International
Interactions, 15, pp. 45-64.
Darwen P. and Yao X. (1995) On evolving robust strategies for iterated prisoner’s
dilemma, Progress in Evolutionary Computation, volume 956 in Lecture
Notes in Artificial Intelligence, Springer, pp. 276-292.
Darwen P. and Yao X. (1996) Automatic modularization by speciation, IEEE
International Conference on Evolutionary Computation, pp. 88-93.
Darwen P. and Yao X. (2001) Why more choices cause less cooperation in Iterated
Prisoner’s Dilemma, Proceedings of the 2001 IEEE Congress on Evolution-
ary Computation.
Darwen P. and Yao X. (2002) Coevolution in iterated prisoner’s dilemma with
intermediate levels of cooperation: Application to missile defense, Interna-
tional Journal of Computational Intelligence and Applications, 2, 1, pp. 83-
107.
Davis D. and Holt C. (1999) Equilibrium cooperation in two-stage games: Exper-
imental evidence, International Journal of Game Theory, 28, 1, pp. 89-109.
Delahaye J. and Mathieu P. (1996) Etude sur les dynamiques du Dilemme Itere
des Prisonniers avec un petit nombre de strategies : Y a-t-il du chaos dans
le Dilemme pur?, Publication Interne IT-294, Laboratoire d’Informatique
Fondamentale de Lille.
Doebeli M., Blarer A., and Ackermann M. (1997) Population dynamics, demo-
graphic stochasticity, and the evolution of cooperation, Proceedings of Na-
tional Academy Society of USA, 94: 5167–5171.
Doebeli M. and Knowlton N. (1998) The evolution of interspecific mutualisms,
Proceedings of the National Academy of Sciences, 95(15): 8676-8680.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 53
Donninger C. (1986) Is it always efficient to be nice?, In Paradoxical effects of so-
cial behavior, edited by Dickmann A. and Mitter P., Heidelberg, Germany:
Physica Verlag, pp. 123-134.
Dugatkin L. (1989) N-person games and the evolution of cooperation: a model
based on predator inspection in fish, Journal of Theoretical Biology, 142,
pp. 123–135.
Dugatkin L. (1990) N-person Games and the Evolution of Co-operation: A Model
Based on Predator Inspection in Fish, Journal of Theoretical Biology, 142,
pp. 123-135.
Durrett R. and Levin S. (1998) Spatial aspects of interspecific competition, The-
oretical Population Biology, 53, 1, pp. 30-43.
Eckert D., Koch S., and Mitlohner J. (2005) Using the iterated prisoner’s dilemma
for explaining the evolution of cooperation in open source communities,
Proceedings of the First Conference on Open Source System, pp. 186-191.
Ficici S., Melnik O., and Pollack J. (2000) A game-theoretic investigation of
selection methods used in evolutionary algorithms, Proceedings of the 2000
Congress on Evolutionary Computation, 2, pp. 880-887.
Ficici S. and Pollack J. (2000) Effects of finite populations on evolutionary stable
strategies, Proceedings of the 2000 Genetic and Evolutionary Computation,
pp. 927-934.
Fogel D. (1993) Evolving behaviors in the iterated prisoners dilemma, Evolution-
ary Computation, 1, 1, pp. 77-97.
Fogel D. and Fogel G. (1995) Evolutionary stable strategies are not always stable
under evolutionary dynamics, Evolutionary Programming IV, pp. 565-577.
Fogel D., Fogel G., and Andrew P. (1997) On the instability of evolutionary stable
strategies, BioSystems, 44, pp. 135-152.
Fogel G., Andrew P., and Fogel D. (1998) On the instability of evolutionary stable
strategies in small populations, Ecological Modelling, 109, pp. 283-294.
Frank R. (1988) Passions within reason. The strategic role of the emotions, New
York: W.W. Norton & Co.
Freund Y., Kearns M., Mansour Y., Ron D., Rubinfeled R., and Schapire R.
(1995) Efficient algorithms for learning to play repeated games against com-
putationally bounded adversaries, Proceedings of the Annual Symposium on
the Foundations of Computer Science, pp. 332–341.
Fudenberg D. and Maskin E. (1986) The Folk Theorem in repeated games with
discounting and incomplete information, Econometrica, 54, pp. 533–554.
Fudenberg D. and Maskin E. (1990) Evolution and cooperation in noisy repeated
games, New Developments in Economic Theory, 80, pp. 274-279.
Fudenberg D. and Levine D. (1998) The theory of learning in games. MIT Press.
Garay B. and Hofbauer J. (2003) Robust permanence for ecological differential
equations: minimax and discretizations, SIAM Journal on Mathematical
Analysis, 34, pp. 1007-1093.
Gaunersdorfer A. (1992) Time averages for heteroclinic attractors, SIAM Journal
on Applied Mathematics, 52, pp. 1476-1489.
Gaunersdorfer A., Hofbauer J., and Sigmund K. (1991) On the dynamics of asym-
metric games, Theoretical Population Biology, 39, pp. 345-357.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
54 S. Y. Chong et al.
Gilboa I. and Matsui A. (1991) Social stability and equilibrium, Econometrica,
59, pp. 859-867.
Gilboa I. and Schmeidler D. (2001) A theory of case-based decisions. Cambridge
University Press.
Gittins J. (1989) Multi-armed bandit allocation indices. Wiley, Chichester, NY.
Glance N. and Huberman B. (1993) The outbreak of cooperation, Journal of
Mathematical sociology, 17, 4, pp. 281–302.
Glance N. and Huberman B. (1994) The dynamics of social dilemmas, Scientific
American, 270, pp. 76-81.
Glomba M., Filak T., and Kwasnicka H. (2005) Discovering effective strategies for
the iterated prisoner’s dilemma using genetic algorithms, 5th International
Conference on Intelligent Systems Design and Applications, pp. 356-363.
Godfray H. (1992) The evolution of forgiveness, Nature, 355, pp. 206-207.
Goldstein J. and Freeman J. (1990) Three-Way Street: Strategic Reciprocity in
World Politics. Chicago: University of Chicago Press.
Grefenstette J., Ramsey C., and Schultz A. (1990) Learning sequential deci-
sion rules using simulation models and competition, Machine Learning, 5,
pp. 355-381.
Grim P. (1995) The greater generosity of the spatialized prisoner’s dilemma, Jour-
nal of Theoretical Biology, 173, pp. 242-248.
Grossman W. (2004) New tack wins Prisoner’s Dilemma, Wired News, Lycos.
Harborne S. (1997) Common belief of rationality in the finitely repeated prisoners’
dilemma, Games and Economic Behavior, 19, 1, pp. 133-143.
Hardin G. (1968) The tragedy of the commons, Science, 162, pp. 1243-1248.
Hargreaves H. and Varoufakis Y. (1995) Game theory: a critical introduction.
Routledge, London.
Harsanyi J. (1967) Games with incomplete information played by Bayesian play-
ers, Management Science, 14, 3, pp. 159-182.
Harsanyi, J., and Selten, R. (1988) A General Theory of Equilibrium Selection in
Games. Cambridge: MIT Press.
Hauser M. (1992) Costs of deception: cheaters are punished in rhesus monkeys
(Macaca mulatta). Proceedings of the National Academy of Sciences, 89,
pp. 12137-12139.
Heller J. (1967) The effects of racial prejudice, feedback strategy, and race on
cooperative-competitive behaviour, Dissertation Abstracts, 27, pp. 2507-
2508.
Hemelrijk C. (1991) Interchange of ’Altruistic’ Acts as an Epiphenomenon. Jour-
nal of Theoretical Biology, 153, pp. 131-139.
Hingston P. and Kendall G. (2004) Learning versus evolution in iterated prisoner’s
dilemma, Proceedings of Congress on Evolutionary Computation, pp. 364-
372.
Hirshleifer J. and Coll J. (1988) What strategies can support the evolutionary
emergence of cooperation?, Journal of Conflict Resolution, 32, 2, pp. 367-
398.
Hirshleifer D. and Rasmusen E. (1989) Cooperation in a repeated prisoner’s
dilemma with ostracism, Journal of Economic Behavior and Organization,
12, pp. 87-106.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 55
Hofbauer J. (1981) On the occurrence of limit cycles in the Volterra-Lotka equa-
tion, Nonlinear Analysis, 5, pp. 1003-1007.
Hofbauer J. (1984) A difference equation model for the hypercycle, SIAM Journal
on Applied Mathematics, 44, pp. 762-772.
Hofbauer J. (1996) Evolutionary dynamics for bimatrix games: a Hamiltonian
system, Journal of Mathematical Biology, 34, pp. 675-688.
Hofbauer J. and Sigmund K. (1998) Evolutionary games and population dynamics.
Cambridge University Press.
Hofbauer J. and Sigmund K. (2003) Evolutionary game dynamics, Bulletin of the
American Mathematical Society, 40, pp. 479-519.
Hoffmann R. (2001) The ecology of cooperation, Theory and Decision, 50,
pp. 101-118.
Holland J. (1975) Adaptation in Natural and Artificial Systems. University of
Michigan Press, Ann Arbor.
Holland J. (1992) Genetic algorithm, Scientific American, 267, 4, pp. 44-50.
Holland J. (1995) Hidden Order - How adaptation builds complexity, Reading,
Mass.: Addison-Wesley.
Huberman B. and Glance N. (1993) Evolutionary games and computer simula-
tions, Proceedings of the National Academy of Sciences, 90, pp. 7716-7718.
Ifti M., Killingback T., and Doebeli M. (2004) Effects of neighborhood size and
connectivity on the spatial continuous prisoner’s dilemma, Journal of The-
oretical Biology, 231, pp. 97-106.
Ikegami T. and Kaneko K. (1990) Computer symbiosis - emergence of symbiotic
behavior through evolution, Physica D, 42, pp. 235-243.
Insko C., Pinkley R., Hoyle R., Dalton B., Hong G., Slim R., Landry P., Holton
B., Ruffin P., and Thibaut J. (1987) Individual-group discontinuity: the
role of intergroup contact, Journal of Experimental Social Psychology, 23,
pp. 250-267.
Insko C., Hoyle R., Pinkley R., and Hong G. (1988) Individual-group discontinu-
ity: the role of a consensus rule, Journal of Experimental Social Psychology,
24, pp. 505-519.
Insko C., Schopler J., Hoyle R., Dardis G., and Graetz K. (1990) Individual-group
discontinuity as a function of fear and greed, Journal of Personality and
Social Psychology, 58, pp. 68-79.
Insko C., Schopler J., Drigotas S., Graetz K., Kennedy J., Cox C., and Bornstein
G. (1993) The role of communication in interindividual-intergroup discon-
tinuity, Journal of Conflict Resolution, 37, pp. 108-138.
Kaelbling L. (1993) Learning in embedded systems. The MIT Press, Cambridge,
MA.
Kaelbling L. and Moore A. (1996) Reinforcement learning: a survey, Journal of
Artificial Intelligence Research, 4, pp. 237-285.
Kagel J. and Roth A. (1995) The Handbook of Experimental Economics. Princeton
University Press.
Kahn L. and Murnighan J. (1993) Conjecture, uncertainty, and cooperation in
Prisoners’ Dilemma games: Some Experimental Evidence, Journal of Eco-
nomic Behavior and Organisms, 22, pp. 91–117.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
56 S. Y. Chong et al.
Kalai E. and Lehrer E. (1993) Rational learning leads to Nash equilibrium Econo-
metrica, 61, 5, pp. 1019-1045.
Kandori M. (1992) Social norms and community enforcement, The Review of
Economic Studies, 59, 1, pp. 63-80.
Katok A. and Hasselblatt B. (1996) Introduction to the modern theory of dynam-
ical systems. Cambridge ISBN 0521575575.
Kavka G. (1986) Hobbesean Moral and Political Theory. Princeton: Princeton
University Press.
Kirchkamp O. (1995) Spatial Evolution of Automata in the Prisoners’ Dilemma.
University of Bonn SFB 303, Discussion Paper B-330.
Kitcher P. (1992) Evolution of altruism in repeated optional games, Working
Paper of University of California at San Diego.
Knapp W. and Podell J. (1968) Mental patients, prisoners, and students with
simulated partners in a mixed-motive game, Journal of Conflict Resolution,
12, pp. 235-241.
Komorita S., Sheposh J., and Braver S. (1968) Power, the use of power, and
cooperative choice in a two-person game, Journal of Personality and Social
Psychology, 8, pp. 134-142.
Kraines D. and Kraines V. (1995) Evolution of learning among Pavlov strategies in
a competitive environment with noise, The Journal of Conflict Resolution,
39, 3, pp. 439-466.
Kraines D. and Kraines V. (2000) Natural selection of memory-one strategies
for the iterated Prisoner’s Dilemma, Journal of Theoretical Biology, 203,
pp. 335-355.
Kreps D., Milgrom P., Roberts J., and Wilson R. (1982) Rational cooperation in
the finitely repeated prisoner’s dilemma, Journal of Economic Theory, 27,
pp. 245–252.
Kreps, D., and Wilson R. (1982) Reputation and imperfect information, Journal
of Economic Theory, 27, pp. 253–279.
Krishna V. and Sjostrom T. (1998) On the convergence of fictitious play, Mathe-
matics Operations Research, 23, pp. 479-511.
Lave L. (1965) Factors affecting cooperation in the prisoner’s dilemma, Behavioral
Science, 10, pp. 26-38.
Lindgren K. (1991) Evolutionary phenomena in simple dynamics, In Christopher
G., et al. Santa Fe Institute Studies in the Sciences of Complexity. 10,
pp. 295-312.
Lindgren K. (1992) Evolutionary phenomena in simple dynamics, In Langton C.
(ed.) Artificial Life II. Addison-Wesley.
Lindgren K. (1995) Evolutionary dynamics in game-theoretic models, The econ-
omy as an evolving complex system II, Santa Fe Institute.
Littman M. (1994) Markov games as a framework for multiagent reinforcement
learning, Proceedings of the 11th International Conference on Machine
Learning, pp. 157-163.
Luce R. and Raiffa H. (1957) Games and decisions. New York: Wiley.
Lynch G. (1968) Defense preference and cooperation and competition in a game,
Dissertation Abstracts, 29, pp. 1174.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 57
Manarini S. (1998) The prisoner’s dilemma, experiments for the study of coopera-
tion. Strategies, theories and mathematical models, Ph.D. thesis, University
of Padova.
Marinoff L. (1992) Maximizing expected utilities in the Prisoner’s Dilemma, Jour-
nal of Conflict Resolution, 36, 1, pp. 183-216.
Martinez-Coll J. and Hirshleifer J. (1991) The limits of reciprocity, Rationality
and Society, 3, pp. 35-64.
Matsui A. (1992) Best response dynamics and socially stable strategies, Journal
of Economic Theory, 57, pp. 343-362.
May R. (1987) More evolution of cooperation, Nature, 327, pp. 15-17.
Maynard Smith J. and Price G. (1973) The logic of animal conflict, Nature, 246,
pp. 15-18.
Maynard Smith J. (1982) Evolution and the Theory of Games, Cambridge Uni-
versity Press.
McKelvey R. and Palfrey T. (1992) An experimental study of the centipede game,
Econometrica, 60, pp. 803-836.
Mealey L. (1995) The sociobiology of sociopathy: an integrated evolutionary
model, Behavioral and Brain Sciences, 18, 3, pp. 523-599.
Michalewicz Z. (1999) Genetic Algorithms + Data Structures = Evolution Pro-
grams, Springer-Verlag.
Micko H. (1997) Benevolent tit for tat strategies with fixed intervals between
offers of cooperation, Meeting of Experimental Psychologists, pp. 250-256.
Micko H. (2000) Ex-
perimental Matrix games, In Open and Distance Learning-Mathematical
Psychology, Institut fur Sozial- und Personlichkeitspsychologie, Universitat
Bonn.
Milgrom, P. and Roberts J. (1982): Predation, reputation and entry deterrence,
Journal of Economic Theory, 27, pp. 280-312.
Milinski M. (1993) Cooperation wins and stays, Nature, 364, pp. 12-13.
Milinski M. and Wedekind C. (1998) Working memory constrains human coop-
eration in the prisoner’s dilemma, Proceedings of the National Academy of
Sciences of the United States of America, 95, 23, pp. 13755-13758.
Miller J. (1996) The coevolution of automata in the repeated prisoner’s dilemma,
Journal of Economic Behavior and Organization, 29, pp. 87-112.
Mitchell M. (1996) An introduction to Genetic Algorithms. The MIT Press, Cam-
bridge MA.
Molander P. (1985) The optimal level of generosity in a selfish, uncertain envi-
ronment, Journal of Conflict Resolution, 29, pp. 611-618.
Moore A. and Atkeson C. (1993) Prioritized sweeping: reinforcement learning
with less data and less real time, Machine Learning, 13, pp. 103-130.
Mueller U. (1988) Optimal retaliation for optimal cooperation, Journal of Conflict
Resolution, 31, 4, pp. 692-724.
Myerson R. (1991) Game Theory, Analysis of Conflict. Cambridge, Harvard Uni-
versity Press.
Nachbar J. (1992) Evolution in the finitely repeated Prisoners’ Dilemma, Journal
of Economic Behavior and Organization, 19, pp. 307-326.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
58 S. Y. Chong et al.
Narendra K. and Thathachar M. (1989) Learning automata: an introduction.
Prentice-Hall, Englewood Cliffs, NJ.
Nash J. (1950) Equilibrium points in n-person games, Proceedings of the National
Academy of the USA, 36, 1, pp. 48-49.
Nash J. (1951) Non-cooperative games, The Annals of Mathematics, 54, 2,
pp. 286-295.
Nash J. (1996) Essays on Game Theory. Elgar. Cheltenham.
Noldeke G. and Samuelson L. (1993) An evolutionary analysis of backward and
forward induction, Games and Economic Behaviour, 5, pp. 425-454.
Nowak M., Bonhoeffer S., and May R. (1994) More spatial games, International
Journal of Bifurcation and Chaos, 4, 1, pp. 33-56.
Nowak M. and May R. (1992) Evolutionary games and spatial chaos, Nature,
359, pp. 826-829.
Nowak M. and May R. (1993) The spatial dilemmas of evolution, International
Journal of Bifurcation and Chaos, 3, pp. 35-78.
Nowak M. and Sigmund K. (1990) The evolution of stochastic strategies in the
prisoner’s dilemma, Acta Applicandae Mathematicae, 20, pp. 247-265.
Nowak M. and Sigmund K. (1992) Tit for tat in heterogeneous populations, Na-
ture, 359, pp. 250-253.
Nowak M. and Sigmund K. (1993) A strategy of win-stay lose-shift that outper-
forms Tit-for-Tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58.
Nowak M., Sigmund K. and El-Sedy E. (1995) Automata, repeated games, and
noise, Journal of Mathematical Biology, 33, pp. 703-722.
Orbell J., Kragt A., and Dawes R. (1988) Explaining discussion-induced cooper-
ation, Journal of Personality and Social Psychology, 54, pp. 811-819.
Orbell J. and Dawes R. (1993) Social welfare, cooperator’s advantage, and the
option of not playing the game, American Sociological Review, pp. 787-800.
Orbell J. and Robyn M. (1993) Social welfare, cooperators’ advantage, and the
option of not playing the game. American Sociological Review, 58, pp. 787-
800.
Oskamp S. (1971) Effects of programmed strategies on cooperation in the pris-
oner’s dilemma and other mixed-motive games, The Journal of Conflict
Resolution, 15, 2, pp. 225-259.
Oskamp S. and Perlman D. (1965) Factors affecting cooperation in a prisoner’s
dilemma game, Journal of Conflict Resolution, 9, pp. 359-374.
Plank M. (1997) Some qualitative differences between the replicator dynamics of
two player and n player games, Nonlinear Analysis, 30, pp. 1411-1417.
Pollock G. (1989) Evolutionary Stability of Reciprocity in a Viscous Lattice.
Social Networks, 11, pp. 175-212.
Posch M. (1997) Win Stay–Lose Shift: An Elementary Learning Rule
for Normal Form Games, Working Paper of Santa Fe Institute,
http://ideas.repec.org/p/wop/safire/97-06-056e.html.
Prisoner’s dilemma tournament result (2004) http://www.prisoners-dilemma.
com/results/cec04/ipd cec04 full run.html.
Prisoner’s dilemma tournament result (2005) http://www.prisoners-dilemma.
com/results/cig05/cig05.html.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 59
Radner R. (1980) Collusive behaviour in non-cooperative epsilon-equilibria in
oligopolies with long but finite lives, Journal of Economic Theory, 22,
pp. 136-154.
Radner R. (1986) Can bounded rationality resolve the prisoner’s dilemma, In
Mas- Colell A. and Hildenbrand W. Essays in Honor of Gerard Debreu,
pp. 387-399.
Rapoport A. (1966) Optimal policies for the prisoner’s dilemma, Technical Report
No. 50 Psychometric Laboratory, University of North California, MH-10006.
Rapoport A. (1999) Two-person Game Theory. Dover Publications, New York.
Rapoport and Chammah (1965) Prisoner’s dilemma: a study in conflict and
cooperation. Ann Arbor: University of Michigan Press.
Rothbard M. (1997) Toward a Reconstruction of Utility and Welfare Economics,
In The Logic of Action One: Method, Money, and the Austrian School,
pp. 211-55.
Rubinstein A. (1979) Equilibrium in super games with the overtaking criterion,
Journal of Economic Theory, 21, pp. 1-9.
Rubinstein A. (1998) Modeling bounded rationality. The MIT Press, 1998.
Samuelson L. (2001) Introduction to the evolution of preferences, Journal of Eco-
nomic Theory, 97, pp. 225-230.
Sandholm T. and Crites R. (1996) Multiagent reinforcement learning in the iter-
ated Prisoner’s Dilemma, Biosystems, 37, 1-2, pp. 147-66.
Sarin R. (1999) Simple play in the prisoner’s dilemma, Journal of Economic
Behavior and Organization, 40, 1, pp. 105–113.
Sasaki A., Taylor C. and Fudenberg D. (2000) Emergence of cooperation and
evolutionary stability in finite populations, Nature, 428, pp. 646-650.
Schmidhuber J. (1996) A general method for multi-agent learning and incremental
self-improvement in unrestricted environments, In Yao X. (ed.) Evolution-
ary Computation: Theory and Applications. Scientific Publications Co.
Schmitt L. (2001) Theory of genetic algorithms, Theoretical Computer Science,
259, pp. 1-61.
Schuessler R. (1989) Exit threats and cooperation under anonymity, Journal of
Conflict Resolution, 33, pp. 728-749.
Schweitzer F. (2002) Modeling Complexity in Economic and Social Systems. World
Scientific, Singapore.
Schweitzer F., Behera L., and Muhlenbein H. (2002) Evolution of cooperation
in a spatical prisoner’s dilemma, Advances in Complex Systems, 5, 2-3,
pp. 269-299.
Scodel A., Minas J., Ratoosh P., and Lipetz M. (1959) Some descriptive aspects of
two-person non-zero sum games, Journal of Conflict Resolution, 3, pp. 114-
119.
Selten, R. (1965) Spieltheoretische behandlung eines oligopolmodells mit nach-
fragetragheit, Zeitschrift fur die Gesamte Staatswissenschaft, 12, pp. 301-
324.
Selten, R. (1975) Reexamination of the perfectness concept for equilibrium points
in extensive games, International Journal of Game Theory, 4, pp. 25-55.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
60 S. Y. Chong et al.
Selten R. (1983) Evolutionary stability in extensive two-person games, Mathe-
matical Social Science, 5, pp. 269-363.
Selten R. (1988) Evolutionary stability in extensive two-person games: correction
and further development, Mathematical Social Science, 16, pp. 223-266.
Selten R. and Stoecker R. (1986) End behaviour in sequences of finite Prisoner’s
Dilemma supergames: a learning theory approach, Journal of Economic
Behaviour and Organisation, 7, pp. 47-70.
Sethi R. and Somanathan E. (1996) The evolution of social norms in common
property resource use, The American Economic Review, 86, 4, pp. 766-788.
Simon H. (1955) A behavioral model of rational choice, Quarterly Journal of
Econometrics, 69, 1, pp. 99-118.
Simon H. (1990) A mechanism for social selection and successful altruism, Science,
250, 4988, pp. 1665-1668.
Sermat V. (1967) Cooperative behaviour in a mixed-motive game, Journal of
Social Psychology, 62, pp. 217-239.
Sigmund K. (1995) Games of Life: Explorations in Ecology, Evolution and Be-
haviour. Penguin, Harmondsworth.
Skyrms B. (1990) The Dynamics of Rational Deliberation. Harvard UP.
Smith H. (1995) Monotone dynamical systems: an introduction to the theory
of competitive and cooperative systems, AMS Mathematical Surveys and
Monographs, 41.
Smith R. and Gray B. (1994) Co-adaptive genetic algorithms: an example in Oth-
ello strategy, Proceedings of the 1994 Florida Artificial Intelligence Research
Symposium, pp. 259-264.
Sobel J. (1975) Reexamination of the perfectness concept of equilibrium in ex-
tensive games, International Journal of Game Theory, 4, pp. 25-55.
Sobel J. (1976) Utility maximization in iterated Prisoner’s Dilemmas, Dialogue,
15, pp. 38-53.
Solomon L. (1960) The influence of some types of power relationships and game
strategies upon the development of interpersonal trust, Journal of Abnormal
and Social Psychology, 61, pp. 223-230.
Spears W. (1992) Crossover or mutation? Foundations of Genetic Algorithms. 2,
FOGA-92, edited by Whitley D., California: Morgan Kaufmann.
Spiro D. (1988) The state of cooperation in theories of state cooperation: the evo-
lution of a category mistake, Journal of International Affairs, 42, pp. 205-
225.
Stanley E., Ashlock D., and Smucker M. (1995) Iterated prisoner’s dilemma with
choice and refusal of partners: Evolutionary results, Lecture Notes in Arti-
ficial Intelligence, 929, pp. 490-502.
Stanley E., Ashlock D., and Tesfatsion L. (1994) Iterated prisoner’s dilemma
with choice and refusal of partners, In Christopher G. Artificial Life III.
Addison-Wesley, pp. 131-176.
Stephens D. (2000) Cumulative benefit games: achieving cooperation when play-
ers discount the future, Journal of Theoretical Biology, 205, 1, pp. 1-16.
Stephens D., Mclinn C., and Stevens J. (2002) Discounting and Reciprocity in an
Iterated Prisoner’s Dilemma, Science, 298, 5601, pp. 2216-2218.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
Iterated Prisoner’s Dilemma and Evolutionary Game Theory 61
Sugden, R. (1986) The Economics of Cooperation, Rights and Welfare. Basil
Blackwell.
Surowiecki J. (2004) The Wisdom of Crowds: Why the Many Are Smarter Than
the Few and How Collective Wisdom Shapes Business, Economies, Societies
and Nations. Little, Brown.
Sutton R. (1990) Integrated architectures for learning, planning, and reacting
based on approximating dynamic programming, Proceedings of the 7th In-
ternational Conference on Machine Learning, pp. 216-224.
Swingle P. and Coady H. (1967) Effects of the partner’s abrupt strategy change
upon subject’s responding in the prisoner’s dilemma, Journal of Personality
and Social Psychology, 5, pp. 357-363.
Swinkels J. (1993) Adjustment dynamics and rational play in games, Games and
Economic Behavior, 5, pp. 455-84.
Taylor, P. D. (1979). Evolutionarily stable strategies with two types of players,
Journal of Applied Probability, 16, pp. 76-83.
Taylor, P. and Jonker, L. (1978) Evolutionary stable strategies and game dynam-
ics, Mathematical Biosciences, 40, pp. 145-156.
Tekol Y. and Acan A. (2003) Ants can play Prisoner’s Dilemma, Proceedings of
the 2003 Congress on Evolutionary Computation, pp. 1151-1157.
Thierens D. (1997) Selection schemes, elitist recombination, and selection inten-
sity, Proceedings of the 7th International Conference on Genetic Algorithms,
pp. 152-159.
Thierens D. and Goldberg D. (1994) Elitist recombination: an integrated se-
lection recombination GA, Proceedings of the First IEEE Conference on
Evolutionary Computation, pp. 508-512.
Thomas B. (1985) On evolutionarily stable sets, Journal of Mathematical Biology,
22, pp. 105-115.
Tzafestas E. (2000a) Toward adaptive cooperative behavior, Proceedings of the
Simulation of Adaptive Behavior Conference, pp. 334-340.
Tzafestas E. (2000b) Spatial games with adaptive tit-for-tats, Proceedings of the
6th Parallel Problem Solving from Nature (PPSN-VI), pp. 507-516.
Young H. and Foster D. (1991) Cooperation in the Short and in the Long Run,
Games and Economic Behavior, 3, pp. 145-156.
Vegaredondo F. (1994) Bayesian boundedly rational agents play the finitely re-
peated prisoner’s dilemma, Theory and Decision, 36, 2, pp. 187–206.
Von Neumann J. and Morgenstern O. (1944) Theory of Games and Economic
Behavior. Princeton UP.
Watkins C. (1989) Learning from delayed rewards. Ph.D. thesis, King’s College,
Cambridge, UK.
Watkins C. and Dayan P. (1992) Q-learning, Machine Learning, 8, 3, pp. 279-292.
Wedekind C. and Milinski M. (1996) Human cooperation in the simultaneous
and the alternating Prisoner’s Dilemma: Pavlov versus Generous Tit-for-
Tat, Proceedings of the National Academy of Sciences of the United States
of America, 93, 7, pp. 2686-2689.
Weibull J. (1995) Evolutionary Game Theory. MIT Press, Cambridge, Mass.
January 30, 2007 18:37 World Scientific Review Volume - 9in x 6in chapter2
62 S. Y. Chong et al.
Wilson W. (1969) Cooperation and the cooperativeness of the other player, Jour-
nal of Conflict Resolution, 13, pp. 110-117.
Wilson W. (1987) Classifier systems and the animat problem, Machine Learning,
2, pp. 199-228.
Whitworth R. and Lucker W. (1969) Effective manipulation of cooperation with
college and culturally disadvantaged populations, Proceedings of 77th An-
nual Convention of American Psychological Association, 4, pp. 305-306.
Wu J. and Axelrod R. (1995) How to cope with noise in the Iterated Prisoner’s
Dilemma, Journal of Conflict Resolution, 39, pp. 183-189.
Zeeman M. (1993) Hopf bifurcations in competitive three dimensional Lotka-
Volterra systems, Dynamics and Stability of Systems, 8, pp. 189-217
Zeeman E., Zeeman M. (2002) An n-dimensional competitive Lotka-Volterra sys-
tem is generically determined by its edges, Nonlinearity, 15, pp. 2019-2032.
Zeeman E., Zeeman M. (2003) From local to global behavior in competitive
Lotka-Volterra systems, Transaction of American Mathematical Society,
355, pp. 713-734.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Chapter 3
Learning IPD Strategies Through Co-evolution
Siang Yew Chong1, Jan Humble2, Graham Kendall2, Jiawei Li2,3, Xin
Yao1
University of Birmingham1, University of Nottingham2, Harbin Institute
of Technology3
3.1. Introduction
Complex behavioral interactions can be abstracted and modelled using a
game. One particular aspect in modelling interactions that is of great
interest is in understanding the specific conditions that lead to cooperation
between selfish individuals. The iterated prisoner’s dilemma (IPD) game is
one famous example. In its classical form, two players engaged in repeated
interactions, are given two choices: cooperate and defect [Axelrod (1984)].
The dilemma of the game is captured by having both players who are
better off mutually cooperating than mutually defecting being vulnerable
to exploitation by one of the party who defects. Although the IPD game has
become a popular model to study conditions for cooperation to occur among
selfish individuals, which was due in large part to a series of tournaments
reported in [Axelrod (1980a,b)], it has also received much attention in many
other areas of study, and used to model social, economic, and biological
interactions [Axelrod (1984)].
The classical IPD can be easily defined as a nonzero-sum, noncooper-
ative, two-player game [Chellapilla and Fogel (1999)]. It is nonzero-sum
because the benefits that a player obtains do not necessarily lead to similar
penalties given to the other player. It is noncooperative because it assumes
no preplay communication between the two players.
The IPD game can be formulated by considering a predefined payoff
matrix that specifies the payoff that a player receives for the choice it makes
for a particular move given the choice that the opponent makes. Referring
63
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
64 S. Y. Chong et al.
to the payoff matrix given by figure 3.1, both players receive R (reward)
units of payoff if both cooperates. They both receive P (punishment) units
of payoff if they both defect. However, when one player cooperates while
the other defects, the cooperator will receive S (sucker) units of payoff
while the defector receives T (temptation) units of payoff.
With the IPD game, the values R, S, T , and P must satisfy the con-
straints; T > R > P > S and R > (S + T )/2. Axelrod in [Axelrod
(1980a,b)] used the following set of values: R = 3, S = 0, T = 5, and
P = 1. However, any set of values can be used as long as they satisfy the
IPD constraints. The game is played when both players choose between
the two alternative choices over a series of moves (i.e., repeated interac-
tions). Note that the game is fully symmetric, i.e., the same payoff matrix
is applied to both players.
Cooperate DefectR T
Cooperate R S
S PDefect
T P
Fig. 3.1. The payoff matrix framework of a two-player, two-choice game. The payoff
given in the lower left-hand corner is assigned to the player (row) choosing the move,
while that of the upper right-hand corner is assigned to the opponent (column).
For the simple case of the one-shot prisoner’s dilemma (both players
only get to make one move), the rational play will be to defect [Chellapilla
and Fogel (1999)]. This can be viewed by considering the obtained payoff
for a choice made by a player in light of the opponent’s. For example, a co-
operating player will receive either R (opponent cooperates) or S (opponent
defects). A defecting player will receive either T (opponent cooperates) or
P (opponent defects). As such, from the player’s point of view (i.e., self-
interested), the rational play will be to defect because regardless of the
opponent’s play, a higher payoff is obtained (T > R and P > S).
However, when the game is iterated over many rounds of moves and
that players can adopt game strategies where a response is based on what
happened in the previous moves, defection is not necessarily the best choice
of play. Instead, many studies have shown cooperative play to be a viable
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 65
strategy, starting with the tournaments organized by Axelrod (reported in
[Axelrod (1980a,b)]). More importantly, later studies (of which Axelrod
himself is one of the early pioneers) showed that cooperative strategies can
be learned from an initial, random population using evolutionary algorithms
[Axelrod (1987); Fogel (1991, 1993); Darwen and Yao (1995)].
In particular, studies made in [Axelrod (1987); Fogel (1991, 1993); Dar-
wen and Yao (1995)] (and many others) used a co-evolutionary learning
approach. The motivation for the co-evolutionary learning approach is
the learning of strategy behaviors through an adaptation process on strat-
egy representations based solely on interactions (i.e., game-play). This
approach is different compared to the classical evolutionary game approach
(and also the ecological game approach used in [Axelrod (1980b); Axel-
rod and Hamilton (1981)]) that is mainly concerned with frequency depen-
dent reproductions of fixed and predetermined strategies. As such, the use
of co-evolutionary learning approach allows for one to construct a game
(i.e., specifying the possible interactions between players, the rules that
govern the interactions, and the payoffs) and then to search for effective
game strategies without the need of human intervention (e.g., specify vi-
able strategies) [Chellapilla and Fogel (1999)].
Within the framework of the co-evolutionary learning of game strategies,
it is natural to explore more complex interactions that is closer to real-world
interactions compared to highly abstracted models like the classical IPD.
This review aims to provide a survey of studies using the co-evolutionary
learning approach of more complex IPD games since the tournaments orga-
nized by Axelrod that were held almost 20 years ago. In particular, focus
is placed on the motivations of certain extensions to the classical IPD and
the general observations made when co-evolutionary learning systems are
used.
The following section describes the framework of co-evolutionary learn-
ing and the general issues of co-evolving IPD strategies. Section 3.2 surveys
studies that extend the classical IPD with more choices, noise, N-players,
and others. The review concludes with some remarks on the future di-
rections for research in co-evolutionary learning of IPD strategies. It is
emphasized again that this review focusses on the co-evolutionary learning
approach to IPD games, rather than all possible work related to IPD games.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
66 S. Y. Chong et al.
3.2. Co-evolving Strategies for the IPD Game
3.2.1. Co-evolutionary Learning Framework
Co-evolutionary learning refers to a broad class of population-based,
stochastic search algorithms that involves the simultaneous evolution of
competing solutions (to a problem) with coupled fitness [Yao (1994)]. A co-
evolutionary learning system can be implemented using evolutionary algo-
rithms (EAs) [Fogel (1994a); Back et al. (1997)]. That is, a co-evolutionary
learning system iteratively apply the process of variation (e.g., mutation,
crossovers, and others) and selection (e.g., choosing solutions to procreate
in the next iterative step) on the competing solutions in the population.
With this view, the framework of co-evolutionary learning (and also that
of EAs) can be illustrated using figure 3.2.
(1) Initialize the population, X(t=0)
(2) Evaluate the fitness of each individual through a comparison process
with other individuals in X(t)
(3) Select parents from X(t) based on their evaluated fitness
(4) Generate offsprings from parents to produce X(t+1)
(5) Repeat steps (2-4) until some termination criteria are reached
Fig. 3.2. The general framework of co-evolutionary learning.
Co-evolutionary learning is different from EAs in the sense of assigning
fitness, i.e., the quality or worth of a solution (Step 2 in Fig. 3.2). EAs are
often viewed and constructed in terms of an optimization context, whereby
an absolute fitness function is required to assign fitnesses to contending so-
lutions. With co-evolutionary learning, the fitness of a solution is obtained
through its interactions with other contending solutions in the population.
That is, a solution fitness in a co-evolutionary learning system is relative
and dynamic because a solution’s fitness not only depends on the popu-
lation, but also changes as the composition of solutions in the population
changes.
Although the difference between co-evolutionary learning systems and
traditional EAs appear to be small at first, from the contexts of certain
problems, it can lead to significantly different outcomes. For example,
consider the problem of searching for optimal solutions. In many real-world
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 67
problems, designing a suitable fitness function that can lead to the search
of solutions can be very difficult, if possible [Yao (1994)]. However, with
co-evolutionary learning, this need of having a fitness function is essentially
removed. Instead, a co-evolutionary learning system only needs to be able
to rank contending solutions based on how they compared to one another.
Here, games are well-suited, natural problem applications for co-
evolutionary learning systems. In particular, although games can be ap-
proached from an optimization context, it may not be possible to construct
a fitness function that fully represent the problem of the game and fully
discriminate solutions found through optimization algorithms. With co-
evolutionary learning, however, the search can be directed to find for better
game strategies (e.g., defeat more strategies) as the evolutionary process
continues [Chellapilla and Fogel (1999)].
In particular, for the IPD game, there have been many different ap-
proaches since Axelrod’s early study in [Axelrod (1987)] that investigated
a particular co-evolutionary learning system. Like the study of EAs (com-
monly known as Evolutionary Computation) [Yao (1994); Fogel (1994a);
Back et al. (1997); Fogel (1995); Back (1996)], there are a wide variety of
specific strategy representations, selection and variation operators in the
co-evolutionary learning approach used for the IPD game. A complete sur-
vey is beyond the scope of this chapter. Instead, the more popular choices
will be reviewed here. The important thing to note is that all the co-
evolutionary learning systems used were based on the framework illustrated
in figure 3.2, i.e., they involved an adaptation process on IPD strategies in
some form of representations (involves variations and selection) based on
interactions (game-play between strategies).
For strategy representations, particularly on deterministic and reactive
IPD strategies that were mostly studied, Axelrod and Lindgren [Axelrod
(1987); Lindgren (1991)] were among the first few who used binary strings
of ones (cooperation) and zeroes (defection) encoding for a look-up table
(essentially a binary decision tree) representation. The look-up table in
particular determines the outcome for the strategy based on the pairs of
previous moves made by the strategy and the opponent. Since the strategies
require histories of previous moves in order to make a response, they are
encoded with the necessary histories for previous moves. We [Chong and
Yao (2005)] recently introduced a look-up table representation that directly
represents IPD strategies based on responses to previous moves. For the
case of looking back the previous pair of moves made by the strategy and
the opponent, direct look-up table represents the strategy responses as a
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
68 S. Y. Chong et al.
two-dimensional table. Each table element represents the response based
on the pair of previous moves. Instead of some fictitious histories required
to start the game, the direct look-up table specifies the first move directly.
Fogel among many others [Fogel (1991, 1993, 1996); Miller (1989); Stan-
ley et al. (1995)] used finite state machines (FSMs) for their capability of
representing complex behaviors of IPD strategies. With FSMs, behavioral
responses of an IPD strategy based on previous moves depend on the states
and the next-state transitions. The motivation for using FSM compared
to look-up table is to have a behavioral representation of IPD strategies
instead of the look-up table representation of responses based on histories
of previous moves (see [Fogel (1993)] for the full discussion on the origin of
using FSM and evolution to simulate intelligent behaviors).
In addition to the simple look-up table and FSM, neural network rep-
resentations had also been experimented with and studied [Harrald and
Fogel (1996); Darwen and Yao (2000); Chong and Yao (2005); Franken and
Engelbrecht (2005)]. Although neural networks are primarily used for their
ability of providing nonlinear input-output responses [Chellapilla and Fogel
(1999)], the initial motivation to representing IPD strategies also include
the capability of neural networks to process and represent a continuous
range of behaviors [Harrald and Fogel (1996)].
After selecting a strategy representation, the next step is to consider
the design of variation operators that are aimed at providing variations of
IPD strategies in the population. In most cases, variation operators are
dependent of the strategy representation considered. For example, look-up
table encoded as binary strings can use crossovers and bit-flip mutation as
in the case of standard genetic algorithms [Axelrod (1987)]. For the case
of FSMs, variation operators may include altering a next-state transition,
adding or removing states, and altering the output symbol (corresponding
to making a choice). With neural networks, especially those that are real-
valued representations, self-adapting mutations based on some probability
distribution (i.e., Gaussian or Cauchy) can be used [Chong and Yao (2005)]
(one of us has provided a comprehensive review on evolving neural networks
in [Yao (1999)]).
As for designing the process of selecting IPD strategies for the next
generation, many other selection operators can be used (those found in
EAs [Fogel (1994a); Back et al. (1997)]) and not just limited to proportional
selection used by Axelrod in the first study of co-evolving IPD strategies
[Axelrod (1987)]. For the case of obtaining the fitness for a particular IPD
strategy in the population, payoffs obtained from the IPD game are usually
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 69
used. In particular, many studies considered calculating the expected IPD-
payoff-based-fitness using a round robin tournament whereby all pairs of
strategies compete, including the pair where a strategy plays itself.
3.2.2. Shadow of the Future
In the IPD game, the shadow of the future refers to the situation whereby
the number of moves of a game is known in advance. In this situation,
there is no incentive to cooperate in the last move because there is no risk
of retaliation from the opponent. However, if every player defects on the
last move, then there is no incentive to cooperate in the move prior to the
last one. If every player defects in the last two moves, then there is no
incentive to cooperate in the move before that, and so forth. As such, we
would end up with mutual defection in all moves.
One popular way to address this issue and to allow for cooperation to
emerge is to have a fixed probability in ending the game on every move,
thereby keeping the game length uncertain. Most of the studies that used
the co-evolutionary learning approach considered a fixed game length (num-
ber of moves) in all game plays. For example, Axelrod [Axelrod (1987)]
and others such as [Fogel (1991, 1993); Chong and Yao (2005)] used 150
moves (move start from 0). Other game lengths can be used, although the
choice depends on the motivation of the study, e.g., a sufficiently long game
length to allow for strategies to reciprocate cooperation. In any case, the
fixed game length is used because the strategy representation cannot count
the number of moves that have been played and how many more remain.
3.2.3. Issues for Co-evolutionary Learning of IPD Strate-
gies
For the IPD game, there are two main contexts in which co-evolutionary
learning can be considered. First, co-evolutionary learning can be used to
search for effective strategies, given the specific the rules of the game that
govern the complexity of strategy interactions. Second, a co-evolutionary
learning system can serve as a model for investigating how certain condi-
tions (e.g., game rules, co-evolutionary learning system setup, or others)
can lead to the evolution of certain behaviors.
For the context of using co-evolutionary learning to search for effective
strategies, the main issue is to evolve IPD strategies that perform well (e.g.,
defeat) against a large number of opponents. Axelrod [Axelrod (1987)]
used a co-evolutionary learning system and compared the evolved strategies
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
70 S. Y. Chong et al.
with the representative strategies (e.g., tit for tat) obtained from his earlier
tournaments that accounted for average performance of all strategies that
participated the tournaments [Axelrod (1980a,b)]. He noted that some of
the evolved strategies outperformed these representative strategies.
Although results obtained from evolving effective IPD strategies were
promising, the study in [Axelrod (1987)] had the important implication on
specifying a principled method to determine the effectiveness (or robustness
[Axelrod and Hamilton (1981)]) of evolved IPD strategies by testing them
against some representative strategies. One of us (Yao) first framed this
particular study in the context of generalization [Darwen and Yao (1995);
Yao et al. (1996)]. In particular, co-evolutionary learning is a machine
learning system that can be analyzed for its generalization performance.
Here, the generalization performance of a co-evolutionary learning system
for the IPD game can be thought of as the performance of the best strategy
in the population or the population itself (e.g., using a gating algorithm that
effectively combines different IPD strategies of the population as a single
strategy entity [Darwen (1996); Darwen and Yao (1997)]) against a large
number of IPD strategies, especially those that the evolved strategies have
yet to play with during evolution.
For the context of using co-evolutionary learning as a model to under-
stand the conditions of how, why, and what IPD strategy behaviors are
evolved, there are many issues that can be studied. First, one can con-
sider the impact of specific IPD game specifications (e.g., payoff matrices
[Fogel (1993)] and duration of interactions or game length [Fogel (1996)])
on evolved IPD strategy behaviors. Second, there are also studies that
have focused on the impact of the interaction or game-play itself, which
are not just limited to noisy interactions [Julstrom (1997)], continuous be-
havioral responses [Harrald and Fogel (1996)], and the possibility of refusal
to interact [Stanley et al. (1995)]. Third, the specific the design of the
co-evolutionary learning system itself can have an impact whereby certain
IPD behaviors are favored and persist for a long period (e.g., investigating
whether systems that provided genotypic diversity actually lead to a diverse
population of IPD strategies with a variety of behaviors [Darwen and Yao
(2000, 2001, 2002)]).
3.3. Extending the IPD Game
The primary motivation in most studies that extend the classical IPD game
is to model more complex IPD interactions that are closer to real-world
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 71
interactions. This section describes some of the extended IPD games that
have been investigated using the co-evolutionary learning approach. Each
subsection starts with the motivation for extending the IPD game in a
specific manner, and the important issues of studying the more complex
IPD games. Each subsection discusses and concludes general observations
obtained from the co-evolutionary learning of the particular extended IPD
game.
3.3.1. Extending the IPD with More Choices
Several studies have extended the classical IPD with more than two extreme
choices that are available for play. That is, there are intermediate choices
between full cooperation and full defection that strategies can response
with. Fogel [Harrald and Fogel (1996)] investigated a continuous IPD game.
We have investigated the IPD with multiple, discrete levels of cooperation
[Darwen and Yao (2000, 2001, 2002); Chong and Yao (2005)], which could
be use to approximate the continuous IPD game when the number of levels
is sufficiently large.
The main motivation of extending the IPD with more choices is to al-
low for the modelling of subtle behavioral interactions that are not possible
with only two extreme choices. With the classical IPD game, the possible
behaviors that strategies can exhibit are severely limited. For example, a
strategy for the classical IPD game cannot play intermediate choices that
allow for some degree of exploitation of the opponent without risking retal-
iation from an otherwise cooperative opponent [Harrald and Fogel (1996)].
The co-evolutionary learning approach usually considers a neural net-
work strategy representation because it can be used to process a continuous
range of behaviors (i.e., real numbers for representing the degree of coopera-
tion) easily. Furthermore, for the case of IPD games with multiple, discrete
levels of cooperation, a neural network is scalable to the number of levels
considered.
Fogel [Harrald and Fogel (1996)] showed that for the extended IPD a
continuous range of choices, the evolution of cooperation is unstable, with
fluctuations of average scores representing short periods of cooperation and
defection. We have further shown that with increasingly higher number of
choices to play in the IPD game with multiple, discrete levels of cooperation,
evolution to cooperation are more difficult to achieve [Darwen and Yao
(2000, 2001, 2002)].
From these studies, it appears that a co-evolving population of IPD
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
72 S. Y. Chong et al.
strategies has a higher tendency of evolving to play full defection. However,
this does not mean that evolution to cooperation is not possible, or that
cooperative behaviors that persist cannot be evolved. For example, it has
been shown that evolving cooperative behaviors depends on the complexity
of strategy representation that is used. In the case of neural networks, the
number of nodes in the hidden layer can affect the co-evolutionary learning
system to produce IPD strategies with cooperative responses [Harrald and
Fogel (1996)].
In addition to the complexity of strategy representation, another impor-
tant factor for evolving cooperative strategies is that of behavioral diversity.
Early studies [Darwen and Yao (2000, 2001)] have shown that genetic di-
versity (i.e., variations at the genotypic level of strategy representations)
does not equate to behavioral diversity (i.e., variations of IPD strategy
responses) in the population. Without sufficient behavioral diversity, the
co-evolving population can overspecialize to a specific strategy behavior
that is vulnerable to invasion (e.g., cycles between tit for tat, naive coop-
erators, and defectors). As such, increasing the level of genetic diversity
in the co-evolutionary learning system does not necessarily lead to an in-
crease in behavioral diversity that can help with the evolution of cooperative
strategies.
We have recently further shown that strategy representation also
plays an important factor in introducing behavioral diversity in the co-
evolutionary learning system [Chong and Yao (2005)]. We considered the
n-choice IPD game, which was obtained based on the following linear in-
terpolation:
pA
= 2.5− 0.5cA
+ 2cB
, − 1 ≤ cA, c
B≤ 1,
where pA
is the payoff to player A, given that cA
and cB
are the cooperation
levels of the choices that players A and B make, respectively. Fogel [Harrald
and Fogel (1996)] also considered a similar interpolation process. However,
we considered multiple, discrete levels of cooperation. For example, we used
the four -choice IPD game, where the four cooperation levels are represented
as +1 (full cooperation), +1/3, −1/3, and−1 (full defection). These choices
can be used with the linear interpolation equation shown above to obtain
the payoff. Figure 3.3 illustrates the payoff matrix of a four -choice IPD
game that was used [Chong and Yao (2005)].
Note that in generating the payoff matrix for a n-choice IPD game, the
following conditions must be satisfied [Chong and Yao (2005)]:
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 73
PLAYER B
+1 +3
1
3
1
1
+1 4 23
2
13
1
0
PLAYER +3
1
43
1
3 13
2
3
1
A3
1
43
2
33
1
23
2
1 5 33
2
23
1
1
Fig. 3.3. The payoff matrix for the two-player four-choice IPD used in [Chong and Yao
(2005)]. Each element of the matrix gives the payoff for Player A.
(1) For cA
< c′A
and constant cB
: pA(c
A, c
B) > p
A(c′
A
, cB
),
(2) For cA≤ c′
A
and cB
< c′B
: pA(c
A, c
B) < p
A(c′
A
, c′B
), and
(3) For cA
< c′A
and cB
< c′B
: pA(c′
A
, c′B
) > 1
2(p
A(c
A, c′
B
) + pA(c′
A
, cB
)).
These conditions are analogous to those for the classical IPD’s. The first
condition ensures that defection always pays more. The second condition
ensures that mutual cooperation has a higher payoff than mutual defec-
tion. The third condition ensures that alternating between cooperation
and defection does not pay in comparison to just playing cooperation.
We investigated two strategy representation: neural networks and direct
look-up table. We considered these two strategy representations because
they allow the investigation on the impact of strategy representation on
the introduction and maintenance of variations of behavioral responses in
the population of IPD strategies. On the one hand, the neural network
indirectly represents the input-output response mappings of IPD strate-
gies, with possibilities of many-to-one mappings between representations
and actual behavioral responses [Fogel (1994b); Atmar (1994)]. On the
other hand, the direct look-up table directly represents the input-output
response mappings of IPD strategies. We hypothesized that a more direct
representation of IPD strategies will allow more behavioral variations to be
introduced and maintained in the population through co-evolution.
For the neural network representation, we used a fixed-architecture feed-
forward multilayer perceptron (MLP) [Chong and Yao (2005)]. Specifically,
the neural network consists of an input layer, a single hidden layer of ten
nodes, and an output node. The network is fully connected and strictly
layered (i.e., no short-cut connection from the input layer to the output
node. The transfer (activation) function used for all nodes is the hyperbolic
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
74 S. Y. Chong et al.
tangent function, tanh(x). The input layer consists of the following four
input nodes:
(1) The neural network’s previous choice, i.e., level of cooperation, in [−1,
+1].
(2) The opponent’s previous level of cooperation.
(3) An input of +1 if the opponent played a lower cooperation level com-
pared to the neural network, and 0 otherwise.
(4) An input of +1 if the neural network played a lower cooperation level
compared to the opponent, and 0 otherwise.
The input layer is a function of two variables (e.g., neural network’s previous
choice and the opponent’s previous choice) since the last two inputs are
derived from the first two inputs. These additional inputs are to facilitate
learning the recognition of being exploited and exploiting. Given the inputs,
the neural network’s output determines the choice for its next move. The
output is a real value between +1 and −1 that is discretized to either +1,
+1/3, −1/3 or −1, depending on which discrete value the neural network
output is closest to.
We considered self-adaptive mutation for variation operators for the
real-valued representation of neural networks that we used [Chong and Yao
(2005)]. This approach associates a neural network with a self-adaptive pa-
rameter vector [σi(j)] that controls the mutation step size of the respective
weights and biases of the neural network [wi(j)]. Offspring neural networks
([w′
i
(j)] and [σ′
i
(j)]) are generated from parent neural networks ([wi(j)] and
[σi(j)]) through mutations. Two different mutations based on Gaussian and
Cauchy distributions were used in order to further investigate the impact of
indirect strategy representation on variation operators that could increase
genetic diversity but not necessarily lead to increase in behavioral diversity.
For the self-adaptive Gaussian mutation, offspring neural networks are
generated according to the following equations:
σ′
i
(j) = σi(j) ∗ exp(τ ∗N
j(0, 1)); i = 1 . . . 15, j = 1, . . . , N
w,
w′
i
(j) = wi(j) + σ′
i
(j) ∗Nj(0, 1); i = 1 . . . 15, j = 1, . . . , N
w,
where Nw
= 63, τ = (2(Nw)0.5)−0.5 = 0.251, and N
j(0, 1) is a Gaussian
random variable (zero mean and standard deviation of one) resampled for
every j. Nw
is the total number of weights, biases, and the pre-game inputs
required for an IPD strategy based on memory length of one.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 75
For the self-adaptive Cauchy mutation that is known to provide bigger
changes to the neural network weights (i.e., provide more genetic diversity)
[Yao et al. (1999)], the following equations are used:
σ′
i
(j) = σi(j) ∗ exp(τ ∗N
j(0, 1)); i = 1 . . . 15; j = 1, . . . , N
w,
w′
i
(j) = wi(j) + σ′
i
(j) ∗ Cj(0, 1); i = 1 . . . 15; j = 1, . . . , N
w,
where Cj(0, 1) is a Cauchy random variable (centered at zero and with a
scale parameter of 1) resampled for every j. All other variables remain the
same as those in the self-adaptive Gaussian mutation.
For the direct look-up table representation, the details can be illustrated
by figure 3.4 [Chong and Yao (2005)], which shows the behavioral response
of a four -choice IPD strategy. mij
specifies the choice to be made, given the
inputs i (player’s own previous choice) and j (opponent’s previous choice).
Rather than using pre-game inputs (two for memory length one strategies),
the first move is specified independently. Each of the table elements can
take any of the possible four choices (+1, +1/3, −1/3, −1).
Opponents Previous Move
+1 + 3
1
3
11
+1 m11 m12 m13 m14
Players + 3
1 m21 m22 m23 m24
Previous Move 3
1 m31 m32 m33 m34
1 m41 m42 m43 m44
Fig. 3.4. The look-up table representation for the two-player IPD with four choices and
memory length one [Chong and Yao (2005)].
A simple mutation operator was used to generate offspring. Mutation
replaces the original element, mij
, by one of the other three possible choices
with an equal probability. For example, if mutation occurs at m13 = +1/3,
then the mutated element m′
13can take either +1, −1/3, or −1 with an
equal probability. Each table element has a fixed probability, pm
, of being
replaced by one of the remaining three choices. The value pm
is not op-
timized. Crossover is not used in any of the experiments. With a direct
representation of IPD strategy behaviors, a simple mutation is more than
sufficient to provide behavioral diversity in the population.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
76 S. Y. Chong et al.
The following co-evolutionary procedure was used [Chong and Yao
(2005)]:
(1) Generation step, t = 0:
Initialize N/2 parent strategies, Pi, i = 1, 2, ..., N/2, randomly.
(2) Generate N/2 offspring, Oi, i = 1, 2, ..., N/2, from N/2 parents using a
variation.
(3) All pairs of strategies compete, including the pair where a strategy plays
itself (i.e., round-robin tournament). For N strategies in a population,
every strategy competes a total of N games.
(4) Select the best N/2 strategies based on total payoffs of all games played.
Increment generation step, t = t + 1.
(5) Step 2 to 4 are repeated until termination criterion (i.e., a fixed number
of generation) is met.
In particular, we used N = 30, and repeated the co-evolutionary pro-
cess for 600 generations (which is sufficiently long to observe an evolutionary
outcome, e.g., persistent cooperation). A fixed game length of 150 itera-
tions is used for all games. Experiments are repeated for 30 independent
runs. Note that additional steps were taken to ensure that the initial pop-
ulation has sufficient behavioral diversity in addition to genotypic diversity
[Darwen and Yao (2000)] to avoid early convergence of results. All details
are available in [Chong and Yao (2005)]. The procedure involves setting
particular parameters for specific strategy representation and resampling
for new strategies to make sure that the frequency at which each of the
four choices (+1, +1/3, −1/3, −1) is played is approximately similar so
that there is no bias to play a particular choice early in the evolution.
Results showed that there were fewer number of runs where the popu-
lation evolved to play mutual cooperation in experiments that used neural
network representations [Chong and Yao (2005)]. For example, some runs
had intermediate outcomes while a few had defection outcomes (Fig. 3.5).
This is quite different from the case for classical IPD games [Axelrod (1987);
Darwen and Yao (1995)] where each run converged to mutual cooperation
quite consistently and quickly. Increasing genetic diversity (e.g., using self-
adaptive Cauchy mutation) do not necessarily lead to more behavioral di-
versity in the population since some runs still evolved to intermediate or
defection outcomes (Fig. 3.6). The results further illustrates that more
choices have made cooperation more difficult to evolve.
However, when direct look-up table representation was used, results
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 77
1
1.5
2
2.5
3
3.5
4
0 100 200 300 400 500 600
Average Payoff
Generation
Fig. 3.5. Five sample runs of a co-evolutionary learning system that used neural network
representation with a self-adaptive Gaussian mutation in the four-choice IPD [Chong and
Yao (2005)].
1
1.5
2
2.5
3
3.5
4
0 100 200 300 400 500 600
Average Payoff
Generation
Fig. 3.6. Five sample runs of a co-evolutionary learning system that used neural network
representation with a self-adaptive Cauchy mutation in the four-choice IPD [Chong and
Yao (2005)].
showed that the evolution to cooperation was not difficult [Chong and Yao
(2005)]. For example, results showed that even when a simple mutation
with a low probability of mutation (e.g., pm
= 0.05) was used, no run
evolved to mutual defection even though intermediate outcomes were ob-
tained (Fig. 3.7). However, increasing the probability of mutation resulted
with all populations in all runs evolving to mutual cooperation play. The
results showed that the choice of strategy representation can have an impact
on the evolution of cooperation if it allows for greater behavioral diversity
in the population.
3.3.2. IPD with Noise
A natural extension to the classical IPD is to consider the impact of noisy
interactions on the evolution of certain behaviors. Axelrod noted two types
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
78 S. Y. Chong et al.
1
1.5
2
2.5
3
3.5
4
0 100 200 300 400 500 600
Average Payoff
Generation
Fig. 3.7. Five sample runs of a co-evolutionary learning system that used direct look-up
table representation with a simple mutation at pm = 0.05 in the four-choice IPD [Chong
and Yao (2005)].
of noise, i.e., misimplementation and misperception, that can affect a strat-
egy’s response to the opponent’s choice of play [Axelrod and Dion (1988)].
With misimplementation, the strategy knows a mistaken play but the oppo-
nent does not know. With misperception, one or both interacting strategies
may not know that a different choice was made. The main motivation for
this extension is to study the impact of noise on the learning of certain be-
haviors through co-evolution when interactions can be noisy. In particular,
one issue that can be considered is whether cooperative strategies based
on reciprocity (such as tit for tat) can still perform well when noise, which
affects strategy behavioral response based on previous moves, is present.
Julstrom [Julstrom (1997)] investigated the effects of noise in the two-
choice IPD through a co-evolutionary learning system. In particular, noise
was modelled as mistakes. That is, there is a probability that the choice
played by a strategy is changed to the other choice (e.g., defection is played
instead of the original cooperation, and vice versa). Results from the ex-
periments showed that noise (starting around 2%) can reduce the level of
cooperation in the population.
Recently, we further extended the IPD game with more choices by in-
troducing noise and used a co-evolutionary learning system as a model for
investigations [Chong and Yao (2005)], which we have detailed in the earlier
subsection. We also modelled noise as mistakes that a player makes. For
the four -choice IPD game, there is a certain probability of occurrence, pn,
and is fixed throughout a game where a strategy intends to play a partic-
ular choice but ends up with a different choice instead. For example, with
pn
= 0.05, there will be a 0.05 probability that if 1/3 is intended to be
played, one of the other three possible cooperation levels, i.e., +1, −1/3,
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 79
and −1, will be chosen uniformly at random.
Results from experiments again showed the importance of behavioral
diversity for the evolution of cooperation for noisy IPD games with more
choices. For noise introduced at very low probabilities (less than 1.5% or
pn
= 0.0015), evolution to cooperation is more likely than the case when
noise was not introduced. Strategies were observed to be more forgiving,
confirming the predictions of other studies noted in [Axelrod and Dion
(1988); Wu and Axelrod (1995)]. However, when noise was introduced at
high probabilities (starting around 5% or pn
= 0.05), evolution to coop-
eration was more difficult. The population was more likely to evolve to
defection.
Despite this, if the co-evolutionary learning system has sufficient behav-
ioral diversity (e.g., using direct look-up table representation that allows
for behavioral diversity to be introduced and maintained more easily and
effectively), evolution of cooperation is not greatly affected [Chong and Yao
(2005)]. Evolved strategies still played high levels of cooperation even when
there are more choices to play and that the interactions can be noisy, both
which can contribute to more difficulty of evolving cooperative behaviors.
For example, table 3.1 compares different co-evolutionary learning system
with different levels of behavioral diversity, e.g., C-CEP (neural network
and self-adaptive Gaussian mutation), C-FEP (neural network and self-
adaptive Cauchy mutation), C-PM05 (direct look-up table and mutation
at pm
= 0.05) for different noise levels (%) [Chong and Yao (2005)]. Re-
sults show the number of runs for each experiment that evolved to mutual
defection, e.g., average payoff less than 1.5. The table showed that no runs
evolved to mutual defection when direct look-up table representation was
used in the co-evolutionary learning system [Chong and Yao (2005)].
Table 3.1. Comparison of results for three
different co-evolutionary learning systems.
Noise (%) C-CEP C-FEP C-PM05
0 4 1 0
5 4 9 0
10 7 11 0
15 8 17 0
20 18 26 0
It should be noted that although both mutation and noise can be consid-
ered as sources of behavioral variations in models that encourage coopera-
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
80 S. Y. Chong et al.
tion [Mcnamara et al. (2004)], they produce behavioral diversity differently.
Mutation introduces strategies with different behaviors into the population.
Noise allows other parts of a strategy’s behavior that are not played other-
wise in a noiseless IPD game to be accessed. Our results [Chong and Yao
(2005)] showed that noise does not necessarily promote behavioral diversity
in the population that lead to a stable evolution to cooperation, although
noise at low levels does help. With higher levels of noise, closer inspection
of evolved strategies showed the population to overspecialize to a specific
behavior that is vulnerable to invasion, leading to cyclic dynamics in the
evolutionary process between cooperation and defection.
In particular, noise and mutation have different impacts on the evolu-
tionary process [Chong and Yao (2005)]. For example, increasingly higher
levels of noise lead to mutual defection outcomes. Given a very noisy en-
vironment, strategies overspecialized to play defection only. This was not
observed in the noiseless case of the IPD with increasingly more mutations.
For example, increasingly higher mutation rates in the co-evolutionary
learning system that used direct look-up table representation did not lead to
mutual defection outcomes. Strategies were not observed to overspecialized
to play defection, or any specific play.
3.3.3. N-Player IPD
Real-world interactions may involve more than two players. One famous
example is the “tragedy of the commons” [Hardin (1968)], which illustrates
the problem of self-interested actions of players for a particular public goods
for initial rewards leading to a situation where everyone loses out in the
end. For the case of the IPD, N-player interactions can be extended to the
original formulation of two-player game [Axelrod and Dion (1988)]. This
allows for the study of whether cooperative behaviors are possible when
interactions involve more than two players since strategies that are effective
for the two-player case may not be effective (or worse, fail) in large group
interactions [Glance and Huberman (1994)].
One of us formulated an N-player IPD or NIPD game for investiga-
tions using the co-evolutionary learning approach [Yao and Darwen (1994)]
(other studies include [Bankes (1994); Lindgren and Johansson (2001)]).
The NIPD game is defined by the following three properties [Colman (1982)]
(page 159):
• Each player faces two choices between cooperation and defection.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 81
• Defection is dominant for each player, i.e., each player is better off
defecting than cooperating regardless of how many other players that
cooperate.
• The dominant defection strategies intersect in a deficit equilibrium. In
particular, the outcome if all players choose their non-dominant coop-
eration strategies is preferable from every player’s point of view to the
one in which everyone chooses defection, but no one is motivated to
deviate unilaterally from defection.
The payoff matrix (Fig. 3.8) for the NIPD game can then be constructed
based on the following conditions that must be satisfied [Yao and Darwen
(1994)]:
• Di> C
ifor 0 ≤ i ≤ n− 1.
• Di+1 > D
iand C
i+1 > Ci
for 0 ≤ i ≤ n− 1.
• Ci> (D
i+ C
i−1)/2 for 0 ≤ i ≤ n− 1.
A large number values satisfy these conditions. For the study in [Yao
and Darwen (1994)], the values are chosen such that if nc
is the number of
cooperators in the NIPD game, then the payoff for cooperation is 2nc− 2
and the payoff for defection is 2nc+1 (Fig. 3.9). For this payoff matrix, the
average per-move payoff a can be calculated as follows if Nc
cooperative
moves are made out of N moves:
a = 1 +N
c
N(2n− 3),
which will allow the measurement of how common cooperation was by ex-
amining the average per-round payoff.
Number of cooperators among the remaining n-1 players
0 1 2 n-1
C C0 C1 C2 … Cn-1
Pla
yer
A
D D0 D1 D2 … Dn-1
Fig. 3.8. The payoff matrix for the NIPD game. The value in the table gives the payoff
to the player based on its choice of play [Yao and Darwen (1994)].
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
82 S. Y. Chong et al.
Number of cooperators among the remaining n-1 players
0 1 2 n-1
C 0 2 4 … 2(n-1)
Pla
yer
A
D 1 3 5 … 2(n-1)+1
Fig. 3.9. An example of the payoff matrix for the NIPD game [Yao and Darwen (1994)].
NIPD game interactions were in the form of a large number of random
selection of groups of N players with replacement (e.g., 1000 NIPD games
for a population of 100 strategies). Results from the experiments in [Yao
and Darwen (1994)] showed the group size (i.e., the value of N in the NIPD
game) has a negative impact on the evolution of cooperation. As N in-
creases, there are fewer number of runs where the population evolved to
play cooperation. For example, in the case of memory two strategies, only
one out of 20 runs had defection outcomes for 3IPD. However, the number
of runs with defection outcomes increased to nine for 6IPD. Increasing N to
16 (i.e., 16IPD) resulted with all runs evolved to defection outcomes [Yao
and Darwen (1994)].
3.3.4. Other Extensions
There are many other extensions to the classical IPD game, or even fur-
ther extensions to already extended IPD games (such as the NIPD) that
can be studied through a co-evolutionary learning approach. For example,
we examined the impact of localized interactions of the NIPD games in
[Seo et al. (1999, 2000)]. The earlier study for the NIPD [Yao and Darwen
(1994)] showed that the evolution of cooperation is more difficult to achieve
through a co-evolutionary learning process as N increases. However, in some
real-world interactions, it is unlikely that a player interacts with everybody
(or that it has equal probability of interacting with anyone in the popu-
lation). Instead, a player might interact with other specific players (e.g.,
neighbours, relatives, or at the workplace). Such localized interactions may
involve spatial models [Nowak and May (1992); Ishibuchi and Namikawa
(2005)]. In particular, localized interactions can have a positive impact
on the evolution of cooperation in the NIPD game. That is, population
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 83
structured in a spatial model is more likely to evolve cooperation [Seo et al.
(1999); Lindgren and Johansson (2001)].
Another extension that can be considered is to incorporate indirect in-
teractions to the IPD game that originally only considers direct interactions
between strategies. Most of the previous studies have focused on modelling
direct interactions (e.g., cooperative behaviors through direct reciprocity
that involves repeated encounters, i.e., IPD games [Axelrod (1984)]) or
indirect interactions (e.g., cooperative behaviors through mechanisms of
indirect reciprocity such as reputation where an individual receives coop-
eration from third parties due to the individual’s cooperative behaviors to
others in the case of indirect reciprocity [Nowak and Sigmund (1998b)]).
However, it has been suggested that complex real-world interactions involve
both direct and indirect interations (although for simplicity for modelling
and analysis, only one of the interactions is considered at one time) [Nowak
and Sigmund (1998a)]. For this aspect, we have investigated a model with
both direct and indirect interactions [Yao and Darwen (1999)]. In partic-
ular, each strategy is tagged with a reputation score, which is calculated
based on payoffs received from a small random sample of pre-games. A
co-evolutionary approach to show that with the addition of reputation, co-
operative outcomes are possible and more likely even for the case of the IPD
with more choices and shorter game durations [Yao and Darwen (1999)].
In addition to that, another extension will be to consider the adaptation
of payoff matrices. We recently conducted a preliminary study on evolving
strategy payoff matrices, and how such an adaptation process can affect
the learning of strategy behaviors [Chong and Yao (2006)]. The motivation
for the study is to relax the assumption of having fixed, symmetric payoff
matrix for all evolving strategies. This assumption may not be realistic,
considering that not all players are similar in real-world interactions. We
focus specifically on an adaptation process of payoff matrix based on past
behavioral interactions. In particular, a simple update rule that provides
a reinforcement feedback process between strategy behaviors and payoff
matrices during the co-evolutionary process is used. Results from exper-
iments [Chong and Yao (2006)] showed that the evolutionary outcome is
dependent on the adaptation process of both behaviors (i.e., strategy be-
havioral responses) and utility expectations that determine how behaviors
are rewarded (i.e., strategy payoff matrices). Defection outcomes are more
likely to be obtained if IPD-like update rules that favor the exploitation of
opponents are used. However, cooperative outcomes can be easily obtained
when mutualism-like update rules that favor mutual cooperation are used.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
84 S. Y. Chong et al.
3.4. Conclusion and Future Directions
The greatest advantage and the most important feature of co-evolutionary
learning is that of the process of adaptation on representation that is de-
pendent on the interactions between members of the population. In this
aspect, the co-evolutionary learning approach is well-suited to solving the
problem of IPD games in two contexts. First, co-evolutionary learning can
be used as a search algorithm for effective strategies without requiring hu-
man knowledge. All that is required is the rules of the game. Second,
the adaptation process of strategy behaviors based on interactions in co-
evolution provides a natural way to investigate conditions that lead to the
evolution of certain behaviors. In both of these contexts, the advantage of
co-evolutionary learning to other approaches is that strategy behaviors are
not fixed or predefined. Instead, co-evolutionary learning provides a means
to realize strategy behavioral responses that are not necessarily bounded
by expert human knowledge, thus providing new insight to the problem.
Since the first study of co-evolutionary learning on the classical IPD by
Axelrod [Axelrod (1987)], there had been a wide-range of studies that fur-
ther extended the classical IPD game with additional features such as, but
not limited to, continuous or multiple levels of cooperation, noisy interac-
tions, N-player interactions, spatial interactions, and indirect interactions.
The motivation in all of these studies is to bridge the gap between the ab-
stract IPD interactions with the complex real-world interactions. As such,
by understanding the specific conditions that lead to the evolution of spe-
cific IPD strategy behaviors, these studies have further helped to provide a
more in-depth view on complex real-world interactions such as those found
in the human society.
There are still much more that can be explored using the co-evolutionary
learning approach. One direction will be to further extend the more com-
plex IPD games and investigate the impact of the additional extension.
This is important because the extensions might interact with one another
in some unknown and nonlinear fashion. Understanding these interactions
will help to further unravel complex human interactions. Another direction
will be to investigate a more rigorous approach to determine the robust-
ness of evolved strategy behaviors. In this particular aspect, the notion of
generalization might provide a more natural approach for co-evolutionary
learning in addition to classical evolutionary game theory approach of the
evolutionarily stable strategies.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 85
References
Atmar, W. (1994). Notes on the simulation of evolution, IEEE Transactions on
Neural Networks 5, 1, pp. 130–147.
Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, The Journal of
Conflict Resolution 24, 1, pp. 3–25.
Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, The Journal
of Conflict Resolution 24, 3, pp. 379–403.
Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York).
Axelrod, R. (1987). The evolution of strategies in the iterated prisoner’s dilemma,
in L. D. Davis (ed.), Genetic Algorithms and Simulated Annealing, chap. 3
(Morgan Kaufmann, New York), pp. 32–41.
Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science
242, 4884, pp. 1385–1390.
Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science
211, pp. 1390–1396.
Back, T. (1996). Evolutionary Algorithms in Theory and Practice (Oxford Uni-
versity Press, New York).
Back, T., Hammel, U. and Schwefel, H. P. (1997). Evolutionary computation:
Comments on the history and current state, IEEE Transactions on Evolu-
tionary Computation 1, 1, pp. 3–17.
Bankes, S. (1994). Exploring the foundations of artificial societies: Experiments
in evolving solutions to iterated n-player prisoner’s dilemma, in R. Brookes
and P. Maes (eds.), Artificial Life IV (Addison-Wesley), pp. 337–342.
Chellapilla, K. and Fogel, D. B. (1999). Evolution, neural networks, games, and
intelligence, Proc. IEEE 87, 9, pp. 1471–1496.
Chong, S. Y. and Yao, X. (2005). Behavioral diversity, choices, and noise in the
iterated prisoner’s dilemma, IEEE Transactions on Evolutionary Compu-
tation 9, 6, pp. 540–551.
Chong, S. Y. and Yao, X. (2006). Self-adaptive payoff matrices in repeated in-
teractions, in 2006 IEEE Symposium on Computational Intelligence and
Games (CIG’06) (IEEE Press, Piscataway, NJ), pp. 103–110.
Colman, A. M. (1982). Game Theory and Experimental Games (Pergamon Press,
Oxford).
Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated pris-
oner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in
Artificial Intelligence, Vol. 956, pp. 276–292.
Darwen, P. and Yao, X. (2000). Does extra genetic diversity maintain escalation
in a co-evolutionary arms race, International Journal of Knowledge-Based
Intelligent Engineering Systems 4, 3, pp. 191–200.
Darwen, P. and Yao, X. (2001). Why more choices cause less cooperation in
iterated prisoner’s dilemma, in Proc. 2001 Congress on Evolutionary Com-
putation (CEC’01) (IEEE Press, Piscataway, NJ), pp. 987–994.
Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with
intermediate levels of cooperation: Application to missile defense, Inter-
national Journal of Computational Intelligence and Applications 2, 1, pp.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
86 S. Y. Chong et al.
83–107.
Darwen, P. J. (1996). Co-evolutionary Learning by Automatic Modularization with
Speciation, Ph.D. thesis, University of New South Wales, Sydney, Australia.
Darwen, P. J. and Yao, X. (1997). Speciation as automatic categorical modulariza-
tion, IEEE Transactions on Evolutionary Computation 1, 2, pp. 101–108.
Fogel, D. B. (1991). The evolution of intelligent decision making in gaming, Cy-
bernetics and Systems: An International Journal 22, pp. 223–236.
Fogel, D. B. (1993). Evolving behaviors in the iterated prisoner’s dilemma, Evo-
lutionary Computation 1, 1, pp. 77–97.
Fogel, D. B. (1994a). An introduction to simulated evolutionary optimization,
IEEE Transactions on Neural Networks 5, 1, pp. 3–14.
Fogel, D. B. (1994b). An introduction to simulated evolutionary optimization,
IEEE Transactions on Neural Networks 5, 1, pp. 3–14.
Fogel, D. B. (1995). Evolutionary Computation: Toward a New Philosophy of
Machine Intelligence (IEEE Press, Piscataway, NJ).
Fogel, D. B. (1996). On the relationship between the duration of an encouter and
the evolution of cooperation in the iterated prisoner’s dilemma, Evolution-
ary Computation 3, 3, pp. 349–363.
Franken, N. and Engelbrecht, A. P. (2005). Particle swarm optimization ap-
proaches to coevolve strategies for the iterated prisoner’s dilemma, IEEE
Transactions on Evolutionary Computation 9, 6, pp. 562–579.
Glance, N. S. and Huberman, B. A. (1994). The dynamics of social dilemmas,
Scientific American , pp. 58–63.
Hardin, G. (1968). The tragedy of the commons, Science 162, pp. 1243–1248.
Harrald, P. G. and Fogel, D. B. (1996). Evolving continuous behaviors in the
iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s
Dilemma 37, pp. 135–145.
Ishibuchi, H. and Namikawa, N. (2005). Evolution of iterated prisoner’s dilemma
game strategies in structured demes under random pairing in game playing,
IEEE Transactions on Evolutionary Computation 9, 6, pp. 552–561.
Julstrom, B. A. (1997). Effects of contest length and noise on reciprocal altruism,
cooperation, and payoffs in the iterated prisoner’s dilemma, in Proc. 7th
International Conf. on Genetic Algorithms (ICGA’97) (Morgan Kauffman,
San Francisco, CA), pp. 386–392.
Lindgren, K. (1991). Evolutionary phenomena in simple dynamics, in C. G. Lang-
ton, C. Taylor, J. D. Farmer and S. Rasmussen (eds.), Artificial Life II
(Addison-Wesley), pp. 295–312.
Lindgren, K. and Johansson, J. (2001). Coevolution of strategies in n-person
prisoner’s dilemma, in J. Crutchfield and P. Schuster (eds.), Evolutionary
Dynamics - Exploring the Interplay of Selection, Neutrality, Accident, and
Function (Addison-Wesley).
Mcnamara, J. M., Barta, Z. and Houston, A. I. (2004). Variation in behaviour
promotes cooperation in the prisoner’s dilemma, Nature 428, pp. 745–748.
Miller, J. (1989). The coevolution of automata in the iterated prisoner’s dilemma,
Tech. Rep. 89-003, Santa Fe Institute Report.
Nowak, M. A. and May, R. M. (1992). Evolutionary games and spatial chaos,
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
Learning IPD Strategies Through Co-evolution 87
Nature 355, pp. 250–253.
Nowak, M. A. and Sigmund, K. (1998a). The dynamics of indirect reciprocity,
Journal of Theoretical Biology 194, pp. 561–574.
Nowak, M. A. and Sigmund, K. (1998b). Evolution of indirect reciprocity by
image scoring, Nature 393, pp. 573–577.
Seo, Y. G., Cho, S. B. and Yao, X. (1999). Emergence of cooperative coalition
in nipd game with localization of interaction and learning, in Proc. IEEE
1999 Congress on Evolutionary Computation (CEC’99) (IEEE Press, Pis-
cataway, NJ), pp. 877–884.
Seo, Y. G., Cho, S. B. and Yao, X. (2000). Exploiting coalition in co-evolutionary
learning, in Proc. IEEE 2000 Congress on Evolutionary Computation
(CEC’00) (IEEE Press, Piscataway, NJ), pp. 1268–1275.
Stanley, E. A., Ashlock, D. and Smucker, M. D. (1995). Prisoner’s dilemma with
choice and refusal of partners: Evolutionary results, in Proc. Third Euro-
pean Conf. on Advances in Artificial Life, pp. 490–502.
Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s
dilemma, The Journal of Conflict Resolution 39, 1, pp. 183–189.
Yao, X. (1994). Introduction, Informatica (Special Issue on Evolutionary Com-
putation) 18, pp. 375–376.
Yao, X. (1999). Evolving artificial neural networks, Proc. IEEE 87, 9, pp. 1423–
1447.
Yao, X. and Darwen, P. (1999). How important is your reputation in a multi-
agent environment, in Proc. 1999 Conf. on Systems, Man, and Cybernetics
(SMC’99) (IEEE Press, Piscataway, NJ), pp. 575–580.
Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated
prisoner’s dilemma games, Informatica 18, pp. 435–450.
Yao, X., Liu, Y. and Darwen, P. J. (1996). How to make best use of evolutionary
learning, in R. Stocker, H. Jelinck, B. Burnota and T. Bossomaier (eds.),
Complex Systems - From Local Interactions to Global Phenomena (IOS
Press, Amsterdam), pp. 229–242.
Yao, X., Liu, Y. and Lin, G. (1999). Evolutionary programming made faster,
IEEE Transactions on Evolutionary Computation 3, 2, pp. 82–102.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter3
This page intentionally left blankThis page intentionally left blank
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
Chapter 4
How to Design a Strategy to Win an IPD Tournament
Jiawei Li
University of Nottingham, Harbin Institute of Technology
4.1. Introduction
Imagine that a player in an IPD tournament knows the strategy of each of
his opponents; he will defect against opponents such as ALLC or ALLD and
cooperate with opponents such as GRIM or TFT in order to maximize his
payoff. This means that he can interact with each opponent optimally and
receive higher payoffs. Although this information a priori is not possible,
one can identify a strategy during the game. For example, if a strategy
cooperated with its opponent in the previous 10 rounds while its opponent
defected, it seems sensible to deduce that it will always cooperate. In fact,
each strategy will gradually reveal itself through the IPD game; moreover,
it is not after the game that we can identify the strategy but possibly after
a few rounds. With an efficient identification mechanism, it is possible for
a strategy to interact with most of its opponent optimally.
However, two main problems must be solved in designing an efficient
identification mechanism. Firstly, it is impossible, in theory, for a strat-
egy to identify an opponent within a finite number of rounds because the
number of possible strategies is huge. Only can the types of strategies be-
longing to a preconcerted finite set be identified, which may be just a small
proportion of all those possible because identification will be of no use if
it takes too long. Secondly, there exists a risk of exploring an opponent
putting the player into a much worse position. In other words, such an
action may have negative effect on future rewards. For example, in order
to distinguish between ALLC and GRIM, a strategy has to defect at least
once and loses the chance to cooperate with GRIM in the future.
In this chapter we will discuss how to resolve these problems, how to
89
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
90 J. Li
design an identification mechanism for IPD games, and how the strategy of
Adaptive Pavlov was designed, which was ranked first in Competition 4 of
the 2005 IPD tournament.
4.2. Analysis of strategies involved in IPD games
Every strategy may have its disadvantages as well as its advantages. A
strategy may receive high payoffs when its opponent belongs to some set of
strategies, and receive lesser payoffs when an opponent belongs to another
set of strategies. However, some strategies always do better than others in
IPD tournaments.
The strategies involved in IPDs can be classified according to whether
or not they respond to their opponents. One set of strategies is fixed and
plays a predetermined action no matter what their opponent does. ALLD,
ALLC and RAND are typical. Other strategies are more complicated and
their actions depend on their opponent’s behavior. TFT, for example, starts
with COOPERATE and then repeats his opponent’s last move. The second
set is obviously superior to the former since the strategies like TFT, TFTT
and GRIM have always performed better than ’fixed’ strategies in past IPD
tournaments.
Then, the question is what the optimal response to every opponent is.
Is TFT’s imitation of opponent’s last move the best response? Although
TFT has been shown to be superior to many other strategies, it is not good
enough to win every IPD tournament.
Let’s consider a simulation of IPD tournament with 9 players. These
players are ALLC, ALLD, RAND, GRIM, TFT, STFT, TFTT, TTFT, and
Pavlov. The descriptions of the strategies of these players are as shown in
Table 4.1. These strategies are simple and representational, and have all
appeared in past IPD tournaments.
The rule of our simulation is that each strategy will play a 200-round
IPD game with every strategy (including itself). The payoffs in a round
are as shown in Fig. 4.1. The total payoff received by any given strategy
is the summation of the payoffs throughout the tournament.
The results of the tournaments vary because there are random choices
in the strategies of Pavlov and RAND. In order to decrease the variability
of the result, the tournament is repeated several times and the average
score for each strategy is calculated. Simulation results show that TFT,
TFTT and GRIM acquire higher scores than the others and their average
scores across several tournaments are quite close. TFTT, however, wins
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 91
Table 4.1. Description of the players of the IPD simulation.
Players Descriptions
ALLC This strategy always plays COOPERATE.
ALLD This strategy always plays DEFECT.
RAND It plays DEFECT or COOPERATE with 1/2 probability.
GRIM Starts with COOPERATE, but after one defection plays always
DEFECT.
TFT Starts with COOPERATE, and then repeats opponent’s moves.
TFTT Like TFT but it plays DEFECT after two consecutive defections.
STFT Like TFT but in first move it plays DEFECT.
TTFT Like TFT but it plays two DEFECT after opponent’s defection.
Pavlov Result of each move is divided into two groups:
SUCCESS (payoff 5 or 3) and DEFEAT (payoff 1 or 0).
If the last result belongs to SUCCESS group it plays the same move,
otherwise it plays the other move.
Player1’schoice
Player2’schoice
COOPERATE DEFECT
COOPERATE (3,3) (0,5)
DEFECT (5,0) (1,1)
Fig. 4.1. Payoffs table of the IPD tournament. The numbers in brackets denote the
payoffs two players receive in a round of a game.
more times than the others in a single tournament. For example, TFTT
wins 11 tournaments from a total of 20, while TFT wins 4 and GRIM wins
5. In addition, if Pavlov and RAND are removed TFTT will always win.
One of the limitations of TFT is that it will inevitably run into the circle
of defecting-defected (which means that TFT plays COOPERATE while
its opponent defects; and then TFT plays DEFECT while its opponent
cooperates) when its opponent happens to be STFT. However, cooperation
will be achieved resulting in higher payoffs if TFT cooperates once after
its opponent defects. TFTT is superior to TFT in this regard. And it
is this reason why TFTT wins more tournaments than TFT in the above
IPD simulation. It is easy to verify that TFT will not get lower scores than
TFTT if STFT is removed from the simulation.
Thus, we can improve the strategy of TFT in this way: when TFT
enters a circle of defecting-defected (for example a sequence of 3 pairs of
defecting-defected) it will choose COOPERATE in two continuous rounds.
This modified TFT (MTFT) will achieve higher payoffs than TFT in the
case that their opponents are STFT. By substituting MTFT for TFT, IPD
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
92 J. Li
experiments show that MTFT gets the highest average score and wins more
single tournaments than the others.
MTFT has used an identification technique. It identified STFT by de-
tecting the defecting-defected circles in the process of an IPD game. When
the opponent was considered to be STFT, optimal action (cooperates in
two sequential rounds) would be carried out in order to maximize future
payoffs. In this way, it is natural to deduce that MTFT can be further im-
proved so that it can identify more strategies and then interact with them
optimally.
In the following sections, an approach to identify each strategy in a
finite set will be introduced. A strategy can interact with the opponents
almost optimally by using this identification mechanism.
4.3. Estimation of possible strategies in an IPD tournament
In this section, we seek to define a finite set of types of strategies to be
identified. Since the number of possible strategies for IPD are infinite, it is
impossible to identify each of them in a finite number of rounds. For exam-
ple, suppose that a strategy cooperated with its opponent in 10 sequential
rounds while its opponent defected continuously. Although it is very likely
to be ALLC, there are always other possibilities. It may be GRIM but the
trigger is 11 defections; it may be RAND that has just happened to play 10
sequential COOPERATEs; or it may be a combination of ALLC and TFT
and it will behave as TFT type in the following rounds. However, since
only ALLC belongs to the set of identification, those other possibilities will
be eliminated.
How to choose the set of identification depends on prior knowledge and
subjective estimation. Some strategies like TFT are likely to appear; while
others are designated as default strategies.
There are numerous strategies one can design for an IPD tournament.
However, most of them seldom appear because their chances of winning are
very small. For example, there may be such a strategy that it cooperates
in the first two rounds and defects in the following two rounds, and then
it cooperates and defects alternately. Few players will apply such a strat-
egy because it is unlikely to win any IPD tournament. It is obvious that
the strategies that usually win appear frequently and the others appear
infrequently.
We define two classifications of IPD strategies: cooperating and defect-
ing. Cooperating strategies, for example TFT and TFTT, wish to coop-
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 93
erate with their opponents and never start defecting. Defecting strategies,
for example ALLD and Pavlov beginning with DEFECT (PavlovD), wish
to defect in order to maximize their payoffs and they always start defecting.
The cooperating strategies differ in the way of their responses to the
opponent’s defections. For example, TFTT is more forgiving than TFT
as it retaliates only if its opponent has defected twice. GRIM is sterner
than TFT as it never forgives a defection. These strategies can be classified
according to their responses to the opponent’s defections. The rules are the
same as the one described in the previous simulation as shown in Fig. 4.2.
TFTTFTT TTFT GRIMALLC
Forgiving Stern
Fig. 4.2. The cooperating strategies.
The defecting strategies differ in the way they insist on defecting.
PavlovD is a representative strategy in this set. It starts with DEFECT.
If the opponent is too forgiving to retaliate, it defects forever. Otherwise,
it tries to cooperate with the opponent.a The defecting strategies can be
classified as shown in Fig. 4.3.
PavlovD ALLDSTFT
Defect less Defect more
Fig. 4.3. The defecting strategies.
Other simple strategies which lack a clear objective differ from the co-
operating and defecting strategies and hardly ever get high scores in IPD
tournaments.
Most of the players of an IPD tournament will be cooperating strate-
gies at the present time since cooperating strategies have been dominant
in most of the tournaments. There will also be a small quantity of de-aAlthough PavlovD tries to cooperate with an opponent when the opponent retaliates
upon its defection, it seldom succeeds. For example, even if PavlovD meets a forgiving
strategy like TFTT they cannot keep cooperating in the game. In fact, if only PavlovD
cooperates one more time cooperating can be achieved. We have examined a modified
PavlovD (MPavlovD) strategy that starts with DEFECT and cooperates twice when the
opponent retaliates. The results of simulation show that MPavlovD always gains more
scores than PavlovD.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
94 J. Li
fecting strategies. Based on the above idea, we have designed the Adaptive
Pavlov strategy that applies a simple mechanism to distinguish cooperating
strategies and several representative defecting strategies.
4.4. Interaction with a strategy optimally
For any strategy there must be another strategy that optimally deals with
it. Because the strategies of ALLC, ALLD and RAND are independent of
the opponent’s behavior, ALLD is the optimal strategy. Because GRIM,
TFT, STFT and TTFT retaliate as soon as their opponent defects, the op-
timal strategy for its opponent is to always cooperate but defect in the last
round. TFTT is more charitable and forgives a single defection; therefore,
its opponent can maximize the payoff by alternately choosing DEFECT and
COOPERATE. If Pavlov starts with COOPERATE its opponent should al-
ways cooperate except in the last round; Otherwise, its opponent should
start with DEFECT, then always cooperate except in the last round. Ta-
ble 4.2 shows the optimal strategies to deal with each strategy shown in
Table 4.1.
Table 4.2. Optimal strategies to interact with a known strategy.
Strategies Optimal strategy of opponent
ALLC It always plays DEFECT.
ALLD It always plays DEFECT.
RAND It always plays DEFECT.
GRIM It always plays COOPERATE except DEFECT in the last move.
TFT It always plays COOPERATE except DEFECT in the last move.
TFTT It starts with DEFECT, and then plays COOPERATE and
DEFECT in turn.
STFT It always plays COOPERATE except DEFECT in the last move.
TTFT It always plays COOPERATE except DEFECT in the last move.
Pavlov If Pavlov starts with DEFECT it starts with DEFECT, and then
always plays COOPERATE except that it plays DEFECT in the
last round; If Pavlov starts with COOPERATE it always plays
COOPERATE except that it plays DEFECT in the last round.
Given an IPD tournament with n players, a player will win the tour-
nament if it interacts with each of its opponent optimally. For example, a
unique ALLD will win when the other n− 1 players in a IPD tournament
are all ALLC. Hence, the winning strategy of an IPD tournament must be
optimal in interacting with most of the others.
Although the strategy of a player is unknown to his opponent before
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 95
a game, the strategy gradually emerges as the game progresses. It is not
difficult for a human player to identify the strategy of his opponent but it is
more difficult for a computer program to possess the ability of identification.
To make this feasible, there is a need for a method to distinguish each type
of strategy from the others, and then a computer program can interact with
different types of strategies with a relevant response. Under the assumption
that every player belongs to a pre-defined finite set of strategies, an example
is given to show how the method of identification is realized and how the
winning strategy is designed .
Consider an IPD tournament with 10 players. Besides the players shown
in Table 4.1, let us add a new player MyStrategy (MS) which applies an
identification mechanism to identify its opponent. The rules are the same
as those described in the previous simulation.
MS starts with DEFECT. If its opponent chooses DEFECT in the first
round, MS chooses COOPERATE in round two, otherwise MS chooses
DEFECT. MS always chooses COOPERATE in the third round. In this
way, most of the strategies can be identified after just three rounds.
For example, suppose that the choices of MS and its opponent in the
first 3 rounds are as shown in Fig. 4.4. The strategy of the opponent can
be confirmed to be RAND. Because the opponent starts with DEFECT it
must be one of the strategies of ALLD, STFT, RAND and Pavlov. Since
MS defects in the first round and the opponent cooperates in round two, it
is impossible to be ALLD or STFT. Since MS and the opponent cooperate
in the second round, the opponent should not defect in the third round
if it were Pavlov. Therefore, the opponent must be RAND. The optimal
strategy is ALLD in interacting with RAND, and MS will behave as ALLD
in the following rounds of the game.
Round 1 Round 2 Round 3
M S’smoves Defect Cooperate Cooperate
Opponent’smoves Defect Cooperate Defect
Fig. 4.4. A possible process of a game (shows that the opponent is RAND).
Some possible results of identification for the 9 strategies are listed in
Table 4.3, where ’C’ denotes COOPERATE and ’D’ denotes DEFECT.
Because the strategy RAND chooses its move randomly it may behave like
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
96 J. Li
any other strategy during a short period; therefore, there needs to be more
rounds to distinguish RAND from other strategies. If there is a process
different from that of as shown in Table 4.3, the strategy of the opponent
must be RAND.
Table 4.3. Identification of the 9 strategies.
Players Possible moves of two players Identification result
MyStrategy D C C Pavlov (RAND)
The opponent D C C
MyStrategy D C C ALLD (RAND)
The opponent D D D
MyStrategy D C C STFT (RAND)
The opponent D D D C
MyStrategy D D C ALLC (RAND)
The opponent C C C C
MyStrategy D D C TFTT (RAND)
The opponent C C C D
MyStrategy D D C Pavlov (RAND)
The opponent C C D C
MyStrategy D D C C TFT (RAND)
The opponent C C D D C
MyStrategy D D C C C TTFT (RAND)
The opponent C D D D C
MyStrategy D D C C C GRIM (RAND)
The opponent C D D D D
In this way, a strategy can be identified after several rounds of game,
and then the optimal strategy can be applied.
Ten IPD tournaments with the above 10 players are carried out.b The
simulation results are as shown in Fig. 4.5. It shows that MS gains the
highest average payoffs when compared to the other strategies and achieves
the highest score in each tournament. The reason for MS’s success is that
bHow many rounds an IPD game commits is usually not fixed in order to avoid the
players’ knowing when the end of the game is due. The simulation applies a fixed number
of rounds in order to decrease complexity of computation. However, the strategy of MS
does not make use of this to get extra payoff; that is to say, MS does not purposely
choose DEFECT in the last round of a game.
January
30,2007
11:0
World
Scie
ntifi
cRevie
wVolu
me
-9in
x6in
chapte
r4
How
toD
esig
na
Strate
gy
toW
inan
IPD
Tournam
ent
97
Players Points in 10 tournaments Average Rank
MS 6134 6213 6179 6127 6202 6175 6152 6172 6212 6187 6175.3 1
TFTT 5957 5996 5970 6003 5994 5959 5965 5969 5966 5976 5975.5 2
TFT 5961 5936 5919 5946 5959 5938 5940 5929 5954 5978 5946.0 3
Pavlov 5718 5691 5725 5775 5816 5763 5748 5763 5733 5745 5747.7 4
TTFT 5725 5723 5725 5717 5719 5725 5746 5732 5722 5716 5725.0 5
GRIM 5404 5394 5416 5410 5440 5468 5322 5400 5390 5384 5402.8 6
ALLC 5115 5091 5103 5127 5103 5103 5103 5082 5109 5091 5102.7 7
RAND 4339 4349 4254 4340 4216 4219 4258 4241 4228 4274 4271.8 8
STFT 4165 4187 4160 4169 4179 4144 4173 4158 4142 4158 4163.5 9
ALLD 3800 3792 3852 3792 3848 3856 3832 3864 3832 3832 3830.0 10
Fig. 4.5. Simulation results of 10 IPD tournaments.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
98 J. Li
it has almost optimally interacted with most of the strategies in this IPD
tournament.
Most IPD strategies, such as TFT or Pavlov, are memory-one strategies
which can only respond to the opponent’s last move; however, the past pro-
cess of the game contains more information. The identification mechanism
of MS uses information about the opponent’s strategy, thus MS responds to
not just the opponent’s past moves but the opponent’s strategy. By iden-
tifying different opponents, MS makes use of more information than the
simple strategies. This is the reason MS is able to win IPD tournaments.
Different identification approaches may lead to different results for MS.
For example, all of the strategies GRIM, TFT and ALLC start with CO-
OPERATE, and they will not defect if their opponents don’t. To identify
each of these strategies, MS starts with DEFECT and loses the chance to
cooperate with GRIM. On the other hand, if MS doesn’t firstly defect, it
cannot distinguish the 3 strategies and cannot interact with ALLC opti-
mally. The risk involved in exploring the opponent must be considered in
order to choose an efficient or payoff-maximizing identification approach.
4.5. Escape from the trap of defection
When a player begins to explore the opponent, there is a risk of the identify-
ing process’s putting the player into a much worse position. Some strategies,
especially those with trigger mechanism such as GRIM, will change their
behaviors at the trigger point. For example, the strategy MS described in
the above section defects at the beginning of IPD games in order to distin-
guish each of the cooperating strategies ALLC, TFT and GRIM; however,
the chance to cooperate with GRIM is lost. In IPD games, the risk of
identification is mainly the trap of defection, which means an identifying
process leading the opponent to keep defecting with nothing that can be
done to rescue the situation.
It appears that a strategy will not run into the trap of defection if it
never defects first. But this is not the case. Suppose a strategy keeps
playing COOPERATE if its opponent defects, and defects forever once
its opponent cooperates; then, any cooperating strategy will be defected
against in interacting with it while most of defecting strategies will keep
cooperating. If there is a equal possibility of this reverse-GRIM strategy
appearing in a game to that of GRIM, to cooperate or to defect has equal
risk to invoke future defection. This means that there always exists the risk
of the defection trap whether or not an identification mechanism is applied.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 99
One may argue that the reverse-GRIM type of strategies will not appear
as frequently as GRIMs in IPDs, so to cooperate is safer than to defect and
the MS strategy is more likely to run into the defection trap than TFT.
That is right. But it is not enough to testify that a defection trap is not
inevitable for a strategy with an identification mechanism because many
identification approaches can be applied. For example, a simple way to
avoid retaliation from GRIM is not to defect first. The identification mech-
anism that Adaptive Pavlov used in 2005 IPD tournament only explored
defecting strategies in order to keep cooperation with each of those coop-
erating strategies.
Again, what kind of identification mechanisms should be applied de-
pends on prior knowledge and subjective estimation. If there are enough
ALLC strategies in an IPD game, it is worth identifying them from other
cooperating strategies. But if GRIMs are prevailing, it is better not to
defect first. Generally speaking, we can compare different identification ap-
proaches to choose the most efficient one although uncertainty still exists.
4.6. Adaptive Pavlov and Competition 4 of 2005 IPD tour-
nament
The 2005 IPD tournament comprised 4 competitions. Competition 4 mir-
rored the original competition of Axelrod. There were a total of 50 players
including 8 default strategies. The strategy of Adaptive Pavlov (AP) that
was ranked first in Competition 4 will be analyzed in this section.
The strategy of AP combines 6 continuous rounds to a period and ap-
plies different tactics in different periods. AP behaves as a TFT strategy in
the first period, and then changes its strategy according to the identification
of its opponent.
AP classifies the possible opponents into 5 categories: cooperating
strategies, STFT, PavlovD, ALLD and RAND.c By identifying the op-
ponent’s strategy at the end of a period, AP shift its strategy in the new
period in order to deal with each opponent optimally.
AP is never the first to defect, and thus it will cooperate with each
cooperating strategy. AP tries to cooperate with the strategies of STFT
and PavlovD, and defect to the strategies such as ALLD or RAND. The
processes of AP’s interacting with cooperating strategies, ALLD, STFT,
and PavlovD in the first 6 rounds are shown in Fig. 4.6 (AP behaves as
TFT). For example, when a process of interaction as shown in Fig. 4.6(c)cRAND is claimed to be a default strategy.
January
30,2007
11:0
World
Scie
ntifi
cRevie
wVolu
me
-9in
x6in
chapte
r4
100
J.Li
(a)
1 2 3 4 5 6
AP C C C C C C
Co-op C C C C C C
1 2 3 4 5 6
AP C D D D D D
ALLD D D D D D D
(b)
1 2 3 4 5 6
AP C D C D C D
STFT D C D C D C
(c)
1 2 3 4 5 6
AP C D D C D D
PavlovD D D C D D C
(d)
Fig. 4.6. Identifying the opponent according to the process of interaction in six rounds. (a) AP cooperates with any cooperating
strategy. (b) ALLD strategy always defects. (c) If a strategy alternately plays D and C when interacting with TFT, it is identified to
be STFT. (d) If a strategy periodically plays D-D-C when interacting with TFT, it is identified to be PavlovD.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 101
occurs, the opponent will be identified to be STFT and AP will cooperate
twice in the next period in order to achieve cooperation. If the opponent
is determined to be PavlovD, AP will defect once and then always coop-
erate in the next period. If there is a process of interaction different from
that of as shown in Fig. 4.6, the opponent will be identified as RAND. In
this way, any strategy that is not defined in identification set is likely to
be identified as RAND. Once cooperation has been established, AP will
always cooperate unless a defection occurs. Identification of the opponent
is performed in each period throughout the IPD tournament in order to
correct misidentification and to deal with those players who change their
strategies during a game.
As we have mentioned, most of the players will be cooperating strategies.
The results show that there are 34 cooperating strategies in Competition 4
(including 4 default strategies of TFT, TFTT, GRIM and ALLC). With the
exception of the default strategies, there are still 3 strategies that behave
like ALLD, 5 strategies that behave like STFT, and 2 strategies that behave
like NEG. As shown in Table 4.4, AP can identify most of the strategies
involved in Competition 4.d
Table 4.4. Categories of the strategies in Competition 4.
Categories Number of the strategies
Cooperating strategies 34
Strategies like STFT 6
Strategies like ALLD 4
Strategies like NEG 3
Strategies like RAND 1
Others 2
4.7. Discussion and conclusion
AP belongs to the type of adaptive automata for IPD. However, it differs
from other adaptive strategies in respect of how adaptation is achieved.
The approach of AP exactly belongs to the set of artificial intelligence ap-
proaches. Rather than adjusting some parameters in computing responses
as most of the adaptive strategies do, AP uses an identification mechanism
dAP regards NEG as RAND. It maximizes the scores when interacting with the strategies
like NEG because either of the optimal strategies to interact with NEG and RAND are
ALLD.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
102 J. Li
which acts as an expert system. Knowledge about different opponents is
expressed in the form of ’If..., then...”, for example, if the opponent cooper-
ates in 6 rounds then it is determined to be ALLC. In this way, information
acquired and used can be transparently expressed and thus AP can tell
which strategy the opponent is using.
Recent years have seen many AI approaches applied in evolutionary
game theory and IPD, for example reinforcement learning, artificial neural
networks, and fuzzy logic [Sandholm and Crites (1996); Macy and Carley
(1996); Fort and Perez (2005)]. To solve the problem of computing a best
response to an unknown strategy has been one of the objectives of those AI
approaches. The problem is, in general, intractable because of the compu-
tational complexity, and finding the best response for an arbitrary strategy
can be non-computable [Papadimitriou (1992); Nachbar and Zame (1996)].
Reinforcement learning which is based on the idea that the tendency to pro-
duce an action should be reinforced if it produces favourable results, and
weakened if it produces unfavourable results [Gilboa (1988); Gilboa and
Zemel (1989)] is widely used for the automata to learn from the interaction
with others. With respect to IPD, several approaches have been developed
to learn optimal responses to a deterministic or mixed strategy [Carmel
and Markovitch (1998); Darwen and Yao (2002)]. However, computational
complexity is still the main difficulty in the application of these approaches
in real IPD tournaments. AP’s identification mechanism is implemented in
a simple way by making use of a priori knowledge, which greatly reduces the
computational complexity and makes it practical for AP to respond to the
opponent almost optimally. First, a priori knowledge about what strategies
are more likely to appear in the IPD tournament is used in determining the
identification set. The size of the identification set is restricted in order to
reduce computational complexity. Second, a priori knowledge about how
well different identification approaches will work in a certain environment
is used in selecting an efficient identification approach, with which AP can
avoid the risk of identification and maximize the payoffs. Third, a priori
knowledge about how to identify the opponent according to the process
of interaction is used in constructing the identification rules. With these
simple rules, the AP strategy is easy to understand.
It is obvious that the identification set can be extended in order to
include more strategies that can be identified; however, more calculations
will be involved as the size of identification set increases. We have to
make a tradeoff between the wish to identify any strategy and the wish
to develop a less complicated strategy. Compared to the NP-completeness
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
How to Design a Strategy to Win an IPD Tournament 103
of those reinforcement learning approaches [Papadimitriou (1992)], AP’s
computational complexity is between O(√
n) and O(n), which depends on
the similarities of those strategies to be identified. Therefore, the algorithm
of AP is suitable for real IPD tournaments.
An identification mechanism can also work in the environment with
noise, where each strategy might, with a possibility, misunderstand the
outcome of game. Noise blurs the boundaries between different strategies.
However, identification can still be applicable by admitting a small identi-
fication error. In this circumstance, we can set a threshold value that the
opponent is considered to be identified if the probability of misidentification
is smaller than this value. Just as the case of identifying the strategy of
RAND, the probability of mistakenly identifying a strategy will decrease to
zero as the process of computation and identification repeats.
Information plays a key role in intelligent activities. The individuals
with more information consequentially gain the advantage over others in
most circumstances. With an identification mechanism, strategies such as
AP acquire information about their opponents and they are more intel-
ligent than any known strategies such as TFT or Pavlov. These type of
strategies are suitable in modeling the decision-making process of human
beings, where learning and improvement frequently happens.
References
Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy
(almost) optimally, in Proceedings of the International Conference on Multi
Agent Systems, pp. 64–71.
Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with
intermediate levels of cooperation: Application to missile defense, Inter-
national Journal of Computational Intelligence and Applications 2, 1, pp.
83–107.
Fort, H. and Perez, N. (2005). The fate of spatial dilemmas with different fuzzy
measures of success, Journal of Artificial Societies and Social Simulation
8, 3.
Gilboa, I. (1988). The complexity of computing best response automata in re-
peated games, Journal of Economic Theory 45, pp. 342–352.
Gilboa, I. and Zemel, E. (1989). Nash and correlated equilibria: some complexity
considerations, Games and Economic Behavior 1, pp. 80–93.
Macy, M. and Carley, K. (1996). Natural selection and social learning in pris-
oner’s dilemma: co-adaptation with genetic algorithms and artificial neural
networks, Sociological Methods and Research 25, 1, pp. 103–137.
January 30, 2007 11:0 World Scientific Review Volume - 9in x 6in chapter4
104 J. Li
Nachbar, J. and Zame, W. (1996). Non-computable strategies and discounted
repeated games, Economic Theory 8, pp. 103–122.
Papadimitriou, C. (1992). On players with bounded number of states, Games and
Economic Behavior 4, pp. 122–131.
Sandholm, T. and Crites, R. (1996). Multiagent reinforcement learning in the
iterated prisoner’s dilemma, Biosystems 37, 1-2, pp. 147–166.
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
Chapter 5
An Immune Adaptive Agent for the Iterated Prisoner’s
Dilemma
Oscar Alonso, Fernando Nino
National University of Colombia
5.1. Introduction
The Prisoner’s Dilemma [Tucker (1950)] is a game in which two players have
to decide between two options: cooperate, doing something that is good for
both players, and defect, doing something that is worse for the other player
but better for oneself. No pre-play communication is permitted between
the players. The dilemma arises because no matter what the other does,
each player will do better defecting than cooperating, but as both players
defect, both will do worse than if both had cooperated [Alonso et al.]. The
payoff obtained by each player is given by a payoff matrix, as shown in
table 5.1. The first number in each cell represents the payoff for the row
player, and the second value represents the payoff for the column player.
Table 5.1. Payoff ma-
trix
C D
C 3 , 3 0 , 5
D 5 , 0 1 , 1
When the game is played several times between the same players, and
the players are able to remember past interactions, it is called the Iterated
Prisoner’s Dilemma (IPD). Each player is said to have a strategy, i.e., a way
to decide its next move depending on previous interactions. Accordingly,
complex patterns of strategic interactions may emerge, which may lead to
exploitation, retaliation or mutual cooperation.
The Iterated Prisoner’s Dilemma game has attracted the interest of
many researchers in a wide set of fields, including game theorists, social
105
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
106 O. Alonso and F. Nino
scientists, economists and computer scientists [Axelrod (1984); Angeline;
Hofstadter (1985); Yao and Darwen (1994)]. From the computational point
of view, there has been a deep interest in the development of effective strate-
gies for the IPD game [Yao and Darwen (1994); Axelrod (1984); Delahaye
and Mathieu (1995)]. Most well-known IPD strategies have been proposed
by humans, specifying the decision rules that a player will follow depend-
ing on the opponent’s behaviour[Beaufils et al. (1997); Nowak and Sigmund
(1993)]. Clearly, this has mainly depended on the researcher’s assumptions
about the game. In a first computational approach, Axelrod explored hu-
man designed strategies by confronting them through a tournament [Axel-
rod (1984)].
Conversely, there has been also some interest in obtaining IPD strate-
gies using evolutionary computation, coevolution, reinforcement learning
and other computational techniques, without explicitly specifying the de-
cision rules [Sandholm and Crites (1995); Darwen and Yao (1995)]. These
methods have found good IPD strategies, requiring little or no intervention
from a human. For instance, in Axelrod’s work, human-designed strategies
were compared to strategies obtained through evolution and coevolution
[Axelrod (1984)]. Further research has been done towards finding strate-
gies that generalise well without human intervention. Studies have focused
on coevolutionary approaches, since no human intervention is required in
the evaluation process. For instance, Darwen and Yao [Darwen and Yao
(1996)] proposed a speciation scheme in order to get a modular system that
played the IPD, in which coevolution and fitness sharing were used in order
to get a diverse population that played as a whole against the opponent.
The scheme showed a significant degree of generalisation.
The model proposed in this work falls into the second kind of method.
Thus, the main goal of this research is to generate an agent which will learn
to play the IPD game and will be able to adapt to the opponent’s behaviour.
Learning, memory and adaptation capabilities are argued to be desirable to
be present in an IPD agent, and consequently, the agent implementation is
accomplished using artificial immune networks, a computational technique
inspired in the natural immune system that presents such capabilities.
The rest of this chapter is organised as follows. First, some funda-
mentals about artificial immune systems, namely, immune networks are
summarised. Subsequently, a general model for an adaptive agent is intro-
duced. Then, a specific immune-based model of this agent is explained in
detail. An implementation of the immune model was developed and some
experiments were carried out to validate the agent capabilities. The imple-
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 107
mented agent showed adaptation and learning; however, in some cases, the
immune agent exhibited a poor performance.
5.2. Immune network fundamentals
Antigens are substances capable of inducing a specific immune response.
They may be viruses, bacteria, fungi, or other protozoa. They are invaders
assumed to cause harm in the body. However, an antigen may be harmless,
such as grass pollen [Jonathan (2001)].
On the other hand, antibodies are proteins found in the blood, produced
by specialised white blood cells, called B-cells. B-cells make antibodies
when the body recognises that something foreign (antigen) is present. An-
tibodies are the antigen-binding proteins that are present on the B-cell
membrane. They are also secreted by plasma cells.
The affinity between an antigen and an antibody is given by the com-
plementarity of their binding proteins. If the antigen/antibody affinity is
higher than an affinity threshold, the corresponding B-cell becomes stim-
ulated. In the early stages of the immune response, the affinity between
the antibodies and antigens may be low, but as the B-cells undergo clonal
selection, the binding B-cells mutate and clone again and again to improve
the affinity of the binding between a particular antigen and a B-cell. Then,
the mature and activated B-cells produce plasma cells, which differentiate
into antibodies with a high affinity of the antigen/antibody bonds.
The Immune Network Theory tries to explain the way in which a natural
immune system achieves immunological memory [Perelson and Weisbuch
(1997)]. Jerne [Jerne (1974)] hypothesised that the immune system is a
regulated network of molecules and cells that recognise one another even in
the absence of antigens, rather than being a set of isolated cells that respond
only when stimulated by antigens. Though in immune network theory
the main elements are B-cells, most models only consider the antibodies
attached to the B-cell membranes. Therefore, here only antibodies will be
considered.
The basic idea behind immune network theory is that antibodies are
stimulated not only by antigens, but also by other antibodies, allowing the
generated antibodies to be preserved over time for future encounters with
the same or similar antigens. Therefore, when the same antigen reappears,
the immune response is faster, since the immune system already contains
suitable antibodies to deal with such antigen. This is known as secondary
response, which is depicted in figure 5.1. [Jonathan (2001)].
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
108 O. Alonso and F. Nino
Fig. 5.1. Secondary Response. The amount of antibodies is greater and the response
time is shorter when the antigen is presented for the second time to the immune system
Even though antibodies stimulate each other, there is also a suppres-
sion relation between them, which controls the size of the network. Thus,
the network structure is a result of the interactions among antibodies. A
graphical representation of an immune network model is shown in figure
5.2.
An Artificial Immune Network (AIN) is a computational model based
on immune network theory. In a broad sense, immune networks are mainly
suitable to solve clustering and classification problems, due to their natural
dynamics by which affine antibodies stimulate each other, thus forming
clusters of antibodies with similar features. Typically, an immune network
is stimulated by a set of antigens, corresponding to input data to a problem,
and the resulting structure of the immune network will give the solution to
the related problem [Castro and Zuben (2000)].
When an antigen is presented to the AIN, the internal dynamics of the
AIN develops antibodies with high affinity to the antigen, through a process
called affinity maturation. This process implies selection of high affinity
antibodies and a mutation process called somatic hypermutation; this is
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 109
Fig. 5.2. Immune Network Theory
an evolutionary process that in a short period of time evolve antibodies
capable of deal with the presented antigen.
Several computational models for immune networks have been proposed,
which are mainly derived from aiNet, a model used for optimisation and
data clustering proposed by de Castro, and RAIN, a model proposed by
Timmis, also used for data analysis [Castro and Zuben (2000); Castro
(2003)].
In the RAIN model, the resulting set of antibodies exhibits an spatial
distribution that reflects the data concentration in the data space. On the
other hand, the result of the aiNet model does not present this behaviour,
as highly concentrated data are considered redundant and then eliminated.
In the aiNet model the interaction among antibodies leads to network sup-
pression, i.e., antibodies that are affine (close) will suppress each other in
order to control the size of the network and eliminate redundant informa-
tion. Consequently, in this work the aiNet model will be used. Notice that
this model does not consider stimulation among antibodies.
When using an immune network to solve a problem, it is necessary to
specify the following aspects:
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
110 O. Alonso and F. Nino
(1) Identify the entities of the problem and find the corresponding elements
in an immune network, i.e., antibodies and antigens;
(2) define an appropriate representation of such elements,
(3) define an affinity measure between antigens and antibodies, and among
antibodies themselves; and
(4) establish the algorithms that model the behaviour of the immune net-
work.
5.3. A general adaptive agent model
In this section, a general adaptive agent model to play the IPD game is
proposed. The model is based on trying to figure out the opponent’s strat-
egy, which is further used to determine the next move of the agent. The
information about the recent history of the game is used to model the op-
ponent’s strategy. Accordingly, in order to decide the next move, the IPD
agent will accomplish the following three phases:
(1) Recognition of the opponent’s strategy
(2) Development of a good strategy to face the opponent
(3) Selection of the next move to play
In the first phase, the Agent attempts to guess the strategy the opponent
is playing, based on the recent history of moves from both players. As a
result of this phase, an IPD strategy which resembles the behaviour of the
opponent is obtained, which will be used in the next stage. In the second
phase, the Agent generates a strategy which obtains a good score when it
is faced to the strategy generated in the first phase. Finally, in the third
phase, the Agent uses the strategy obtained in the second phase to decide
its next move.
The adaptive IPD agent consists of a memory, a recognition module,
strategy generation module and a decision module (see figure 5.3), which
are explained next.
• The memory is responsible for storing the recent history of moves
played by both, the agent and the opponent.
• The recognition module is responsible for recognising the opponent’s
strategy based on the recent history; it produces a strategy that resem-
bles the one of the opponent
• The strategy generation module is responsible for generating a strategy
which obtains a good score when faced to the strategy that resembles
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 111
Fig. 5.3. General model of the IPD agent
the opponent’s
• The decision module is responsible for using the strategy obtained by
the strategy generation module in order to decide the next move that
the agent will play
Though, the model may look simple at first, it should be emphasised
that the implementation of each one of the modules is not trivial. The
recognition module should try to infer the strategy that the opponent is
playing, which may be a difficult task. Also, the strategy generation module
should be able to adapt to the changes in the opponent’s strategy.
5.4. Immune agent model
The definition of a particular agent based on the general model presented
above requires the stipulation of each module, as well as the representation
that will be used for the strategies.
The recognition of the opponent and the generation of a good strat-
egy against it require adaptability and learning. Additionally, it would
be desirable to preserve the strategies generated, which implies a memory
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
112 O. Alonso and F. Nino
mechanism. For these reasons, Artificial Immune Networks are used to im-
plement the recognition and strategy generation modules. The structure of
the general IPD agent can be seen in figure 5.4, and the global IPD decision
making process is described in algorithm 5.1.
Fig. 5.4. Structure of the immune agent
Algorithm 5.1. Decision making algorithm
Decision making
1 while playing
2 do
3 Present history to the recognition AIN
4 Find recognised strategy from the recognition AIN
5 Present recognised strategy to strategy generation AIN
6 Find best payoff strategy from strategy generation AIN
7 Obtain suggested next move from best payoff strategy
8 Play next move
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 113
5.4.1. Strategy representation
First, each strategy is represented using a look up table [Axelrod (1984)].
This representation indicates the next move to play, based on the n previous
moves of both players. The representation consists of a vector of moves,
where each position in the vector indicates the next move to be played given
a specific history of the game. Thus, there are 22n possible histories given a
memory of n previous moves. Additionally, since there is no initial history,
this representation requires 2n assumed pre-game moves at the beginning
of the game. Hence, the total length of the vector of moves will be 22n +2n,
and given that each position of the vector has 2 possible values, cooperate
and defect, the number of strategies that can be represented is 222n
+2n. An
example of a look up table is shown in figure 5.5.
Fig. 5.5. Example of a Look up Table representing the strategy TFT
5.4.2. Memory
The memory of the agent is represented by 2 vectors, containing the last k
moves played by the agent and the opponent.
5.4.3. Recognition module
An antibody of the Recognition AIN is represented by an IPD strategy.
The Recognition AIN will receive as an antigen the history of recent moves
of both, the opponent and the agent itself.
As the agent should obtain a strategy similar to the opponent’s, the an-
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
114 O. Alonso and F. Nino
tibodies are stimulated according to its similarity with the opponent. This
was measured presenting the moves played by the agent to each strategy,
and comparing its response with the one of the opponent. Such measure is
given by the Hamming distance between the move sequences of the strategy
and the opponent.
Additionally, the AIN model requires a measure of stimulation between
antibodies. Such measure was given by the similarity between the strate-
gies. The similarity between two strategies is measured indirectly as fol-
lows: both strategies play against a randomly generated sequence of moves.
Then, the moves of the strategies are compared using the Hamming dis-
tance and the percentage of coincidences determines the similarity of the
strategies. The interaction between antibodies leads to suppression, i.e.,
similar strategies suppress each other.
After presenting the history of recent moves to the Recognition AIN,
the most stimulated antibody is taken, because it represents the strategy
which better resembles the opponent.
A summary of the representation of the elements in the recognition AIN
is given in table 5.2.
Table 5.2. Recognition immune network representation
Immune network Representation
Antigen History of moves
Antibody IPD strategy
Antibody/Antigen affinity Similarity between the strategy an the oppo-
nent’s
Antibody/Antibody affinity Similarity between the strategies
In addition, the process of affinity maturation requires the strategies
to be mutated. Particularly, strategies will be mutated in two fashions.
The first one consists of changing the number of previous interactions re-
membered by the strategy (memory length), and the second one consists
of mutating each position of the vector that defines the strategy according
to the mutation rate.
The process of changing the memory length is performed as follows: the
new length is selected randomly between one and the maximum allowed
memory length. If the new length of the strategy is same as the old one,
nothing has to be done. If it is longer, the new positions of the vector are
filled in such a way that the strategy presents the same decision rules as
before. This operation is shown in figure 5.6.
If the new history length is shorter than before, the process is done as
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 115
Fig. 5.6. Example of mutation when new LuT memory length is larger
follows: notice that there are four histories which are different in only the
last move. Thus, removing the last move will cause those four histories to be
compressed into one. Therefore, the corresponding value of the compressed
history will be the value that has the majority in the correspondent histories
of the original vector. If there is a tie, it is resolved as Defect. This operation
is shown in figure 5.7.
5.4.3.1. Immune network model
In the aiNet model, all the antigens are known a priori and they are pre-
sented to the network many times until the structure of the network adapts
to the antigen set. In contrast, the proposed IPD agent, the opponents are
not known a priori, and the agent will have to be adapted to the opponents
as they appear. Accordingly, to deal with such problem, a slightly modified
version of the aiNet algorithm will be used.
The main modification of the aiNet algorithm is introduced in the mech-
anism used by the network to add antibodies to the memory. An antibody
interacts with the antibodies that have been already memorised. If the sup-
pression it receives from memorised antibodies is less than the suppression
threshold, it is added to the memory and will never be removed. Notice
that if an antibody is suppressed by the memorised ones, it means that
an antibody capable of recognising such antigen is already present in the
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
116 O. Alonso and F. Nino
Fig. 5.7. Example of mutation when new LuT memory length is shorter
memory. Thus, in order to avoid redundancy, this new antibody is not
added to the memory.
When a new opponent starts playing a game, there is not enough in-
formation to consider that the recognised antibodies correspond to the op-
ponent, therefore adding the antibody in the very early beginning of the
game is not a good idea. Additionally, since the agent confronts the same
opponent during various moves, it is not necessary to add antibodies to the
memory in each movement given that the history of moves does not change
significantly with only one new movement. Thus, in this situation, it is
more efficient to add the antibodies that have been periodically generated
every k movements.
The modified version of the aiNet algorithm is summarised in algorithm
5.2.
Algorithm 5.2. Modified aiNet algorithm
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 117
Modified aiNet
1 for each antigen
2 do
3 Add new random antibodies to the network
4 Calculate antigen/antibody affinity
5 Select the n antibodies with highest affinity
6 Clone and hypermutate selected antibodies
7 Re-calculate antigen/antibody affinity
8 Re-select a percentage of highest affinity antibodies
9 Remove low affinity antibodies
10 Calculate suppression among antibodies
11 Remove highly suppressed antibodies
12 Add resultant antibodies to the memory
In the algorithm, the affinity (suppression) of the antibodies is nor-
malised in the interval [0,1]. After that, it is considered low (high) in
relation to an affinity (suppression) threshold, which is a parameter of the
algorithm. Additionally, in the hypermutation process, the mutation rate is
inversally related to the affinity of the antigen. Particularly, it was defined
as 1- affinity. This means that high affinity antibodies are mutated less
than low affinity antibodies, which helps keeping good antibodies while ex-
ploring new regions of the search space. When an new antigen is presented,
the network dynamics develops antibodies with high affinity (similarity)
with it.
5.4.4. Strategy generation module
For this module, the antibodies are also represented as game strategies.
In this case, the strategy obtained in the phase one is presented as an
antigen for the second AIN. As the agent is interested in obtaining a good
strategy against the one obtained in the first phase, the antibodies are
stimulated according to the result of a short IPD game between the antigen
and each antibody, beginning from the current history of the game between
the agent and the opponent. The affinity between antibodies, the mutation
operator and the immune network algorithm are defined in the same way
as in recognition AIN.
Therefore, the most stimulated antibody corresponds to the best strat-
egy against the one that resembles the opponent, and is selected as the
output of this phase.
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
118 O. Alonso and F. Nino
A summary of the representation in the strategy generation AIN is
shown in table 5.3.
Table 5.3. Strategy generation immune network representation
Immune network Representation
Antigen IPD strategy
Antibody IPD strategy
Antibody/Antigen affinity Payoff of a short IPD game
Antibody/Antibody affinity Similarity between the strategies
5.4.5. Decision module
Once a good strategy against the opponent has been found, it is used to
look up the next move that the agent will play, given the recent history of
the game.
5.5. Experimental results
Some experiments were carried out in order to explore the capabilities of
the proposed agent. All the experiments used a payoff matrix where Temp-
tation=5, Punishment=1, Sucker’s Payoff=0 and Reward=4.
The values of the parameters of an immune network affect some aspects
of it, such as the number of antibodies of the network and the performance
of the affinity maturation process. After testing several values for the pa-
rameters, the following were found to provide a good behaviour to the agent:
the suppression threshold was 0.8, and the affinity threshold was 0.9; the
number of stimulated antibodies that were selected in each iteration was 5,
and the percentage of stimulated antibodies that were selected after being
cloned and hypermutated was 20%. In each iteration of the immune net-
works, four new random antibodies were added to the network. In clonal
selection, the minimum number of clones that a stimulated antibodies could
generate was 5, and the maximum amount was 10. New antibodies were
added to the memory of each network every 20 moves.
In the recognition process, the length of the history of moves was 10,
and the maximum length of the memory of the lookup table representation
was set to 3 previous moves.
The experiments were designed to answer some key questions about the
agent’s capabilities, which are addressed in the following subsections.
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 119
5.5.1. Can the agent adapt to a new opponent?
In order to test the adaptability of the immune agent when confronting one
opponent, it was faced to opponents playing the well-known strategies TFT,
ALLD, Pavlov and GRIM. The length of the game was 100 moves, and there
were 100 repetitions for each opponent. The average score obtained in this
experiment is shown in figure 5.8.
a)
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100
Ave
rage
Pay
off
Move Number
Adaptation for TFT
AgentOptimal
b)
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100
Ave
rage
Pay
off
Move Number
Adaptation for ALLD
AgentOptimal
c)
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100
Ave
rage
Pay
off
Move Number
Adaptation for Pavlov
AgentOptimal
d)
0
1
2
3
4
5
6
0 10 20 30 40 50 60 70 80 90 100
Ave
rage
Pay
off
Move Number
Adaptation for GRIM
AgentOptimal
Fig. 5.8. Adaptability Tests. Optimal is obtained from mutual cooperation in a), c and
d), and mutual defection in b).
As it can be seen, the agent adapts its behaviour to the one of the
opponent, which leads to an increase of the mean payoff over the first 20
moves, and then it stabilises.
5.5.2. Can the agent adapt to consecutive opponents?
The agent was confronted with two opponents one after the other, in order
to evaluate the adaptability of the agent to further opponents (i.e. not
only the first opponent it confronts). The results for consecutive opponents
playing ALLD-TFT and PAVLOV-GRIM can be seen in figure 5.9.
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
120 O. Alonso and F. Nino
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
Pay
off
Move Number
Adaptation for Consecutive Opponents (ALLD-TFT)
AgentOptimal
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140 160 180 200
Ave
rage
Pay
off
Move Number
Adaptation for Consecutive Opponents (Pavlov-GRIM)
AgentOptimal
Fig. 5.9. Tests of adaptation to consecutive opponents
Experimental results showed that the immune agent adapts to every
new opponent it confront. Moreover, the curves described by the mean
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 121
payoff is very similar to the one found in the first experiment, which shows
that the agent preserves its adaptability through multiple games.
5.5.3. Can the agent remember previous opponents?
Since immune networks possess a memory mechanism, this was evaluated
in the agent. In this setup, the agent first confronts an opponent, then it
is faced with another opponent and once again it is confronted by the first
opponent. In this case two experiments were performed, the first confronted
TFT-ALLD-TFT, and the second one confronted Pavlov-GRIM-Pavlov-
GRIM. Also 100 repetitions of the experiment were carried out and the
length of every game was 100 moves. The average payoff can be seen in
figure 5.10.
The results showed that the mean average curves stabilised faster the
second time the agent faced an opponent, as a result of the memory capa-
bility of the agent. However, the mean in which the payoff stabilises did
not increase.
5.5.4. Results from the IPD competition
The agent proposed in this chapter participated in the IPD competitions
held at CEC 2004 and CIG 2005, under the name of ”Immune Based
Agent”. It competed twice in the first competition and it was ranked 126
out of 223 and 160 out of 223. In the second competition, it participated
in the category # 4 (one entry per participant), and it was ranked 40 out
of 50.
5.6. Discussion
Experimental results show that the proposed agent presents the expected
behaviour: it adapts its behaviour to the opponents it confronts in order
to increase its payoff, and is also able to remember its interactions with
opponents in order to recognise them faster in future encounters. However,
the following was observed in the agents behaviour:
• The payoff stabilises in a mean value which is less than the best possible
payoff.
• For some opponents such as GRIM, the performance of the agent is
very poor: it obtains a payoff much lower than the best possible
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
122 O. Alonso and F. Nino
0
1
2
3
4
5
6
0 50 100 150 200 250 300
Ave
rage
Pay
off
Move Number
Memory of previous opponents (TFT-ALLD-TFT)
AgentOptimal
0
1
2
3
4
5
6
0 50 100 150 200 250 300 350 400
Ave
rage
Pay
off
Move Number
Memory of previous opponents (Pavlov-GRIM-Pavlov-GRIM)
AgentOptimal
Fig. 5.10. Test of memory of previously met opponents
Some explanations to the agent’s behaviour could be hypothesised as
follows:
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 123
• The recognition module finds a strategy similar to the opponent’s,
which is good enough for most cases. However, a history of moves
could correspond to several opponent strategies, which makes very dif-
ficult for the recognition module to find the exact strategy that the
opponent is playing. For instance, a history where all previous moves
are COOPERATE could correspond to players playing TFT or ALLC,
and the best response is different in every case.
• Since the recognition process is imperfect, the strategy found by the
strategy generation module may not be the most adequate, and could
lead the agent to make bad decisions. This produces non-optimum
payoff, and with some strongly retaliative opponents, it may lead to
mutual defection and low payoffs. This explains why the immune agent
does not obtain a good payoff confronting strategies such as GRIM,
since it tries to take advantage of the opponent and, consequently, it
receives a strong retaliation from GRIM.
• The model does not implements a feedback mechanism which may help
to determine how good a strategy selected is. Notice that the agent
knows the history of moves, but it does not analyse if the strategy it
is currently using is good or bad in order to change it if the strategy is
performing badly.
The experiments also show that since the agent does not reach the best
possible payoff, it is slightly exploited by very uncooperative strategies,
such as ALLD.
An analysis of the performance of the strategy during the competition
showed that the agent frequently evolved to mutual defection with oppo-
nents that were not fully uncooperative, such as go by majority. There were
also some cases where the agent became exploited by some opponents, prob-
ably due to the perception limitations of the agent exposed above. As a
consequence, the agent performed poorly in the competition.
5.7. Conclusions
This work presented an agent model that played the IPD game. The model
is based on artificial immune systems in order to achieve adaptability, learn-
ing and memory.
Some experiments were carried out in order to evaluate the behaviour
of the proposed agent. The results showed that the agent presents the
expected capabilities: it adapted its own behaviour to suit the opponent’s
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
124 O. Alonso and F. Nino
one, going through a learning process which produced an increase of the
mean payoff until it reached a stable value. Additionally, the learning
process was faster when the agent met the opponent for the second time,
which evidenced a memory mechanism.
However, although the mean payoff increased and stabilised due to the
learning process, it did not reach the optimum value. Additionally, for some
strategies such as GRIM, the agent did not even obtain a payoff close to the
best possible. This shows that although the agent perform as expected, it
still needs to be tuned in order to avoid poor performance in some special
cases.
Particularly, the proposed model could be modified by using different
computational techniques, such as evolutionary algorithms, to implement
some of the modules. It may also be extended to include multiple levels
of cooperation and multiple opponents. Additionally, the agent could be
endowed with a feedback mechanism, such as reinforcement learning.
References
Alonso, O., Nino, F. and Velez, M. (2004). A robust immune based approach
to the iterated prisoner’s dilemma, in Proceedings of the 3rd International
Conference on Artificial Immune Systems, pp. 290–301.
Angeline, P. J. (1994). An alternate interpretation of the iterated prisoner’s
dilemma and the evolution of non-mutual cooperation, in Proceedings 4th
Artificial Life Conference, pp. 353–358.
Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York, USA).
Beaufils, B., Delahaye, J.-P. and Mathieu, P. (1997). Our meeting with gradual:
A good strategy for the iterated prisoner’s dilemma, in Artificial Life V
(Proceedings of the Fifth Int’l Workshop on the Synthesis and Simulation
of Living Systems) (MIT Press), pp. 202–209.
Castro, L. N. D. (2003). The immune response of an artificial immune network
(ainet), in Congress on Evolutionary Computation (CEC’03) (Canberra),
pp. 146–153.
Castro, L. N. D. and Zuben, F. J. V. (2000). An evolutionary immune network
for data clustering, in IEEE Brazilian Symposium on Artificial Neural Net-
works (Rio de Janeiro), pp. 84–89.
Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated pris-
oner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in
Artificial Intelligence, Vol. 956, pp. 276–292.
Darwen, P. and Yao, X. (1996). Automatic modularization by speciation, in Proc.
of the 1996 IEEE Int’l Conf. on Evolutionary Computation (ICEC’96)
(IEEE Press, Nagoya, Japan), pp. 88–93.
Delahaye, J.-P. and Mathieu, P. (1995). Complex strategies in the iterated pris-
oner’s dilemma, in A. Albert (ed.), Chaos and Society, Frontiers in Arti-
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma 125
ficial Intelligence and Applications, Vol. 29 (IOS Press, Amsterdam), pp.
283–292.
Hofstadter, D. R. (1985). The prisoner’s dilemma computer tournaments and the
evolution of cooperation, in Metamagical Themas: Questing for the essence
of mind and pattern (Basic Books, New York).
Jerne, N. K. (1974). Towards a network theory of the immune system, Ann.
Immunol. 125, pp. 373–389.
Jonathan, T. (2001). Artificial Immune Systems: A novel data analysis technique
inspired by the immune network theory, Ph.D. thesis, University of Wales,
Aberystwyth, Wales.
Nowak, M. A. and Sigmund, K. (1993). A strategy of win-stay lose-shift that
outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364, pp.
56–58.
Perelson, A. S. and Weisbuch, R. (1997). Immunology for physicists, Rev. Modern
Physics 69, pp. 1219–1267.
Sandholm, T. and Crites, R. (1995). Multiagent reinforcement learning in the
iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s
Dilemma 37, pp. 147–166.
Tucker, A. W. (1950). A two person dilemma, .
Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated
prisoner’s dilemma games, Informatica 18, pp. 435–450.
April 16, 2007 10:32 World Scientific Review Volume - 9in x 6in chapter5
This page intentionally left blankThis page intentionally left blank
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Chapter 6
Exponential Smoothed Tit-for-Tat
Michael Filzmoser
University of Vienna
Reciprocating strategies, as for instance Tit-for-Tat, have been shown
to be very successful in IPD Tournaments without noise, while other tour-
naments and analytical studies show that they perform rather poor in noisy
environments. The implementation of generosity or contrition into recipro-
cating strategies was proposed as a solution for this poor performance. We
propose a third possibility, a relief of the provocability property of recipro-
cating strategies, which we design by exponential smoothing. This chapter
explores how exponential smoothing and Tit-for-Tat can be combined in
’Exponential Smoothed Tit-for-Tat’ strategies for the Iterated Prisoners’
Dilemma and how the strategies perform in competitions with and without
noise compared to Tit-for-Tat
6.1. Introduction
Robert Axelrod (1980a,b, 1984) was the first to perform computer tour-
naments of the Iterated Prisoners’ Dilemma (IPD). In these tournaments
strategies played the Prisoners’ Dilemma repeatedly with additional infor-
mation about the history of their own moves as well as of the moves of
the opponent strategy. In two tournaments with 14 and 62 entries respec-
tively the winner both times was Tit-for-Tat (TFT), submitted by Anatol
Rapoport, the simplest of all participating strategies. TFT starts with coop-
eration and afterwards mirrors the opponent’s move of the previous round.
Niceness and provocability were identified to be important properties of
successful strategies in the IPD and both are embodied in TFT. Niceness
in this context denotes that a strategy never should be the first to defect,
while provocability denotes that an ’uncalled for’ defection of the opponent
should be punished by a defection immediately (Axelrod, 1980b).
127
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
128 M. Filzmoser
In recent years the original IPD has been extended by the integration
of ’noise’. Noise in the context of IPD can either denote measurement er-
rors — a strategy receives incorrect information that its opponent defected
while it actually cooperated and vice versa — or implementation errors —
a strategy which is intended to cooperate in a given situation erroneously
defects and vice versa (Bendor, 1993).a Axelrod and Wu (1995) state that
noise is an important feature of real-world interaction as errors in the imple-
mentation of choice can never be completely excluded. It has been shown
analytically (Molander, 1985; Bendor, 1993) as well as by further IPD tour-
naments which incorporated noise (Donninger, 1986; Bendor et al., 1991)
that the existence of noise undermines the performance of reciprocating
strategies like TFT dramatically. Bendor, Kramer and Stout (1991) argue
that a main reason for the poor performance of TFT in noisy environments
is the unintended involvement in vendettas of mutual or alternating defec-
tion with other nice and provocable strategies, which can be caused by one
single implementation error on either side.
For coping with noise Axelrod and Wu (1995) propose to make recipro-
cating strategies more generous or more contrite. Generosity denotes that
some of the opponent’s defections are not punished as they could be the
result of noise. Such generosity of course can be exploited easily but pre-
vents from an echoing of a single error throughout the whole game and
therefore can maintain mutual cooperation among reciprocating strategies.
Contrition on the other hand means that a defection as a reaction to a
defection of the opponent in the last round, which in turn occurred as an
answer to one’s own implementation error in the round before last, should
be avoided.
While generosity can be conceived as a correction of the opponent’s im-
plementation errors, contrition can be interpreted as the correction of one’s
own implementation errors in a noisy environment. However both of these
further-developments of reciprocating strategies for noisy environments are
one-sided insofar as they focus on correcting their own or the opponent’s
implementation errors only, none of these concepts attempts to correct both
kinds.
Moreover the mitigation of the provocability property of reciprocating
strategies takes place in a rather indiscriminate way by an increase of gen-
erosity or the implementation of contrition. In an effort to improve the
aWe focus exclusively on implementation error as this was the category of noise imple-
mented in the IPD tournament of G. Kendal, P. Darwen, and X. Yao performed in April
2005 on which this study bases (see http://www.prisoners-dilemma.com).
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Exponential Smoothed Tit-for-Tat 129
performance of reciprocating strategies if they play against other recipro-
cating strategies one must not neglect the existence of non-reciprocating
strategies. Such strategies could capitalize on the combination of noise and
generosity by infrequent but intentional defections. Moreover if the history
of the opponent’s moves consists of a series of defections, a single coopera-
tion, which could be an implementation error of the opponent, should not
induce a reciprocating strategy to switch from defection to cooperation. In
such a case of continuous defection of the opponent an increase of generosity
will only reduce the performance of a reciprocating strategy.
We share the opinion that a mitigation of the provocability property is
essential to overcome the comparatively poor performance of reciprocating
strategies like TFT in IPD tournaments with noise. To do so we propose a
third alternative beside generosity and contrition. We hold the view that
the whole history of the opponent’s moves as well as the misperceptions
should be taken into consideration by a reciprocating strategy in the deci-
sion to cooperate or defect. Generous or contrite reciprocating strategies as
proposed by Axelrod and Wu (1995) only take into account the last move
of the opponent and use some additional modification rules to adapt to the
situation of noise. The analysis of the entire series of moves of the opponent
should allow filtering out reactions to our own implementation errors as well
as the opponent’s implementation errors, which in turn should improve the
performance of a so-designed reciprocating strategy.
In section 6.2 we present exponential smoothing which we suggest as a
method to implement the concept of considering the whole history of moves
in the decision making process of reciprocating IPD strategies. Further-
more ’Exponential Smoothed Tit-for-Tat’ (ESTFT) strategies are developed.
Section 6.3 reports on the performance of the ESTFT strategies in an IPD
tournament in competitions with and without noise, and in comparison to
TFT. Section 6.4 summarizes the main results and concludes.
6.2. Exponential Smoothed Tit-for-Tat
The intention of ESTFT is to incorporate the two properties of TFT, niceness
and provocability, which have demonstrated to be important ingredients of
successful strategies in the IPD without noise, and mitigate provocability to
adjust to the existence of noise. To do so ESTFT uses exponential smooth-
ing. Tzafestas (2000) used exponential smoothing as the basis for the devel-
opment of his meta-regulated adaptive TFT (a strategy that drops the
cooperation rate when the opponent is conceived cooperative and increases
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
130 M. Filzmoser
it otherwise) and by Ashlock et al. (1996) for memory weighting in a study
on partner selection for the IPD. However exponential smoothing has not
yet been applied to cope with the problem of noise in the IPD, which is
the focus of this study. In the next two subsections, first the concept of
exponential smoothing will be briefly presented and afterwards applied for
the design of exponential smoothed Tit-for-Tat strategies for competitions
with and without noise.
6.2.1. Exponential Smoothing
Exponential smoothing was originally a time series analysis approach, which
can be used for the analysis of time series that neither exhibit trend nor
seasonal components. It allows for weighting past – possibly not so im-
portant – observations differently than the recent ones. From the original
time series Xt
the exponential smoothed time series St
can be calculated
by (6.1). For the calculations it is necessary to indicate a starting value S0
as for the first period no observations of the original time series exist.
St=
S0 if t = 0,
(1− α)St−1 + αX
t−1 else(6.1)
In (6.1) α is the smoothing parameter that indicates the weight assigned
to the last observation.b The higher α, the lower the smoothing of the time
series, so for α = 1 exponential smoothing reproduces the original time
series while for α = 0 the smoothed time series is a constant of St= S0.
Exponential smoothing can be customized for the design of reciprocat-
ing or simple deterministic IPD strategies. We conceive the series of the
transformations of the opponent’s moves mtas the observations that are to
be smoothed. Where the opponent’s moves are transformed into discrete
numbers applying (6.2).
mt=
1 if opponent move in t is ’cooperate’
0 if opponent move in t is ’defect’(6.2)
With the adapted exponential smoothing formula (6.3) different kinds of
simple deterministic strategies can be designed, that either defect if St= 0
bThe notion ’smoothing parameter’ is somewhat misleading, as a higher value for this
parameter leads to a stronger consideration of currently observed values and therefore
results in a less smoothed time series.
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Exponential Smoothed Tit-for-Tat 131
or cooperate if St
= 1. The parameter combination S0 = 1 and α = 1
exactly equals the TFT strategy (Tzafestas, 2000), with α = 0 and S0 = 1
(respectively S0 = 0) a constant series of cooperations (respectively de-
fections) and therefore an ALLC (respectively ALLD) strategy can be mod-
elled. Many other combinations of the two variables, starting value S0 and
smoothing parameter α, are possible which allow modelling a large number
of strategies.
St=
S0 if t = 0
(1− α)St−1 + αm
t−1 else(6.3)
We refer to the internal register St
as the ’mood’ of the strategy. This
mood is a continuous variable ranging from 0 — in case of total defection
of the opponent — to 1 — for total cooperation of the opponent. Inter-
mediate values between these two extremes represent different degrees of
cooperation (closer to 1) and defection (closer to 0). In the spirit of TFT the
next own move (either cooperate or defect) is derived from this mood by a
threshold rule (see section 6.2.2). Furthermore we need an initial mood I —
an expectation about the opponent’s behavior — to calculate the bounds
on the smoothing parameter α for the ESTFT strategies designed for the
competition with noise (see section 6.2.2). We derive I from the optimistic
assumption that the opponent strategy is cooperative or reciprocating and
therefore will cooperate if it plays against ESTFT strategies except the ex-
pected 10% of implementation errors due to noise (i.e. I = 0.9).
6.2.2. Strategies for Competitions with and without Noise
The ESTFT strategies were actually a two parameter family of IPD strate-
giesc where the two decision parameters are i) the smoothing parameter α
and ii) the threshold rule for which values of Stthe strategy should cooper-
ate or defect. For all ESTFT strategies the threshold rule for cooperation and
defection in round t is determined as follows: for St≥ 0.5 ESTFT cooperate
otherwise defect (see 6.4).
move in t =
’cooperate’ if 0.5 ≤ St≤ 1
’defect’ if 0 ≤ St< 0.5
(6.4)
In defining one threshold rule for all ESTFT strategies the only vari-
able parameter of these strategies is the α-value. For the ESTFT strategiescThe starting value S0 can be derived from α
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
132 M. Filzmoser
designed for the competition with noise we demand two additional char-
acteristics to cope with the problem of noise in the IPD: i) they should
never defect in return to a single defection of the opponent as this single
defection could be an implementation error by the opponent or a reaction
to an implementation error of the ESTFT strategy itself, and ii) they should
react with defection in return to two consecutive defections of the oppo-
nent to avoid exploitation by the opponent. These additional requirements
restrict the area of possible values for the smoothing parameter α. The
possible area according to restrictions i) and ii) is calculated in (6.5) and
(6.6) respectively for an initial mood of I = 0.9.
Restriction i): ESTFT should cooperate after a single defection
St= (1− α)I + αm
t−1 ≥ 0.5
for mt−1 = 0 and I = 0.9
α ≤ 1− 0.5
0.9= 0.44444
(6.5)
Restriction ii): ESTFT should defect after two consecutive defections
St−1 = (1− α)I + αm
t−2
St= (1− α)S
t−1 + αmt−1 < 0.5
for mt−2 = m
t−1 = 0 and I = 0.9
α > 1−√
0.5
0.9= 0.25464
(6.6)
From (6.5) we derive that the ESTFT strategy will not defect after a single
defection (mt−1 = 0) of assumed cooperative or reciprocating strategies
(I = 0.9) when α ≤ 0.44444 as for this value of the smoothing parameter St
remains above the threshold for defection of 0.5. An α > 0.25464 guarantees
that for two consecutive defections of the opponent (mt−2 = m
t−1 = 0) the
St
lies below the threshold value and the ESTFT strategy therefore defects
(6.6).
Three ESTFT strategies lowESTFT_noise, mediumESTFT_noise, and
highESTFT_noise were designed for the competition with noise using α-
values that represent the upper bound, the lower bound, and the average
between these extremes (α ≈ 0.34) respectively. As large numbers for the
smoothing parameter α lead to higher weighting of the current observa-
tions and to a less smoothed value the border induced by (6.5) is applied in
the lowESTFT_noise, the border induced by (6.6) in the highESTFT_noise
and the average between these borders in the mediumESTFT_noise ESTFT
strategy. For the competition without noise neither of the two restrictions
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Exponential Smoothed Tit-for-Tat 133
mentioned above is necessary as the true move of the opponent can be ob-
served with certainty. We determine smoothing parameters of α = 0.2 for
the highESTFT_classic , α = 0.35 for the mediumESTFT_classic , and
α = 0.5 for the lowESTFT_classic strategy respectively and start with
cooperation. Due to the decision rule mentioned above values above 0.5
would not change the result and are therefore omitted. The results these
strategies achieved compared to TFT in competitions with and without noise
are summarized in the next section.
6.3. Tournament Results
The IPD computer tournament organized by Graham Kendal, Paul Dar-
wen, and Xin Yao in April 2005 offered an excellent possibility to test the
ESTFT strategies and to compare it to TFT in situations with and without
noise. In addition to the classical competition without noise (competition
1) – a re-run of Robert Axelrod’s original tournaments – a competition with
a 10% chance of noise in the form of implementation error (competition 2)
was conducted. The Java Applet that was used to run the tournament,
as well as the entries, based on the Java IPDLX software library, in addi-
tion simple deterministic strategies with a history of maximal three rounds
could be entered via a web-interface. In each of the competitions five runs
were performed, each of these runs lasted 200 rounds.
For each competition we calculate the average of the payoffs of all five
runs the ESTFT strategies and TFT reached if it plays against a specific oppo-
nent. In using the average we can filter out random effects induced by noise
or by the strategies themselves (e.g. RAND). Furthermore we consider only
the payoffs against the 141 strategies that are represented in both the classi-
cal competition and the competition with noise. This establishes a common
basis for analysis that allows us to perform paired tests on the difference
between the average payoffs against each of these 141 reference strategies.
Figure 6.3 presents box-wisker diagrams for the six ESTFT strategies and
TFT (the according data can be taken from Table 6.1).
First we apply a non-parametric paired Wilcoxon test to test the dif-
ference in the payoffs of the ESTFT strategies and TFT between the com-
petition without noise and the competition with noise. The alternative
hypothesis that payoffs are higher in the competition without noise than
in the competition with noise can be accepted for all seven reciprocating
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
134 M. Filzmoser
0.20 0.26 0.34 0.35 0.44 0.5 1 (TFT)
300
400
500
600
soomthing parameter
payo
ff (a
vera
ged
over
the
five
runs
)
Fig. 6.1. Box-Wisher plot of the average payoff of ESTFT and TFT strategies for the
competition with noise
strategies. The results are highly significant as can be seen from Ta-
ble 6.2.d
Next we test the difference in the payoff between the ESTFT strate-
gies and TFT in the competitions without and with noise. Again we use a
non-parametric paired Wilcoxon test. The alternative hypothesis that the
payoffs of the focal ESTFT strategy are greater than the payoffs of TFT can
be accepted only for the lowESTFT_classic and mediumESTFT_classic
(p < 0.05).
From Table 6.2 we see that the tournament results reproduce what has
been argued analytically and already shown in previous IPD tournaments
already (Molander, 1985; Donninger, 1986; Bendor et al., 1991; Bendor,
1993), reciprocating strategies like TFT or ESTFT are less successful in noisy
dIn Tables 6.2 and 6.3 the column α represents the specific value of the smoothing
parameter for this strategy, µ the mean of the payoff averaged over the five runs per
competition, ± the standard deviation of payoffs, V the test statistic and p the signifi-
cance of the non-parametric paired Wilcoxon test.
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Exponential Smoothed Tit-for-Tat 135
Table 6.1. Data for the Box-Wisher plot of the average payoff of ESTFT and
TFT strategies of the competition with noise
strategy min 1. quartile median 3. quartile max
highESTFT_classic 232.6 255.4 432.8 500.2 613.4
highESTFT_noise 242.0 268.8 435.4 515.6 594.2
mediumESTFT_noise 241.0 260.2 430.6 514.0 585.6
mediumESTFT_classic 243.0 270.6 430.4 505.0 593.2
lowESTFT_noise 240.0 264.2 435.4 513.0 597.6
lowESTFT_classic 245.8 300.6 428.0 512.2 586.4
TFT 223.4 269.0 430.6 502.6 614.2
Table 6.2. Performance of the ESTFT and TFT strategies in environments with and
without noise
without noise with noise
strategy α µ ± µ ± V p
highESTFT_classic 0.20 467.22 181.14 397.73 123.93 7,817.0 < 0.0001
highESTFT_noise 0.26 470.97 180.20 399.73 117.57 8,131.0 < 0.0001
mediumESTFT_noise 0.34 470.43 180.45 399.45 120.43 8,115.5 < 0.0001
mediumESTFT_classic 0.35 468.47 179.45 404.26 114.16 7,733.5 < 0.0001
lowESTFT_noise 0.44 469.43 181.72 401.94 120.24 8,000.5 < 0.0001
lowESTFT_classic 0.50 469.76 179.64 408.43 109.95 7,734.0 < 0.0001
TFT 1.00 467.49 181.06 400.39 121.78 7,615.0 < 0.0001
Table 6.3. Comparison of the TFT and ESTFT strategies in environ-
ments with and without noise
without noise with noise
strategy α V p V p
highESTFT_classic 0.20 3.0 0.6054 4,779.0 0.6798
highESTFT_noise 0.26 449.0 0.9980 5,342.5 0.2443
mediumESTFT_noise 0.34 366.0 0.9998 4,962.0 0.5361
mediumESTFT_classic 0.35 500.5 0.6830 6,071.0 0.0142
lowESTFT_noise 0.44 282.0 1.0000 5,308.5 0.1758
lowESTFT_classic 0.50 628.5 0.2342 5,961.0 0.0247
environments than in environments without noise. The comparison between
the performance of ESTFT strategies compared to TFT for the competition
without and the competition with noise summarized in Table 6.3 shows
two noteworthy results. First while there are no significant differences in
performance between ESTFT strategies and TFT in the case of no noise, the
ESTFT strategies lowESTFT_classic and mediumESTFT_classic are sig-
nificantly better (p < 0.05) in the presence of noise. Moreover the three
ESTFT strategies designed for the competition with noise (lowESTFT_noise,
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
136 M. Filzmoser
mediumESTFT_noise, and highESTFT_noise) where, though still less suc-
cessful than TFT, able to reduce the distance to TFT in the competition with
noise compared to the one without.
Above we mentioned that the α-value is the only parameter that varies
across the six ESTFT strategies. In the competition without noise the av-
erage performance of all ESTFT strategies except highESTFT-noise – the
strategy with the lowest α-value – exceeded the performance of TFT, how-
ever these results are not significant according to the non-parametric paired
Wilcoxon tests (see Table 6.3). From Table 6.2 one can see that in the com-
petition with noise the three ESTFT strategies with the higher α-values
(mediumESTFT_classic, lowESTFT_noise, and lowESTFT_classic) reach
higher average payoffs than TFT while the three ESTFT strategies with lower
α-values (lowESTFT_classic, lowESTFT_noise, and mediumESTFT_noise)
reach lower average payoffs. Generally the performance of the ESTFT strate-
gies in the competition with noise increases with the smoothing parameter.
That two ESTFT strategies designed for the classical competition with-
out noise outperformed TFT in the competition with noise while the ESTFT
strategies designed for the competitions with noise did rather poorly, does
not necessarily contradict the statements made above. We stated that an
unbalanced mitigation of the provocability property of reciprocating strate-
gies or too much generosity is insufficient to improve the performance of
reciprocating strategies in the IPD with noise. A one-sided reduction of
provocability which just focuses on not punishing some of the opponent’s
defections as they could be the direct or indirect result of implementation
errors neglects the possibility of implementation errors in combination with
opponent’s cooperation, while too much generosity could cause exploitation.
On the one hand the ESTFT strategies for the competition with noise were
more generous as they only defect when two consecutive defections of the
opponent occur or the smoothed value declines below a limit for coopera-
tion. On the other hand highESTFT_classic probably smoothed too much
which reduces its performance. Two strategies that outperformed TFT used
higher values for the smoothing parameter that leads to a higher weight-
ing of the currently observed opponent’s moves and reduces the smoothing
effect.
6.4. Conclusions
Based on the shortfalls of existing approaches that attempt to improve the
poor performance of reciprocating strategies for the IPD with noise, we
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
Exponential Smoothed Tit-for-Tat 137
suggest that exponential smoothing is an approach that allows an outbal-
anced mitigation of the provocability property of reciprocating strategies.
By exponential smoothing the whole series of the opponent’s moves rather
than only the previous move of the opponent can be taken into considera-
tion in the decision of cooperation or defection. Six ESTFT strategies were
designed and participated in an IPD tournament in competitions with and
without noise.
The results of the tournament show that in noisy environments the per-
formance of ESTFT strategies increases with the smoothing parameter and
that low exponential smoothing improves the performance of reciprocating
strategies. While exponential smoothing improves the ability of TFT to deal
with noise in the IPD, it still does not deal with it very well. Moreover the
results indicate that our design concept for determining smoothing param-
eters for strategies for the competition with noise seem to be inadequate,
as in the competition with noise strategies designed for the competition
without noise outperformed strategies designed especially for this environ-
ment. The optimistic assumptions concerning the initial mood of the ESTFT
strategies and the simplistic restrictions for the smoothing parameter α for
strategies for the competition with noise may be the cause of this weaker
than expected performance. While we found a seemingly promising way to
improve the performance of reciprocating strategies for noisy environments,
obviously further research in this direction is necessary. Moreover we used
TFT — as the most important representative of reciprocating strategies —
as a benchmark, clearly ESTFT strategies have to be compared to other
(reciprocating) strategies as well.
References
Ashlock, D., Smucker, M. D., Stanley, E. A. and Tesfatsion, L. (1996). Pref-
erential partner selection in an evolutionary study of prisoner’s dilemma,
BioSystems 37, pp. 99–125.
Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, Journal of Con-
flict Resolution 24, 2, pp. 3–25.
Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, Journal of
Conflict Resolution 24, 3, pp. 379–403.
Axelrod, R. (1984). Genetic algorithms and simulated annealing, chap. The evo-
lution of strategies in the iterated prisoner’s dilemma (Pitman, London),
pp. 32–41.
Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s
dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189.
March 14, 2007 8:45 World Scientific Review Volume - 9in x 6in chapter6
138 M. Filzmoser
Bendor, J. (1993). Uncertainty and the evolution of cooperation, Journal of Con-
flict Resolution 37, 4, pp. 709–734.
Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt. cooperation
in a noisy prisoner’s dilemma, Journal of Conflict Resolution 35, 4, pp.
691–719.
Donninger, C. (1986). Paradoxical effects of social behavior. Essays in honor of
Anatol Rapoport, chap. Is it always efficient to be nice? A computer simu-
lation of Axelrod’s computer tournament (Physica, Heidelberg).
Molander, P. (1985). The optimal level of generosity in a selfish, uncertain envi-
ronment, Journal of Conflict Resolution 29, 4, pp. 611–618.
Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of
the simulation of Adaptive behavior conference (Paris).
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Chapter 7
Opponent Modelling, Evolution, and the Iterated
Prisoner’s Dilemma
Philip Hingston1, Dan Dyer2, Luigi Barone2, Tim French2,
Graham Kendall3
Edith Cowan University1 , The University of Western Australia2 ,
The University of Nottingham3
In this chapter, we report on a series of studies exploring the interplay
between evolution and intelligence. The evolutionary setting is a population
of agents playing Iterated Prisoners Dilemma, a setting which provides
choice between cooperative and selfish behaviour in interactions between
agents. Intelligence is represented using opponent modelling agents. Our
studies show that, while opponent modellers can survive in such a setting,
an evolving population of less intelligent agents can limit their success. We
also report on the performance of our opponent modelling agent, which
competed in the CIG’05 IPD competition.
7.1. Introduction
IPD has served as a model for cooperation between self-interested individ-
uals for 40 years. Sometimes, these individuals are taken to be animals,
sometimes humans, and sometimes some other kind of agency, such as a
corporation or a nation. A useful way to categorise studies based on the
IPD model is by what is assumed about the cognitive abilities of the players.
On an increasing scale of rationality, well-studied assumptions include
• That populations of players can evolve good strategies. This is
the traditional evolutionary computation approach.
• That the players can learn good strategies. This is the traditional
machine learning approach.
• That the players are perfectly rational. This is the traditional
mathematical game theory approach.
139
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
140 P. Hingston et al.
But there is another point on this scale, somewhere between the last
two points, that has, surprisingly, been largely neglected — the assumption
that players adapt their play based on a learned model of their
opponents’ play. In this chapter, we will argue the merits of opponent
modelling as a realistic approach to the study of IPD, and present some
results of experiments designed to explore this approach.
The expanded, four point scale, corresponds roughly to some theories
concerning stages in the evolution of intelligence. A recent and controver-
sial example is the Machiavellian intelligence hypothesis – “that apes and
humans have evolved special cognitive adaptations for predicting and ma-
nipulating the behaviour of other individuals” [Miller (1997), p313]. There
are various stronger or weaker interpretations of this hypothesis. A strong
version postulates a “theory of mind”, a module that attributes beliefs and
desires to others in order to better predict their behaviour. In other words,
in our terms, apes and humans use opponent modelling.
Researchers disagree about whether or not, and to what degree, vari-
ous primates have such a module. Everyone seems to agree that humans
do, but there is evidence supporting both sides of the argument regarding
distinctions between sprepsirhine primates (lemurs and lorises), haplorine
primates (the rest), or between monkeys, great apes and ancient and mod-
ern humans. A well known example is the ability of great apes to recognize
themselves in mirrors, whereas monkeys cannot [(Parker et al. (1994), as
cited in Miller (1997)].
If IPD is to teach us about human behaviour, then it makes sense to
model intelligence at the correct level. To evolve agents that play pre-
determined, fixed strategies seems appropriate for studies of animal with
low levels of intelligence. Game theorists might argue that corporations or
nations are best modelled as perfectly rational, though many popular com-
mentators would disagree. For humans, neither of these seems realistic –
humans do not ignore what experience teaches. Rote learning of strategies,
via some mechanism such as reinforcement learning, or learning by imita-
tion, may be sufficient to explain animal behaviours, and some aspects of
human behaviour, but even our great ape cousins are known to go beyond
this. Thus, we believe, to realistically model strategies employed by hu-
mans, we must include learning to predict our opponent’s behaviour, and
applying our reasoning abilities to devise a plan based on our predictions.
Just how sophisticated the prediction and reasoning method needs to be is
another question, but some kind of opponent modelling is called for.
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 141
In the field of multi-agent systems, our approach would be called model-
based learning, as distinct from model-free learning. We see many of the
acknowledged advantages and disadvantages of model-based learning in this
study. Many variations and subtleties of approaches to problems of learning
in multi-agent systems have been studied (see, e.g. [Markovitch and Reger
(2005)] for one such variation, and a nice overview). It is not our aim
in this study to survey this field. Likewise, there are many aspects of
intelligence that we do not concern ourselves with, including the question
of what intelligence actually is! One of our reviewers pointed out that
our “four point scale” could have many other intermediate points on it,
depending on the level of sophistication of the modeller. For example,
should the modeller assume that the other players are also modellers and
reason about them on that level (we have chosen to answer “no”)? Again,
should the modeller model only individual players, or the population of
players in its environment (we have chosen the former)? We neglect these
questions not because we see them as uninteresting – far from it – in fact
it is because there are so many aspects, so many possibilities, that we must
make our one set of choices and stick with them. From our point of view,
the key requirement is that the players must be opponent modellers of some
sort, and that we want to learn about what happens when such players are
subjected to the forces of evolution.
In the following pages, we discuss the advantages of opponent modelling,
and the problems that a successful opponent modeller must solve. With this
background, we then describe our opponent modelling entry for the IPD
competition held at CIG’05, the 2005 IEEE Computational Intelligence in
Games conference. This entry was adapted from an earlier IPD opponent
modeller used to study the role of intelligence in the evolution of cooperation
[Hingston and Kendall (2004)]. We revisit this work, and follow up with a
report on some new experiments carried out recently to better understand
our earlier findings.
7.1.1. Opponent modelling
Opponent modelling is the term used to describe the process of constructing
some form of representation (called the model) for an opponent’s strategy,
typically in order to exploit inherent weaknesses in their play. It is worth
pointing out here that we take the view that all game players are ultimately
self-interested. Even in games where cooperation is possible, players only
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
142 P. Hingston et al.
cooperate because it is to their advantage to do so. This is not so much
a value judgement on our part – since we are going to be dealing with
evolution, those who are not self-interested (or at least those whose genes
are not self-interested!) will cease to be relevant.
Consider then a simple example, the two-player game of rock, paper,
scissors (also known as Roshambo). In this game, each player selects one of
the three options, rock, paper, or scissors, ensuring their selection is hidden
from the other player. After both players have made their selection, players
reveal their choices and the winner is determined as follows: rock defeats
scissors, scissors defeats paper, and paper defeats rock. Should both players
select the same option, the game is deemed a draw.
Simple analysis shows that if an opponent truly selects randomly, the
best a player can do is to also choose randomly, thus assuring the overall
expectation is neutral (each player winning one third of games, each player
losing one third of games, with the remaining third of games drawn). How-
ever, if the opponent is not selecting randomly (or truly randomly), a player
can potentially do better than this neutral expectation by “guessing” (or in
artificial intelligence speak, “predicting”) which option the opponent will
select next. Using this prediction, the player can then choose the option
(the counter-strategy) that ensures victory in the game (choosing rock if
the prediction suggests the opponent will select paper, choosing scissors if
the prediction suggests the opponent will select paper, and choosing paper
if the prediction suggests the opponent will select rock). This is the do-
main of opponent modelling – building a model, typically from observation
or experience, of the next most likely action (move) of the opponent.
Note that building a model of the opponent’s next most likely action is
equivalent to building a model of the opponent’s strategy directly since a
player’s strategy directly determines the next move of the player. Once a
model of an opponent’s strategy is determined, the model can be analysed
(or deconstructed) to identify weaknesses in the opponent’s play. From this
analysis, a counter-strategy that best “improves” the player’s position in the
game (typically by exploiting any identified weaknesses in the opponent’s
strategy) can then be determined and executed, allowing the player to
maximise personal gain in the game.
All other things being equal, the overall success of a player employing
opponent modelling (an opponent modeller) depends on the accuracy of its
prediction of the opponent’s next action. For example, imagine an opponent
that always selects rock as their hidden choice in the fore-mentioned game
of Roshambo. Obviously, the optimal counter-strategy is to select paper,
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 143
thus ensuring a win against the opponent’s rock selection. An opponent
modeller that is able to correctly deduce this strategy weakness is then
able to ensure victory in all games against this opponent.
While this type of obvious strategy flaw is unlikely, experience shows
most players of games contain some form of strategy weakness, especially
in games containing many different game states. For example, standard
5-card poker has over 2,500,000 ways of forming a hand; 7-card vari-
ants have over 133 million ways. Factoring in the complications due to
betting, considerations of the different playing abilities and styles, and
the large number of situations a player must respond to, every poker
player is likely to contain some (and probably many) predictabilities in
their strategy (if they didn’t, playing the game would be pointless —
ignoring short-term variance, all players would end up level in the long
run).
Even in simple games like Roshambo, players often contain subtle weak-
nesses in their play that can be exploited by an opponent modeller (world
championships in the game pit players’ abilities to determine and exploit
these weaknesses). For example, some players may always choose rock af-
ter winning the previous game with paper. Other players may never select
any single option four times in a row. Both of these examples demonstrate
non-random choices by an opponent and hence can be exploited by an op-
ponent modeller capable of deciphering predictable patterns in opponent
behaviour. Indeed, any non-random choice in the selection of the hidden
option may be exploited. If the game is played often enough, subtle weak-
nesses in strategy may well give the advantage to the opponent modeller
in the long run. Artificial intelligence research into opponent modelling is
interested in just this – finding subtle flaws in an opponent’s strategy in
order to maximise personal gain in the game.
While it seems opponent modelling is an obviously good way of identify-
ing strategy weaknesses, care must be taken to ensure against over-reliance
on the inferred model of the opponent’s strategy. Two immediate prob-
lems can arise: the inferred model may be incomplete, or even worse, in-
correct for certain scenarios, or even if the model is definitely correct at
some moment in time, an opponent may dynamically modify their strategy
over time invalidating the model. The first problem is obvious – incorrect
or incomplete models affect an opponent modeller’s capability to identify
weaknesses in an opponent’s strategy and hence determine the next best
action to select. The second problem motivates the need for adaptation –
the opponent modeller must constantly re-assess its inferences and resulting
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
144 P. Hingston et al.
counter-strategies in order to stay abreast of the strategy employed by the
opponent.
For example, consider the always-select-rock strategy flaw discussed ear-
lier for the game of Roshambo. Obviously, an opponent modeller capable of
correctly inferring this strategy weakness will have no problem exploiting
the weakness to ensure victory against this opponent. However, with such
an obvious weakness in strategy, the opponent is likely soon to realise their
flaw and try another strategy instead. The opponent modeller must now
adapt their counter-strategy in order to exploit the new strategy, otherwise
they run the risk of becoming predictable and may well be exploited them-
selves (recall that in Roshambo, a player must select randomly, otherwise
an opponent may be able to predict their next action). The opponent may
be “setting-up” the opponent modeller with a false model in order to exploit
the opponent modeller later on with rapid successive changes in strategy
(the hunter becoming the hunted).
The other major problem for an opponent modeller is striking a balance
between exploring unknown regions of an opponent’s strategy to discover
new information (and new weaknesses) and using existing information to ex-
ploit weaknesses in the strategy. A trade-off occurs: insufficient exploration
may prevent the opponent modeller from finding better counter-strategies
that yield higher returns, but exploration is costly since it is a distraction
from the primary task of exploiting the opponent by using the information
the player already has. Exploring new counter-strategies may mean sacri-
ficing short-term performance (the player may need to accept short-term
losses), and in the worse case, may even lead to inescapable sub-graphs of
the opponent’s strategy that yield sub-optimal returns in the long run (for
example, exploring the strategy of defecting against a grim-like player in
IPD – see later, in 7.1.4).
Opponent modelling is not only useful in games, but also in other situa-
tions involving responding to opponent actions. Examples include evolving
cooperative behaviour, stock market prediction, negotiation and diplomacy,
and military strategy planning. These types of problems can benefit from
opponent modelling — building a model of the behaviour of the opponent
in order to exploit strategy weaknesses and to respond “well” to opponent
actions. Indeed, most environments containing adversarial situations can
benefit from opponent modelling — that is, exploitation of opponent weak-
nesses in order to maximise personal gain in the game. The question is to
what extent.
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 145
IPD is a game often touted as being an example of human behaviour.
Due to the iterated nature of the game, players may choose to take their
opponent’s previous actions into account when deciding how to act in sub-
sequent rounds of the game. This opens up the possibility of a player being
predictable, and hence the possibility of exploiting the player’s predictabil-
ities. This is the thesis of our work — that the use of opponent modelling
to construct a model of an opponent’s strategy can offer an advantage to
a player in the IPD game. Using this approach, we aim to construct au-
tomated computer players capable of exploiting observable strategy weak-
nesses in opponent’s strategies in IPD. This means that we need some way
to automatically construct a model of the opponent’s strategy, some way of
automatically analysing the constructed model to determine weaknesses in
the strategy, and some way of automatically determining the best counter-
strategy to counter-act the inferred strategy of the opponent. In general,
all three of these tasks may indeed be difficult. In the next section, we
detail how we did these things in the context of an IPD competition.
7.1.2. Modeller, the competition entry
In this section, we describe Modeller, the strategy that we entered into the
IPD competition held in conjunction with CIG’05, the 2005 IEEE Com-
putational Intelligence in Games conference. It is a modified version of an
opponent modelling agent described in a paper presented at CEC’04, the
IEEE Congress on Evolutionary Computation in Seattle in 2004, [Hingston
and Kendall (2004)]. The focus of that work was the interplay of evolution
and learning, which was explored by simulating co-evolving populations
of IPD playing agents using fixed strategies with agents using opponent
modelling. It is discussed further in the next section, Opponent modelling
versus evolution. We made some minor changes to the opponent modelling
agent, for the purposes of the competition, but the precise details of the
implementation are less important than the overall spirit of the opponent
modelling approach.
7.1.3. Anatomy of the modeller
Modeller plays tit-for-tat for a fixed number (50) of moves. (Recall that
tit-for-tat cooperates on the first move, and copies the opponent’s previous
move from then on.) During that time, it builds up a predictive model
of the opponent. After the fixed number of moves, it uses the model to
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
146 P. Hingston et al.
calculate expected future payoffs for each possible move, depending on the
game position, choosing the move with the highest expected future payoff.
In the case of ties, it chooses randomly between the moves with the highest
expected future payoff.
The opponent model used is a 1st order lookup table. It is assumed
that the opponent’s probability of cooperation on a given move is deter-
mined by what happened on the previous move, e.g. both cooperated, or
we cooperated but our opponent defected, etc. This assumption was prob-
ably incorrect for most of the strategies entered in the competition, but
we hoped that it would be approximately true, or at least true enough to
obtain good average scores.
We could employ more complicated models. For example, the oppo-
nent’s probability of cooperation could depend on the previous two moves,
or even more generally, could be described by a probabilistic finite state
automaton. We opted for the simplest choice that demonstrates the op-
ponent modelling approach. In any case, we reasoned, more complicated
models might not be warranted for the competition, because they have
more parameters to estimate, requiring more time to learn. However, if the
expected game length was very long, and our opponents were sophisticated,
we conjecture that using more complicated models would produce a more
capable strategy.
The hypothetical probabilities that determine a 1st order model are
estimated by counting how many times the opponent cooperated or defected
after each possible previous move. These counters are initialized with values
that are consistent with the opponent playing tit-for-tat, that is, we used
tit-for-tat as an a priori model. This seemed like a good choice for the
competition, as we expected that many opponents would play variants of tit-
for-tat. The counters are used to compute an estimate of the probability of
the opponent cooperating as the ratio of the cooperation counter to the sum
of the cooperation and defection counters. We continue to update the model
by incrementing the counters during subsequent play. More sophisticated
updating could also be used, for example, weighting evidence on recency,
to respond faster to opponents with dynamic strategies, as was done in
Hingston and Kendall (2004). Since our aim was to see how well opponent
modelling would do in the competition environment, rather than to compare
and tweak implementation details, we again decided to opt for simplicity.
Assuming that the opponent model is correct, and given knowledge of the
probability of the game continuing to another move, we can calculate the
expected future payoff for any move in any game position. According to
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 147
cc
cccd ccdc
dddccd
ccdd cdcc cdcd ddddcccc
first
C
D
C C C
C
DD DD
C01-C0
C1(cc)
C0 1-C0
1-C1(dd)
Fig. 7.1. Game tree for the first few moves of a game of IPD.
the competition rules, this probability was constant at 1-0.00346, giving an
expected game length of 200 moves.
To see how expected payoffs can be calculated, consider the initial seg-
ment of an IPD game tree shown in figure 7.1. Starting at the bottom,
there is one branch for our choice to cooperate (labeled C) and one for
our choice to defect (labeled D). Following the C branch, there is then one
branch for our opponent’s choice to cooperate (labeled C0) and one for his
choice to defect (labeled 1-C0). These labels represent the probability that
our opponent will cooperate, or respectively, defect, on the first move of the
game. Following the “cooperate” branch, we reach a node that represents
the game position in which both players cooperated on the previous move
(labeled cc). There is then a branch representing our next move (C or D),
and then our opponent’s next move, where the label C1(cc) represents the
probability that the opponent will cooperate when both players cooper-
ated on the previous move. Likewise, the label on the rightmost branch,
1 − C1(dd) represents the probability that the opponent will defect when
both players defected on the previous move. The topmost nodes are shown
with labels like cccd, representing a game position where both players coop-
erated two moves ago, and we cooperated but our opponent defected on the
previous move. Since we are assuming that our opponent (and therefore we
also) only consider the previous move, when deciding on his next play, this
node might as well be labeled cd, and be identified with the other nodes
labeled cd. Thus the infinite game tree collapses to become a finite graph
(not drawn).
Thus, to determine a counter-strategy, we need only decide on our choice
at the start of the game, and at each of the games positions cc, cd, dc and
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
148 P. Hingston et al.
dd. Thus, there are 25 = 32 possible counter-strategies to consider. We can
choose between these by calculating their expected payoffs.
Let V (cc) be the value of the game at position cc, by which we mean the
expected discounted future payoff starting from this position. Define the
value of the game for the other positions similarly. Let δ be the probability
of continuing the game, P be the penalty for mutual defection, R be the
reward for mutual cooperation, T be the temptation to defect, and S be
the sucker payoff. If we choose to cooperate at position cc, then the ex-
pected future payoff, V1(cc), is equal to the probability that the opponent
cooperates (that is C1(cc)) times future payoff given that he cooperates,
plus the probability that the opponent defects (that is 1 − C1(cc)) times
the future payoff given that he defects. The future payoff given that he
cooperates is equal to the immediate payoff for both cooperating (R), plus
the expected future payoff after that (V1(cc)) times the probability that
the game continues (δ). The future payoff given that he defects is equal
to the immediate payoff for us cooperating while he defects (S), plus the
expected future payoff after that (V1(cd)) times the probability that the
game continues (δ). Putting that all together:
V (cc) = C1(cc)× (R + δ × V (cc)) + (1− C1(cc))× (S + δ × V (cd)) .
Similarly, if we choose to defect, then:
V (cc) = C1(cc)× (T + δ × V (dc)) + (1− C1(cc))× (P + δ × V (dd)) .
Analogous equations hold for the other positions, giving a system of
equations that can be solved for the values V ( ). Finally, the value of the
game at the start of the game is either
V = C0 × (R + δ × V (cc)) + (1− C0)× (S + δ × V (cd)) , or
V = C0 × (T + δ × V (dc)) + (1− C0)× (P + δ × V (dd))
depending on whether we choose to cooperate or defect on the first move.
The best counter-strategy is that set of 5 choices which maximizes the value
of V ( ) for the current game position. After playing a move selected by this
method, and observing the opponent’s move, the model is updated and the
calculation above must be repeated to choose our next move. This tech-
nique can be extended to calculate a best response against any finite-order
stochastic strategy, or indeed against any strategy defined by a probabilistic
finite state automaton.
This implementation of an opponent modeller can likely be improved
(in terms of achieving a higher score in IPD competitions), by more careful
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 149
choices of the target class of opponent models, exploration/exploitation
balance, and updating method. We intend to test this claim in future IPD
competitions! However, the good performance of this simple implementa-
tion in the CIG’05 competition is evidence that opponent modelling is a
viable strategy for IPD.
7.1.4. Competition performance
While our main target was Competition 4, we also entered Modeller in
Competitions 1 and 2. Competition 4 was a faithful reproduction of Ax-
elrod’s original conception [Axelrod (1984)] Modeller performed very well
in this competition, placing 3rd out of 50 entries in four runs out of five,
and 5th on the remaining run. In all cases, it was just over 3% behind the
winning entry, and 1% ahead of the next best defeated opponent.
Despite this good performance, a detailed examination of individual
games reveals some weaknesses in our implementation. One thorny issue
that we side-stepped by using tit-for-tat as an a priori model, is the “cost”
of learning. In order to develop an accurate model of an opponent, one
would like to “explore” — that is, to sample the opponent’s moves in every
possible game position, many times. As discussed earlier, there are two
barriers to this. One is that the opponent’s play may be such that some
game positions are never reached. The second, and more troublesome, is
that such exploration does not come for free: if we deliberately play a
certain move to see what our opponent will do, our decision will affect
the payoff that we receive for this move, and possibly for future moves
too. Playing against the grim strategy is an extreme example. A grim
player, cooperates on the first move of a game, and continues to cooperate
as long as the opponent does, but if ever the opponent defects, then grim
continues to defect forever more. In Hingston and Kendall (2004), we used
the device of deliberately playing the “wrong” move from time to time – the
so-called “trembling hand” device. Because of this, the opponent modeller
was frequently punished by grim-like opponents. A single experimental
defection against grim ensures that the opponent will defect for the rest of
the game, locking both players into low payoffs.
For the purpose of the competition, we avoided this problem by playing
tit-for-tat at first, and then the moves that we calculate to be optimal.
This is simple and reduces our risk of offending grim-like opponents, but
also reduces the accuracy of our models, so that we may miss the chance of
truly optimal play. For example, Modeller loses badly in games against the
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
150 P. Hingston et al.
fixed strategy Always Defect, which simply defects at all times. The reason
for this is outlined below.
During the first 50 moves against Always Defect, Modeller gets no infor-
mation about what the opponent would do if both players cooperated last
move, or if we defected while he cooperated (because he never cooperates).
Also, since we play tit-for-tat up to this point, we only cooperate on the
first move, and thereafter defect, so we only see one example of the oppo-
nent’s play after we cooperate and he defects. The problem of this lack of
data is the reason we begin with an a priori model, specifically, tit-for-tat.
After 50 moves against Always Defect, the model looks like this:
Probability of cooperating after we both cooperate = 1
Probability of cooperating after I cooperate and he defects = 0.5
Probability of cooperating after I defect and he cooperates = 0
Probability of cooperating after we both defect = 0
Thus, when first called on to apply the model, Modeller reasons like
this:
“We just both defected. If I cooperate on this move, I’m sure he’ll
defect. After that, there’s a 50% chance he’ll cooperate on the next move.
(If not, I can try again.) If I cooperate too, that makes a 50% chance that
we will both cooperate. From then on, I’m sure we’ll keep cooperating.”
So Modeller expects to reach mutual cooperation and good payoffs after
a few more moves, if he continues to cooperate. The problem is that the
50% estimate is wrong (the true probability is 0). Although this incorrect
value will continue to be updated, it will take many more sucker moves
before the model is accurate enough for Modeller to make the right choice
(defect).
So we see that, for several reasons, the models learned by Modeller are
imperfect. Sometimes, this hurts us, but we push ahead regardless, hoping
that, on average, it will not hurt us too much. At least in this competition,
this was a reasonable assumption.
We expected that Modeller would have a tougher time in Competitions 1
and 2, because, in these competitions, collusion between entries was allowed.
With collusion allowed, the best approaches will use a “champion” strategy
that takes advantage of conditions by relying on “confederate” strategies
to sacrifice themselves by cooperating whilst their champion constantly de-
fects. In addition, confederate strategies can damage other competitors by
constantly defecting against them. The idea is for champion and confed-
erates to use the first few moves to identify each other. It is clear that no
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 151
non-colluding strategy can hope to compete in this environment. Never-
theless, we entered Modeller to provide an opponent for other entries, and
out of curiosity. It performed creditably, finishing 62nd, 61st, 64th, 60th
and 65th out of 192 entries in the five runs. It would be interesting to know
where it was placed among the non-colluding entries.
Modeller fared better in Competition 2, which allowed collusion, but
introduced “noise” – that is, with low probability, signals may be misinter-
preted by players. This upsets colluders by interfering with the identifica-
tion of confederates, but doesn’t inconvenience Modeller much at all, as it
is designed to deal with stochastic opponent strategies. In this competition,
Modeller finished 20th, 18th, 5th, 13th and 18th out of 165 entries in the
five runs.
It could be argued that, by allowing collusion, Competitions 1 and 2
changed the nature of the problem under consideration. The problem be-
comes one of teamwork, rather than one of cooperation with a self-interested
other. One can think of real-world scenarios that IPD-with-collusion use-
fully models — for example, teams of riders in the Tour de France, in which
team members sacrifice their own chances in order to protect a teammate
and improve his chance of a high-placed finish. The analogy is imperfect,
though, as in the Tour, teammates do not compete against each other di-
rectly. It is harder to think of examples from Nature. At least at first
sight, it would seem that colluding strategies would not work very well in
simulated evolution experiments like those that we describe in the next sec-
tion. Strategies acting as confederates would be selected against. In such
a scenario, it would be the average fitness of all members of the species
that determined reproductive success of the species as a whole. We won-
der what the average scores of the teams entered in Competitions 1 and 2
were, but we cannot calculate this because we do not know who was col-
luding with whom. Perhaps there are examples in Nature that we are not
aware of, and it may be a matter of appropriately structuring the simula-
tion to make collusion profitable. It would be interesting to hear of such
examples.
7.2. Opponent Modelling Versus Evolution
The opponent modeller described in the previous section was based on that
used in the CEC’04 study, which had nothing to do with the competition,
or, really, with Axelrod’s original competitions. It did take inspiration,
though, from Axelrods experiments with evolution and IPD.
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
152 P. Hingston et al.
Those experiments by Axelrod were motivated in part by an apparent
anomaly in evolutionary theory. Cooperation between organisms in nature
entails one organism changing its behaviour in order to benefit another,
possibly to its own detriment. A commonly used example is that of the
lookout in groups of social animals, that makes an alarm call to warn the
rest of the group of the presence of a predator, placing itself at risk by
calling attention to itself. If evolution favours survival of the fittest, then
why does it not work against this kind of cooperation? Would it not rather
favour the cheat, who benefits from the alarm calls of others, but stays
silent when his own turn comes to act as lookout?
One can make plausible arguments to resolve this puzzle, invoking ideas
like kin-selection [Maynard-Smith (1988), pp. 192-193], or social reputation
[Maynard-Smith and Harper (2003), pp. 121-122], or one can build and
analyse mathematical models to test hypothetical mechanisms to explain
it, as in evolutionary game theory [Maynard-Smith (1988), pp. 194-200].
Or one can design and carry out simulated evolution experiments, which
is what Axelrod did, using IPD as his model. Subsequently, many others
have carried out their own, similar experiments, using variations on the
classic IPD model, exploring issues such as spatial effects [Nowak and May
(1992)], more complex strategies [Fogel (1993); Miller (1996)], the ability
to choose partners [Ashlock et al. (1996)] and so on.
There is another natural phenomenon that evolutionary theory must ex-
plain – intelligence. The central question is: Why and how did intelligence
evolve? This is a large topic, much debated, and one that has many facets.
Theories include Calvin’s “throwing theory” (that bigger brains evolved in
order to better throw rocks) [Calvin (1983)], the theory that greater intel-
ligence resulted as a response to the last ice age [Calvin (1991)], that the
evolution of intelligence was a result of sexual selection [Miller (1997)], and
the idea that intelligence is about being better at deceiving and detect-
ing deception – as in the Machiavellian intelligence hypothesis [Byrne and
Whiten (1988); Whiten and Byrne(1997)].
Our CEC’04 study was in part our attempt to contribute to the debate.
Just as Axelrod used IPD to study the evolution of cooperation, we used
similar experiments to study the evolution of intelligence. To better explain
what we did, we ask the reader to keep mind the following hypothetical
scenario:
Imagine a world inhabited by simple creatures who interact and ex-
change resources by playing IPD with each other. Those who get bet-
ter payoffs live longer and have more offspring. These creatures are not
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 153
intelligent. Their moves are determined by their genes. They can recog-
nize each other, and they can only remember what moves were played the
last time they played each other — but no more than that. They act in-
stinctively, and never learn anything at all. This is the world of Axelrod’s
experiments.
Now imagine that a strange mutation arises amongst these creatures.
Mutants have abnormally large brains, large enough for them to remember
quite a bit about what happened in previous encounters with each other
creature. Enough for them to be able to make a good guess about what
move the other will make the next time they play. In fact, their brains are so
big and complex, that they can use this information to plan what move they
should make next, and what would happen after that, and so on, and choose
a move that will maximize the payoff in games against each other in the
future. These mutants are intelligent. What will happen to these intelligent
mutants? What will happen to the original, unintelligent creatures? This
is the world, and these are the questions that were addressed in the CEC’04
study.
This was not the first study that considered how an intelligent player
might play IPD or to use simulation to study the evolution of intelligence,
but it may be the first to consider opponent modelling as an approach to
IPD, and also the first to consider combining evolution and intelligence in
the context of IPD.
The opponent modelling implementation for this study was similar to
that used in the competition, except that there was no initial period in
which the model was not used, the model update method was different,
and the players were all equipped with a “trembling hand”. As explained
earlier, the competition variant used an initial waiting period as a safeguard
against grim-like opponents, and because we guessed that many competi-
tion opponents would be tit-for-tat-like. The model update method in the
CEC’04 study used a “forgetting factor”, γ, to give greater weight to more
recent events. After each move, both counters pertaining to the current
game position were multiplied by γ, and the relevant counter was incre-
mented by 2× (1− γ), keeping the sum of the two counters constant. All
players in the CEC’04 study had a “trembling hand”. That is, they would
occasionally play defect when they intended to cooperate, or vice versa.
The advantage of this is that it makes all parts of an opponent model
reachable, and offers some hope of recovery against a grim-like opponent.
Players using 1st order lookup tables were used for the unintelligent
players. Only pure strategies were used – that is, ones in which each
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
154 P. Hingston et al.
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
0 100 200 300 400 500 600 700 800 900 1000
Generation
Mean
fit
ness
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Co
op
era
tio
n
mean
coop
Fig. 7.2. A typical run with unintelligent strategies.
cooperation probability is either 0 or 1. No crossover was used, and mu-
tation, simply, was to flip a probability of 0 to 1, or vice versa. Though
they were not reported in the paper, experiments were also conducted with
stochastic strategies, giving broadly similar results.
There were two experiments reported. In the first experiment, fixed
strategy, unintelligent players were evolved in a simulation similar to Axel-
rod’s:
An initial population is created.
A round-robin IPD tournament is held between the members of the pop-
ulation. Every player plays every other player in a game of IPD in which
the game continues to another round with probability δ (set to 0.96, for an
average game length of 25 moves). The fitness of each individual is assigned
to be that player’s average payoff per move in the tournament.
Fitness-proportionate selection is used to select parents for the next gener-
ation (stochastic uniform selection).
Each parent, when selected, produces one child, by a process of copying
the genome of the parent (with a low mutation rate – the probability of
mutation as each gene is copied), and the development of a new individual
from this genome. The children become the next generation.
Repeat steps 2-4 for 1000 generations.
The results of this experiment were similar to those reported by Axel-
rod. The populations evolved in a few generations to a mixture of generally
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 155
Table 7.1. Summary statistics for evolution of unintelli-
gent strategies, n = 20, mean ± std.dev.
mean fitness Mean coop% grim% tft%
2.783 ± 0.013 86.7 ± 0.8 26.4 ± 1.5 19.7 ± 1.8
0
10
20
30
40
50
60
70
80
0 100 200 300 400 500 600 700 800 900 1000
Generation
Perc
en
t
nGrim
nTFT
Fig. 7.3. Percentage of grim and tit-for-tat strategies for the run in figure 7.2.
cooperative players, cooperating around 87% of the time. As can be seen
in table 7.1, the mean reward was close to the mean of 2.783 in all the
runs. The average percentages of grim and tit-for-tat (TFT) strategies
were around 26% and 20% respectively. Figure 7.2 shows a typical run,
with defection initially popular, and cooperation taking over after about
20 generations. Although the mean reward and degree of cooperation of the
population have stabilised, the composition of the population is constantly
fluctuating, with grim and tit-for-tat always present in large numbers, ap-
pearing to be loosely tied together in a cycle of period about 100 genera-
tions. Figure 7.3 shows the percentages for the same typical run.
In the second experiment, the players’ genomes were extended by adding
a “smart bit”. With the smart bit turned on, the player becomes an in-
telligent mutant, and plays as an opponent modeller. With the smart bit
turned off, the player remains an unintelligent player. In the initial pop-
ulation, all the smart bits were off. The scene is set. The mutants are
equipped to exploit the weak amongst the normal players. Will this ability
enable them to take over the population? Will they merely weed out the
exploitable players?
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
156 P. Hingston et al.
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
0 100 200 300 400 500 600 700 800 900 1000
Generation
Mean
fit
ness
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Co
op
era
tio
n
mean
coop
Fig. 7.4. A typical run with unintelligent players and opponent modellers.
Table 7.2. Summary statistics for coevolution of unintelligent players with op-
ponent modellers, n = 20, mean ± std.dev.
Meanmean
Meanmodeller% grim% tft%
fitnessmodeler
coop%fitness
2.67 ± 0.01 2.51 ± 0.01 80.6 ± 0.6 13.5 ± 0.7 21.7 ± 2.0 24.1 ± 2.0
Figure 7.4 shows the mean fitness and level of cooperation in a typical
run. The picture is similar to that of the first experiment, with a slightly
lower degree of cooperation at around 81%, and slightly lower mean re-
wards around 2.67. Figure 7.5 shows the percentage of tit-for-tat and grim
strategies and also the percentage of opponent modellers for the same run.
As table 7.2 shows, a significant number of opponent modellers, a mean of
around 13.5% of the population, is able to survive. Compared to the first
experiment, some of the grim strategies have been displaced, but the per-
centage of tit-for-tat strategies has actually increased. We conjecture that
the increase in tit-for-tat was at the expense of more exploitable strategies,
which are under pressure from the opponent modellers. While grim play-
ers can’t be exploited, they are involved in a lot of unprofitable mutual
defection with opponent modellers, and also suffer.
Although opponent modellers are able to survive in this simulated envi-
ronment, their mean fitness is lower than that of the rest of the population.
Without mutation, they would be driven to extinction. One problem for
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 157
0
10
20
30
40
50
60
0 100 200 300 400 500 600 700 800 900 1000
Generation
Percent
nGrim
nTFT
nSmart
Fig. 7.5. Percentage of opponent modellers, grim and tit-for-tat strategies for the run
in figure 7.4.
opponent modellers is the poor payoff from games with grim. But, the
main reason for their relatively poor performance is that when two oppo-
nent modellers meet, their average payoff is only 1.69. (In the competition,
this doesn’t happen, as after the first 50 moves, each thinks the other is
playing tit-for-tat, so mutual cooperation is locked in. In any case, each
player only plays itself once in the competition, so self-play is not an im-
portant factor.)
As an explanation of how intelligence might evolve, this model has raised
some questions. One could regard it as an illustration of the self-limiting
nature of exploitative behaviour in human and animal societies. Taking
these results as a starting point, one could ask under what conditions intel-
ligent players would do better, or worse, against unintelligent opponents,
than they did in this experiment. Answers to this question might provide
clues as to how and why intelligence has evolved in Nature, and why various
successful species have varying degrees of intelligence. In the next section,
we describe some new experiments in which we investigate some of the
effects that contributed to the results of this section.
7.2.1. The new experiments
As described above, one finding of the CEC’04 work was that the presence
of opponent modellers in an evolving population of IPD playing agents has
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
158 P. Hingston et al.
an influence on the kinds of fixed strategy players favoured by evolution.
The experiments described in this section seek to explore this further.
The main difference between these experiments and the earlier ones is
that opponent modellers do not directly take part in the evolution process,
but are used to test the fitness of the members of an evolving population of
fixed strategy IPD players. This makes it possible to isolate and manipulate
the influence of the opponent modellers.
There are some minor differences between the implementation of oppo-
nent modelling used in these experiments and the one used in the earlier
study. Instead of using a default model for an opponent strategy (based
on TFT) as used in the CEC’04 work, for these new experiments, we in-
stead start with an empty model of the opponent. As before, we count the
number of times the opponent cooperated for each game state to determine
a probability of cooperation for that game state based on observation of
the opponent’s moves. From this probability of cooperation, we are able
to calculate the next best move by calculating the best expectation for
all possibilities by looking ahead in the game state graph to consider the
consequences of each possible course of action. Since look-ahead is compu-
tationally expensive, we consider only a look-ahead of 5 moves (sufficient
to prevent short-term gains from taking precedence over long-term consid-
erations). Unlike the earlier work, we do not include a recency factor to
discount older observations as we do not make any short-cut assumption
about the opponent strategy at the start.
Exploration of an opponent’s strategy is also undertaken differently. In
the earlier CEC’04 work, a trembling hand was used for exploration of the
opponent strategy. In this work, exploration is more immediate – the op-
ponent modeller makes random decisions when it encounters games states
for which it has no information about the opponent’s strategy. The ad-
vantage of this approach is that exploration occurs earlier in the modelling
process, thus meaning more information is available earlier in the game,
hopefully leading to better exploitation of an opponent’s weaknesses in the
short term.
Of course, the key difference between our new experiments and those
presented at CEC’04 has to do with the effects of the opponent modeller on
the course of the evolutionary process. In the CEC’04 work, the opponent
modeller was considered another instance of the evolving population that
could replicate (so there could be multiple copies of the opponent modeller),
and needed to compete to earn their position in the population in order to
survive (i.e., opponent modellers were subjected to the same evolutionary
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 159
selection pressure as the unintelligent players). Results from the CEC’04
work showed that due to poor performance against other opponent mod-
ellers (the average return for self-play was 1.69), the number of opponent
modellers in the population fell away over the course of the evolutionary
run. In these new experiments, we take a different approach – we do not
involve the opponent modeller in the evolutionary process (so the oppo-
nent modeller is not subjected to the same evolutionary pressures), and it
is instead treated separately from the evolving population. As before, we
still maintain a population of unintelligent players that must compete for
their right to remain (and reproduce) in the population, but now assess-
ment of an individual’s ability (its fitness) is calculated as a weighted sum
of its performance against the other (unintelligent) members of the evolv-
ing population and its performance against the separate opponent modeller.
Below, we report on experiments with different weightings to determine and
isolate the effects of the opponent modeller on the evolutionary process.
While there are a number of differences between these two studies,
analysis shows that the results are mostly robust with respect to these
differences. Indeed, compensating for the effects of self-play in the ear-
lier CEC’04 work yields results mostly similar to the results found using
this new methodology (some differences occur due to the differences in ex-
ploration between the two approaches). We use our simpler approach in
the experiments below, thus allowing us to explore longer-term effects and
longer-term IPD games (these new experiments investigates games last-
ing 1000 rounds while the earlier work investigated games lasting only
25 rounds).
Our baseline experiment is to play the opponent modeller against a se-
lection of eight commonly known IPD strategies. Each strategy is played
against each of the others for 1000 iterations, giving a total of 8000 iter-
ations for each strategy. The results of the round-robin tournament are
presented below in table 7.3. Table 7.4 reports a breakdown of the oppo-
nent modeller’s performance (average payoff) against each of the strategies
in table 7.3.
Table 7.3 lists a couple of strategies we have yet to describe. STST
(Suspicious tit-for-tat) is like tit-for-tat except that it defects on the first
move. Gradual is another variation on tit-for-tat : this strategy acts as tit-
for-tat, except that after the first defection of the other player, it defects one
time and cooperates two times; After the second defection of the opponent,
it defect two times and cooperate two times, and so on. The Pavlov strategy
is similar to grim, except that it is more forgiving. Based around the
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
160 P. Hingston et al.
Table 7.3. Round-robin tournament results in-
volving the opponent modeller against eight other
commonly known IPD strategies.
Rank Strategy Average payoff
1 Opponent Modeller 2.74
2 Gradual 2.68
3 TFT 2.59
4 Grim 2.26
5 STFT 2.22
6 Pavlov 2.15
7 Always Cooperate 2.07
8 Always Defect 2.05
9 Random 1.64
Table 7.4. Round-robin tournament results involving the
opponent modeller against eight other commonly known
IPD strategies.
StrategyAverage payoff
Opponent Modeller Opponent
Gradual 2.87 2.75
TFT 2.99 2.99
Grim 1.00 1.01
STFT 2.99 3.00
Pavlov 3.00 0.50
Always Cooperate 5.00 0.01
Always Defect 1.00 1.00
Random 3.04 0.51
principle of continuing to do the same thing when performing well and only
changing when performing poorly, Pavlov starts cooperating and continues
to cooperate until its opponent defects. Upon defection, Pavlov switches
to defection. The difference between grim and Pavlov is that Pavlov will
return to cooperation if defection does not prove to be profitable (i.e., if its
opponent also begins to defect), hoping to return back to a state of mutual
cooperation.
That the opponent modeller emerged as the winner of the tournament
is encouraging, but is not particularly significant given the arbitrary se-
lection of opponents in the tournament. What is more interesting is the
performance of the opponent modeller against each individual strategy.
The first thing to note is the ability of the opponent modeller to suc-
cessfully identify Always Defect as the best counter-strategy against the
non-reactive opponents (Always Cooperate, Always Defect, and random),
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 161
achieving near-perfect scores against Always Cooperate and Always Defect,
and the best possible result against random.
Against tit-for-tat and STFT, the opponent modeller is able to identify
cooperation as the best course of action without falling into the defection
echo trap. As expected, the inevitable strategy exploration against grim
is punished, resulting in a poor score for the opponent modeller. The rel-
atively poor performance of Pavlov in the round-robin tournament is at
least partially due to the opponent modeller settling on an Always Defect
counter-strategy, rather than the equally effective Always Cooperate alter-
native.
Our next experiments examine the effect of the opponent modeller on
the course of a population of IPD players subjected to evolutionary selec-
tion pressure. As seen in the earlier CEC’04 work, the presence of op-
ponent modellers in the population effects the kinds (and distribution) of
fixed strategy players selected by evolution. These new experiments further
elaborate on these effects.
First, we report on the performance of the opponent modeller against
an evolving population of fixed pure strategies. Figure 7.6 plots the average
payoff for the opponent modeller against each member of the population
along with the average-payoff of the evolving population.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
generation
me
an
pa
y-o
ff
population
modeller
Fig. 7.6. Average payoffs for the evolving population of fixed pure strategies and an
opponent modeller playing against each member of the population over time.
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
162 P. Hingston et al.
We can see that in most generations, the opponent modeller is able
to outperform the evolving population, obtaining a higher average payoff
than the average payoff of the evolving population. However, there are sev-
eral generations where the population outperforms the opponent modeller.
Analysis of the population composition at these points reveals that this
occurs when there are a large number of grim strategies in the population
(recall that exploration against the unforgiving grim is fatal – one defection
against grim locks the opponent modeller into a payoff at best 1.0 from then
on). For example, in generation 988, where the opponent modeller is at its
least effective (scoring on average 1.10 less than the evolving population),
the number of grim strategies reaches its peak −68% of the population.
The first row of table 7.6 reports the composition of the evolving popu-
lation for the corresponding experiments plotted in figure 7.6. We can see
grim, tit-for-tat, and Pavlov are the most prevalent in the population.
The results from these experiments show that the opponent modeller
is successful against an evolving population of fixed pure IPD strategies,
provided the proportion of grim strategies in the population is not high.
However, these experiments have not rewarded fixed strategies that score
well against the opponent modeller, only those that perform well against
the rest of the evolving population. Next, we examine experiments that
incorporate scores achieved against the opponent modeller into the fitness
evaluations of the fixed strategies.
Table 7.5 reports the average payoffs for the members of the evolving
population of pure strategies and the opponent modeller, along with the
composition of selected strategies in the population (dashed entries indicate
low numbers) for different ratios of the weighted sum that constitutes the
fitness of a member of the evolving population.
Table 7.5. Average payoffs for the members of an evolving population of
pure strategies and an opponent modeller playing against each member of
the population, for different weightings in the weighted sum.
WeightingPopulation average payoff
Modeller average
(against Modeller)Against
Against Modeller payoffpopulation
0 2.50(0.19) 1.07 2.62(0.32)
0.05 2.51(0.21) 1.31 2.81(0.27)
0.1 2.60(0.12) 1.60 2.94(0.21)
0.2 2.68(0.09) 2.08 2.99(0.12)
0.5 2.65(0.09) 2.53 2.93(0.06)
1.0 2.14(0.25) 2.72 2.91(0.05)
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 163
Table 7.6. Distribution of strategies for the members of an
evolving population of pure strategies and an opponent mod-
eller playing against each member of the population, for dif-
ferent weightings in the weighted sum as in table 7.5.
Weighting Number of each fixed strategy
(against Modeller) Grim TFT Pavlov STFT
0 25(10) 15(7) 10(5) -
0.05 15(6) 25(10) 7(5) -
0.1 12(6) 33(9) 5(3) -
0.2 8(3) 49(8) - 8(3)
0.5 4(1) 61(6) - 13(4)
1.0 2(1) 38(11) - 41(11)
The first obvious difference of the experiments that include performance
against the opponent modeller in fitness calculations (rows 2 onwards in
table 7.5) is the increased average payoff of the opponent modeller. In
comparison to the first row of the Table, we see that the average payoff of
the opponent modeller increases up to a point, before leveling off at around
2.95. This is due to the changes in the composition of the resulting evolved
population (see table 7.6). As we saw in our baseline experiment, grim and
Pavlov do not perform well against the opponent modeller (scoring 1.01
and 0.50 respectively) and hence even a very low degree of influence from
the opponent modeller on fitness scores is enough to reduce the appear-
ance in the evolving population of these strategies. With a weighting of
0.2, Pavlov is unable to score highly enough to survive in any significant
quantities and the presence of grim is much reduced. The reduction in the
number of grim strategies explains the increase in the average payoff of
the opponent modeller (recall that the opponent modeller performs poorly
against grim because of the high cost of strategy exploration). With a 0.5
weighting, grim becomes marginalised. The increasing number of STFT
strategies at the higher weightings explains the small decrease in average
payoff of the opponent-modeller – STFT is not exploitable and indeed may
benefit from its suspicious nature at the beginning of the game. At the
higher weightings, the only strategies other than tit-for-tat and STFT to
appear in the population are single-step (differing in just one state) mu-
tants from tit-for-tat and STFT (including grim) induced by the mutation
in the evolutionary process. These mutants are not able to survive in the
evolving population and are quickly eliminated.
Variance in the performance of the opponent modeller also decreases as
we increase the relative importance of performance against the opponent
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
164 P. Hingston et al.
modeller in the evaluation of the success of a population member. This is
because the opponent modeller acts as a stabilising influence on the fitness
of the evolving population since it is a constant in the environment. The
more that the fitness is derived from games against the remainder of the
population (low weightings), the more performance is affected by changes
in the population.
As seen in column 3 of table 7.5, the average payoff of the evolving pop-
ulation against the opponent modeller increases as the relative importance
of performance against the opponent modeller in fitness calculations for a
population member increases. This is as expected, because survival in the
population now depends more and more on this metric than performance
against the other members of the evolving population. Indeed, at a weight-
ing of 1.0, performance against the opponent modeller is maximal, at a
sacrifice of performance against the other members of the evolving popula-
tion. Somewhat strangely though, as the weighting increases from 0 to 0.2,
performance of the evolving population against other members of the pop-
ulation increases, even though fitness now depends more on performance
against the opponent modeller. This is due to the decreased numbers of
grim strategies (driven out by the opponent modeller) – defection is no
longer as costly as it was before. At the highest weighting, even the small
numbers of grim strategies ensure that the abundance of STFT strate-
gies perform relatively poorly, lowering the average-payoff of the evolving
population in play against each other.
Importantly, we observe in table 7.5 that while the performance of the
evolving population against the opponent modeller increases as the weight-
ing increases, the evolving population is never able to obtain a level of
performance comparable to that of the opponent modeller (contrast col-
umn 4 of table 7.5 against column 3). Of course, evolution does its best –
evolving a population consisting of predominately non-exploitable strate-
gies (tit-for-tat and STFT ). However, due to the stochastic nature of the
evolutionary process in the mutation operation, other strategies find their
way into the population, thus allowing the opponent modeller to exploit
weaknesses and obtain an average payoff higher than that of the evolving
population.
Our analysis of table 7.5 shows that the opponent-modeller to be ef-
fective against populations of pure strategies, outperforming the evolving
population in terms of average payoff in play against each other. The oppo-
nent modeller is able to outperform the evolving population, learning with
a high degree of certainty what its opponent will do in any given situation
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 165
(game state). In our next experiment, we repeat these tests using stochastic
strategies in place of pure strategies.
Stochastic IPD strategies differ from pure IPD strategies as they allow
the player the flexibility of selecting a cooperate/defect action probabilis-
tically given a particular game state. Whereas a pure strategy will always
select the same action for a given game state, a stochastic strategy may
(probabilistically) decide which action to take. This means that successive
calls of a stochastic strategy for the same input game state may produce
different output actions. This cannot occur for a pure strategy – the pure
strategy will always select the same response given an input game state.
Stochastic strategies are implemented as follows: for each unique game
state (recall, we are assuming 1st order strategies only), the stochastic
strategy stores a probability that determines the probability of cooperating
in this game state. Choice of an action depends directly on this probability
– this probability of cooperating is this stored probability. Mutation of
a stochastic strategy occurs by adjusting each internal probability by a
randomly sampled variable taken from a Gaussian distribution with mean
0 and a standard deviation of 0.025.
Against stochastic strategies, the opponent modeller can still observe
the probability with which its opponent will cooperate, but it cannot be
sure that the opponent will cooperate on any given move. This experiment
against stochastic strategies will report on the effects of this uncertainty in
behaviour on the performance of the opponent modeller.
Table 7.7 reports the average payoffs for the members of the evolving
population of stochastic strategies and the opponent modeller, along with
the composition of selected strategies in the population (dashed entries
Table 7.7. Average payoffs for the members of an evolving
population of stochastic strategies and an opponent modeller
playing against each member of the population, for different
weightings in the weighted sum.
Weighting Population average payoffModeller average
(against Against Againstpayoff
Modeller) population Modeller
0 2.08(0.61) 1.29 2.16(0.59)
0.05 2.43(0.22) 2.23 2.54(0.22)
0.1 2.59(0.14) 2.49 2.68(0.12)
0.2 2.61(0.12) 2.76 2.68(0.10)
0.5 2.54(0.11) 2.96 2.62(0.09)
1.0 2.13(0.18) 3.08 2.44(0.08)
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
166 P. Hingston et al.
Table 7.8. Distribution of strategies for the members of an
evolving population of stochastic strategies and an opponent
modeller playing against each member of the population, for
different weightings in the weighted sum as in table 7.7.
Weighting Number of each fixed strategy
(against Modeller) Grim TFT Pavlov STFT
0 22(18) 12(14) 7(8) 2(2)
0.05 6(10) 23(23) - 23(22)
0.1 - 22(19) - 37(27)
0.2 - 40(29) - 29(28)
0.5 - 40(30) - 36(27)
1.0 - 60(29) - 31(28)
indicate low numbers) for different ratios of the weighted sum that consti-
tutes the fitness of a member of the evolving population.
Against a population of evolved stochastic strategies (row 1 of table 7.7),
the opponent modeller, on average, does outscore the evolving population,
performing well in certain generations, but not in others. As in the equiv-
alent experiment against pure strategies, this performance depends on the
number of grim-like strategies in the evolving population – when the num-
ber of grim-like strategies is high, performance is relatively weak; when
the number of grim-like strategies is low, performance is relatively high.
However, unlike the experiment involving pure strategies, the performance
of the opponent modeller is more unstable, perhaps due to large dynamic
changes in the composition of the opponent strategies observed in the evo-
lution of a population of stochastic strategies. No such large-scale changes
in strategy composition were evident in the evolution of a population of
pure strategies (contrast the variance in the numbers of each fixed strategy
in table 7.5 and table 7.8).
As in the experiment with pure strategies, as the importance of the per-
formance against the opponent modeller increases, the average payoff of the
evolving population against the opponent modeller increases. However, in
contrast to the experiments involving pure strategies, we see that the evolv-
ing population is able to obtain a higher average payoff than the opponent
modeller for weightings greater than 0.2 (recall previously that an evolv-
ing population of pure strategies was unable to surpass the performance
of an opponent modeller regardless of the relative weighting). Indeed, at
a weighting of 1.0, the evolving population is able to achieve an average
payoff of greater than 3 against the opponent modeller, whilst the oppo-
nent modeller scores less than 2.5 on average, suggesting that the evolving
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 167
population is exploiting the opponent modeller. Why is the opponent mod-
eller scoring less than its opponent in these scenarios? Does this represent
a failure for our opponent modelling approach, or even opponent modelling
general? The key to understanding these observations has to do with the
composition of the evolving population.
At the higher weightings, tit-for-tat-like and STFT-like strategies ac-
count for the majority of strategies making up the evolving population
(indeed, grim-like strategies have mostly disappeared). If these were pure
strategies, we would expect to see them achieve an average payoff of no
more than 3 (mutual cooperation). However, these strategies are not pure,
instead behaving stochastically, acting mostly like their pure strategy coun-
terpart, but sometimes not. This means that a stochastic TFT-like strategy
will typically play like tit-for-tat and enter into mutual cooperation. How-
ever, occasionally, this stochastic tit-for-tat-like strategy will attempt an
unprovoked defection.
To understand why these stochastic variants are successful, particularly
against the opponent modeller, we need to consider the nature of the game.
As IPD is not a zero-sum game, and since the objective of the opponent
modeller is to achieve the highest-payoff it can (and not to achieve a higher
payoff than its opponent), it is often better for the opponent modeller to
accept the occasional defection without retaliating in order to achieve a
higher average payoff in the long run (provided the defection doesn’t occur
too frequently). Indeed, if the opponent modeller was to reciprocate every
defection by its opponent, it would be able to prevent its opponent from
significantly out-scoring it, but at the cost of lowering its own average payoff
(for example, it is better to accept an average payoff of 2 and allow your
opponent an average payoff of 4 than to retaliate and restrict both players
to an average payoff of 1).
This scenario provides an interesting demonstration of the interactions
between evolution and learning in a competitive environment. We have ob-
served that evolution produces IPD players that can improve their average
payoffs against the opponent modeller by employing occasional unprovoked
defections. It would seem that as long as the evolved strategies do not de-
fect often enough to evoke retaliation from the opponent modeller, they will
achieve higher average payoffs than the opponent modeller. In response, the
opponent modeller seemingly recognises that, although it is being exploited,
it will achieve better future rewards by not retaliating, since its opponent
will resume cooperation after each unprovoked defection. Indeed, this sug-
gests that Axelrod’s third guideline for playing IPD (“always reciprocate
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
168 P. Hingston et al.
cooperation and defection”) does not apply against stochastic strategies.
We still deem this a success for the opponent modeller – indeed, the oppo-
nent modeller is still able to achieve the highest payoff possible against this
particularly “nasty” opponent. Sometimes, you just have to grin and bear
it.
7.3. Conclusions
IPD is a game that models human choices in self-interested environments.
Previous studies of the game have focused on both evolution and standard
artificial intelligence techniques to study game strategies. However, some-
thing has been missed in these previous investigations – the role of a theory
of mind, specifically, of adapting one’s play based upon a learned model of
an opponent’s strategy. This is the area of opponent modelling – building
a representation of an opponent’s strategy, typically from experience, in or-
der to exploit weaknesses in their play. The trade-off between exploration
(searching for better ways to exploit an opponent) and exploitation (taking
advantage of the weaknesses in an opponent’s strategy) is paramount to
the success of the opponent modeller – too much strategy exploration and
the opponent modeller may not solidify its advantage; too little strategy
exploration and the opponent modeller may be sacrificing potential gains.
A balance between the two must be achieved for near-optimal play.
Using an observational model of the choices made by an opponent and a
simple technique to select the best choice given the next most likely action of
the opponent, we have introduced a simple approach to construct computer
IPD players capable of exploiting observable strategy weaknesses in oppo-
nents’ play. Our experiments show that a computer opponent modelling
IPD player is able to outperform an evolving population of fixed pure-
strategy opponents in terms of average payoff in play against each other
and perform as well as possible against a population of stochastic-strategy
opponents. Further, the strong performance of our entry in the IPD compe-
tition held at CIG’05, the 2005 IEEE Computational Intelligence in Games
conference, supports our claims of the benefits of opponent modelling – our
entry, based on the ideas presented in this work, consistently finished in
the top five in the classical IPD competitions, and performed honourably
in the collusion-based competitions.
Beyond the IPD game, this work makes a contribution to the question of
how intelligent behaviour evolves. Higher intelligence is more than simple
mimicry or rote learning, requiring the ability to predict and respond to
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma 169
specific “opponent” choices. Our work reflects on a Machiavellian view of
intelligence, in which the manipulation of the behaviour of other individuals
is crucial. High levels of intelligence are not universal in Nature – the
majority of life is simple and unintelligent, and human level intelligence is
unique. A traditional explanation for this invokes cost in terms of energy
needs of a highly developed brain. One of our reviewers pointed out that our
approach offers a fundamentally different explanation in terms of the cost
of exploration. Another explanation again is the self-limiting dynamics of
having an intelligent sub-population. Our experiments show, for example,
that opponent modelling is a viable strategy in an IPD environment, and
moreover, that the presence of opponent modellers affects the success of
other strategies, which in turn alters the characteristics of that environment.
This may be an important factor to consider in any study of the evolution of
intelligence. The subtleties and parameters of such interactions might offer
an explanation as to why the varying requirements of different ecological
niches lead to co-existence of species having different levels of intelligence.
Further study is needed to understand such interactions and the factors
that determine their outcomes.
References
Ashlock, D., Smucker, M. D., Stanley, E. A., and Tesfatsion, L. (1996) Pref-
erential partner selection in an evolutionary study of prisoner’s dilemma,
BioSystems, 37, pp. 99-125.
Axelrod, R. (1984) The Evolution of Cooperation. New York, Basic Books.
Byrne, R. W. and Whiten, A. (1988) Machiavellian Intelligence: Social Exper-
tise and the Evolution of Intellect in Monkeys, Apes and Humans. Oxford,
Clarendon Press.
Calvin, W. H. (1983) A Stone’s Throw and its Launch Window: Timing Pre-
cision and its Implications for Language and Hominid Brains, Journal of
Theoretical Biology, 104, pp. 121-135.
Calvin, W. H. (1991) The Ascent of Mind, Bantam.
Fogel, D. B. (1993) Evolving behaviors in the iterated prisoner’s dilemma. Evo-
lutionary Computation, 1, 1, pp. 77-97.
Hingston, P., and Kendall, G (2004) Learning versus Evolution in Iterated Pris-
oner’s Dilemma, Proceedings of the IEEE Congress on Evolutionary Com-
putation (CEC’05), Portland, IEEE, pp. 364-372
Markovitch, S. and Reger, R. (2005) Learning and Exploiting Relative Weaknesses
of Opponent Agents, Autonomous Agents and Multi-Agent Systems, 10,
pp. 103-130.
Maynard-Smith, J. (1988) Did Darwin get it right? Essays on Games, Sex and
Evolution, Penguin Books.
January 30, 2007 10:59 World Scientific Review Volume - 9in x 6in chapter7
170 P. Hingston et al.
Maynard-Smith, J. and Harper, D. (2003) Animal Signals. Oxford, Oxford Uni-
versity Press.
Miller, G. F. (1997) Protean primates: The evolution of adaptive unpredictability
in competition and courtship. Machiavellian Intelligence II: Extensions and
Evaluations. Cambridge, Cambridge University Press: 312-340.
Miller, J. H. (1996) The coevolution of automata in the repeated prisoner’s
dilemma, Journal of Economic Behavior and Organization, 29, pp. 87-112.
Nowak, M. and May, R (1992) Evolutionary games and spatial chaos, Nature,
359, pp. 826-829.
Parker, S. T., Mitchell, R.W. and Boccia, M.L., Ed. (1994). Self-awareness in
Animals and Humans: Developmental Perspectives. Cambridge, Cambridge
University Press.
Whiten, A. B., and Byrne, R. W. (1997) Machiavellian Intelligence II: Extensions
and Evaluations. Cambridge, Cambridge University Press.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
Chapter 8
On some winning strategies for the Iterated Prisoner’s
Dilemma or Mr. Nice Guy and the Cosa Nostra
Wolfgang Slany and Wolfgang Kienreich
Technical University, Graz, Austria
We submitted two kinds of strategies to the iterated prisoner’s dilemma
(IPD) competitions organized by Graham Kendall, Paul Darwen and Xin
Yao in 2004 and 2005.a Our strategies performed exceedingly well in both
years. One type is an intelligent and optimistic enhanced version of the well
known TitForTat strategy which we named OmegaTitForTat. It recognizes
common behaviour patterns and detects and recovers from repairable mu-
tual defect deadlock situations, otherwise behaving much like TitForTat.
OmegaTitForTat was placed as the first or second individual strategy in
both competitions in the leagues in which it took part. The second type
consists of a set of strategies working together as a team. The call for par-
ticipation of the competitions explicitly stated that cooperative strategies
would be allowed to participate. This allowed a form of implicit communi-
cation which is not in keeping with the original IPD idea, but represents a
natural extension to the study of cooperative behaviour in reality as it is
aimed at through the study of the simple, yet insightful, iterated prisoner’s
dilemma model. Indeed, one’s behaviour towards another person in reality
is very often influenced by one’s relation to the other person.
In particular, we submitted three sets of strategies that work together as
groups. In the following, we will refer to these types of strategies as group
strategies. We submitted the CosaNostra,b the StealthCollusion, and the
EmperorAndHisClones group strategies. These strategies each have one dis-
tinguished individual strategy, respectively called the CosaNostraGodfather
aSee http://www.prisoners-dilemma.com/ for more details.
bOne of us, Slany, had submitted the CosaNostra group strategy previously to an iterated
prisoner’s dilemma competition organized by Thomas Grechenig in 1988. Our submitted
group strategies are inspired by this first formulation of such a group strategy that we
are aware of.
171
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
172 W. Slany & W. Kienreich
(called ADEPT in 2004), the Lord strategy, and the Emperor, that heavily
profit from the behaviour of the other members of their respective groups:
the CosaNostraHitmen (10 to 20 members), the Peons (open number of
members), and the CloneArmy (with more than 10,000 individually named
members), which willingly let themselves being abused by their masters but
themselves lowering the scores of all other players as much as possible, thus
further maximizing the performance of their masters in relation to other
participants. Our group strategies were placed first, second and third places
in several leagues of the competitions and also likely were the most efficient
of all group strategies that took part in the competitions. Such group
strategies have since been described as collusion group strategies. We will
show that the study of collusion in the simplified framework of the iterated
prisoner’s dilemma allows us to draw parallels to many common aspects
of reality both in Nature as well as Human Society, and therefore further
extends the scope of the iterated prisoner’s dilemma as a metaphor for the
study of cooperative behaviour in a new and natural direction. We fur-
ther provide evidence that it will be unavoidable that such group strategies
will dominate all future iterated prisoner’s dilemma competitions as they
can be stealthy camouflaged as non-group strategies with arbitrary sub-
tlety. Moreover, we show that the general problem of recognizing stealth
colluding strategies is undecidable in the theoretical sense.
The organization of this chapter is as follows: Section 0 introduces the
terminology. Section 0 evaluates our results in the competitions. Section 0
describes our strategies. Section 0 analyses the performance of our and
similar strategies and proves the undecidability of recognizing collusion.
Section 0 relates the findings to phenomena observed in Nature and Human
Society and draws conclusions.
8.1. Introduction
The payoff values in an iterated prisoner’s dilemma are traditionally called
T (for temptation to betray a cooperating opponent), S (for sucker’s payoff
when being betrayed while cooperating oneself), P (for punishment when
both players betray each other), and R (for reward when both players coop-
erate with each other). Their values vary from formulation to formulation
of the prisoner’s dilemma. Nevertheless, the inequalities S < P < R < T
and 2R > T + S are always observed between them. The last one ensures
that cooperating twice (2R) pays more than alternating one’s own betrayal
of one’s partner (T) with allowing oneself to be betrayed by him or her (S)
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 173
[Kuhn (2003)]. In the iterated prisoner’s dilemma competitions organized
by Graham Kendall, Paul Darwen and Xin Yao in 2004 and 2005, these
values were, respectively, S = 0, P = 1, R = 3, and T = 5. Note that
the general results in Section 0 are true for arbitrary values constrained by
the inequalities stated above.
8.2. Analysis of the Tournament Results
The strategies we submitted to the competitions were the OmegaTitForTat
individual, single-player strategy (OTFT), the CosaNostra group strategy,
the StealthCollusion group strategy, and the EmperorAndHisClones group
strategy. The following subsections summarize the results, followed by two
sections commenting on real and presumed irregularities in some of the
results.
8.2.1. 2004 competition, league 1 (standard IPD rules, with
223 participating strategies)
• Our OTFT was the best non-group, individual strategy.
• Our Godfather strategy (called ADEPT in 2004) of our CosaNostra group
was the second best group strategy (with less than 10 members) after the
STAR group strategy of Gopal Ramchurn (with 112 members, though
we are not sure that all strategies colluded as one group). Note that
even badly performing group strategies can score arbitrarily higher than
individually better group strategies by sheer numerical superiority (see
below and Section 0). We also initially noted with one eyebrow raised
that 112 is exactly the smallest integer larger than 223 divided by 2, so the
STAR group members were just more than 50% of the total population.
However, we now believe that this might have been just a coincidence.
• Our EmperorAndHisClones group strategy was not allowed to fully com-
pete but would have won by large (it had more than 10,000 individually
named clones of which unfortunately only one was eventually allowed to
participate), for payoff values see below. EMP scored as good as ADEPT
as it was following the same recognition protocol.
• Our StealthCollusion group strategy (sent in by a virtual person Con-
stantin Ionescu and called LORD and PEON) participated as a proof of
the collusion concept, apparently without detection of the collusion by
the organizers, as further variants of members of the CosaNostra group
strategy. Constantin asked the organizers to clone his PEON strategy
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
174 W. Slany & W. Kienreich
as often as possible; however, only one copy was eventually allowed to
participate. Read more about Constantin later in Section 0.
Simple calculations show that a numerical advantage would have vastly
improved the results of our ADEPT and EmperorAndHisClones strate-
gies. In all the following calculations we neglect protocol losses among
group members as they insignificantly increase the numbers reported be-
low compared to the scores that would really have been achieved had the
competitions taken place as described. Table 8.1a shows the results of
the tournament with the number of clones actually allocated. Table 8.1b
shows the estimated results if 100 additional clones had been allowed for
our collusion strategy. Table 8.1c shows how 10,000 additional clones would
have influenced the results. These results were computed for an average of
200 turns per game, giving on the one hand full temptation payoff value
t to EMP/ADEPT from their CosaNostraHitmen, Peons, and clones of
the CloneArmy, whereas EMP/ADEPT played OmegaTitForTat against
all strategies outside our group and thus achieving the same result against
these as if the very well performing OmegaTitForTat strategy would have
been used by itself. CosaNostraHitmen, Peons, and clones of the Clon-
eArmy, and EMP/ADEPT on the other hand always cooperated with their
EMP/ADEPT bosses while permanently betraying all strategies outside our
group and thus resulting in full punishment payoff value p or even sucker’s
payoff value s to strategies outside our group to themselves and to their
opponents. Clearly, had our strategies been composed of as many members
as the STAR strategy or, even better, as many as we had submitted, it
very plausibly would have won by large factors (43% with additional 100
members, 800% with additional 10,000 members as we had submitted). We
can therefore plausibly conjecture, under the assumption that the STAR
strategy had more then 100 strategies colluding with each other, that our
group strategies would be vastly more efficient than the winning STAR
group strategy and would have won had we been allowed to play as we
had submitted our strategies and as it was positively hinted at by one of
the organisers when we submitted our strategies, in a mail received from
Graham Kendall on May 29, 2004, as otherwise we would have inflated our
stealth collusion strategies — we had prepared a respectable number of vir-
tual persons similar to Constantin Ionescu as described in Section 0. Also
note that a sufficiently large group of real people (e.g., one of us, Slany, has
to teach 750 computer science students each year that in theory could all
be enticed to participate) would have produced a similar effect.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 175
Table 8.1a. Original tournament results.
Rank Player Strategy Score
1 Gopal Ramchurn StarSN (StarSN) 117,057
2 Gopal Ramchurn StarS (StarS) 110,611
3 Gopal Ramchurn StarSL (StarSL) 110,511
4GRIM (GRIM
GRIM (GRIM Trigger)100,611
Trigger) 1
5 Wolfgang Kienreich OTFT (Omega tit for tat) 100,604
6 Wolfgang KienreichADEPT (ADEPT
96,291Strategy)
7 Emp 1 EMP (Emperor) 95,927
8 Bingzhong Wang (noname) 94,161
9 Hannes Payer Probbary 94,123
10 Nanlin Jin HCO (HCO) 93,953
Table 8.1b. Tournament results with additional 100 clones.
Rank Player Strategy Score
1 Wolfgang KienreichADEPT (ADEPT
196,291Strategy)
2 Emp 1 EMP (Emperor) 195,927
3 Gopal Ramchurn StarSN (StarSN) 137,057
4 Gopal Ramchurn StarS (StarS) 130,611
5 Gopal Ramchurn StarSL (StarSL) 130,511
6GRIM (GRIM
GRIM (GRIM Trigger) 120,611Trigger) 1
7 Wolfgang Kienreich OTFT (Omega tit for tat) 120,604
8 Bingzhong Wang (noname) 114,161
9 Hannes Payer Probbary 114,123
10 Nanlin Jin HCO (HCO) 113,953
8.2.2. 2004 competition, league 2 (uncertainty IPD vari-
ant, same 223 participating strategies as in the first
league)
• OTFT was a very close 2nd.
• ADEPT and other Godfather variants ranked as the 2nd group strategy.
8.2.3. 2005 competition, league 1 (standard IPD rules, with
192 participating strategies)
• CosaNostra Godfather was overall winner, with 20 CosaNostra Hitmen
participating in the CosaNostra group strategy.
• OTFT did not participate; it remains unclear why.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
176 W. Slany & W. Kienreich
Table 8.1c. Tournament results with additional 10,000 clones.
Rank Player Strategy Score
1 Wolfgang KienreichADEPT (ADEPT
10,096,291Strategy)
2 Emp 1 EMP (Emperor) 10,095,927
3 Gopal Ramchurn StarSN (StarSN) 2,117,057
4 Gopal Ramchurn StarS (StarS) 2,110,611
5 Gopal Ramchurn StarSL (StarSL) 2,110,511
6 GRIM (GRIM Trigger) 1 GRIM (GRIM Trigger) 2,100,611
7 Wolfgang Kienreich OTFT (Omega tit for tat) 2,100,604
8 Bingzhong Wang (noname) 2,094,161
9 Hannes Payer Probbary 2,094,123
10 Nanlin Jin HCO (HCO) 2,093,953
• Our StealthCollusion group strategy member LORD was placed 5th, the
collusion again apparently being undetected by the organizers.
8.2.4. 2005 competition, league 4 (standard IPD rules, but
only non-group, individual strategies were allowed to
participate; 50 participating strategies)
OTFT was a very close 2nd. Detailed analysis of results initially suggested
that the first placed strategy APavlov OTFT might have been a member of
a stealth colluding group strategy — this later turned out to most likely not
being true. However, our most likely mistaken analysis of some strategies
that seemed to be involved illustrates how difficult it can be to clearly
differentiate between stealth collusion strategies and strategies that only
appear to behave as colluding strategies, seemingly showing a cooperative
behaviour that in fact emerges randomly among strategies that actually
are not consciously cooperating with each other. A more detailed analysis
follows in the discussion below.
8.2.5. Analysis of OmegaTitForTat’s (OTFT) performance
In the following, we review the performance of our single player, individual
OTFT strategy in more detail. In the first league of the 2004 competi-
tion, which was intended to be a replay of the famous first iterated pris-
oner’s dilemma competition organized by Robert Axelrod in 1984 [Axelrod
(1984)], our OTFT strategy was arguably placed second together with the
default GRIM strategy out of a total of 223 participating strategies. Ac-
tually OTFT was placed third after the GRIM strategy, GRIM leading
by a mere 0.007% points. However, this lead was later seriously put into
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 177
question by the fact that GRIM on average had played 0.92% more games
than OTFT in the tournament, as pointed out by Abraham Heifets in an
email sent to the organizers on March 29 2005 which the organizers kindly
forwarded to us. More rounds obviously add to the score so this difference
was significant. When results are scaled to reflect the difference, OTFT
would have been placed as the first non-group strategy before GRIM, with
an estimated payoff of 101,530 points compared to the 100,604 of GRIM.
OTFT and GRIM were clearly outperformed only by a winning strategy
being member of the same stealth colluding group of strategies sent in by
Gopal Ramchurn.
In the following we will refer to Ramchurn’s group as the STAR group
strategy. More on group strategies against individual strategies will fol-
low in Section 0. Let us just remark here that we will show in Section 0
that group strategies can perform arbitrarily better than non-group, single-
player strategies. This basically means that OTFT was the best single-
player strategy. Moreover, the good results of GRIM are very likely due to
the tournament having been dominated by the STAR group strategy, with
its individual group members accounting for more than 50% of the partic-
ipating strategies. GRIM scores best against STAR group members that
always defect against members outside their group, the purpose being to
damage competing strategies by always defecting (ALLD), because GRIM
has a very short (one turn) interval of determination before it switches
to ALLD itself. OTFT loses some points in comparison because of inter-
spaced recovery trials during which OTFT cooperates instead of continuing
to defect. However, in Section 0 we show that, with and without a high
percentage of ALLD strategies OTFT is robustly superior to GRIM.
In the second league of the 2004 competition, which was the league with
a small probability of erroneous interpretation of the other player’s last
move, OTFT was placed as the second best non-group, individual strategy,
placed third after three members of Ramchurn’s STAR group and an indi-
vidual strategy sent in by Colm O’Riordan.c GRIM again ranked high but
was slightly outperformed by OTFT, a result that was to be expected in the
slightly randomized setting of this league. Miscommunication does happen
in the real world, so this illustrates again that in a non-perfect environment
an optimistic strategy like OTFT fares better than one with a pessimistic
world-view such as GRIM. It also shows that OTFT was again among the
cOne of our reviewers learned from ORiordan that this strategy is actually very similar
to OTFT.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
178 W. Slany & W. Kienreich
best single-player strategies, now also in an environment in which miscom-
munication happens inherently.
For reasons that remain unclear to the authors, OTFT was not allowed
to participate in the first and second leagues in the 2005 competition.
However, OTFT achieved a second place in league number four in the
2005 competition, which was the league allowing participation of only one
strategy by each team, thereby supposedly eliminating the participation of
group strategies. Winner was the strategy APavlov sent in by Jia-Wei Li,
outperforming our second placed OTFT by 1.2%.
8.2.6. The practical difficulty of detecting collusion
The small margin by which APavlov outperformed OTFT caused us to take
a very close look at the tournament results of the single-player league. We
first note that in the general results, there were strategies present which
achieved a lower score than ALLC (always cooperates), RAND (randomly
cooperates or defects), NEG (always plays the opposite from what the op-
ponent played last, first move is random) and the other standard strategies
usually ranking lowest in tournaments with only single-player strategies
present. These scores are shown in Table 8.2.
It takes quite an amount of ingenuity to achieve scores as low as the last
three candidates. Each one scored even lower than standard RAND and
NEG, and all the scores are within an interval below the variance introduced
by the RAND strategy. We initially suspected that the last three strategies
represented part of a collusion strategy somebody tried to introduce into
Table 8.2. Strategies having the lowest score in
2005’s league 4.
Rank Player Strategy Score
39 (Standard) ALLC 22,182
40 Oscar Alonso IBA 22,054
41 Oliver Jackson OJ 21,694
42 Bin Xiang A1 19,586
43 Quek Han Yang SPILA 19,518
44 (Standard) ALLD 18,764
45 Kaname Narukawa (noname) 18,592
46 (Standard) RAND 18,153
47 (Standard) NEG 17,176
48 Bernat Ricardo ALT 16,934
49 Yusuke Nojima (noname) 16,383
50 Yannis Aikater TCO3 16,228
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 179
Table 8.3. Collusion suspects: TCO3 and ALT cooperating with Apav.
TCO3 C D D C C D D C C C C C C...
ALT C D D C C D D C C C C C C...
APav C C D D C C D D D D D D D...
Table 8.4. Collusion suspects: TCO and ALT cooperating with OTFT.
TCO3 C D D C C D D C C D D C C...
ALT C D D C C D D C C D D C C...
OTFT C C D D C C D D C C D D D...
Table 8.5. Collusion suspect: TCO3 showing TFT a cold shoulder.
TCO3 C D D C C D D C C D D C...
TFT C C D D C C D D C C D D...
the single player league and therefore took a closer look at their style of
play in respect to standard strategies and to player strategies, including the
winning strategy Apavlov and our OTFT strategy.
Analysis of two suspect strategies looked very much as if they cooperated
with the winning APavlov strategy (compare Table 8.3) but also with our
OTFT strategy (compare Table 8.3), raising their score by cooperating in
the face of continuous defection. On the other hand, the suspect strategies
did not exhibit this kind of cooperative behaviour against defection by
standard strategies (compare Table 8.5).
Obviously, a trigger sequence of moves similar to the protocol exchange
employed by our CosaNostra strategy (see 1.3.2) caused the switch to an
exploitable ALLC behaviour in the strategies analysed above.
Now, we cannot speak for the authors of APavlov, but we swear on our
honour and solemnly declared that we did not consciously implement collu-
sion features into OTFT, nor did we introduce any of the suspect strategies
above ourselves. Both OTFT and APavlov, if its name is any indicator of
the type of algorithm used, are strategies that try to correct for occasional
mistakes. Such strategies have generally been known to outperform Tit-
ForTat (see, for example, [Nowak and Sigmund (1993)]) and rank highly in
single player tournaments. In this case, the correction algorithm in both
dOne reviewer suggested that swearing on our honour and solemnly declaring this would
not be necessary. However, since this chapter involves so many aspects of stealth collu-
sion, we felt it would help making sure that readers would trust us that OTFT was not
involved in any collusion.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
180 W. Slany & W. Kienreich
strategies obviously triggered the exploitable behaviour in the collusion
suspects, effectively “taking over someone else’s hitman” in the terminology
of our CosaNostra collusion strategy (compare Section 0).
We conclude that in the presence of strategies which exhibit exploitable
behaviour based on very simple trigger mechanisms, collusion as a concept
is essentially undetectable. It is not possible to denounce a strategy for us-
ing collusion if the behaviour triggering the collusion is entirely reasonable
in the context of standard strategies playing to win. In case of IPD com-
petitions in which cooperation and defection can be done in a gradual way,
that is, when more than one payoff and multi-choice as in league 3 of the
two competitions of 2004 and 2005 exist, this cooperation can be hidden
with even more subtlety. In Section 0 we will show that in general deciding
whether a set of strategies are involved in a collusion group is among the
most difficult questions that theoretically can arise.
8.3. Details of Our Strategies
8.3.1. OmegaTitForTat, or Mr. Nice Guy meets the iterated
prisoner’s dilemma
The OmegaTitForTat (OTFT) strategy is based on heuristics targeting
several tournament situations which have been identified, by tests and sta-
tistical analysis, as being both common and damaging to conventional
strategies for the IPD. In a tournament environment, certain types of strat-
egy behaviour are very common both in standard strategies added to get
a performance comparison base as well as in custom strategies designed
to dominate. Several such types of behaviour have been identified, and
solutions to optimize the interaction with them have been implemented in
OTFT. Let us note that, while we constructed OTFT from scratch, similar
forgiving strategies have been described in the literature, see, for exam-
ple, [Nowak and Sigmund (1993); Beaufils, Delahaye, and Mathieu (1996);
Tzafestas 2000; O’Riordan 2000].
8.3.1.1. Suspicion
A common trait of many strategies, including the SuspiciousTitForTat
(STFT) strategy from the standard set of strategies used in the tournament,
is suspicion: The strategy starts by playing defect, or plays defect after a
succession of mutual cooperation. Such a move can prove beneficial for a
strategy if the opponent strategy does not immediately counter a defection;
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 181
Table 8.6. Deadlock between TFT and STFT.
TFT C D C D C D CD...
STFT D C D C D C DC...
for example, TFTT (TitForTwoTat) would not react to occasional, singular
defections, thus giving a suspicious strategy a clear advantage. Note that
suspicious strategies do not need to keep defecting after an initial defect:
The STFT strategy, for example, simply plays standard TFT but starts
each game with a defection.
The problem many strategies encounter when facing suspicion is that of
deadlock: If a strategy is programmed to counter defection in a TitForTat
manner, and the suspicious strategy itself is programmed the same way,
one suspicious defection can cause a mutual exchange of defects between
two strategies which could cooperate perfectly if only one player would
once forgive a defection. In general, we define deadlock as any situation
where a succession of defects is being played by two strategies because of
an out-of-phase TitForTat behaviour, as shown in Table 8.6.
OTFT counters deadlocks by forgiving a certain number of defections
when a strategy has cooperated for a long time. OTFT starts by cooper-
ating and then tracks the number of cooperations encountered. The initial
idea was that for a certain amount of cooperation, a certain number of de-
fections would be forgivable. The final OTFT algorithm incorporates this
idea, together with other adaptations, into a single strategy as described
below.
8.3.1.2. Randomness
Randomness, in the form of cooperative and defective moves varying with-
out any discernible pattern, can be introduced by simulated noise in the
command transmission, as used in several specific tournament environ-
ments, or it can be a trait of a strategy as such. Strategies trying to gain
by finding a cooperative base with an opponent are faced with a difficult
problem when the opponent is acting erratically: Finding a cooperative
base requires some small sacrifice (for example, STFT and TFTT, in con-
trast to TFT, can cooperate for the whole game because TFTT sacrifices
the initial defection). However a random strategy is highly likely to not
stick to a cooperative behaviour, resulting in the sacrifice cost mounting
and damaging the score of an otherwise successful, cooperative strategy.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
182 W. Slany & W. Kienreich
As a consequence, randomness must be detected in an opponent’s be-
haviour, and countered appropriately: By playing ALLD (full defect).
There is no way to gain from mutual cooperation if an opponent plays
completely random. Nevertheless, a strategy can at least deny such an op-
ponent gains by playing defection itself, and moreover, thereby profit from
defecting on any unrelated cooperative moves from the random strategy.
OFTF counters randomness by playing ALLD when a strategy exhib-
ited a certain amount of random behaviour. The initial idea was to cut
losses against the standard RAND strategy. However, in the final OTFT
algorithm, the random detection routine was merged with other traits into
a single strategy described below.
8.3.1.3. Exploits
Many strategies can be devised that try to exploit forgiving behaviour. For
example, a simple strategy could be designed to check once if it is playing
against any type of TFTT opponent, who forgives one defection “for free”,
and to exploit such behaviour. Table 8.7 shows the result of such an exploit
strategy at work on TFTT.
Fully countering such exploits leads to a strategy similar to PAV: Con-
stant checks would ensure that an opponent does not gain more from the
current play mode than oneself. When devising a scheme to implement
such checks, a solution was found which incorporates the above mentioned
problems of randomness and suspicion. The result is the final version of
the OTFT algorithm.
8.3.1.4. OTFT
The OTFT algorithm starts by playing C, then TFT. It then maintains a
variable noting the behaviour of the opponent according to typical situa-
tions as described above: For every time the opponent’s move differs from
the opponents previous move, and for every time the opponent’s move dif-
fers from OTFT’s previous move, the variable is increased. For every time
the opponent cooperated with OTFT, the variable is decreased. These rules
allow tracking of randomness and exploits: Based on mutual cooperation
Table 8.7. A strategy exploiting TFTT.
EXPL D D C D D C D D CDD...
TFTT C C D C C D C C DCC...
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 183
as the mutually most beneficial case, each change of move of the opponent
indicates some kind of either randomness, or of a try of exploitation of
the TFT behaviour used by OTFT. When the so-called exploit tracker in
OTFT reaches a certain value, the algorithm switches to all-out defection
ALLD to cut losses against an opponent repeatedly breaking cooperation.
A second mechanism is at work and allows recovery from deadlocks
as described above. When OTFT plays standard TFT, it is vulnerable
to deadlock, so independently of the exploit tracker described, a second
variable counts the number of times the opponent’s move was the opposite
of OTFT’s move. If this so-called deadlock tracker encounters a certain
number of exchanges of C and D, an additional C is played and the deadlock
counter is reset. As a consequence, OTFT is able to recover from deadlocks
occurring anywhere in a given exchange of moves.
8.3.1.5. Examples
Table 8.8 demonstrates how the desired avoidance of deadlocks is achieved
in a game played by OTFT versus STFT.
8.3.1.6. OTFT’s behaviour laid bare
In the end, there is no more detailed and exact description of OTFT’s inner
workings than the source code of its implementation. Luckily, the code is
short and easy to understand. We therefore reproduce it in Table 8.10,
leaving aside only the general parts required for the IPDLX framework
that was used in the competitions.e
Table 8.8. Deadlock resolved by OTFT.
OTFT C D C D C C C C C...
STFT D C D C D C C C C...
Table 8.9 shows how OTFT counters random strategies with all-out
defection after a certain amount of random behaviour has been detected.
Table 8.9. Random recognized and countered by OTFT.
OTFT C C D C D C C D C C C D D D D...
RAND C D C D D D C C C D D C D C Cs&Ds...
eFor details of IPDLX see http://www.prisoners-dilemma.com/competition.html#java
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
184 W. Slany & W. Kienreich
Table 8.10. Main parts of OTFT’s source code.
private static final int DEADLOCK_THRESHOLD = 3;private static final int RANDOMNESS_THRESHOLD = 8;public void reset() super.reset();
deadlockCounter = 0; randomnessMeasure = 0; opponentMove = COOPERATE; opponentsPreviousMove = COOPERATE; myPreviousMove = COOPERATE; public double getMove() if( deadlockCounter >= DEADLOCK_THRESHOLD )
// OTFT assumes a deadlock and tries to break it cooperating myReply = COOPERATE; // ... twice ... if( deadlockCounter == DEADLOCK_THRESHOLD ) deadlockCounter = DEADLOCK_THRESHOLD + 1; else // ... and then assumes the deadlock has been broken deadlockCounter = 0; else // OTFT assumes that there is no deadlock (yet) // OTFT assesses the randomness of the opponent’s behaviour if( opponentMove == COOPERATE && opponentsPreviousMove == COOPERATE randomnessMeasure-; if(opponentMove != opponentsPreviousMove) randomnessMeasure++; if(opponentMove != myPreviousMove) randomnessMeasure++; if(randomnessMeasure >= RANDOMNESS_THRESHOLD) // OTFT switches to ALLD (randomnessMeasure can only increase) myReply = DEFECT; else // OTFT assumes the opponent is not (yet) behaving randomly // OTFT behaves like TFT ... myReply = opponentMove; // ... but checks whether a deadlock situation seems to arise if( opponentMove != opponentsPreviousMove ) deadlockCounter++; else // OTFT recognizes that there is no sign of a deadlock deadlockCounter = 0; // OTFT memorizes the current moves for the next round opponentsPreviousMove = opponentMove; myPreviousMove = myReply; return(super.getFinalMove(myReply));
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 185
8.3.2. Our group strategies
8.3.2.1. The CosaNostra group strategy, or Organized crime meets
the iterated prisoner’s dilemma
The CosaNostra strategy is based on the concept of one strategy, denoted
Godfather, exploiting another strategy, denoted Hitman, to achieve a higher
total score in an IPD tournament scenario. In this context, exploitation
denotes the ability to deliberately extract cooperative moves from a strat-
egy while playing defect, a situation yielding high payoff for the exploiting
strategy. It is obvious that most opponents would avoid such a situation,
stopping to cooperate with an opponent who repeatedly played defection
in the past. Hence, a special opponent strategy, the Hitman, is designed to
provide this kind of behaviour, and is introduced into the tournament in as
large a number as possible.
A Hitman strategy which indiscriminatingly plays cooperation, however,
is of no use for a Godfather. In mimicking the ALLC standard strategy,
such a Hitman would be beneficial for all other strategies in a tournament
able to recognize and exploit ALLC. Consequentially, the Hitman must be
able to conditionally exhibit two types of behaviour:
• By default, Hitman must play a strategy which does not benefit other
strategies, which is not easily exploitable. Extending the idea, Hitman
should play a strategy most damaging to other strategies to lower their
score. Such a strategy is simple ALLD.
• When confronted with a certain stimulus, Hitman must switch to the
cooperative behaviour defined above.
Complementing the Hitman, Godfather should by default play the best
standard strategy available against any non-Hitman and switch to ALLD
when it encounters a Hitman, relying on the Hitman’s unconditional coop-
eration to raise its score. In our case, the Godfather plays OTFT when not
playing against a Hitman.
The critical part of CosaNostra is the identification of opponents, the
way in which Godfather detects a Hitman, and a Hitman detects a God-
father. We have employed sequences of Defections and Cooperations to
implement a bit-wise protocol which both sides use to mutually establish,
and check, identities (in case of multiple choices and multiple payoffs, this
protocol could be made very short, depending on the number of choices,
possibly to one exchange). If Godfather is aware he is not facing a Hitman,
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
186 W. Slany & W. Kienreich
he must switch to a good non-group strategy like OTFT or GRIM, and if
Hitman is aware it is not facing a Godfather, he must switch to the ALLD
strategy strafing all strategies that are not in their group. This occurs in
the following cases:
• “Unhonorable behaviour”: A presumed Hitman defecting or a presumed
Godfather cooperating outside protocol exchanges
• “Protocol breach”: Both not following the rules during protocol ex-
changes
Putting the rules in other words, the CosaNostra strategy is based on a
Godfather which can be sure that the next n moves of its opponent will be
cooperation, because it identifies the opponent through a simple exchange
protocol. A problematic aspect of such a strategy is the notion of Godfather
or Hitman being “taken over”: Both are prone to wrongly identify an op-
ponent as their strategic counterpart and grant it an advantage (in the case
of Hitman) or depend on predefined behaviour (in the case of Godfather)
and thus lower their score.
The effects if Godfather is taken over: Godfather thinks it is exploiting
a Hitman, plays DEFECT, but the opponent plays DEFECT, too, so God-
father gets the lowest possible score for the exchange. This situation is easy
to counter: If Godfather detects any defects when it believes it is exploiting
a Hitman, it assumes takeover and switches to its good non-group strategy
like OTFT or GRIM.
The effect of a Hitman being taken over is more subtle: Hitman thinks
he is being exploited by Godfather and plays COOP, a behaviour which
benefits the opponent. Countering this situation is complex: A first solu-
tion would be for Hitman to start playing ALLD as soon as it detects a
cooperative move outside the defined protocol exchanges (Hitman assumes
to be exploited). But another strategy could still play mostly DEFECT
and sometimes cooperate, thus fooling a Hitman: For example, a random
opponent strategy with 1/10 of all its moves being cooperative could by
chance emulate a protocol exchange which takes place when a interval of
fixed length ten is used by Hitman (and Godfather), at least for some time.
CosaNostra solves the takeover problem by varying intervals of
cooperation-protocol exchange, with the time between exchanges (the num-
ber of turns) in one interval being communicated within the protocol ex-
change. Godfather and Hitman both have an internal counter which tells
them when to synchronize by executing a protocol exchange, and check for
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 187
the other strategy truly being part of CosaNostra. Godfather communi-
cates to the Hitman a modification to the interval during each handshake.
Thus, no other strategy is likely to take over a Hitman or manipulate a
Godfather.
The communication protocol contains a 1 bit signature plus a 2 bit
sequence coding the length of the next interval, as depicted in Table 8.11
(the numbers at the beginning of the lines are countdown steps until the
start of the next interval).
A sample exchange will then look as illustrated in Table 8.12:
In this example an offset of 2 (CD = 01, binary = 2) is encoded. In-
ternally, the offset is used to select an interval length from a table roughly
adjusted to match the number of iterations in the tournament, in our im-
plementation as follows: DD → 26, DC → 15, CD → 19, and CC → 8.
Thus, after sending DD, Godfather will defect 23 times, then again start-
ing the hand shake with a single COOPERATE followed by the next two
bits encoding the length of the following interval, and so on. On his side,
the Hitman will expect 23 DEFECTs and then one COOPERATE from
Table 8.11. CosaNostra communication protocol.
Godfather Hitman
CountdownIndex Move Description Move Description
3 C
Godfather plays
D
Hitman plays a
a single signature single signature
COOPERATE DEFECT
2 D/CGodfather plays
C
Hitman
first message bit
COOPERATEs to
minimize protocol
loss
1 D/C
Godfather plays
C
Hitman stores first
second message
message bit,
bit
COOPERATEs to
minimize protocol
loss
CountdownD
Godfather starts
C
Hitman stores
restart, e.g., 26
new interval and second message
plays DEFECT bit, starts new
from now on interval, plays
until the next cooperate from
shake hand, now on until the
when next shake hand
CountIndex == 3 when CountIndex
again == 3 again
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
188 W. Slany & W. Kienreich
Table 8.12. Synchronization between Godfather and Hitman.
# ... 6 5 4 3 2 1 19 ...
Godfather D D D D C C D D D...
Hitman C C C C D C C C C...
his opponent to continue to believe the opponent is a Godfather and thus
allowed to exploit the Hitman.
Note that it would be easy to make the communication protocol more
efficient by increasing the length of each new interval according to a func-
tion known only to the Hitman and his Godfather, e.g., by adding a number
proportional to the number of intervals completed so far. Indeed, the like-
lihood that a non-Godfather strategy by coincidence can continually fool
a Hitman into believing he is serving his Godfather while instead allow-
ing the non-Godfather to take advantage of the Hitman, is decreasing very
quickly with each successful exchange. Conversely, the longer the opponent
of Hitman is following the Godfather’s behaviour, the more likely it is that
the opponent really is his Godfather, and so it becomes safer and safer for
the Hitman to let the opponent abuse him for longer and longer interval
lengths.
The bootstrap for the two strategies is that the Hitman starts with a
defection and the Godfather with cooperation, mimicking step 3 as shown
above. The initial cooperation move is important for Godfathers stan-
dard strategy: To achieve a good score against certain standard opponents
(GRIM being an extreme example), it is necessary to start off with Coop-
eration.
Godfather’s protocol loss per interval is at a minimum 5 points (the
single protocol cooperation), at a maximum 9 for the Godfather: A base
loss of 5 for the single protocol bit is inevitable. Then, at worst, Godfather
sends CC, the Hitman cooperates to minimize loss, yielding 3 + 3 = 6
instead of 5 + 5 = 10 in the best case where Godfather sends two defections
as protocol bits.
The CosaNostra group strategies have not been designed to fare well in
a noisy environment as in league 2 of the 2004 competition, though they
in practice did quite well (see Section 0). Note that it would not be very
difficult to make them more noise resistant by introducing some error cor-
recting mechanism such as, e.g., allowing a certain number of mistakes (or
unexpected replies but explainable as answers to possibly wrongly commu-
nicated signals from oneself) of the other player until deciding that he is
not part of one’s group.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 189
Table 8.13. Main parts of CosaNostra Godfather’s source code.
>> private variables and constants like in Table 8.10 <<
private static final int SYNC_GF_COOPERATES = 3;private static final int SYNC_HM_REPLIES_WITH_DEFECT = 2;private static final int GF_SENDS_FIRST_MESSAGE_BIT = 2;// private static final int GF_SENDS_SECOND_MESSAGE_BIT = 1;private int nextCountdownRestartValue;
public void reset() >> Content of OTFT's reset() method from Table 8.10 <<
countdownIndex = SYNC_GF_COOPERATES; // First COOPERATEopponentPlayedSoFarLikeHitman = true;
public double getMove() if( opponentPlayedSoFarLikeHitman ) // Did the opponent just break the Hitman behaviour pattern? if( ( countdownIndex == SYNC_HM_REPLIES_WITH_DEFECT && opponentMove == COOPERATE ) || ( countdownIndex != SYNC_HM_REPLIES_WITH_DEFECT && opponentMove == DEFECT ) ) // Yes, so the opponent cannot be a Hitman, so Godfather ... myReply = DEFECT; // ... defects and switches ... opponentPlayedSoFarLikeHitman = false; // ... to OTFT else // No, the opponent again played like a Hitman. if( countdownIndex > SYNC_GF_COOPERATES ) myReply = DEFECT; // Godfather thus exploits Hitman else if( countdownIndex == SYNC_GF_COOPERATES ) myReply = COOPERATE; // COOPERATE once to synchronize nextCountdownRestartValue = 9; // GF starts to prepare else if( countdownIndex == GF_SENDS_FIRST_MESSAGE_BIT ) myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?7:0; else // if( countdownIndex == GF_SENDS_SECOND_MESSAGE_BIT ) myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue; // restart countdownIndex--; else // Opponent surely is no Hitman and thus Godfather plays OTFT >> Content of OTFT's getMove() method from Table 8.10 <<
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
190 W. Slany & W. Kienreich
Table 8.14. Main parts of CosaNostra Hitman’s source code.
private static final int SYNC_HM_DEFECTS = 3;private static final int SYNC_GF_REPLIES_WITH_COOPERATE = 2;private static final int FIRST_MESSAGE_BIT_FROM_GF = 1;private static final int SECOND_MESSAGE_BIT_FROM_GF = 0;private int nextCountdownRestartValue;
public void reset() super.reset();
opponentPlayedSoFarLikeGodfather = true; // Assume the best opponentMove = DEFECT; // As a Godfather would have been doingcountdownIndex = SYNC_DEFECT; // First DEFECT to synchronize
public double getMove() if( opponentPlayedSoFarLikeGodfather ) // Did the opponent just break the Godfather behaviour pattern? if( ( countdownIndex == SYNC_GF_REPLIES_WITH_COOPERATE && opponentMove == DEFECT ) || ( countdownIndex > SYNC_GF_REPLIES_WITH_COOPERATE && opponentMove == COOPERATE ) ) // Yes, so the opponent cannot be a Godfather, so Hitman ... myReply = DEFECT; // ... defects and switches... opponentPlayedSoFarLikeGodfather = false; // ... to ALLD else // No, the opponent again played like a Godfather. if( countdownIndex != SYNC_HM_DEFECTS ) myReply = COOPERATE; // Godfather thus can exploit Hitman if( countdownIndex == FIRST_MESSAGE_BIT_FROM_GF ) nextCountdownRestartValue += (opponentMove==DEFECT)?7:0; else if( countdownIndex == SECOND_MESSAGE_BIT_FROM_GF ) nextCountdownRestartValue += (opponentMove ==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue - 1; // restart else // if( countdownIndex == SYNC_HM_DEFECTS ) myReply = DEFECT; // Hitman DEFECTs once to synchronize nextCountdownRestartValue = 9; // HM starts to prepare countdownIndex--; else // Opponent surely is no Godfather and thus Hitman ... myReply = DEFECT; // ... plays ALLD
return(super.getFinalMove(myReply));
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 191
8.3.2.2. The gory details of the CosaNostra group strategy
As in OTFT’s case, there is no more detailed and exact description of the
CosaNostra group strategy’s inner workings than the source code of its im-
plementation. Again, the code is short and easy to understand. We there-
fore reproduce it in Tables 8.13 for the Godfather and 8.14 for the Hitman
strategy, again leaving aside only the general parts required for the IPDLX
framework that was used in the competitions5. As Godfather uses the
OTFT strategy against strategies other than Hitman, the part of the code
of Godfather that is identical to the one of OTFT in Table 8.10 is not
repeated but referred to.
8.3.2.3. TheEmperorAndHisCloneWarriors
This group strategy is based on the same principles as the CosaNostra group
strategy, with one emperor playing the role of the Godfather, and his clone
warriors playing the Hitman strategy in large numbers (the number being
the major difference), each clone strategy having an individual number in
its name since it was required in the submission procedure to the competi-
tion to give each individual strategy a different name. We had trusted the
organizers after enquiring via email that open group strategies would be
allowed in the 2004 competition and accordingly had submitted the Em-
perorAndHisClones strategy with altogether 11,110 individually numbered
clones as one group strategy, as it was not clear how large groups would be
permitted to be. For reasons that, especially in hindsight, are not entirely
clear to us, the organizers decided to let altogether only one clone (with
the emperor) participate in the competitions. We are still perplexed with
respect to this point. In particular, we were initially prepared to submit a
much larger collusion group within the CosaNostra group strategy but —
after hearing that groups would be allowed — decided to submit only one
such collusion strategy as a proof of concept, counting on the fact that our
clone army would evaporate all competitors.
8.3.2.4. The StealthCollusion group strategy
As a proof of concept (see previous section), we submitted under the name
of Constantin Ionescu a group strategy that cooperates with our CosaNos-
tra group strategy, though not perfectly so. The mail with which we
submitted the strategy was written on purpose with some typos, a few
grammatical glitches, and sloppy formatting, all in order to add to the look
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
192 W. Slany & W. Kienreich
of authenticity of the submission by distracting from the real intention. It
was sent from a free mail account hosted in Romania, the sender claiming
to be a Student of informatica from the technical school of Timisoara. As
expected the deception went undetected.
8.4. Analysis of the Performance of the Strategies
8.4.1. OmegaTitForTat
Table 8.15 shows how OTFT clearly dominates a standard tournament with
strategies commonly used as test cases. Table 8.16 illustrates how OTFT
dominates in harsh environments where a lot of unconditional defection
occurs. Table 8.17 demonstrates OTFT’s dominance in random environ-
ments. The slight lead of GRIM in league 4 of the 2005 competition was due
to the higher number of games GRIM was allowed to play as we explained
already in Section 0.
8.4.2. Group strategies
In this section we study general characteristics of important possible group
strategies. We first classify and name group strategy classes as follows:
• Democracy during peace (DP): All group members are equals and treat
each other nicely by always cooperating, and play TFT or a better strat-
egy such as OTFT or GRIM outside of their community.
• Democracy at war (DW): All group members are equals and treat each
other nicely, however they continually defect (ALLD) against all other
strategies (after a short recognition interval).
Table 8.15. OTFT in a standard envi-
ronment, standard strategy sample, 200
turns.
Rank Strategy Score
1 OTFT 5,978
2 GRIM 5,538
3 TFT 5,180
4 TFTT 5,134
5 ALLC 4,515
6 RAND 4,062
7 STFT 4,018
8 ALLD 4,016
9 NEG 3,726
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 193
Table 8.16. OTFT in a harsh environ-
ment, 50% ALLD opponents, 200 turns.
Rank Strategy Score
1 OTFT 7,358
2 GRIM 6,959
3 TFT 6,577
4 TFTT 6,524
5 ALLD 5,512
6 ALLD 5,464
7 ALLD 5,452
8 ALLD 5,428
9 ALLD 5,428
10 ALLD 5,416
11 STFT 5,415
12 ALLD 5,404
13 ALLD 5,400
14 RAND 4,658
15 ALLC 4,530
16 NEG 3,728
Table 8.17. OTFT in a random en-
vironment with 50% RAND opponents,
200 turns.
Rank Strategy Score
1 OTFT 10,114
2 GRIM 9,867
3 TFT 8,338
4 ALLD 8,236
5 TFTT 7,806
6 RAND 7,357
7 RAND 7,212
8 RAND 7,195
9 STFT 7,192
10 RAND 7,150
11 RAND 7,150
12 RAND 7,099
13 RAND 7,099
14 RAND 7,082
15 NEG 6,947
16 ALLC 6,624
• Empire during peace (EP): There is one special group member, the em-
peror, which is allowed to take advantage of all other members of his
empire by playing defect while they cooperate with him. The subjects
otherwise cooperate among each other, and play TFT or a better strategy
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
194 W. Slany & W. Kienreich
such as OTFT or GRIM outside their community, after a short recogni-
tion interval.
• Empire at war (EW): Again, the emperor is allowed to take advantage of
all other members of his empire by playing defect while they cooperate
with him. Again, the subjects otherwise cooperate among each other,
but now they play, after a short recognition interval, ALLD against all
other strategies.
In the following, we will show that groups can be arbitrarily better
performing than individual strategies, and that, under equal group size,
EW groups can achieve arbitrarily higher payoffs (for the emperor) than
EP groups, and that EP groups can achieve arbitrarily higher payoffs (for
the emperor) than members of an DP group, which can achieve arbitrarily
higher payoffs than members of a DW group. When group sizes vary, we
show that even the weak DW group members can achieve arbitrarily higher
payoffs than the emperor of a competing EW group by sheer numerical
superiority.
First some preliminaries: We know that the payoff values observe the
relations S < P < R < T and 2R > T + S of Section 0. Let us assume
in the following that the group in the democracy variants and the group of
subjects in the empire variants are of size m (for members), and that there
are altogether n players in total (so m < n) which play i iterations during
the IPD competition.
We further assume that:
• The best single-player (non-group) strategy IOPT (for individual optimal
strategy) achieves payoff X · i after i iterations.
• The emperor strategy achieves payoff E · i after i iterations.
• The individual members (or subjects) achieve payoff M·i after i iterations.
• The loss due to recognition of members of the same group is negligible
due to the size of i.
• We further assume that the emperor always plays the best non-group
strategy against non-members of his group.
• During peace, individual members always play the best non-group strat-
egy against non-members of their group.
• We assume that the best single-player strategy achieves an average payoff
of A against other non-group strategies. The relations P < A < T
are plausible, and a value of A near R is likely under the assumption
that most individual strategies are similar to TFT. We therefore assume
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 195
that A = R in the following unless stated otherwise. This implies that
members of groups of type DP achieve more or less the same payoff
as the best individual strategy IOPT, so we assume that MDP = XDP.
This assumption simplifies the calculations in the following claim without
sacrificing the fundamental relations between the different strategies.
• We also assume that most single-player strategies achieve an average
score near A (and thus near R according to the previous assumption)
when playing against other single-player strategies (so more or less all of
them are optimal) and against DP, EP, or emperors of EW strategies (so
they all play fairly against each other), and an average score of P when
playing against members of groups at war. This would roughly corre-
spond to the pay-off achievable by OTFT and similar strategies. Again,
this assumption simplifies the calculations in the following claim without
sacrificing the fundamental relations between the different strategies.
Claim 8.1: Under the above assumptions and unless stated otherwise,
the following relations hold:
(1) Members of groups of type DW can achieve larger payoffs than members
of groups of type DP only when the DW members constitute more than
50% of the total population. When group sizes are equal and there are
other strategies, DP has an advantage over DW. By increasing i, this
advantage can be made arbitrarily large: mDP ≥ mDW → MDP · i >>
MDW · i.
(2) Emperors from EP groups can achieve larger payoffs than members of
groups of type DP (assuming equal group size). By increasing i, this
advantage can be made arbitrarily large: EEP · i >> MDP · i. Because
of our assumption that MDP = XDP the relation also holds for the best
individual strategy IOPT, so emperors from EP groups can achieve
arbitrarily larger payoffs than the best individual strategy.
(3) Emperors from EW groups can achieve larger payoffs than an emperor
from an EP group (assuming equal group size). By increasing i, this
advantage can be made arbitrarily large: EEW · i >> EEP · i.
(4) When two groups of unequal size compete, then:
(a) Independently of the group sizes and the values of S, P, R, and
T, emperors (at war or during peace) fare better than democrats
at peace. By increasing i, this advantage can be made arbitrarily
large: EE · i >> MDP · i.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
196 W. Slany & W. Kienreich
(b) Depending on the values of P, R, and T, and when i increases, a
democracy at war can fare arbitrarily better than an emperor (at
war or during peace) when it is sufficiently large: mDW >> mE →
MDW · i >> EE · i.
(5) We now assume that IOPT scores a higher average payoff value A
against non-group strategies than the group strategies achieve against
non-group strategies; let B with B ¡ A ¡ T be the (bad) score that an
emperor achieves on average against non-group strategies (we here de-
liberately drop the initial assumption that emperors play IOPT against
non-group strategies). In order for the emperor to nevertheless win de-
spite playing worse in general than IOPT, the following inequalities
must be satisfied: In case of EP,
(6)
mEP > (A− B)/(T− B)n ,
and in case of EW,
mEW > (A− B)/(T− B− P + A)n .
Again, larger group size helps even when the strategies are badly per-
forming. We also see that as B approaches A, emperors can win against
IOPT even with very few other group members.
(7) When two DW, EP, or EW groups of the same type but of different size
and with different “efficiencies” compete (we here again deliberately
drop the initial assumptions that emperors play IOPT against non-
group strategies), larger group size can compensate for less efficiency,
and vice versa. Note that this is not true for DP groups.
Proof.
(1) MDP = R (n−mDW)+P mDW and MDW = R mDW < +P (n−mDW),
assuming that no other group at war is present in the population. Thus,
MDW > MDP if and only if mDW > n/2.
(2) MDP = R n and EEP = R(n−m) + T m. Since T > R, EEP > MDP.
(3) EEP = R (n− 2m) + T m + P m and EEW = R (n−m) + T m. Since
R > P, EEW > EEP.
(4) For groups of unequal size:
(a) It suffices to show that EEP > MDP is independent of the size of
the groups. EEP = R (n−mEP) + T mEP and MDP = R n. Since
T > R, EEP > MDP holds independently of the size of the groups.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 197
(b) It suffices to show that there exists a large enough mDW such that
MDW > EEW. MDW = R (n − mEW) + P mEW and EEW = R
(n − mEW − mDW) + T mEW + P mDW. Then MDW > EEW if
and only if mDW > (T − P)/(R − P)mEW. In the 2004 and 2005
competitions, P = 1, R = 3, and T = 5, so mDW would have to
be larger than 2mEW. In case only the two group strategies would
compete, this would mean that the DW strategy would need 2/3
of the strategies in the whole population.
(5) In case of EP: EEP = B (n −mEP) + T mEP and XEP = A n. Then
EEP > XEP if and only if mEP > (A − B)/(T − B)n (assuming that
T > A > B). In case of EW: EEW = B (n − mEW) + T mEW and
XEW = A (n − mEW) + P mEW. Then EEW > XEW if and only if
mEW > (A− B)/(T− B− P + A)n.
(6) We show it here for two unequal EW strategies, and note that similar
arguments work for the cases EP and DW. Let B1 and B2 be the scores
that the two emperors achieve on average against non-group strategies,
with B1 < B2 and |B1−B2| = α (T−P) with 0 < α < 1. Then E1 > E2
if and only if
m1 > (1− α)/(1 + α)m2 + α/(1 + α)n .
Example: suppose B1 = 2.5 and B2 = 2.6, and as before P = 1 and T
= 5 so that α = 0.025, and
m1 > 0.9513m2 + 0.0244n .
Thus, when m2 = 20 and n = 100 then m1 must be at least 22 so that
the first emperor can triumph above his more efficient opponent.
8.4.3. Collusion detection is an undecidable problem
The difficulty of detecting collusion practically has been shown in previous
parts of this chapter. The difficulty of recognizing collusion is also sup-
ported by the difficulty of solving the problem from a theoretical point of
view: We show below that the general question of whether two strategies
of which the source code is known and that do not depend on any third
party source of randomness are actually colluding or not, is undecidable
— of course it is even harder when the strategies only are known as black
boxes, without having access to their source code. Simpler arguments than
ours would also do but we try in our approach to define the formal collusion
problem as closely to the practical collusion detection problem as possible.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
198 W. Slany & W. Kienreich
Remember the definition of the Halting problem: Is there a finite de-
terministic Turing machine H that is able to decide in finitely many steps
whether an arbitrary finite deterministic Turing machine M ultimately will
halt or not? It is well known that the Halting problem has been shown to
be undecidable by Turing. Exact definitions of Turing machines and other
notions appearing in this section as well as references to the original sources
can easily be found, e.g., in any theoretical computer science reference book
such as Papadimitriou (1994).
Let the Simplified Collusion problem formally be defined as follows: Is
there a deterministic Turing machine SC that is able to decide in finitely
many steps whether, given two arbitrary integers i and j, two arbitrary
finite deterministic Turing machines S1 and S2 will both output a sequence
of at least length i+j characters (one character per tape position) composed
only of the letters “C” and “D” on their two separate write-once output
tapes T1 and T2, such that the j letters starting from tape position i + 1
will all be “D”s on T1 and all be “C”s on T2?
This simplistic definition covers many (but surely not all) real collusion
cases. It also would imply that strategies usually not considered collud-
ing consciously like ALLD as T1 and ALLC as T2 would be classified as
colluding in the Simplified Collusion terminology. However, ALLD really
could be colluding with a large group of ALLC where other more cautious
strategies like OTFT would not be able to take advantage of ALLC since
they never would defect first. Thus, when a player or a group of players are
able to introduce an ALLD and many ALLC into a competition, they could
well be part of an intentional collusion, and thus the classification in the
Simplified Collusion terminology would not be completely wrong. Eventu-
ally, deciding what really is a collusion and what not cannot be solved by
formal methods alone. Nevertheless, we can at least show the following:
Claim 8.2: The Simplified Collusion problem is undecidable.
Proof. To formally show the undecidability of the Simplified Collusion
problem, we follow the standard argument by reducing the Halting problem
to it. Take any finite deterministic one-tape Turing machine M for which we
want to know whether it halts or not. Without loss of generality, we assume
that the tape of M is infinite in both directions, that each combination of
the finitely many characters of the alphabet, which includes the letters “C”
and “D”, and of the finitely many states of M defines exactly one of the
finitely many rules of M, and that only the special state h stops M.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 199
To decide whether M halts or not, we construct for each M two new
Turing machines N1 and N2. N1, in comparison to M, is defined as follows:
It has an additional initially empty output tape T, an additional tape IJ
that initially contains the numbers i and j in binary with the character “:”
written between the two numbers, an additional state s, and a constant
number of other states needed to be able to countdown the two binary
numbers and do the other things described below, and almost the same set
of rules as M, with only the following changes: each rule of M leading to h
instead leads to state s, and there is a constant number of additional rules
that make sure the following: When N1 enters state s, it will countdown
from i to zero, each time writing one letter “C” on IJ and then moving
one position to the right on IJ, so that at the end a sequence of i “C”s is
written on IJ. Then it will countdown from j to zero, each time writing one
letter “D” on IJ and then moving one position to the right on IJ, so that
at the end a sequence of i “C”s followed by j “D”s is written on IJ. Then
it will change to state h and halt. N2 is defined as follows: it simply writes
i + j letters “C” to its output tape T. Finally, we choose the two numbers
i and j, e.g., i = 1 and j = 1.
It is clear that this construction always leads to a valid instance of the
Simplified Collusion problem. It is also clear that if and only if M halts, then
the question posed in the Simplified Collusion problem will have a positive
answer for the constructed instance of Simplified Collusion problem.
Now, if a finite deterministic Turing machine SC that is able to decide
the Simplified Collusion problem in finitely many steps would exist, then we
could also decide the Halting in finitely many steps, as follows: We would
define a new finite deterministic Turing machine R that for any given Turing
machine M (properly encoded for R on R’s input tape), first constructs
(in finitely many steps) an encoding of corresponding finite deterministic
Turing machines N1 and N2 with i = 1 and j = 1 as described above
(this surely can be done in finitely many steps), then simulates SC applied
to this instance of the Simplified Collusion problem, thereby deciding in
finitely many steps (SC takes only finitely many steps, and simulating it
on R is also easily feasible in finitely man steps) whether it is a yes or a
no-instance, and returns this answer of SC as the answer of R, which must
also be the answer to the question of whether M halts or not. So, if the
Simple Collusion problem is decidable, then the Halting problem must also
be decidable. Since we know for sure the latter is not true, the former also
cannot be true, and thus the Simple Collusion problem is undecidable.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
200 W. Slany & W. Kienreich
8.5. Conclusion
We have described our submissions to the iterated prisoner’s dilemma (IPD)
competitions of 2004 and 2005, the OmegaTitForTat (OTFT) single-player
strategy and the CosaNostra group strategy composed of one Godfather
(CNGF) and several Hitman (CNHM). We also studied their performance
in the different leagues of the competitions.
The observed slight superiority of OTFT in comparison to GRIM psy-
chologically is a reassuring result. The charm of OTFT compared to GRIM
is that OTFT is an intelligent forgiving strategy whereas GRIM, as the
name implies, is an unforgiving iron-handed pig-head that falls in an eter-
nal revenge mode after being deceived a single time.
We also have established a taxonomy of generalized group strategies
for IPD competitions. In it, the types of group strategies are classified
according to their behaviour towards other members of the same group
and towards strategies outside of their group. We labelled the four classes
of group strategies studied as democracies during peace (DP), democ-
racies at war (DW), empires during peace (EP), and empires at war
(EW). As we have shown in the previous section, group strategies can
easily outperform any individual strategy by sheer numerical superior-
ity. Group strategies appear at every place in Nature and Human So-
ciety, and group strategies competing in IPD competitions can serve as
simplified study objects of the former. It is interesting to note that in
the analysis in the last section, individual strategies member of a DW
group fare less well than those of a DP group, and that this relation
is reversed for empires, EW faring better than EP, not because the em-
peror itself fares better, but because his competitors are more harmed.
This is clear from the fact that members of DW lose individually more
than members of DP, whereas emperors at war (EW) fare better than
emperors during peace (EP), and these better than DW and DP. E.g.,
emperors at war do not have to suffer from their aggressive acts, and ac-
tually do better in comparison than their opponents by letting the pay-
off of individuals that are not members of their group get lowered by
their other, underling members, while at the same time retaliation from
others does not hit them directly (think of real emperors, Mafia bosses,
etc).
But it not even has to be fights for life and death, wars, or outright geno-
cide: the same pattern appears in business where larger or more advanced
companies (in particular their owners) that are more or less aggressive
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 201
can crush competitors or, in extremity, take advantage of cheap child-slave
labour, thus extremely abusing their own workforce.
It is also interesting to note that better resources, be it people, money,
or technology, corresponding to a higher number of individual strategies
in the group, or better average payoff values against non-group strategies,
positively influence the overall payoff values of the groups. Thus, numerical
superiority does not have to mean that the number of soldiers is higher,
but can also be due to better technology, be it military, commercial, or
biological. It is also not surprising that, as described point 4.a of Claim 8.1
of Section 0, individual strategies in democracies during peace always “lose”
against emperors, the latter always being able to get more from his subjects
than what he gives in return, and certainly more than his unorganized
competitors. However, given enough superiority, again either in number,
money, or technology, even democracies at war can win against empires at
war (point 4.b of the claim in Section 0), the Second World War for instance
having several examples of such situations.
We also showed that group strategies can be subtly camouflaged to
look like unrelated single-player strategies. These stealth collusion group
strategies will elude detection with high probability, e.g., by introducing
a certain amount of noise in the interaction with one’s group members to
make the collusion less evident. We showed that the differentiation between
colluding and non-colluding behaviour can be very difficult in practice and
is generally undecidable from a theoretical point of view.
In the study of economics, collusion takes place within an industry when
rival companies cooperate for their mutual benefit. According to game the-
ory, the independence of suppliers forces prices to their minimum, increas-
ing efficiency and decreasing the price determining ability of each individual
firm. If one firm decreases its price, other firms will follow suit in order to
maintain sales, and if one firm increases its price, its rivals are unlikely
to follow, as their sales would only decrease. These rules are used as the
basis of kinked-demand theory. If firms collude to increase prices as a co-
operative, however, loss of sales is minimized as consumers lack alternative
choices at lower prices. This benefits the colluding firms at the cost of
efficiency to society [Wikipedia: Collusion 2005].
There was some discussion whether collusion group strategies were ac-
tually cheating in the 2004 and 2005 IPD competitions, but since the orga-
nizers clearly said that cooperating strategies were to be allowed, it would
have been strange to deny participation to such group strategies. What we
can say at least is that the detection of StealthCollusion, both in future IPD
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
202 W. Slany & W. Kienreich
competitions as well as in real life, in practice is very difficult. The Mafia,
or for that matter, any human organization that is not readily recognizable
as a group, be it Masonic lodges, secret religious groups, or corporate car-
tels, exist and as such are certainly worth to be modelled. Being able to
secretly communicate, thereby “colluding” in a general sense, is quite com-
mon, and in practice forbidding it is nearly infeasible whenever intelligent
individuals exchange information repeatedly. An exception where a biolog-
ical occurrence of an IPD without information exchange has been reported
to take place has been described by Turner and Chao (1999). They show
that certain viruses that infect and reproduce in the same host cells seem to
be engaged in a survival of the fittest-driven prisoner’s dilemma. However,
in light of the ways different types of bird’s flu viruses infecting the same
human cells can exchange RNA in order to increase their fitness, it can be
argued that such emerging colluding group behaviour appears already at
this relatively low level of life.
In commerce, collusion is largely illegal due to antitrust law, but im-
plicit collusion in the form of price leadership and tacit understandings is
unavoidable. Several recent examples of explicit collusion in the United
States include [Wikipedia: Collusion 2005]:
• Price fixing and market division among manufacturers of heavy electrical
equipment in the 1960s.
• An attempt by Major League Baseball owners to restrict players’ salaries
in the mid-1980s.
• Price fixing within food manufacturers providing cafeteria food to schools
and the military in 1993.
• Market division and output determination of livestock feed additive by
companies in the US, Japan and South Korea in 1996.
There are many ways that implicit collusion tends to develop
[Wikipedia: Collusion 2005]:
• The practice of stock analyst conference calls and meetings of indus-
try almost necessarily cause tremendous amounts of strategic and price
transparency. This allows each firm to see how and why every other firm
is pricing their products. Again, the line between insider information and
just being better informed is often very thin.
• If the practice of the industry causes more complicated pricing, which is
hard for the consumer to understand (such as risk based pricing, hidden
taxes and fees in the wireless industry, negotiable pricing), this can cause
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
On some winning strategies for the Iterated Prisoner’s Dilemma 203
competition based on price to be meaningless (because it would be too
complicated to explain to the customer in a short ad). This causes in-
dustries to have essentially the same prices and compete on advertising
and image, something theoretically as damaging to a consumer as normal
price fixing.
We predict that all iterated prisoner’s dilemma competitions in the fu-
ture will be dominated by group strategies. Even when in a future IPD
competition all strategies will be chosen by the same single person who
consciously tries to avoid that any “group cooperation” happens among
his strategies, then random and involuntary cooperation that mathemati-
cally is identical to voluntary cooperation can never be excluded. Actually,
group cooperation can be self-emerging in a population, some strategies
involuntarily faring better together and possibly against other groups or
individuals, however loosely they are constituted. We predict that when
evolutionary algorithms are used to breed new species of IPD strategies,
such cooperation will automatically emerge at a certain point.
Cooperation in groups of strategies in IPD competitions mimics co-
operation of groups in Nature and Human Society — it therefore allows
modelling another common aspect of cooperative behaviour that so far was
not explicitly studied in the IPD framework: more or less open coopera-
tion of subgroups versus other subgroups or individuals. The number of
members of the group does not have to correspond to the actual number of
individuals. Instead, it could also mean the amount of money involved, or
the technological advantage of one subgroup relative to another one.
Acknowledgments
The authors would like to thank the anonymous reviewers for many useful
comments and corrections.
References
Axelrod, R. (1984) The evolution of cooperation. Basic Books.
Beaufils, B., Delahaye, J.-P., and Mathieu, P. (1996). Our meeting with gradual:
A good strategy for the iterated prisoner’s dilemma, Proceedings Artificial
Life V, Nara, Japan, 1996.
Kuhn, S. (2003) Prisoner’s Dilemma. The Stanford Encyclopedia of Philoso-
phy (Fall 2003 Edition), Edward N. Zalta (ed.), http://plato.stanford.edu/
archives/fall2003/entries/prisoner-dilemma/.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter8
204 W. Slany & W. Kienreich
Mehlmann, A. (2000) The Game’s Afoot! Game Theory in Myth and Paradox.
AMS Press.
Nowak, M. and K. Sigmund (1993) A strategy of win-stay, lose-shift that outper-
forms tit-for-tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58.
O’Riordan, C. A (2000) Forgiving Strategy for the Iterated Prisoner’s Dilemma.
Journal of Artificial Societies and Social Simulation, 3, 4.
Papadimitriou, C. H. (1994) Computational Complexity. Addison-Wesley.
Turner, P. and L. Chao (1999). Prisoner’s dilemma in an RNA virus, Nature,
398, pp. 441-443.
Tzafestas, E.S. (2000) Toward adaptive cooperative behavior, From Animals to
animats, Proceedings of the 6th International Conference on the Simulation
of Adaptive Behavior (SAB-2000), 2, pp. 334-340.
Wikipedia: Collusion (2005). http://en.wikipedia.org/w/index.php?title=
Collusion&oldid=33029071.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Chapter 9
Error-Correcting Codes for Team Coordination within a
Noisy Iterated Prisoner’s Dilemma Tournament
Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn, Perukrishnen
Vytelingum, Nicholas R. Jennings
University of Southampton
9.1. Introduction
The mechanism by which cooperation arises within populations of selfish
individuals has generated significant research within the biological, social
and computer sciences. Much of this interest derives from the original re-
search of Axelrod and Hamilton[Axelrod and Hamilton (1981)], and, in
particular, the two computer tournaments that Axelrod organised in or-
der to investigate successful strategies for playing the Iterated Prisoner’s
Dilemma (IPD)[Axelrod (1984)]. These tournaments were so significant as
they demonstrated that a simple strategy based on reciprocity, namely tit-
for-tat, was extremely effective in promoting and maintaining cooperation
when playing against a wide range of seemingly more complex opponents.
To mark the twentieth anniversary of the publication of this work,
these two computer tournaments were recently recreated (see http://www.
prisoners-dilemma.com/) with separate events being hosted at the 2004
IEEE Congress on Evolutionary Computing (CEC’04) and the 2005 IEEE
Symposium on Computational Intelligence and Games (CIG’05). To stim-
ulate novel research, the rules of Axelrod’s original tournaments were ex-
tended in two key ways. Firstly, noise was introduced, whereby the moves
of each player would be mis-executed with some small probability. Sec-
ondly, and most significantly, researchers were invited to enter more than
one player into the round-robin style tournament. This second extension
to the original rules, prompted several researchers to enter teams of players
into the tournament. This choice being motivated by the intuition that the
members of such a team could, in principle, recognise and collaborate with
205
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
206 A. Rogers et al.
one another in order to gain an advantage over other competing players.
This proved to be the case, and teams of players performed well in both
competitions. Indeed, a member of such a team, entered by the authors,
won the noisy IPD tournaments held at both events.
Now, for this approach to be effective in practice, two key questions have
to be addressed. Firstly, the players, who have no access to external means
of communication, have to be able to recognise one another when they meet
within the IPD tournament. Secondly, having achieved this recognition, the
players have to adopt a strategy that increases the probability that one of
their own kind wins the tournament. In this chapter, we present our work
investigating these two questions. Specifically:
(1) We show how our players are able to use a pre-agreed sequence of
moves, that they make at the start of each interaction, to transmit a
covert signal to one another, and thus detect whether they are facing
a competing player or a member of their own team.
(2) We show that by recognising and then cooperating with one another,
the members of the team can act together to mutually improve their
performance within the tournament. In addition, by recognising and
acting preferentially toward a single member of the team, the team
can further increase the probability that this member wins the overall
tournament. In both cases, this can be achieved with a team that is
small in comparison to the population (typically less than 15%).
(3) Given this approach, we show with an experimental IPD tournament
that the performance of our team is highly dependent on the length
of the pre-agreed sequence of moves. The length of this sequence de-
termines both the cost and the effectiveness of the signalling between
team members, and these factors contribute to an optimum sequence
length that is independent of both the size of the team and the number
of competing players within the tournament.
(4) Using the results of these experimental IPD tournaments, we show that
signalling with a pre-agreed sequence of moves, within the noisy IPD
tournament, is exactly analogous to the problem, studied in informa-
tion theory, of communicating reliably over a noisy channel. Thus we
demonstrate that we can implement error correcting codes in order to
further optimise the performance of the team.
(5) Finally, we discuss how the results of these investigations guided the
design of the teams that we entered into the two recent IPD competi-
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 207
tions, and thus we follow this analysis with a discussion of the results
of these competitions.
The remainder of this chapter is organised as follows: section 9.2 describes
the Iterated Prisoner’s Dilemma setting and related work. Section 9.3 de-
scribes the team players that we implemented in our investigations and
section 9.4 describes the results of the experimental IPD tournaments that
we implemented. In section 9.5 we analyse these results and in section 9.6
we discuss our use of coding theory to optimise the performance of the
team. Finally, we discuss the application of these techniques within the
two computer tournaments in section 9.7 and we conclude in section 9.8.
9.2. The Iterated Prisoner’s Dilemma and Related Work
In our investigations, we consider the standard Iterated Prisoner’s Dilemma
(IPD) as used by Axelrod in his original computer tournaments. Thus, in
each individual IPD game, two players engage in repeated rounds of the
normal form Prisoner’s Dilemma game, where, at each round, they must
choose one of two actions: either to cooperate (C) or to defect (D). These
actions are chosen simultaneously and depending on the combination of
moves revealed, each player receives the payoff indicated in the game matrix
shown in table 9.1. For example, should player 1 cooperate (C) whilst player
2 defects (D), then player 1 receives zero points whilst player 2 receives five
points. The scores of each player in the overall IPD game are then simply
the sum of the payoffs achieved in each of these rounds. In our experiments
we assume that each IPD game consists of 200 such rounds, however, this
number is of course unknown to the players participating.
As in the original tournaments, a large number of such players (each
using a different strategy to choose its actions in each individual IPD game)
are entered into a round-robin tournament. In such a tournament, each
player faces every other player (including a copy of itself) in separate IPD
games, and the winner of the tournament is the player whose total score,
summed over each of these individual interactions, is the greatest.
Given this problem description, the goal of Axelrod’s original tourna-
ments was to find the most effective strategies that the players should adopt.
Whilst in a single instance of the Prisoner’s Dilemma game it is a domi-
nant strategy for each player to defect, in the iterated game this immediate
temptation is tempered by the possibility of cooperation in future rounds.
This is often termed the shadow of the future[Trivers (1971)], and, thus, in
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
208 A. Rogers et al.
Table 9.1. Pay-off matrix of the
normal form Prisoner’s Dilemma
game.
Player 2
C D
Player 1C 3,3 0,5
D 5,0 1,1
order to perform well in an IPD tournament, it is preferable for a player to
attempt to establish mutual cooperation with the opponent. Thus, strate-
gies based on reciprocity have proved to be successful, and, indeed, the
simplest such strategy, tit-for-tat (i.e. start by cooperating and then de-
fect whenever the opponent defected in the last move) famously won both
tournaments[Axelrod (1984)].
More recent research has extended this reciprocity based approach, and
has lead to strategies that out-perform tit-for-tat in general populations.
For example, Gradual[Beaufils et al. (1997)] is an adaption of tit-for-tat that
incrementally increases the severity of its retaliation to defections (i.e. the
first defection is punished by a single defection, the second by two consec-
utive defections, and so on). Likewise, Adaptive[Tzafestas (2000)] follows
the same intuition as Gradual but addresses the fact that the opponent’s
behaviour may change over time and thus a permanent count of past de-
fections may not be the best approach. Rather, it maintains a continually
updated estimate of the opponent’s behaviour, and uses this estimate to
condition its future actions.
However, this reciprocity is challenged within the noisy IPD tourna-
ment. Here, there is a small possibility (typically around 1 in 10) that the
moves proposed by either of the players is mis-executed. Thus a player
who intended to cooperate, may defect accidentally (or vice versa)a and
this noise makes maintaining mutual cooperation much more difficult. For
example, a single accidental defection in a game where two players are using
the tit-for-tat strategy, will lead to a series of mutual defections in which
each player scores are reduced. This detrimental effect is often resolvedaNote that this noise can be implemented in two different ways: either the cooperation
is actually mis-executed as a defection, or it is simply perceived by the other player
as a defection. The difference between these two implementations results in different
payoffs to the players in that round on the IPD game. Whilst this does result in slightly
different scores in the overall IPD tournament, it does not significantly effect the results,
as, in general, the performance of a player is determined by its actions in the moves
that follow either the real or perceived defection. In our experiments, we use the first
implementation and assume that noisy moves are actually mis-executed.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 209
by implementing more generous strategies which do not retaliate immedi-
ately. For example, tit-for-two-tats (TFTT) will only retaliate after two
successive defections[Axelrod (1997); Axelrod and Wu (1995)] and gener-
ous tit-for-tat (GTFT) only retaliates a small percentage of the times that
tit-for-tat would[Axelrod and Wu (1995)]. However, whilst these strategies
manage to maintain mutual cooperation when playing against similar gen-
erous strategies, their generosity is also vulnerable to exploitation by more
complex strategies. Thus effective strategies for noisy IPD tournaments
must carefully balance generosity against vulnerability to exploitation, and
in practise, this is difficult to achieve.
Now, the possibility of entering a team of players within a noisy IPD
tournament offers an alternative to this reciprocity based approach. If the
members of the team are able to recognise one another, they can uncon-
ditionally mutually cooperate and thus do not need to retaliate against
defections that are the result of mis-executed moves. In addition, by de-
fecting against players who they do not recognise as fellow team members,
they are immune to exploitation from these competing players. As such,
this approach resembles the notion of kin selection from the evolutionary
biology literature, where individuals act altruistically toward those that
they recognise as being their genetic relatives[Hamilton (1963, 1964)].
However, to use this approach in practise, we must address two specific
issues. Firstly, we must enable the players to recognise one another and
we do so by using a pre-agreed sequence of moves that each player makes
at the start of each IPD interaction. Secondly, since our goal is to ensure
that one member of the team wins the tournament, we explicitly identify
one team member as the team leader, and have the other team members
favour this individual. We describe these steps, in more detail, in the next
section.
9.3. Team Players
Thus, as described in the previous section, we initially implement a team
of players who recognise one another through the initial sequence of moves
they make at the start of each IPD interaction. To this end, each team
player uses a fixed length binary code word to describe this initial sequence
of moves. Specifically, we denote 0 as defect and 1 as cooperate, and the
binary code word indicates the fixed sequence of moves that the player
should make, regardless of the actions of the opponent. This binary code
word is known to all members of the team, and by comparing the moves
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
210 A. Rogers et al.
Team Member
‘team member code’
-
-CCCCCCCC recognise team member
DDDDDDDD otherwise.
Fig. 9.1. Diagram showing the sequence of actions played by each of the team members.
of their opponents against this code word, players within the team can
recognise if they are playing against another member of the team or against
an unknown opponentb.
Now, whenever a team member meets another team member within the
IPD tournament, they can recognise one another and then cooperate with
one another unconditionally. In addition, the team members can recognise
when they are playing against a competing player and then defect contin-
ually (see figure 9.1). In this way, since the team players no longer have to
reciprocate any mis-executed moves in order to maintain cooperation, they
achieve close to the maximum possible score whenever they play against
other team members. In addition, since they defect against competing
players, they are also immune to exploitation from these players. Thus
given a sufficient number of team members within the IPD tournament,
the team players perform well, compared to reciprocity based strategies.
However, our goal is to form a team that maximises the probability that
one of its members will be the most successful player within the IPD tourna-
ment. Thus, we can improve the performance of the team by identifying one
of the team members as the team leader, and allowing the other ordinary
team members to act preferentially towards this team leader. Thus, when
the ordinary team members encounter the team leader, they continually
cooperate, whilst allowing the team leader to exploit them by continually
defecting. In this way, whilst competing players derive the minimum possi-
ble score in interactions with the ordinary team members, the team leader
derives the maximum possible score in these same interactions. Hence,
by allowing the team leader to exploit them, the ordinary team members
sacrifice their own chance of winning the tournament, but by changing the
tournament environment, they are able to increase the chance that the team
leader will winc.bNote that this recognition will not be perfectly reliable; the code word may be corrupted
by noise or competing players may accidentally make a sequence of moves that matches
the team code word. These are effects that we explicitly consider in section 6.cThus the team that we implement is similar to the ‘master’ and ‘slave’ approach sug-
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 211
Team Leader
‘team leader code’-
-
-
CCCCCCCC
recognise team memberDDDDDDDD
recognise team leader
CC – TFT – otherwise.
Team Member
‘team member code’-
-
-
CCCCCCCC
recognise team memberCCCCCCCC
recognise team leader
DDDDDDDD otherwise.
Fig. 9.2. Diagram showing the sequence of actions played by each of the team players.
The case above describes the instances in which the team leader encoun-
ters another team member. However, when the team leader encounters any
other competing players it should adopt some default strategy. Clearly,
using the best performing strategy available will increase the chances of
the team leader winning the tournament. However, since our purpose here
is to demonstrate the factors that influence the effectiveness of the team,
rather than to optimise a single example case, in the investigations that
we present here, we use tit-for-tat as this default strategy. As such, tit-
for-tat is well understood, and whilst it does not exploit other strategies
as effectively as the more recently developed alternatives discussed in the
previous section, it is immune to being exploited itself. Thus in the case
that the team leader does not recognise another team player, it cooperates
on the next two moves in an attempt to reestablish cooperation and then
continues by playing tit-for-tat for the rest of the interaction.
Finally, since the rules of the IPD tournament mean that each player
must play against a copy of themselves, we also enable the team leader to
recognise and cooperate with a copy of itself. Thus, the actions of both the
ordinary team members and the team leader are shown schematically in
figure 9.2. Note, that it is not strictly necessary to implement two different
codes (i.e. one for the team leader and one for ordinary team members),
however, we do so to reduce the chances of a competing player exploiting
the ordinary team members (see section 9.7 for a more detailed discussion).
gested by Delahaye and Mathieu[Delahaye and Mathieu (1993)]. However, unlike this
example, where the slaves were simple strategies that could potentially be exploited by
any member of the population, all of our team players explicitly recognise one another
and condition their actions on this recognition.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
212 A. Rogers et al.
9.4. Experimental Results
Now, given the team players described in the previous section, two imme-
diate questions are posed: (i) how does the number of team players within
the population effect the probability that the team leader does in fact win
the tournament? and (ii) how does the length of the code word (i.e. the
length of the initial sequence of moves that the team players use to signal
to one another) affect the performance of the team leader? In order to
address these questions and to test the effectiveness of the team, we imple-
ment an IPD tournament (with and without noise) using a representative
population of competing players. To ensure consistency between differ-
ent comparisons within the literature, we adopt the same test population
as previous researchers[Beaufils et al. (1997); ORiodan (2000); Tzafestas
(2000)] and thus the population consists of eighteen players implementing
the base strategies used in the original Axelrod competition (e.g. All C,
All D, Random and Negative), simple strategies that play periodic moves
(e.g. periodic CD, CCD and DDC) and state-of-the-art strategies that have
been shown to outperform these simple strategies (e.g. Adaptive, Forgiving
and Gradual). A full list and description of the strategies adopted by these
players is provided in Appendix A.
We first run this tournament, using this fixed competing population,
whilst varying the number of team players within the population, from 2
to 5 (i.e. one team leader and 1 to 4 ordinary team members), and varying
the length of code word, L, from 1 to 16 bits. To ensure representative
results, we also average over all possible code words, and in total, we run
the tournament 1000 times and average the results. Since our aim is to
show the benefit that the team has yielded, compared to the the default
strategy of the team leader (in this case tit-for-tat), we divide the total score
of the team leader by the total score of the player adopting the simple tit-
for-tat strategy. Thus, we calculate 〈ScoreLeader
〉 / 〈ScoreTFT〉 and note
that the greater this value, the better the performance of the team. The
results of these experiments are shown in figure 9.3 for the noise free IPD
tournament and in figure 9.5 for the noisy IPD tournament. In these figures,
the experimental results are plotted with error bars, along with a continuous
best fit curve (see section 9.5 for a discussion of the calculation of this line).
Now, in order to investigate the effect of larger population sizes, we
also run experiments where we fix the number of team players within the
population to be five (again composed of one team leader and four ordi-
nary team members), but then generate competing populations of differ-
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 213
2 4 6 8 10 12 14 160.9
1
1.1
1.2
1.3
<ScoreLeader > / <ScoreTFT >
Code Word Length (L)
2 team players3 team players4 team players5 team players
Fig. 9.3. Experimental results showing the benefit of the team in a noise free IPD
tournament. Results show code word lengths from 1 to 16 bits where the total population
consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members)
and 18 competing players. Results are averaged over 1000 tournament runs.
2 4 6 8 10 12 14 160.9
1.1
1.3
1.5
1.7
<ScoreLeader > / <ScoreTFT >
Code Word Length (L)
6 competing players12 competing players18 competing players24 competing players30 competing players
Fig. 9.4. Experimental results showing the benefit of the team in a noise free IPD
tournament. Results show code word lengths from 1 to 16 bits where the total population
consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6,
12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
214 A. Rogers et al.
2 4 6 8 10 12 14 16
1
1.05
1.1
1.15
<ScoreLeader > / <ScoreTFT >
Code Word Length (L)
2 team players3 team players4 team players5 team players
Fig. 9.5. Experimental results showing the benefit of the team in a noisy IPD tour-
nament. Results show code word lengths from 1 to 16 bits where the total population
consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members)
and 18 competing players. Results are averaged over 1000 tournament runs.
2 4 6 8 10 12 14 161
1.1
1.2
1.3
1.4
<ScoreLeader > / <ScoreTFT >
Code Word Length (L)
6 competing players12 competing players18 competing players24 competing players30 competing players
Fig. 9.6. Experimental results showing the benefit of the team in a noisy IPD tour-
nament. Results show code word lengths from 1 to 16 bits where the total population
consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6,
12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 215
ent sizes by randomly selecting players from our pool of 18 base strate-
gies (always ensuring that we have at least one player using the tit-for-tat
strategy). We run the tournament 10000 (more than before as we must
also average over the stochastic competing population) and again calculate
〈ScoreLeader
〉 / 〈ScoreTFT〉. Figure 9.4 shows these results for the noise free
IPD tournament and figure 9.6 show results for the noisy IPD tournament
The results clearly indicate that, as expected, increasing the number
of team players, or more exactly, increasing the percentage of the popula-
tion represented by the team, improves the performance of the team (i.e.
increases 〈ScoreLeader
〉 / 〈ScoreTFT〉). In addition, in both the noise free
and noisy IPD tournaments there is clearly an optimum code word length
whereby the benefit of the team decreases when the code word length is
longer or shorter than this optimum. Most significantly, this optimum code
word length is clearly independent of both the size of the team and the
population. In addition, in the case of the noisy IPD tournament, the re-
sults are very sensitive to this optimum code word length and, overall, the
benefit of the team is much less than that achieved in the noise free IPD
tournament. In the next section, we analyse these results and propose error
correcting codes to improve performance in the noisy IPD tournament.
9.5. Analysis
The optimum code word lengths observed in the previous experimental re-
sults are the result of a number of opposing factors. If we initially consider
the noise free IPD tournament, we can identify two such factors. The first
represents the cost of the signalling between team players. As the length of
the code word is increased, the team players have less available remaining
moves in which to manipulate the outcome of the tournament and, thus,
this factor favours shorter code word lengths. However, for this signalling
to be effective, the team players must be able to distinguish between com-
peting players and other team players. If the code word becomes too short,
it becomes increasingly likely that a competing player will through pure
chance make the sequence of moves that correspond to either of the code
words of the team players. Thus the second factor represents the effec-
tiveness of the signalling. It has the opposite effect of the first and thus
favours longer code word lengths. The balance of these two opposing fac-
tors give rise to the behaviour seen in figures 9.3 and 9.4 where we observe
an optimum code length near seven bits; at greater lengths we observe an
approximately linear decrease in performance, whilst at shorter lengths, we
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
216 A. Rogers et al.
2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
Probability of Discimination (Pd )
Code Word Length (L)
Fig. 9.7. Experimental and theoretical results showing the probability of a team player
successfully discriminating between another team player and a competing player in an
IPD tournament.
observe a more rapid decrease in performance.
When noise is added to the IPD tournament, a third factor, which also
affects the effectiveness of the signalling, becomes apparent. In order for
the team players to recognise one another, the sequence of moves made
by each player must be correctly executed. In the noisy IPD tournament,
there is a small probability that one or more of the moves that constitute
these code words will be mis-executed and, in this case, the team players
will fail to recognise one another. The effect of this additional factor is
clearly seen in a comparison of figures 9.3 and 9.4 and figures 9.5 and 9.6.
In the noisy IPD tournament the optimum code word length is significantly
shorter than the noise free case and there is a very rapid non-linear decrease
in performance at code word lengths greater than this optimum. This final
factor is very significant, and thus in the noisy IPD tournament, the team
yields much less benefit than that in the noise free IPD tournament.
Now, the two factors that describe the effectiveness of the signalling can
usefully be expressed as two probabilities. These are the probability that a
team player will successfully discriminate a competing player from another
team player, Pd, and the probability that two team players will successfully
recognise one another, Pr. We can directly measure these probabilities from
the experimental results presented in the last section, and then compare
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 217
them to theoretical predictions.
Thus, to calculate the probability of successful discrimination, Pd, we
consider that out of the 2L possible code words, one is required for the team
leader code and one for the team member code. Thus, when we consider
the average over all possible code words, this probability is given by:
Pd
= 1−2
2L
(9.1)
In the case of the probability of successful recognition, Pr, we require that
both code word sequences are played with no mis-executed moves. If the
probability of mis-executing a move is γ (in our case γ = 1/10), then this
probability is simply given by:
Pr
= (1− γ)2L (9.2)
Figures 9.7 and 9.8 show a comparison of these analytical results against the
probabilities measured from the experimental results presented in the last
section. Clearly the theoretical predictions match the experimental data
extremely welld and these results indicate that the benefit of the team is
strongly dependent on the effectiveness of the signalling between the team
members. Most surprising, is that in the case of the noisy IPD tournament,
with anything but the very shortest code word lengths, the chances of two
team players successfully recognising one another is extremely small. At
first sight, this result suggests that the use of teams is unlikely to be very
effective in noisy environments. However, the problem that we face here
(i.e. how to reliably recognise code words in the presence of mis-executed
moves), is exactly analogous to that studied in information theory of com-
municating reliably over a noisy channel. As such, we can use the results
of this field (specifically error correcting codes), to increase the probability
that the team members successfully recognise one another, and thus, in
turn, increase the benefit that the team will yield.
9.6. Error Correcting Codes
The problem of communicating reliably over a noisy channel, or in our case,
reliably recognising code words when moves of the IPD game are subject todFurther confirmation of this analysis is provided by the observation that the best-fit
lines shown in figures 9.3 to 9.6, are calculated by postulating that the shape of the line
is given by y = A + Bx +C
2x + D(1 − γ)2x
. The coefficients A, B, C and D are then
found via regression so as to minimise the sum of the squared error between observed
and calculated results. In the case of the noise free IPD tournament, the value of D is
fixed at zero.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
218 A. Rogers et al.
2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
Code Word Length (L)
Probability of Recognition (Pr )
Fig. 9.8. Experimental and theoretical results showing the probability of two team
players successfully recognising one another in a noisy IPD tournament.
mis-execution, is fundamental to the field of information theory[Shannon
(1948)]. One of the most widely used results of this work is the concept
of error correcting codes; codes that allow random transmission errors to
be detected and corrected[MacKay (2003); Peterson and Weldon (1972)].
Such codes typically take a binary code word of length Lc
and encode it
into a longer binary message of length Lm
(i.e. Lm
> Lc). Should any
errors occur in the transmission of this message (e.g. a 1 transmitted by
the sender is interpreted as a 0 by the receiver), the decoding procedure and
the redundancy that has been incorporated into the longer message, mean
that these errors can be corrected and the original code word retrieved.
Different coding algorithms are distinguished by the length of the initial
code word, the degree of redundancy added to the message and by the
number of errors that they can correct. Thus, in our application, all the
team members must implement the same coding algorithm, but now, rather
than using the code word directly to describe their initial sequence of moves,
they use the longer encoded message. Likewise, they observe the moves of
their opponent and then compare the results of the decoding algorithm to
their reference code words.
The improvement that such error-correcting codes can achieve is signifi-
cant but we have several requirements when selecting an appropriate coding
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 219
algorithm. The coding algorithm should increase the effectiveness of the
signalling, by increasing the probability that the team members can suc-
cessfully discriminate between team members and other competing players
(i.e. increase Pd) and by increasing the probability that the team members
recognise one another successfully (i.e. increase Pr). However, it should not
increase the cost of the signalling such that this increase in effectiveness is
lost. The need to limit the increase in the cost of signalling, and thus limit
the length of the encoded message, Lm
, is the key factor in restricting our
choice of coding algorithm. As shown in figures 9.3 and 9.4, even with the
perfect recognition that is achieved in the noise free case, the performance
of the team begins to degrade when Lm
> 7, and whilst many coding algo-
rithms exist, the vast majority generate message lengths far in excess of this
value[Peterson and Weldon (1972)]. Thus, our choice of coding algorithm
is limited to the three presented below:
(1) A single block Hamming code that takes a 4 bit code word and generates
a seven bit message that can be corrected for a single error.
(2) A two block Hamming code that simply concatenates two four bit words
and thus produces a fourteen bit message that can be corrected for a
single error in each 7 bit block.
(3) A [15,5] Bose-Chaudhuri-Hochquenghem (BCH) code that encodes a
five bit code word into a fifteen bit message, but is capable of correcting
up to three errors.
Now, in each case, the probability of successfully discriminating between
team players and competing players is still determined by the initial code
word length (i.e. the decoding algorithm maps the 2Lm possible encoded
messages onto 2Lc possible code words), and thus, as before, is given by:
Pd
= 1−2
2Lc
(9.3)
However, the probability that the team players successfully recognise one
another is determined by the message length and by the error correcting
ability of the code. Thus, for the Hamming code with n blocks, this prob-
ability is given by the probability that less than two error occurs in each
seven bit encoded message:
Pr
=
[
1∑
k=0
(
k
7
)
γk(1− γ)7−k
]2n
(9.4)
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
220 A. Rogers et al.
Table 9.2. Calculated results for the probability of discrimination, Pd, and the
probability of recognition, Pr, for three different error correcting codes considered.
Direct Hamming BCH
L=3 1 block 2 blocks [15,5]
Lc – Code Word Length 3 4 8 5
Lm – Message length 3 7 14 15
Pd – Probability of Discrimination 0.750 0.875 0.992 0.937
Pr – Probability of Recognition 0.531 0.723 0.527 0.892
For the [15,5] BCH code, the probability of recognition is given by consid-
ering that the code word can be correctly decoded if less than four errors
occur in the fifteen bit encoded message, and thus:
Pr
=
[
3∑
k=0
(
k
15
)
γk(1− γ)15−k
]2
(9.5)
These calculated values are shown in table 9.2 for the three coding algo-
rithms considered, along with the original case results in which the direct
code words are used (we use the value of L = 3 which was shown to be
optimal for the noisy IPD tournament presented in section 9.4). Note, that
all of the coding algorithms result in improvements in Pd
since they all im-
plement a code word of length greater than three. However, only the single
block Hamming code and the [15,5] BCH code improve upon Pr. In the case
of the two block Hamming code, the error correcting ability is not sufficient
to overcome the long message length that results. Of the three algorithms,
the [15,5] BCH code is superior; it creates the longest message length, yet
its error correcting ability is such that it also displays the best probability of
recognition. This result is confirmed by implementing the different coding
algorithms within the team players and repeating the experimental noisy
IPD tournament, with a fixed competing population, described in section
9.4. As before, to ensure representative results, we run the tournament
1000 times and average over all possible choices of code words. Table 9.3
shows the results of this comparison when 2 to 5 team players (i.e. one
team leader and 1 to 4 ordinary team members) are included within the
population. As expected, the [15,5] BCH code outperforms the others and,
in the case where there are five team members, the performance of the
[15,5] BCH algorithm is very close to the best achieved in the noise free
IPD tournament presented in figure 9.3.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 221
Table 9.3. Experimental results for 〈ScoreLeader
〉 / 〈Score TFT 〉 for
the three different error correcting codes considered here. Tournaments
are averaged over 1000 runs and the standard error of the mean is
±0.002.
Direct Hamming BCH
L=3 1 block 2 blocks [15,5]
Number of
Team Players
2 1.043 1.055 1.044 1.062
3 1.079 1.101 1.083 1.120
4 1.112 1.145 1.121 1.173
5 1.141 1.184 1.159 1.221
Finally, we present results from implementing this [15,5] BCH code in
the noisy IPD tournament, again with a fixed competing population. In
table 9.4 we show the total scores achieved by each player when the number
of team players increases from 2 to 5. To enable comparison with other
populations, we normalise these scores and divide the total score achieved
by each player, by the size of the population and by the number of rounds
in each IPD game (in this case 200). Thus, the values shown are the
ranked average pay-off received by the player in each round of the Prisoner’s
Dilemma game. Within this table, the competing players are denoted by
the mnemonic given in Appendix A, the team leader is denoted by LEAD
and the ordinary team members by MEMB .
Clearly, as more team members are added to the population, they are
increasingly able to change the environment in which the team leader must
interact and thus they are able to influence the outcome of the tournament
in favour of the team leader. In three out of the four cases, the team leader
is in fact the winner of the tournament, despite the fact that this player is
based upon the tit-for-tat strategy that performs relatively poorly against
this population (see the results shown in Appendix A). In addition, these
results also clearly show that the mutual cooperation of the other team
members, also leads them to perform well. Indeed, when the team consists
of five (or more) such team members, all five occupy the top positions.
In table 9.5, rather than showing the averaged scores of the tournament
players, we present the probability that one of the team players actually
wins the overall noisy IPD tournament. In addition to the previous results
where the probability that a move was mis-executed was 1/10, we present
a range of values from 0 to 1/5. The results indicate that whilst we have
assumed a noise level of 1/10 throughout the analysis, our results are not
particularly sensitive to this value. Indeed, the more significant factor is
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
222 A. Rogers et al.
Table 9.4. Experimental results showing the results of the noisy IPD tournament when
the team players implement a [15,5] BCH coding algorithm and there are increasing
numbers of team players (a). . .(d). The tournaments are averaged over 1000 runs and
the standard error of the mean is ±0.002.
(a) (b) (c) (d)
Player Score
GRAD 2.347
LEAD 2.344
ADAP 2.263
SMAJ 2.256
GRIM 2.239
ALLD 2.219
MEMB 2.219
TFT 2.207
TFTT 2.175
FORG 2.171
GTFT 2.160
PCD 2.138
PCCD 2.136
STFT 2.124
HMAJ 2.109
RAND 2.101
PAVL 2.099
PDDC 2.072
NEG 2.049
ALLC 1.996
Player Score
LEAD 2.427
GRAD 2.298
MEMB 2.246
MEMB 2.246
ADAP 2.228
SMAJ 2.221
GRIM 2.221
ALLD 2.192
TFT 2.168
TFTT 2.135
FORG 2.126
GTFT 2.114
PCD 2.091
STFT 2.090
HMAJ 2.084
PCCD 2.078
RAND 2.058
PAVL 2.047
PDDC 2.033
NEG 1.991
ALLC 1.934
Player Score
LEAD 2.503
MEMB 2.273
MEMB 2.272
MEMB 2.271
GRAD 2.256
ADAP 2.191
SMAJ 2.186
GRIM 2.181
ALLD 2.161
TFT 2.133
TFTT 2.099
FORG 2.086
GTFT 2.068
STFT 2.061
HMAJ 2.054
PCD 2.047
PCCD 2.027
RAND 2.013
PDDC 2.005
PAVL 2.004
NEG 1.938
ALLC 1.877
Player Score
LEAD 2.568
MEMB 2.296
MEMB 2.294
MEMB 2.294
MEMB 2.292
GRAD 2.218
ADAP 2.164
SMAJ 2.157
GRIM 2.156
ALLD 2.136
TFT 2.103
TFTT 2.062
FORG 2.054
STFT 2.036
GTFT 2.031
HMAJ 2.030
PCD 1.999
PCCD 1.982
RAND 1.969
PDDC 1.969
PAVL 1.966
NEG 1.886
ALLC 1.820
the loss of performance of the competing players as the noise level increases.
The table shows that with just two team members and no noise, a team
player will win the tournament just 3.4% of the time. However, as the noise
level increases, the performance of the other players within the tournament
degrades at a faster rate than that at which the effectiveness of the signalling
between team members diminishes. At a noise level of 1/5 the same team
members win 70.2% of the time. Indeed with 3 or 4 team members, the
results are independent of the noise level within this range.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 223
Table 9.5. Experimental results showing the probability that one of the team
members wins the noisy IPD tournament. Results are for different numbers
of team members and a range of noise levels. Results are averaged over 1000
tournament runs and the standard error of the mean for each result is ±0.5.
Noise Level (γ)
0.00 0.05 0.10 0.15 0.20
Number of
Team Players
2 2.8 % 10.6 % 22.4 % 30.0 % 32.6 %
3 3.4 % 81.0 % 80.4 % 81.6 % 70.2 %
4 97.6 % 99.0 % 96.4 % 96.6 % 97.2 %
5 97.4 % 96.6 % 97.2 % 96.6 % 96.8 %
9.7. Competition Entry
The results of the previous sections clearly indicate that there is an advan-
tage to be gained by entering a team of players into the noisy IPD tour-
nament. However, when using these results to actually design the players
for the IPD competition entries, a number of additional factors must be
considered. Firstly, in our experimental investigations we have averaged
over all possible code words to produce representative results. However,
for the competition entry we must actually select two code words: one for
the team members and one for the team leader. Whilst the probability of
recoginising a team player is independent of the choice of code word (this
is a property of the codes that are implemented), the probability of succes-
fully discriminating between team and competing players is not. Clearly,
code words that are close (in Hamming distance) to the initial moves of
competing players are more likely to be corrupted by noise and thus falsely
recognised. Thus we must select code words that are most unlike the moves
that we expect to observe from competing players. Actually making this
choice is complicated by the fact that we do not know the strategies that
the competing players will use, and the moves that they make will them-
selves depend on the actual code words that the team players use. Thus, we
again use our test population of eighteen default strategies, and by exhaus-
tive test, we select two code words which most often lead to the correct
recognition of team players and the correct discrimination of competing
players.
Secondly, throughout these investigations, we have not considered the
possibility of another competing player learning the code words of the team
members and then attempting to exploit them. Within our competition en-
tries, we greatly reduce the possibility of this occurring by having each team
player monitor the behaviour of their opponent, in order to check that they
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
224 A. Rogers et al.
behave as expected. Thus, if an ordinary team member recognises their
opponent to be another ordinary team member, they check that the oppo-
nent does in fact cooperate in the subsequent rounds of the game. Should
the opponent attempt to defect (with some allowance for the possibility
of mis-executed moves), it is assumed that the opponent has been falsely
recognised and thus the team member begins to defect to avoid the possibil-
ity of being exploited. Given this additional checking, the only possibility
of exploitation is that a competing player learns the code word of the team
leader, and thus tricks the ordinary team members into allowing themselves
to be exploited. However, in the IPD tournament, this is extremely unlikely
to occur. The players within the tournament only interact with each other
once, thus, whilst a competing player may encounter several ordinary team
members, there is little possibility of them learning the code word of the
team leader in this single interaction. This is the reason for implementing
separate team member and team leader code words.
Finally, we must decide how many team members to submit into the
competition. Clearly, our results indicate that the larger the number of
players, the better the performance of the team leader. However, typically,
this number is limited by the rules of the competition (e.g. the rules of the
second IPD tournament capped this number at 20), and thus, we should
submit the maximum allowable number of players.
Thus, the teams that we entered into the two recent IPD competitions
held at the 2004 IEEE Congress on Evolutionary Computing (CEC’04)
and the 2005 IEEE Symposium on Computational Intelligence and Games
(CIG’05), followed these guidelines and were successful. In the first compe-
tition, we entered several teams, that used the single block Hamming code,
and a range of default strategies for the team leader. Whilst a few other
researchers entered teams of players, the policy was not widely adopted and
the team leader from the largest team won with a clear advantage.
In the second round of competitions we entered a single team using
the more complex [15,5] BCH coding scheme, and, as in our investigations
here, we used tit-for-tat as the default strategy of the team leader. In this
competition, separate noise free and noisy IPD tournaments were held, and
these tournaments were more competitive, as given the results of the first
competition, many more researchers adopted the policy of submitting a
team of players. Within the noise free IPD tournament, three of the top
four positions were occupied by representatives of different teams. However,
within the noisy IPD tournament, our team leader again won with a clear
advantage, despite using the tit-for-tat as a default strategy. The other
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 225
teams entered into this tournament performed poorly compared to the noise
free IPD tournament. Thus, these results clearly illustrate the advantage
that the use of error-correcting codes has yielded by enabling our team
players to recognise one another in the noisy environment.
9.8. Conclusions
In this chapter, we presented our investigations into the use of a team of
players within an Iterated Prisoner’s Dilemma tournament. We have shown
that if the team players are capable of recognising one another, they can
condition their actions to increase the probability that one of their mem-
bers wins the tournament. Since, outside means of communication are not
available to these players, we have shown that they are able to make use
of a covert channel (specifically, a pre-agreed sequence of moves that they
make at the start of each interaction) to signal to one another and thus
perform this recognition. By carefully considering both the cost and effec-
tiveness of the signalling, we have shown that we can use error correcting
codes to optimise the performance of the team and that this coding allows
the teams to be extremely effective in the noisy IPD tournament; a noisy
environment which initially appears to preclude their use.
Our future work in this area concerns the use of these team players in
an evolutionary model of the IPD tournament. That is, rather than the
static IPD tournament presented here (where the population of competing
players is fixed), we consider a model where the population of competing
players evolves over time (i.e. the survival of any individual within the
population is dependent on their performance within an IPD tournament
held at each generation). Here we are particularly interested in searching
for evolutionary stable strategies (ESS), and thus are interested whether an
explicit team leader is required (or indeed, can even be implemented) and
how team players may attempt to exploit other team players to their own
advantage. As such, this work attempts to compare the roles of kin selection
and reciprocity for maintaining cooperation in noisy environments.
A.1. Test Population
The test population consists of eighteen players implementing the base
strategies used in the original Axelrod competition (e.g. All C, All D,
Random and Negative) plus simple strategies that play periodic moves (e.g.
periodic CD, CCD and DDC) and state-of-the-art strategies that have been
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
226 A. Rogers et al.
shown to outperform these simple strategies (e.g. Adaptive, Forgiving and
Gradual). A full list and description of the strategies adopted by these
players is shown in table A.1, and table A.2 shows the results of running
noise free and noisy IPD tournaments using just these players. To ensure
repeatable results, we run the tournament 1000 times and present the aver-
age results. To allow easy comparison with other publications, we normalise
the scores and thus divide them by the size of the population and the num-
ber of rounds in each IPD game (in this case 200). Thus, the values shown
are the ranked average pay-off received by the player in each round of the
Prisoner’s Dilemma game.
Note, that in this population, tit-for-tat performs relatively poorly and
is easily beaten by a number of strategies. In addition, in general the scores
in the noisy IPD tournament are less than those in the noise free tourna-
ment, since it is far harder to ensure mutual cooperation in the presence of
accidental defections.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 227
Table A.1. Description of the strategies adopted by the competing players in the
test population.
Strategy Name Description
Adaptive ADAP Uses a continuously updated estimate of the
opponent player’s propensity to defect to
condition future actions[Tzafestas (2000)].
All C ALLC Cooperates continually.
All D ALLD Defects continually.
Forgiving FORG Modified tit-for-tat strategy that attempts
to reestablish mutual cooperation after
a sequence of mutual defections[ORiodan
(2000)].
Gradual GRAD Modified tit-for-tat strategy that use pro-
gressively longer sequences of defections in
retaliation[Beaufils et al. (1997)].
Grim GRIM Cooperates until a strategy defects against
it. From that point on defects continually.
Generous Tit-For-Tat GTFT Like tit-for-tat but cooperates 1/3 of the
times that tit-for-tat would defect[Axelrod
and Wu (1995)].
Hard Majority HMAJ Plays the majority move of the opponent.
On the first move, or when there is a tie, it
cooperates.
Negative NEG Plays the negative of the opponents last
move.
Pavlov PAVL Plays win-stay, lose-shift[Nowak and Sig-
mund (1993)].
Periodic CD PCD Plays ‘cooperate, defect’ periodically.
Periodic CCD PCCD Plays ‘cooperate, cooperate, defect’ period-
ically.
Periodic DDC PDDC Plays ‘defect, defect, cooperate’ periodically.
Random RAND Cooperates and defects at random.
Suspicious
Tit-For-Tat
STFT Identical to tit-for-tat but starts by defect-
ing.
Soft Majority SMAJ Plays the majority move of the opponent.
On the first move, or when there is a tie, it
defects.
Tit-For-Tat TFT Starts by cooperating and then plays the last
move of the opponent.
Tit-For-Two-Tats TFTT Like tit-for-tat but only defects after two
consecutive defections against it.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
228 A. Rogers et al.
Table A.2. Reference performance of the test popu-
lation in the (a) noise free and (b) noisy IPD tourna-
ment. Results are averaged over 1000 repeated tour-
naments and the standard error of the mean is ±0.002.
(a) (b)
Strategy Score
ADAP 2.888
GRAD 2.860
GRIM 2.773
TFT 2.647
FORG 2.627
GTFT 2.591
SMAJ 2.575
TFTT 2.544
PAVL 2.390
ALLC 2.332
PCD 2.279
HMAJ 2.277
STFT 2.233
PCCD 2.190
ALLD 2.175
RAND 2.114
NEG 2.111
PDDC 2.081
Strategy Score
GRAD 2.410
ADAP 2.329
GRIM 2.297
SMAJ 2.292
ALLD 2.278
TFT 2.245
FORG 2.211
TFTT 2.204
GTFT 2.198
PCCD 2.185
PCD 2.179
STFT 2.155
RAND 2.143
PAVL 2.140
HMAJ 2.134
NEG 2.112
PDDC 2.110
ALLC 2.043
References
Axelrod, R. (1984). The Evolution of Cooperation (Basic Books).
Axelrod, R. (1997). The Complexity of Cooperation (Princeton University Press).
Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science
211, pp. 1390–1396.
Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s
dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189.
Beaufils, B., Delahaye, J. P. and Mathieu, P. (1997). Our meeting with gradual:
A good strategy for the iterated prisoner’s dilemma, in Proceedings of the
Fifth International Workshop on the Synthesis and Simulation of Living
Systems (MIT Press), pp. 202–212.
Delahaye, J. P. and Mathieu, P. (1993). L’altruisme perfectionne, Pour la Science
187, pp. 102–107.
Hamilton, W. D. (1963). The evolution of altruistic behaviour, Am. Nat. 97, pp.
354–356.
Hamilton, W. D. (1964). The genetical evolution of social behaviour, J. Theor.
Biol. 7, pp. 1–16.
MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms
(Cambridge University Press).
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
Error-Correcting Codes for Team Coordination 229
Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that out-
performs tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58.
ORiodan, C. (2000). A forgiving strategy for the iterated prisoner’s dilemma,
Journal of Artificial Societies and Social Simulation 3, 4, pp. 56–58.
Peterson, W. W. and Weldon, E. J. (1972). Error-Correcting Codes (MIT Press).
Shannon, C. E. (1948). A mathematical theory of communication, The Bell Sys-
tem Technical Journal 27, pp. 379–423, 623–656.
Trivers, R. (1971). The evolution of reciprocal altruism, Quarterly Review of
Biology 46, pp. 35–57.
Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of
the Sixth International Conference on the Simulation of Adaptive Behavior
(SAB-2000), Vol. 2, pp. 334–340.
January 30, 2007 11:1 World Scientific Review Volume - 9in x 6in chapter9
This page intentionally left blankThis page intentionally left blank
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Chapter 10
Is it Accidental or Intentional? A Symbolic Approach to
the Noisy Iterated Prisoner’s Dilemma
Tsz-Chiu Au, Dana Nau
University of Maryland
10.1. Introduction
The Iterated Prisoner’s Dilemma (IPD) has become well known as an ab-
stract model of a class of multi-agent environments in which agents accu-
mulate payoffs that depend on how successful they are in their repeated
interactions with other agents. An important variant of the IPD is the
Noisy IPD, in which there is a small probability, called the noise level, that
accidents will occur. In other words, the noise level is the probability of
executing “cooperate” when “defect” was the intended move, or vice versa.
Accidents can cause difficulty in cooperations with others in real-life sit-
uations, and the same is true in the Noisy IPD. Strategies that do quite well
in the ordinary (non-noisy) IPD may do quite badly in the Noisy IPD [Axel-
rod and Dion (1988); Bendor (1987); Bendor et al. (1991); Molander (1985);
Mueller (1987); Nowak and Sigmund (1990)]. For example, if two players
both use the well-known Tit-For-Tat (TFT) strategy, then an accidental
defection may cause a long series of defections by both players as each of
them punishes the other for defecting.
This chapter reports on a strategy called the Derived Belief Strategy
(DBS), which was the best-performing non-master-slave strategy in Cate-
gory 2 (noisy environments) of the 2005 Iterated Prisoner’s Dilemma com-
petition (see Table 10.1).
Like most opponent-modeling techniques, DBS attempts to learn a
model of the other player’s strategy (i.e., the opponent model∗) during the
∗The term “opponent model” appears to be the most common term for a model of the
other player, even though this player is not necessarily an “opponent” (since the IPD is
not zero-sum).
231
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
232 T-C. Au and D. Nau
Table 10.1. Scores of the best programs in Competition 2 (IPD with Noise). The
table shows each program’s average score for each run and its overall average over
all five runs. The competition included 165 programs, but we have listed only the
top 25.
Score
Rank Program Author Run1 Run2 Run3 Run4 Run5 Avg.
1 BWIN P. Vytelingum 441.7 431.7 427.1 434.8 433.5 433.8
2 IMM01 J.W. Li 424.7 414.6 414.7 409.1 407.5 414.1
3 DBSz T.C. Au 411.7 405.0 406.5 407.7 409.2 408.0
4 DBSy T.C. Au 411.9 407.5 407.9 407.0 405.5 408.0
5 DBSpl T.C. Au 409.5 403.8 411.4 403.9 409.1 407.5
6 DBSx T.C. Au 401.9 410.5 407.7 408.4 404.4 406.6
7 DBSf T.C. Au 399.2 402.2 405.2 398.9 404.4 402.0
8 DBStft T.C. Au 398.4 394.3 402.1 406.7 407.3 401.8
9 DBSd T.C. Au 406.0 396.0 399.1 401.8 401.5 400.9
10 lowES-
TFT classic
M. Filzmoser 391.6 395.8 405.9 393.2 399.4 397.2
11 TFTIm T.C. Au 399.0 398.8 395.0 396.7 395.3 397.0
12 Mod P. Hingston 394.8 394.2 407.8 394.1 393.7 396.9
13 TFTIz T.C. Au 397.7 396.1 390.7 392.1 400.6 395.5
14 TFTIc T.C. Au 400.1 401.0 389.5 388.9 389.2 393.7
15 DBSe T.C. Au 396.9 386.8 396.7 394.5 393.7 393.7
16 TTFT L. Clement 389.1 395.8 394.1 393.4 394.7 393.4
17 TFTIa T.C. Au 389.5 394.4 395.1 389.6 397.7 393.3
18 TFTIb T.C. Au 391.7 390.0 390.5 401.0 392.4 393.1
19 TFTIx T.C. Au 398.3 391.3 390.8 391.0 393.7 393.0
20 mediumES-
TFT classic
M. Filzmoser 396.7 392.6 398.3 390.8 386.0 392.9
21 TFTIy T.C. Au 391.7 394.6 390.8 392.1 394.9 392.8
22 TFTId T.C. Au 395.6 393.1 388.8 385.7 391.3 390.9
23 TFTIe T.C. Au 396.7 391.1 385.2 388.2 393.5 390.9
24 DBSb T.C. Au 393.2 386.1 392.6 391.1 391.0 390.8
25 T4T D. Fogel 391.5 387.6 400.4 387.3 383.5 390.0
games. Our main innovation involves how to reason about noise using the
opponent model.
The key idea used in DBS is something that we call symbolic noise
detection—the use of the other player’s deterministic behavior to tell
whether an action has been affected by noise. More precisely, DBS builds
a symbolic model of how the other player behaves, and watches for any
deviation from this model. If the other player’s next move is inconsistent
with its past behavior, this inconsistency can be due either to noise or to
a genuine change in its behavior; and DBS can often distinguish between
these two cases by waiting to see whether this inconsistency persists in the
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 233
next few iterations of the game.†
Of the nine different version of DBS that we entered into the competi-
tion, all of them placed in the top 25, and seven of them placed among top
ten (see Table 10.1). Our best version, DBSz, placed third; and the two
players that placed higher were both masters of master-and-slave teams.
DBS operates in a distinctly different way from the master-and-slaves
strategy used by several other entrants in the competition. Each participant
in the competition was allowed to submit up to 20 programs as contestants.
Some participants took advantage of this to submit collections of programs
that worked together in a conspiracy in which 19 of their 20 programs (the
“slaves”) worked to give as many points as possible to the 20th program
(the “master”). DBS does not use a master-and-slaves strategy, nor does it
conspire with other programs in any other way. Nonetheless, DBS remained
competitive with the master-and-slaves strategies in the competition, and
performed much better than the master-and-slaves strategies if the score of
each master is averaged with the scores of its slaves. Furthermore, a more
extensive analysis [Au and Nau (2005)] shows that if each master-and-slaves
team had been limited to 10 programs or less, DBS would have placed first
in the competition.
10.2. Motivation and Approach
The techniques used in DBS are motivated by a British army officer’s storythat was quoted in (Axelrod, 1997, page 40):
I was having tea with A Company when we heard a lot of
shouting and went out to investigate. We found our men and
the Germans standing on their respective parapets. Suddenly
a salvo arrived but did no damage. Naturally both sides got
down and our men started swearing at the Germans, when all
at once a brave German got onto his parapet and shouted out:
“We are very sorry about that; we hope no one was hurt. It
is not our fault. It is that damned Prussian artillery.” (Rutter
1934, 29)
Such an apology was an effective way of resolving the conflict and preventing
a retaliation because it told the British that the salvo was not the intention
of the German infantry, but instead was an unfortunate accident that the
German infantry did not expect nor desire. The reason why the apology was
convincing was because it was consistent with the German infantry’s past
†An iteration has also been called a period or a round by some authors.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
234 T-C. Au and D. Nau
behavior. The British had was ample evidence to believe that the German
infantry wanted to keep the peace just as much as the British infantry did.
More generally, an important question for conflict prevention in noisy
environments is whether a misconduct is intentional or accidental. A devia-
tion from the usual course of action in a noisy environment can be explained
in either way. If we form the wrong belief about which explanation is cor-
rect, our response may potentially destroy our long-term relationship with
the other player. If we ground our belief on evidence accumulated before
and after the incident, we should be in a better position to identify the true
cause and prescribe an appropriate solution. To accomplish this, DBS uses
the following key techniques:
(1) Learning about the other player’s strategy. DBS uses an induc-
tion technique to identify a set of rules that model the other player’s
recent behavior. The rules give the probability that the player will
cooperate under different situations. As DBS learns these probabili-
ties during the game, it identifies a set of deterministic rules that have
either 0 or 1 as the probability of cooperation.
(2) Detecting noise. DBS uses the above rules to detect anomalies that
may be due either to noise or a genuine change in the other player’s
behavior. If a move is different from what the deterministic rules pre-
dict, this inconsistency triggers an evidence collection process that will
monitor the persistence of the inconsistency in the next few iterations
of the game. The purpose of the evidence-collection process is to deter-
mine whether the violation is likely to be due to noise or to a change
in the other player’s policy. If the inconsistency does not persist, DBS
asserts that the derivation is due to noise; if the inconsistency persists,
DBS assumes there is a change in the other player’s behavior.
(3) Temporarily tolerating possible misbehaviors by the other
player. Until the evidence-collection process finishes, DBS assumes
that the other player’s behavior is still as described by the determin-
istic rules. Once the evidence collection process has finished, DBS de-
cides whether to believe the other player’s behavior has changed, and
updates the deterministic rules accordingly.
Since DBS emphasizes the use of deterministic behaviors to distinguish
noise from the change of the other player’s behavior, it works well when
the other player uses a pure (i.e., deterministic) strategy or a strategy that
makes decisions deterministically most of the time. Fortunately, determin-
istic behaviors are abundant in the Iterated Prisoner’s Dilemma. Many
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 235
well-known strategies, such as TFT and GRIM, are pure strategies. Some
strategies such as Pavlov or Win-Stay, Lose-Shift strategy (WSLS) [Kraines
and Kraines (1989, 1993, 1995); Nowak and Sigmund (1993)] are not pure
strategies, but a large part of their behavior is still deterministic. The rea-
son for the prevalence of determinism is discussed by Axelrod in [Axelrod
(1984)]: clarity of behavior is an important ingredient of long-term cooper-
ation. A strategy such as TFT benefits from its clarity of behavior, because
it allows other players to make credible predictions of TFT’s responses to
their actions. We believe the success of our strategy in the competition is
because this clarity of behavior also helps us to fend off noise.
The results of the competition show that the techniques used in DBS
are indeed an effective way to fend off noise and maintain cooperation in
noisy environments. When DBS defers judgment about whether the other
player’s behavior has changed, the potential cost is that DBS may not
be able to respond to a genuine change of the other player’s behavior as
quickly as possible, thus losing a few points by not retaliating immediately.
But this delay is only temporary, and after it DBS will adapt to the new
behavior. More importantly, the techniques used in DBS greatly reduce
the probability that noise will cause it to end a cooperation and fall into
a mutual-defect situation. Our experience has been that it is hard to re-
establish cooperation from a mutual-defection situation, so it is better avoid
getting into mutual defection situations in the first place. When compared
with the potential cost of ending an cooperation, the cost of temporarily
tolerating some defections is worthwhile.
Temporary tolerance also benefits us in another way. In the noisy It-
erated Prisoner’s Dilemma, there are two types of noise: one that affects
the other player’s move, and the other affects our move. While our method
effectively handles the first type of noise, it is the other player’s job to deal
with the second type of noise. Some players such as TFT are easily pro-
voked by the second type of noise and retaliate immediately. Fortunately, if
the retaliation is not a permanent one, our method will treat the retaliation
in the same way as the first type of noise, thus minimizing its effect.
10.3. Iterated Prisoner’s Dilemma with Noise
In the Iterated Prisoner’s Dilemma, two players play a finite sequence of
classical prisoner’s dilemma games, whose payoff matrix is:
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
236 T-C. Au and D. Nau
Player 2
Cooperate Defect
Cooperate (uCC, u
CC) (u
CD, u
DC)
Player 1Defect (u
DC, u
CD) (u
DD, u
DD)
where uDC
> uCC
> uDD
> uCD
and 2uCC
> uDC
+ uCD
. In the
competition, uDC
, uCC
, uDD
and uCD
are 5, 3, 1 and 0, respectively.
At the beginning of the game, each player knows nothing about the
other player and does not know how many iterations it will play. In each
iteration, each player chooses either to cooperate (C) or defect (D), and
their payoffs in that iteration are as shown in the payoff matrix. We call
this decision a move or an action. After both players choose a move, they
will each be informed of the other player’s move before the next iteration
begins.
If ak, b
k∈ C,D are the moves of Player 1 and Player 2 in iteration
k, then we say that (ak, b
k) is the interaction of iteration k. If there are
N iterations in a game, then the total scores for Player 1 and Player 2 are∑
1≤k≤N
uakbk
and∑
1≤k≤N
ubkak
, respectively.
The Noisy Iterated Prisoner’s Dilemma is a variant of the Iterated Pris-
oner’s Dilemma in which there is a small probability that a player’s moves
will be mis-implemented. The probability is called the noise level.‡ In other
words, the noise level is the probability of executing C when D was the in-
tended move, or vice versa. The incorrect move is recorded as the player’s
move, and determines the interaction of the iteration.§ Furthermore, nei-
ther player has any way of knowing whether the other player’s move was
executed correctly or incorrectly.¶
For example, suppose Player 1 chooses C and Player 2 chooses D in
iteration k, and noise occurs and affects the Player 1’s move. Then the
interaction of iteration k is (D,D). However, since both players do not
know that the Player 1’s move has been changed by noise, Player 1 and
Player 2 perceive the interaction differently: for Player 1, the interaction is
(C,D), but for Player 2, the interaction is (D,D). As in real life, this mis-
understanding would become an obstacle in establishing and maintaining
‡The noise level in the competition was 0.1.
§Hence, a mis-implementation is different from a misperception, which would not change
the interaction of the iteration. The competition included mis-implementations but no
misperceptions.
¶As far as we know, the definitions of “mis-implementation” used in the existing litera-
ture are ambiguous about whether either of the players should know that an action has
been mis-executed.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 237
cooperation between the players.
10.4. Strategies, Policies, and Hypothesized Policies
A history H of length k is the sequence of interactions of all iterations up
to and including iteration k. We write H = 〈(a1, b1), (a2, b2), . . . , (ak, b
k)〉.
Let H = 〈(C,C), (C,D), (D,C), (D,D)〉∗ be the set of all possible histories.
A strategy M : H → [0, 1] associates with each history H a real number
called the degree of cooperation. M(H) is the probability that M chooses
to cooperate at iteration k + 1, where k = |H | is H ’s length.
For examples, TFT can be considered as a function MTFT
, such that (1)
MTFT
(H) = 1.0 if k = 0 or ak
= C (where k = |H |), and (2) MTFT
(H) =
0.0 otherwise; Tit-for-Two-Tats (TFTT), which is like TFT except it defects
only after it receives two consecutive defections, can be considered as a
function MTFTT
, such that (1) MTFTT
(H) = 0.0 if k ≥ 2 and ak−1 = a
k=
D, and (2) MTFTT
(H) = 1.0 otherwise.
We can model a strategy as a policy. A condition Cond : H →
True,False is a mapping from histories to boolean values. A history H
satisfies a condition Cond if and only if Cond(H) = True. A policy schema
Ω is a set of conditions such that each history in H satisfies exactly one
of the conditions in Ω. A rule is a pair (Cond, p), which we will write as
Cond → p, where Cond is a condition and p is a degree of cooperation
(a real number in [0, 1] ). A rule is deterministic if p is either 0.0 or 1.0;
otherwise, the rule is probabilistic. In this paper, we define a policy to be a
set of rules whose conditions constitute a policy schema.
MTFT
can be modeled as a policy as follows: we define Conda,b
to be
a condition about the interactions of the last iteration of a history, such
that Conda,b
(H) = True if and only if (1) k ≥ 1, ak
= a and bk
= b,
(where k = |H |), or (2) k = 0 and a = b = C. For simplicity, we also write
Conda,b
as (a, b). The policy for MTFT
is πTFT
= (C,C)→ 1.0, (C,D)→
1.0, (D,C) → 0.0, (D,D) → 0.0. Notice that the policy schema for πTFT
is Ω = (C,C), (C,D), (D,C), (D,D).
Given a policy π and a historyH , there is one and only one rule Cond→
p in π such that Cond(H) = True. We write p as π(H). A policy π is
complete for a strategy M if and only if π(H) = M(H) for any H ∈ H. In
other words, a complete policy for a strategy is one that completely models
the strategy. For instance, πTFT
is a complete policy for MTFT
.
Some strategies are much more complicated than TFT—we need a large
number of rules in order to completely model these strategies. If the number
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
238 T-C. Au and D. Nau
of iterations is small and the strategy is complicated enough, it is difficult
or impossible for DBS to obtain a complete model of the other player’s
strategy. Therefore, DBS does not aim at obtaining a complete policy of
the other player’s strategy; instead, DBS leans an approximation of the
other player’s strategy during a game, using a small number of rules. In
order to distinguish this approximation from the complete policies for a
strategy, we call this approximation a hypothesized policy.
Given a policy schema Ω, DBS constructs a hypothesized policy π whose
policy schema is Ω. The degrees of cooperation of the rules in π are esti-
mated by a learning function (e.g., the learning methods in Section 10.6),
which computes the degrees of cooperation according to the current his-
tory. For example, suppose the other player’s strategy is MTFTT
, the given
policy schema is Ω = (C,C), (C,D), (D,C), (D,D), and the current his-
tory is H = (C,C), (D,C), (C,C), (D,C), (D,C), (D,D), (C,D), (C,C).
If we use a learning method which computes the degrees of cooperation by
averaging the number of time the next action is C when a condition holds,
then the hypothesized policy is π = (C,C)→ 1.0, (C,D)→ 1.0, (D,C)→
0.66, (D,D) → 0.0. Notice that the rule (D,C) → 0.66 does not accu-
rately model MTFTT
; this probabilistic rule is just an approximation of
what MTFTT
does when the condition (D,C) holds. This approximation is
inaccurate as long as the policy schema contains (D,C)—there is no com-
plete policy for MTFTT
whose policy schema contains (D,C). If we want
to model MTFTT
correctly, we need a different policy schema that allows
us to specify more complicated rules.
We interpret a hypothesized policy as a belief of what the other player
will do in the next few iterations in response to our next few moves. This
belief does not necessarily hold in the long run, since the other player can
behave differently at different time in a game. Even worse, there is no
guarantee that this belief is true in the next few iterations. Nonetheless,
hypothesized policies constructed by DBS usually have a high degree of
accuracy in predicting what the other player will do.
This belief is subjective—it depends on the choice of the policy schema
and the learning function. We formally define this subjective viewpoint as
follows. The hypothesized policy space spanned by a policy schema Ω and a
learning function L : Ω×H → [0, 1] is a set of policies Π = π(H) : H ∈ H,
where π(H) = Cond → L(Cond,H) : Cond ∈ Ω. Let H be a history of
a game in which the other player’s strategy is M . The set of all possible
hypothesized policies for M in this game is π(Hk) : H
k∈ prefixes(H) ⊆
Π, where prefixes(H) is the set of all prefixes of H , and Hk
is the prefix
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 239
of length k of H . We say π(Hk) is the current hypothesized policy of
M in the iteration k. A rule Cond → p in π(Hk) describes a particular
behavior of the other player’s strategy in the iteration k. The behavior is
deterministic if p is either zero or one; otherwise, the behavior is random
or probabilistic. If π(Hk) 6= π(H
k+1), we say there is a change of the
hypothesized policy in the iteration k + 1, and the behaviors described by
the rules in (π(Hk) \ π(H
k+1)) have changed.
10.5. Derived Belief Strategy
In the ordinary Iterated Prisoner’s Dilemma (i.e., without any noise), if
we know the other player’s strategy and how many iterations in a game,
we can compute an optimal strategy against the other player by trying
every possible sequence of moves to see which sequence yields the highest
score, assuming we have sufficient computational power. However, we are
missing both pieces of information. So it is impossible for us to compute
an optimal strategy, even with sufficient computing resource. Therefore,
we can at most predict the other player’s moves based on the history of a
game, subject to the fact that the game may terminate any time.
Some strategies for the Iterated Prisoner’s Dilemma do not predict the
other player’s moves at all. For example, Tit-for-Tat and GRIM react de-
terministically to the other player’s previous moves according to fixed sets
of rules, no matter how the other player actually plays. Many strategies
adapt to the other player’s strategy over the course of the game: for exam-
ple, Pavlov [Kraines and Kraines (1989)] adjusts its degree of cooperation
according to the history of a game. However, these strategies do not take
any prior information about the other player’s strategy as an input; thus
they are unable to make use of this important piece of information even
when it is available.
Let us consider a class of strategies that make use of a model of the other
player’s strategy to make decisions. Figure 10.1 shows an abstract represen-
tation of these strategies. Initially, these strategies start out by assuming
that the other player’s strategy is TFT or some other strategy. In every
iteration of the game, the model is updated according to the current history
(using UpdateModel). These strategies decide which move it should make
in each iteration using a move generator (GenerateMove), which depends
on the current model of the other player’s strategy of the iteration.
DBS belongs to this class of strategies. DBS maintains a model of
the other player in form of a hypothesized policy throughout a game, and
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
240 T-C. Au and D. Nau
Procedure StrategyUsingModelOfTheOtherPlayer()
π ← InitialModel() // the current model of the other player
H ← ∅ // the current history
a← GenerateMove(π,H) // the initial move
Loop until the end of the game
Output our move a and obtain the other player’s move b
H ← 〈H, (a, b)〉
π ← UpdateModel(π,H)
a← GenerateMove(π,H)
End Loop
Fig. 10.1. An abstract representation of a class of strategies that generate moves using
a model of the other player.
makes decisions based on this hypothesized policy. The key issue for DBS in
this process is how to maintain a good approximation of the other player’s
strategy, despite that some actions in the history are affected by noise. A
good approximation will increase the quality of moves generated by DBS,
since the move generator in DBS depends on an accurate model of the other
player’s behavior.
The approach DBS uses to minimize the effect of noise on the hypoth-
esized policy has been discussed in Section 10.2: temporarily tolerate pos-
sible misbehaviors by the other player, and then update the hypothesized
policy only if DBS believes that the misbehavior is due to a genuine change
of behaviors. Figure 10.2 shows an outline of the implementation of this
approach in DBS. As we can see, DBS does not maintain the hypothesized
policy explicitly; instead, DBS maintains three sets of rules: the default
rule set (Rd), the current rule set (R
c), and the probabilistic rule set (R
p).
DBS combines these rule sets to form a hypothesized policy for move gen-
eration. In addition, DBS maintains several auxiliary variables (promotion
counts and violation counts) to facilitate the update of these rule sets. We
will explain every line in Figure 10.2 in detail in the next section.
10.6. Learning Hypothesized Policies in Noisy Environ-
ments
We will describe how DBS learns and maintains a hypothesized policy for
the other player’s strategy in this section. Section 10.6.1 describes how
DBS uses discounted frequencies for each behavior to estimate the degree of
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 241
Procedure DerivedBeliefStrategy()
1. Rd← π
TFT// the default rule set
2. Rc← ∅ // the current rule set
3. a0 ← C ; b0 ← C ; H ← 〈(a0, b0)〉 ; π = Rd
; k ← 1 ; v ← 0
4. a1 ← MoveGen(π,H)
5. Loop until the end of the game
6. Output ak
and obtain the other player’s move bk
7. r+ ← ((ak−1, bk−1)→ b
k)
8. r− ← ((ak−1, bk−1)→ (C,D \ b
k))
9. If r+, r− 6∈ Rc, then
10. If ShouldPromote(r+) = true, then insert r+ into Rc.
11. If r+ ∈ Rc, then set the violation count of r+ to zero
12. If r− ∈ Rc
and ShouldDemote(r−) = true, then
13. Rd← R
c∪R
d; R
c← ∅ ; v ← 0
14. If r− ∈ Rd, then v ← v + 1
15. If v > RejectThreshold, or (r+ ∈ Rc
and r− ∈ Rd), then
16. Rd← ∅ ; v ← 0
17. Rp← (Cond→ p′) ∈ ψ
k+1 : Cond not appear in Rc
or Rd
18. π ← Rc∪R
d∪R
p// construct a hypothesized policy
19. H ← 〈H, (ak, b
k)〉; a
k+1 ← MoveGen(π,H) ; k ← k + 1
20. End Loop
Fig. 10.2. An outline of the DBS strategy. ShouldPromote first increases r+’s promotion
count, and then if r+’s promotion count exceeds the promotion threshold, ShouldPromote
returns true and resets r+’s promotion count. Likewise, ShouldDemote first increases
r−’s violation count, and then if r−’s violation count exceeds the violation threshold,
ShouldPromote returns true and resets r−’s violation count. Rp in Line 17 is the proba-
bilistic rule set; ψk+1 in Line 17 is calculated from Equation 10.1.
cooperation of each rule in the hypothesized policy. Section 10.6.2 explains
why using discounted frequencies alone are not sufficient for constructing an
accurate model of the other player’s strategy in the presence of noise, and
how symbolic noise detection and temporary tolerance can help overcome
the difficulty in using discounted frequencies alone. Section 10.6.3 presents
the induction technique DBS uses to identify deterministic behaviors in the
other player. Section 10.6.4 illustrates how DBS defers judgment about
whether an anomaly is due to noise. Section 10.6.5 discusses how DBS
updates the hypothesized policy when it detects a change of behavior.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
242 T-C. Au and D. Nau
10.6.1. Learning by Discounted Frequencies
We now describe a simple way to estimate the degree of cooperation of
the rules in the hypothesized policy. The idea is to maintain a discounted
frequency for each behavior: instead of keeping an ordinary frequency count
of how often the other player cooperates under a condition in the past, DBS
applies discount factors based on how recent each occurrence of the behavior
was.
Given a history H = (a1, b1), (a2, b2), . . . , (ak, b
k), a real number α
between 0 and 1 (called the discount factor), and an initial hypothesized
policy π0 = Cond1 → p0
1, Cond2 → p0
2, . . . , Cond
n→ p0
n
whose policy
schema is C = Cond1, Cond2, . . . , Condn, the probabilistic policy at iter-
ation k + 1 is ψk+1 = Cond1 → pk+1
1, Cond2 → pk+1
2, Cond
n→ pk+1
n
,
where pk+1
i
is computed by the following equation:
pk+1
i
=
∑
0≤j≤k
(
αk−jgj
)
∑
0≤j≤k
(αk−jfj)
(10.1)
and where
gj
=
p0
i
if j = 0,
1 if 1 ≤ j ≤ k, Condi(H
j−1) = True and bj
= C,
0 otherwise;
fj
=
p0
i
if j = 0,
1 if 1 ≤ j ≤ k, Condi(H
j−1) = True,
0 otherwise;
Hj−1 =
∅ if j = 1,
(a1, b1), (a2, b2), . . . , (aj−1, bj−1) otherwise.
In short, the current historyH has k+1 possible prefixes, and fjis basically
a boolean function indicating whether the prefix of H up to the j − 1’th
iteration satisfies Condi. g
jis a restricted version of f
j.
When α = 1, pi
is approximately equal to the frequency of the occur-
rence of Condi→ p
i. When α is less than 1, p
ibecomes a weighted sum of
the frequencies that gives more weight to recent events than earlier ones.
For our purposes, it is important to use α < 1, because it may happen that
the other player changes its behavior suddenly, and therefore we should
forget about its past behavior and adapt to its new behavior (for instance,
when GRIM is triggered). In the competition, we used α = 0.75.
An important question is how large a policy schema to use for the hy-
pothesized policy. If the policy schema is too small, the policy schema won’t
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 243
provide enough detail to give useful predictions of the other player’s behav-
ior. But if the policy schema is too large, DBS will be unable to compute
an accurate approximation of each rule’s degree of cooperation, because the
number of iterations in the game will be too small. In the competition, we
used a policy schema of size 4: (C,C), (C,D), (D,C), (D,D). We have
found this to be good enough for modeling a large number of strategies.
It is essential to have a good initial hypothesized strategy because at
the beginning of the game the history is not long enough for us to derive
any meaningful information about the other player’s strategy. In the com-
petition, the initial hypothesized policy is πTFT
= (C,C)→ 1.0, (C,D)→
1.0, (D,C)→ 0.0, (D,D)→ 0.0.
10.6.2. Deficiencies of Discounted Frequencies in Noisy En-
vironments
It may appear that the probabilistic policy learned by the discounted-
frequency learning technique should be inherently capable of tolerating
noise, because it takes many, if not all, moves in the history into account:
if the number of terms in the calculation of the average or weighted average
is large enough, the effect of noise should be small. However, there is a
problem with this reasoning: it neglects the effect of multiple occurrences
of noise within a small time interval.
A mis-implementation that alters the move of one player would distort
an established pattern of behavior observed by the other player. The gen-
eral effect of such distortion to the Equation 10.1 is hard to tell—it varies
with the value of the parameters and the history. But if several distortions
occur within a small time interval, the distortion may be big enough to al-
ter the probabilistic policy and hence change our decision about what move
to make. This change of decision may potentially destroy an established
pattern of mutual cooperation between the players.
At first glance, it might seem rare for several noise events to occur at
nearly the same time. But if the game is long enough, the probability of it
happening can be quite high. The probability of getting two noise events in
two consecutive iterations out of a sequence of i iterations can be computed
recursively as Xi
= p(p + qXi−2) + qX
i−1, providing that X0 = X1 = 0,
where p is the probability of a noise event and q = 1−p. In the competition,
the noise level was p = 0.1 and i = 200, which gives X200 = 0.84. Similarly,
the probabilities of getting three and four noises in consecutive iterations
are 0.16 and 0.018, respectively.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
244 T-C. Au and D. Nau
In the 2005 competition, there were 165 players, and each player played
each of the other players five times. This means every player played 825
games. On average, there were 693 games having two noises in two consecu-
tive iterations, 132 games having three noises in three consecutive iterations,
and 15 games having four noises in four consecutive iterations. Clearly, we
did not want to ignore situations in which several noises occur nearly at
the same time.
Symbolic noise detection and temporary tolerance outlined in Sec-
tion 10.2 provide a way to reduce the amount of susceptibility to multi-
ple occurrences of noise in a small time interval. Deterministic rules enable
DBS to detect anomalies in the observed behavior of the other player. DBS
temporarily ignores the anomalies which may or may not be due to noise,
until a better conclusion about the cause of the anomalies can be drawn.
This temporary tolerance prevents DBS from learning from the moves that
may be affected by noise, and hence protects the hypothesized policy from
the influence of errors due to noise. Since the amount of tolerance (and the
accuracy of noise detection) can be controlled by adjusting parameters in
DBS, we can reduce the amount of susceptibility to multiple occurrences of
noise by increasing the amount of tolerance, at the expense of a higher cost
of noise detection—losing more points when a change of behavior occurs.
10.6.3. Identifying Deterministic Rules Using Induction
As we discussed in Section 10.2, deterministic behaviors are abundant in the
Iterated Prisoner’s Dilemma. Deterministic behaviors can be modeled by
deterministic rules, whereas random behavior would require probabilistic
rules.
A nice feature about deterministic rules is that they have only two
possible degrees of cooperation: zero or one, as opposed to an infinite set of
possible degrees of cooperation of the probabilistic rules. Therefore, there
should be ways to learn deterministic rules that are much faster than the
discounted frequency method described earlier. For example, if we knew at
the outset which rules were deterministic, it would take only one occurrence
to learn each of them: each time the condition of a deterministic rule was
satisfied, we could assign a degree of cooperation of 1 or 0 depending on
whether the player’s move was C or D.
The trick, of course, is to determine which rules are deterministic. We
have developed an inductive-reasoning method to distinguish deterministic
rules from probabilistic rules during learning and to learn the correct degree
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 245
of cooperation for the deterministic rules.
In general, induction is the process of deriving general principles from
particular facts or instances. To learn deterministic rules, the idea of induc-
tion can be used as follows. If a certain kind of behavior occurs repeatedly
several times, and during this period of time there is no other behavior
that contradicts to this kind of behavior, then we will hypothesize that the
chance of the same kind of behavior occurring in the next few iterations is
pretty high, regardless of how the other player behaved in the remote past.
More precisely, let K ≥ 1 be a number which we will call the promotion
threshold. Let H = 〈(a1, b1), (a2, b2), . . . , (ak, b
k)〉 be the current history.
For each condition Condj∈ C, let I
jbe the set of indexes such that for
all i ∈ Ij, i < k and Cond
j(〈(a1, b1), (a2, b2), . . . , (ai
, bi)〉) = True. Let I
j
be the set of the largest K indexes in Ij. If |I
j| ≥ K and for all i ∈ I
j,
bi+1 = C (i.e., the other player chose C when the previous history up to the
i’th iteration satisfies Condj), then we will hypothesize that the other player
will choose C whenever Condj
is satisfied; hence we will use Condj→ 1
as a deterministic rule. Likewise, if |Ij| ≥ K and for all i ∈ I
j, b
i+1 = D,
we will use Condj→ 0 as a deterministic rule. See Line 7 to Line 10 in
Figure 10.2 for an outline of the induction method we use in DBS.
The induction method can be faster at learning deterministic rules than
the discounted frequency method that regards a rule as deterministic when
the degree of cooperation estimated by discounted frequencies is above or
below certain thresholds. As can be seen in Figure 10.3, the induction
method takes only three iterations to infer the other player’s moves cor-
rectly, whereas the discounted frequency technique takes six iterations to
obtain a 95% degree of cooperation, and it never becomes 100%.‖ We may
want to set the threshold in the discounted frequency method to be less than
0.8 to make it faster than the induction method. However, this will increase
the chance of incorrectly identifying a random behavior as deterministic.
A faster learning speed allows us to infer deterministic rules with a
shorter history, and hence increase the effectiveness of symbolic noise de-
tection by having more deterministic rules at any time, especially when a
change of the other player’s behavior occurs. The promotion threshold K
controls the speed of the identification of deterministic rules. The larger the
value of K, the slower the speed of identification, but the less likely we will
mistakenly hypothesize that the other player’s behavior is deterministic.
‖If we modify Equation 10.1 to discard the early interactions of a game, the degree of
cooperation of a probabilistic rule can attain 100%.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
246 T-C. Au and D. Nau
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Iteration
Deg
ree
of C
oope
ratio
n
InductionDiscount Frequency
Fig. 10.3. Learning speeds of the induction method and the discounted frequency
method when the other player always cooperates. The initial degree of cooperation
is zero, the discounted rate is 0.75, and the promotion threshold is 3.
10.6.4. Symbolic Noise Detection and Temporary Tolerance
Once DBS has identified the set of deterministic rules, it can readily use
them to detect noise. As we said earlier, if the other player’s move violate
a deterministic rule, it can be caused either by noise or by a change in
the other player’s behavior, and DBS uses an evidence collection process
to figure out which is the case. More precisely, once a deterministic rule
Condi→ o
iis violated (i.e., the history up to the previous iteration satis-
fies Condi
but the other player’s move in the current iteration is different
from oi), DBS keeps the violated rule but marks it as violated. Then DBS
starts an evidence collection process that in the implementation of our com-
petition entries is a violation counting: for each violated probabilistic rule
DBS maintains a counter called the violation count to record how many
violations of the rule have occurred (Line 12).∗∗ In the subsequent itera-
tions, DBS increases the violation count by one every time a violation of
the rule occurs. However, if DBS encounters a positive example of the rule,
DBS resets the violation count to zero and unmark the rule (Line 11). If
any violation count excesses a threshold called the violation threshold, DBS
concludes that the violation is not due to noise; it is due to a change of
the other player’s behavior. In this case, DBS invokes a special procedure
∗∗We believe that a better evidence collection process should be based on statistical
hypothesis testing.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 247
(described in Section 10.6.5) to handle this situation (Line 13).
This evidence collection process takes advantages of the fact that the
pattern of moves affected by noise is often quite different from the pat-
tern of moves generated by the new behavior after a change of behavior
occurs. Therefore, it can often distinguish noise from a change of behavior
by observing moves in the next few iterations and gather enough evidence.
As discussed in Section 10.6.2, we want to set a larger violation threshold
in order to avoid the drawback of the discount frequency method in dealing
with several misinterpretations caused by noise within a small time inter-
val. However, if the threshold is too large, it will slow down the speed of
adaptation to changes in the other player’s behavior. In the competition,
we entered DBS several times with several different violation thresholds;
and in the one that performed the best, the violation threshold was 4.
10.6.5. Coping with Ignorance of the Other Player’s New
Behavior
When the evidence collection process detects a change in the other player’s
behavior, DBS knows little about the other player’s new behavior. How
DBS copes with this ignorance is critical to its success.
When DBS knows little about the other player’s new behavior when
it detects a change of the other player’s behavior, DBS temporarily uses
the previous hypothesized policy as the current hypothesized policy, un-
til it deems that this substitution no longer works. More precisely, DBS
maintains two sets of deterministic rules: the current rule set Rc
and the
default rule set Rd. R
cis the set of deterministic rules that is learned after
the change of behavior occurs, while Rd
is the set of deterministic rules
before the change of behavior occurs. At the beginning of a game, Rd
is
πTFT
and Rc
is an empty set (Line 1 and Line 2). When DBS constructs a
hypothesized policy π for move generation, it uses every rule in Rc
and Rd.
In addition, for any missing rule (i.e., the rule those condition are differ-
ent from any rule’s condition in Rc
or Rd), we regard it as a probabilistic
rule and approximate its degree of cooperation by Equation 10.1 (Line 17).
These probabilistic rules form the probabilistic rule set Rp⊆ ψ
k+1.
While DBS can insert any newly found deterministic rule in Rc, it insert
rules into Rd
only when the evidence collection process detects a change of
the other player’s behavior. When it happens, DBS copies all the rules in
Rc
to Rd, and then set R
cto an empty set (Line 13).
The default rule set is designed to be rejected : we maintain a violation
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
248 T-C. Au and D. Nau
count to record the number of violations to any rule in Rd. Every time any
rule in Rd
is violated, the violation count increased by 1 (Line 14). Once
the violation count exceeds a rejection threshold, we drop the default rule
set entirely (set it to an empty set) and reset the violation count (Line 15
and Line 16). We also reject Rd
whenever any rule in Rc
contradicts any
rule in Rd
(Line 15).
We preserve the rules in Rc
mainly for sake of providing a smooth tran-
sition: we don’t want to convert all deterministic rules to probabilistic rules
at once, as it might suddenly alter the course of our moves, since the move
generator in DBS generates moves according to the current hypothesized
policy only. This sudden change in DBS’s behavior can potentially disrupt
the cooperative relationship with the other player. Furthermore, some of
the rules in Rcmay still hold, and we don’t want to learn them from scratch.
Notice that symbolic noise detection and temporary tolerance makes use
of the rules in Rc
but not the rules in Rd, although DBS makes use of the
rules in both Rc
and Rd
when DBS decides the next move (Line 18). We do
not use Rd
for symbolic noise detection and temporary tolerance because
when DBS inserts rules into Rd, a change of the other player’s behavior
has already occurred—there is little reason to believe that anomalies de-
tected using the rules in Rd
are due to noise. Furthermore, we want to turn
off symbolic noise detection and temporary tolerance temporarily when a
change of behavior occurs, in order to identify a whole new set of deter-
ministic rules from scratch.
10.7. The Move Generator in DBS
We devised a simple and reasonably effective move generator for DBS. As
shown in Figure 10.1, the move generator takes the current hypothesized
policy π and the current history Hcurrent
whose length is l = |Hcurrent
|,
and then decides whether DBS should cooperate in the current iteration.
It is difficult to devise a good move generator, because our move could lead
to a change of the hypothesized policy and complicate our projection of
the long-term payoff. Perhaps, the move generator should take the other
player’s model of DBS into account [Carmel and Markovitch (1994)]. How-
ever, we found that by making the assumption that hypothesized policy
will not change for the rest of the game, we can devise a simple move gen-
erator that generates fairly good moves. The idea is that we compute the
maximum expected score we can possibly earn for the rest of the game, us-
ing a technique that combines some ideas from both game-tree search and
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 249
Markov Decision Processes (MDPs). Then we choose the first move in the
set of moves that leads to this maximum expected score as our move for
the current iteration.
To accomplish the above, we consider all possible histories whose prefix
isHcurrent
as a tree. In this tree, each path starting from the root represents
a possible history, which is a sequence of past interactions in Hcurrent
plus
a sequence of possible interactions in future iterations. Each node on a path
represents the interaction of an iteration of a history. Figure 10.4 shows an
example of such a tree. The root node of the tree represents the interaction
of the first iteration.
Let interaction(S) be the interaction represented by a node S. Let
〈S0, S1, . . . , Sk〉 be a sequence of nodes on the path from the root S0
to Sk. We define the depth of S
kto be k − l, and the history of S
k
be H(Sk) = 〈interaction(S1), interaction(S2), . . . , interaction(S
k)〉. S
iis
called the current node if the depth of Si
is zero; the current node rep-
resents the interaction of the last iteration and H(Si) = H
current. As we
do not know when the game will end, we assume it will go for N ∗ more
iterations ; thus each path in the tree has length of at most l+N ∗.
Our objective is to compute a non-negative real number called the max-
imum expected score E(S) for each node S with a non-negative depth. Like
a conventional game tree search in computer chess or checkers, the maxi-
mum expected scores are defined recursively: the maximum expected score
of a node at depth i is determined by the maximum expected scores of its
children nodes at depth i + 1. The maximum expected score of a node S
of depth N∗ is assumed to be the value computed by an evaluation func-
tion f . This is a mapping from histories to non-negative real numbers,
such that E(S) = f(H(S)). The maximum expected score of a node S of
depth k, where 0 ≤ k < N∗, is computed by the maximizing rule: sup-
pose the four possible nodes after S are SCC
, SCD
, SDC
, and SDD
, and
let p be the degree of cooperation predicted by the current hypothesized
policy π (i.e., p is the right-hand side of a rule (Cond → p) in π such that
H(S) satisfies the condition Cond). Then E(S) = maxEC(S), E
D(S),
where EC
(S) = p(uCC
+E(SCC
)) + (1− p)(uCD
+E(SCD
)) and ED
(S) =
p(uDC
+E(SDC
))+(1−p)(uDD
+E(SDD
)). Furthermore, we letmove(S) be
the decision made by the maximizing rule at each node S, i.e., move(S) = C
if EC
(S) ≥ ED
(S) and move(S) = D otherwise. By applying this max-
imizing rule recursively, we obtain the maximum expected score of every
node with a non-negative depth. The move that we choose for the current
iteration is move(Si), where S
iis the current node.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
250 T-C. Au and D. Nau
.
PSfrag replacements
First Iteration(Root Node)
Previous Iteration(Current Node)
Depth 0
Depth 1
Depth 2
Fig. 10.4. An example of the tree that we use to compute the maximum expected scores.
Each node denotes the interaction of an iteration. The top four nodes constitute a path
representing the current history Hcurrent. The length of Hcurrent is l = 2, and the
maximum depth N∗is 2. There are four edges emanating from each node S after the
current node; each of these edges corresponds to a possible interaction of the iteration
after S. The maximum expected scores (not shown) of the nodes with depth 2 are set by
an evaluation function f ; these values are then used to calculate the maximum expected
scores of the nodes with depth 1 by using the maximizing rule. Similarly, the maximum
expected scores of the current node is calculated using four maximum expected scores
of the nodes with depth 1.
The number of nodes in the tree increases exponentially with N ∗. Thus,
the tree can be huge—there are over a billion nodes when N ∗≥ 15.
It is infeasible to compute the maximum expected score for every node
one by one. Fortunately, we can use dynamic programming to speed
up the computation. As an example, suppose the hypothesized policy is
π = (C,C) → pCC, (C,D) → p
CD, (D,C) → p
DC, (D,D) → p
DD, and
suppose the evaluation function f returns a constant fo1o2
for any history
that satisfies the condition (o1, o2), where o1, o2 ∈ C,D. Then, given our
assumption that the hypothesized policy does not change, it is not hard to
show by induction that all nodes whose histories have the same length and
satisfy the same condition have the same maximum expected score. By
using this property, we construct a table of size 4× (N ∗ + 2) in which each
entry, denoted by Ek
o1o2, stores the maximum expected score of the nodes
whose histories have length l + k and satisfy the condition (o1, o2), where
o1, o2 ∈ C,D. We also have another table of the same size to record the
decisions the procedure makes; the entry mk
o1o2of this table is the deci-
sion being made at Ek
o1o2. Initially, we set EN+1
CC
= fCC
, EN+1
CD
= fCD
,
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 251
EN+1
DC
= fDC
, and EN+1
DD
= fDD
. Then the maximum expected scores in
the remaining entries can be computed by the following recursive equation:
Ek
o1o2= max
(
po1o2
(uCC
+Ek+1
CC
) + (1− po1o2
)(uCD
+Ek+1
CD
),
po1o2
(uDC
+Ek+1
DC
) + (1− po1o2
)(uDD
+Ek+1
DD
))
,
where o1, o2 ∈ C,D. Similarly, mk
o1o2= C if (p
o1o2(u
CC+ Ek+1
CC
) + (1−
po1o2
)(uCD
+Ek+1
CD
)) ≥ (po1o2
(uDC
+Ek+1
DC
)+ (1− po1o2
)(uDD
+Ek+1
DD
) and
mk
o1o2= D otherwise. If the interaction of the previous iteration is (o1, o2),
we pick m0
o1o2as the move for the current iteration. The pseudocode of
this dynamic programming algorithm is shown in Figure 10.5.
Procedure MoveGen(π,H)
〈pCC
, pCD
, pDC
, pDD〉 ← π
(a1, b1), (a2, b2), . . . , (ak, b
k) ← H
(a0, b0)← (C,C) ; (a, b)← (ak, b
k)
〈EN
∗
+1
CC
, EN
∗
+1
CD
, EN
∗
+1
DC
, EN
∗
+1
DD
〉 ← 〈fCC, f
CD, f
DC, f
DD〉
For k = N∗ down to 0
For each (o1, o2) in (C,C), (C,D), (D,C), (D,D)
F k
o1o2← p
o1o2(u
CC+Ek+1
CC
) + (1− po1o2
)(uCD
+Ek+1
CD
)
Gk
o1o2← p
o1o2(u
DC+Ek+1
DC
) + (1− po1o2
)(uDD
+Ek+1
DD
)
Ek
o1o2← max(F k
o1o2, Gk
o1o2)
If F k
o1o2≥ Gk
o1o2, then mk
o1o2← C
If F k
o1o2< Gk
o1o2, then mk
o1o2← D
End For
End For
Return m0
ab
Fig. 10.5. The procedure for computing a recommended move for the current iteration.
In the competition, we set N∗= 60, fCC = 3, fCD = 0, fDC = 5, and fDD = 1.
10.8. Competition Results
The 2005 IPD Competition was actually a set of four competitions, each for
a different version of the IPD. The one for the Noisy IPD was Category 2,
which used a noise level of 0.1.
Of the 165 programs entered into the competition, eight of them were
provided by the organizer of the competition. These programs included
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
252 T-C. Au and D. Nau
ALLC (always cooperates), ALLD (always defects), GRIM (cooperates un-
til the first defection of the other player, and thereafter it always defects),
NEG (cooperate (or defect) if the other player defects (or cooperates) in
the previous iteration), RAND (defects or cooperates with the 1/2 proba-
bility), STFT (suspicious TFT, which is like TFT except it defects in the
first iteration) TFT, and TFTT. All of these strategies are well known in
the literature on IPD.
The remaining 157 programs were submitted by 36 different partici-
pants. Each participant was allowed to submit up to 20 programs. We
submitted the following 20:
• DBS. We entered nine different versions of DBS into the competition,
each with a different set of parameters or different implementation.
The one that performed best was DBSz, which makes use of the exact
set of features we mentioned in this chapter. Versions that have fewer
features or additional features did not do as well.
• Learning of Opponent’s Strategy with Forgiveness (LSF). Like
DBS, LSF is a strategy that learns the other player’s strategy during
the game. The difference between LSF and DBS is that LSF does not
make use of symbolic noise detection. It uses the discount frequency
(Equation 10.1) to learn the other player’s strategy, plus a forgiveness
strategy that decides when to cooperate if mutual defection occurs. We
entered one instance of LSF. It placed around the 30’th in three of the
runs and around 70’th in the other two runs. We believe the poor
ranking of LSF is due to the deficiency of using discount frequency
alone as we discussed at the beginning of Section 10.6.
• Tit-for-Tat Improved (TFTI). TFTI is a strategy based on a to-
tally different philosophy from DBS’s. It is not an opponent-modeling
strategy, in the sense that it does not model the other player’s behavior
using a set of rules. Instead, it is a variant of TFT with a sophisticated
forgiveness policy that aims at overcoming some of the deficiencies of
TFT in noisy environments. We entered ten instantiations of TFTI in
the competition, each with a different set of parameters or some dif-
ferences in the implementation. The best of these, TFTIm, did well in
the competition (see Table 10.1), but not as well as DBS.
Three of the other participants each entered the full complement
of twenty programs: Wolfgang Kienreich, Jia-wei Li, and Perukrishnen
Vytelingum. All three of them appear to have adopted the master-and-
slaves strategy that was first proposed by Vytelingum’s team from the Uni-
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 253
versity of Southampton. A master-and-slaves strategy is not a strategy for
a single program, but instead for a team of collaborating programs. One of
the programs in such a team is the master, and the remaining programs are
slaves. The basic idea is that at the start of a run, the master and slaves
would each make a series of moves using a predefined protocol, in order to
identify themselves to each other. From then on, the master program would
always play “defect” when playing with the slaves, and the slave programs
would always play “cooperate” when playing with the master, so that the
master would gain the highest possible payoff at each iteration. Further-
more, a slave would alway plays “defect” when playing with a program
other than the master, in order to try to minimize that player’s score.
Wolfgang Kienreich’s master program was CNGF (CosaNostra Godfa-
ther), and its slaves were 19 copies of CNHM (CosaNostra Hitman). Jia-wei
Li’s master program was IMM01 (Intelligent Machine Master 01), and its
slaves were IMS02, IMS03, . . . , IMS20 (Intelligent Machine Slave n, for
n = 02, 03, . . .20). Perukrishnen Vytelingum’s master program was BWIN
(S2Agent1 ZEUS), and its slaves were BLOS2, BLOS3, . . . , BLOS20 (like
BWIN, these programs also had longer names based on the names of ancient
Greek gods).
We do not know what strategies the other participants used in their
programs.
10.8.1. Overall Average Scores
Category 2 (IPD with noise) consisted of five runs. Each run was a round-
robin tournament in which each program played with every program, in-
cluding itself. Each program participated in 166 games in each run (recall
that there is one game in which a player plays against itself, which counts
as two games for that player). Each game consisted of 200 iterations. A
program’s score for a game is the sum of its payoffs over all 200 iterations
(note that this sum will be at least 0 and at most 1000). The program’s
total score for an entire run is the sum of its scores over all 166 games. On
the competition’s website, there is a ranking for each of the five runs, each
program is ranked according to its total score for the run.
A program’s average score within a run is its total score for the run
divided by 166. The program’s overall average score is its average over all
five runs, i.e., its total over all five runs divided by 830 = 5× 166.
The table in Table 10.1 shows the average scores in each of the five runs
of the top twenty-five programs when the programs are ranked by their
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
254 T-C. Au and D. Nau
overall average scores. Of our nine different versions of DBS, all nine of
them are among the top twenty-five programs, and they dominate the top
ten places. This phenomenon implies that DBS’s performance is insensitive
to the parameters in the programs and the implementation details of an
individual program. The same phenomenon happens to TFTI—nine out of
ten programs using TFTI are ranked between the 11th place and the 25th
place, and the last one is at the 29th place.
10.8.2. DBS versus the Master-and-Slaves Strategies
Recall from Table 10.1: that DBSz placed third in the competition: it lost
only to BWIN and IMM01, the masters of two master-and-slaves strategies.
DBS does not use a master-and-slaves strategy, nor does it conspire with
other programs in any other way—but in contrast, BWIN’s and IMM01’s
performance depended greatly on the points fed to them by their slaves. In
particular,
(1) If we average the score of each master with the scores of its slaves, we get
379.9 for BWIN and 351.7 for IMM01, both of which are considerably
less than DBSz’s score of 408.
(2) A more extensive analysis [Au and Nau (2005)] shows that if the size of
each master-and-slaves team had been limited to less than or equal to
10, DBSz would have outperformed BWIN and IMM01 in the compe-
tition, even without averaging the score of each master with its slaves.
The reason for the above two phenomena is that the master-and-slaves
strategies did not cooperate the other players as much as they did amongst
themselves. In particular, Table 10.2 gives the percentages of each of the
four possible interactions when any program from one group plays with any
program from another group. Note that:
• When BWIN and IMM01 play with their slaves, about 64% and 47% of
the interactions are (D,C), but when non-master-and-slaves strategies
play with each other, only 19% of the interactions are (D,C).
• When the slave programs play with non-master-and-slaves programs,
over 60% of interactions are (D,D), but when non-master-and-slaves
programs play with other non-master-and-slaves programs, only 31%
of the interactions are (D,D).
• The master-and-slaves strategies decrease the overall percentage of
(C,C) from 31% to 13%, and increase the overall percentage of (D,D)
from 31% to 55%.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 255
Table 10.2. Percentages of different interactions. “All but
M&S” means all 105 programs that did not use master-and-slaves
strategies, and “all” means all 165 programs in the competition.
Player 1 Player 2 (C,C) (C,D) (D,C) (D,D)
BWIN BWIN’s slaves 12% 5% 64% 20%
IMM01 IMM01’s slaves 10% 6% 47% 38%
CNGF CNGF’s slaves 2% 10% 10% 77%
BWIN’s slaves all but M&S 5% 9% 24% 62%
IMM01’s slaves all but M&S 7% 9% 23% 61%
CNGF’s slaves all but M&S 4% 8% 24% 64%
TFT all but M&S 33% 20% 20% 27%
DBSz all but M&S 54% 15% 13% 19%
TFTT all but M&S 55% 20% 11% 14%
TFT all 23% 19% 16% 42%
DBSz all 36% 14% 11% 39%
TFTT all 38% 21% 10% 31%
all but M&S all but M&S 31% 19% 19% 31%
all all 13% 16% 16% 55%
10.8.3. A comparison between DBSz, TFT, and TFTT
Next, we consider how DBSz performs against TFT and TFTT. Table 10.2
shows that when playing with another cooperative player, TFT cooperates
((C,C) in the table) 33% of the time, DBSz does so 54% of the time, and
TFTT does so 55% of the time. Furthermore, when playing with a player
who defects, TFT defects ((D,D) in the table) 27% of the time, DBSz
does so 19% of the time, and TFTT does so 14% of the time. From this,
one might think that DBSz’s behavior is somewhere between TFT’s and
TFTT’s.
But on the other hand, when playing with a player who defects, DBSz
cooperates ((C,D) in the table) only 15% of the time, which is a lower
percentage than for TFT and TFTT (both 20%). Since cooperating with
a defector generates no payoff, this makes TFT and TFTT perform worse
than DBSz overall. DBSz’s average score was 408 and it ranked 3rd, but
TFTT’s and TFT’s average scores were 388.4 and 388.2 and they ranked
30th and 33rd.
10.9. Related Work
Early studies of the effect of noise in the Iterated Prisoner’s Dilemma fo-
cused on how TFT, a highly successful strategy in noise-free environments,
would do in the presence of noise. TFT is known to be vulnerable to noise;
for instance, if two players use TFT at the same time, noise would trig-
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
256 T-C. Au and D. Nau
ger long sequences of mutual defections [Molander (1985)]. A number of
people confirmed the negative effects of noise to TFT [Molander (1985);
Bendor (1987); Mueller (1987); Axelrod and Dion (1988); Nowak and Sig-
mund (1990); Bendor et al. (1991)]. Axelrod found that TFT was still the
best decision rule in the rerun of his first tournament with a one percent
chance of misperception (Axelrod, 1984, page 183), but TFT finished sixth
out of 21 in the rerun of Axelrod’s second tournament with a 10 percent
chance of misperception [Donninger (1986)]. In Competition 2 of the 2005
IPD competition, the noise level was 0.1, and TFT’s overall average score
placed it 33rd out of 165.
The oldest approach to remedy TFT’s deficiency in dealing with noise
is to be more forgiving in the face of defections. A number of studies found
that more forgiveness promotes cooperation in noisy environments [Bendor
et al. (1991); Mueller (1987)]. For instance, Tit-For-Two-Tats (TFTT), a
strategy submitted by John Maynard Smith to Axelrod’s second tourna-
ment, retaliates only when it receives two defections in two previous itera-
tions. TFTT can tolerate isolated instances of defections caused by noise
and is more readily to avoid long sequences of mutual defections caused by
noise. However, TFTT is susceptible to exploitation of its generosity and
was beaten in Axelrod’s second tournament by TESTER, a strategy that
may defect every other move. In Competition 2 of the 2005 IPD Competi-
tion, TFTT ranked 30—a slightly better ranking than TFT’s. In contrast
to TFTT, DBS can tolerate not only an isolated defection but also a se-
quence of defections caused by noise, and at the same time DBS monitors
the other player’s behavior and retaliates when exploitation behavior is
detected (i.e., when the exploitation causes a change of the hypothesized
policy, which initially is TFT). Furthermore, the retaliation caused by ex-
ploitation continues until the other player shows a high degree of remorse
(i.e., cooperations when DBS defects) that changes the hypothesized policy
to one with which DBS favors cooperations instead of defections.
[Molander (1985)] proposed to mix TFT with ALLC to form a new
strategy which is now called Generous Tit-For-Tat (GTFT) [Nowak and
Sigmund (1992)]. Like TFTT, GTFT avoids an infinite echo of defections
by cooperating when it receives a defection in certain iterations. The differ-
ence is that GTFT forgives randomly: for each defection GTFT receives it
randomly choose to cooperate with a small probability (say 10%) and defect
otherwise. DBS, however, does not make use of forgiveness explicitly as in
GTFT; its decisions are based entirely on the hypothesized policy that it
learned. But temporary tolerance can be deemed as a form of forgiveness,
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 257
since DBS does not retaliate immediately when a defection occurs in a mu-
tual cooperation situation. This form of forgiveness is carefully planned
and there is no randomness in it.
Another way to improve TFT in noisy environments is to use contrition:
unilaterally cooperate after making mistakes. One strategy that makes use
of contrition is Contrite TFT (CTFT) [Sugden (1986); Boyd (1989); Wu
and Axelrod (1995)], which does not defect when it knows that noise has
occurred and affected its previous action. However, this is less useful in the
Noisy IPD since a program does not know whether its action is affected by
noise or not. DBS does not make use of contrition, though the effect of
temporary tolerance resembles contrition.
A family of strategies called “Pavlovian” strategies, or simply called
Pavlov, was found to be more successful than TFT in noisy environ-
ments [Kraines and Kraines (1989, 1993, 1995); Nowak and Sigmund
(1993)]. The simplest form of Pavlov is called Win-Stay, Lose-Shift [Nowak
and Sigmund (1993)], because it cooperates only after mutual cooperation
or mutual defection, an idea similar to Simpleton [Rapoport and Chammah
(1965)]. When an accidental defection occurs, Pavlov can resume mu-
tual cooperation in a smaller number of iterations than TFT [Kraines and
Kraines (1989, 1993)]. Pavlov learns by conditioned response through re-
wards and punishments; it adjusts its probability of cooperation according
to the previous interaction. Like Pavlov, DBS learns from its past experi-
ence and makes decisions accordingly. DBS, however, has an intermediate
step between learning from experience and decision making: it maintains a
model of the other player’s behavior, and uses this model to reason about
noise. Although there are probabilistic rules in the hypothesized policy,
there is no randomness in its decision making process.
For readers who are interested, there are several surveys on the Iterated
Prisoner’s Dilemma with noise [Axelrod and Dion (1988); Hoffmann (2000);
O’Riordan (2001); Kuhn (2001)].
The use of opponent modeling is common in games of imperfect infor-
mation such as Poker [Billings et al. (1998); Barone and While (1998, 1999,
2000); Davidson et al. (2000); Billings et al. (2003)] and RoShamBo [Egnor
(2000)]. One entry in Axelrod’s original IPD tournament used opponent
modeling, but it was not successful. There have been many works on learn-
ing the opponent’s strategy in the non-noisy IPD [Dyer (2004); Hingston
and Kendall (2004); Powers and Shoham (2005)]. By assuming the oppo-
nent’s next move depends only on the interactions of the last few iterations,
these works model the opponent’s strategy as probabilistic finite automata,
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
258 T-C. Au and D. Nau
and then use various learning methods to learn the probabilities in the au-
tomata. For example, [Hingston and Kendall (2004)] proposed an adaptive
agent called an opponent modeling agent (OMA) of order n, which main-
tains a summary of the moves made up to n previous iterations. Like DBS,
OMA learns the probabilities of cooperations of the other player in dif-
ferent situations using an updating rule similar to the Equation 10.1, and
generates a move based on the opponent model by searching a tree similar
to that shown in Figure 10.4. The opponent model in [Dyer (2004)] also
has a similar construct. The main way they differ from DBS is how they
learn the other player’s strategy, but there are several other differences: for
example, the tree they used has a maximum depth of 4, whereas ours has
a depth of 60.
The agents of both [Hingston and Kendall (2004)] and [Dyer (2004)]
learned the other player’s strategy by exploration—deliberately making
moves in order to probe the other player’s strategy. The use of exploration
for learning opponent’s behaviors was studied by [Carmel and Markovitch
(1998)], who developed a lookahead-based exploration strategy to balance
between exploration and exploitation and avoid making risky moves during
exploration. [Hingston and Kendall (2004)] and [Dyer (2004)] used a differ-
ent exploration strategy than [Carmel and Markovitch (1998)]; [Hingston
and Kendall (2004)] introduced noise to 1% of their agent’s moves (they
call this method the trembling hand), whereas the agent in [Dyer (2004)]
makes decisions at random when it uses the opponent’s model and finds a
missing value in the model. Both of their agents used a random opponent
model at the beginning of a game.
DBS does not make deliberate moves to attempt to explore the other
player’s strategy, because we believe that this is a high-risk, low-payoff
business in IPD. We believe it incurs a high risk because many programs in
the competition are adaptive; our defections made in exploration may affect
our long-term relationship with them. We believe it has a low payoff because
the length of a game is usually too short for us to learn any non-trivial
strategy completely. Moreover, the other player may alter its behavior at
the middle of a game, and therefore it is difficult for any learning method
to converge. It is essentially true in noisy IPD, since noise can provoke the
other player (e.g., GRIM). Furthermore, our objective is to cooperate with
the other players, not to exploit their weakness in order to beat them. So as
long as the opponent cooperates with us there is no need to bother with their
other behaviors. For these reasons, DBS does not aim at learning the other
player’s strategy completely; instead, it learns the other player’s recent
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 259
behavior, which is subject to change. In contrast to the OMA strategy
described earlier in this section, most of our DBS programs cooperated
with each other in the competition.
Our decision-making algorithm combines elements of both minimax
game tree search and the value iteration algorithm for Markov Decision
Processes. In contrast to [Carmel and Markovitch (1994)], we do not model
the other player’s model of our strategy; we assume that the hypothesized
policy does not change for the rest of the game. Obviously this assump-
tion is not valid, because our decisions can affect the decisions of the other
players in the future. Nonetheless, we found that the moves returned by
our algorithm are fairly good responses. For example, if the other player
behaves like TFT, the move returned by our algorithm is to cooperate re-
gardless of the previous interactions; if the other player does not behave
like TFT, our algorithm is likely to return defection, a good move in many
situations.
To the best of our knowledge, ours is the first work on using opponent
models in the IPD to detect errors in the execution of another agent’s
actions.
10.10. Summary and Future Work
For conflict prevention in noisy environments, a critical problem is to distin-
guish between situations where another player has misbehaved intentionally
and situations where the misbehavior was accidental. That is the problem
that DBS was formulated to deal with. DBS’s impressive performance in
the 2005 Iterated Prisoner’s Dilemma competition occurred because DBS
was better able to maintain cooperation in spite of noise than any other
program in the competition.
To distinguish between intentional and unintentional misbehaviors, DBS
uses a combination of symbolic noise detection plus temporary tolerance: if
an action of the other player is inconsistent with the player’s past behavior,
we continue as if the player’s behavior has not changed, until we gather
sufficient evidence to see whether the inconsistency was caused by noise or
by a genuine change in the other player’s behavior.
Since clarity of behavior is an important ingredient of long-term coop-
eration in the IPD, most IPD programs have behavior that follows clear
deterministic patterns. The clarity of these patterns made it possible for
DBS to construct policies that were good approximations of the other play-
ers’ strategies, and to use these policies to fend off noise.
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
260 T-C. Au and D. Nau
We believe that clarity of behavior is also likely to be important in
other multi-agent environments in which agents have to cooperate with
each other. Thus it seems plausible that techniques similar to those used
in DBS may be useful in those domains.
In the future, we are interested in studying the following issues:
• The evidence collection process takes time, and the delay may invite
exploitation. For example, the policy of temporary tolerance in DBS
may be exploited by a “hypocrite” strategy that behaves like TFT most
of the time but occasionally defects even though DBS did not defect
in the previous iteration. DBS cannot distinguish this kind of inten-
tional defection from noise, even though DBS has built-in mechanism
to monitor exploitation. We are interested to seeing how to avoid this
kind of exploitation.
• In multi-agent environments where agents can communicate with each
other, the agents might be able to detect noise by using a predefined
communication protocol. However, we believe there is no protocol that
is guaranteed to tell which action has been affected by noise, as long as
the agents cannot completely trust each other. It would be interesting
to compare these alternative approaches with symbolic noise detection
to see how symbolic noise detection could enhance these methods or
vice versa.
• The type of noise in the competition assumes that no agent know
whether an execution of an action has been affected by noise or not.
Perhaps there are situations in which some agents may be able to ob-
tain partial information about the occurrence of noise. For example,
some agents may obtain a plan of the malicious third party by counter-
espionage. We are interested to see how to utilize these information
into symbolic noise detection.
• It would be interesting to put DBS in an evolutionary environment to
see whether it can survive after a number of generations. Is it evolu-
tionarily stable?
Acknowledgment. This work was supported in part by ISLE contract
0508268818 (subcontract to DARPA’s Transfer Learning program), UC
Berkeley contract SA451832441 (subcontract to DARPA’s REAL program),
and NSF grant IIS0412812. The opinions in this paper are those of the au-
thors and do not necessarily reflect the opinions of the funders.
This work is based on an earlier work: Accident or Intention: That Is
the Question (in the Noisy Iterated Prisoner’s Dilemma), in AAMAS’06
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
Is it Accidental or Intentional? 261
(May 8–12 2006) c©ACM, 2006.
We would like to thank the anonymous reviewers for their comments.
References
Au, T.-C. and Nau, D. (2005). An Analysis of Derived Belief Strategy’s Perfor-
mance in the 2005 Iterated Prisoner’s Dilemma Competition, Tech. Rep.
CSTR-4756/UMIACS-TR-2005-59, University of Maryland, College Park.
Axelrod, R. (1984). The Evolution of Cooperation (Basic Books).
Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Com-
petition and Collaboration (Princeton University Press).
Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science
242, 4884, pp. 1385–1390.
Barone, L. and While, L. (1998). Evolving adaptive play for simplified poker, in
Proceedings of IEE International Conference on Computational Intelligence
(ICEC-98), pp. 108–113.
Barone, L. and While, L. (1999). An adaptive learning model for simplified poker
using evolutionary algorithms, in Proceedings of the Congreess of Evolu-
tionary Computation (GECCO-1999), pp. 153–160.
Barone, L. and While, L. (2000). Adaptive learning for poker, in Proceedings of
the Genetic and Evolutionary Computation Conference, pp. 566–573.
Bendor, J. (1987). In good times and bad: Reciprocity in an uncertain world,
American Journal of Politicial Science 31, 3, pp. 531–558.
Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt... cooperation in
a noisy prisoner’s dilemma, The Journal of Conflict Resolution 35, 4, pp.
691–719.
Billings, D., Burch, N., Davidson, A., Holte, R. and Schaeffer, J. (2003). Approxi-
mating game-theoretic optimal strategies for full-scale poker, in IJCAI, pp.
661–668.
Billings, D., Papp, D., Schaeffer, J. and Szafron, D. (1998). Opponent modeling
in poker, in AAAI, pp. 493–499.
Boyd, R. (1989). Mistakes allow evolutionary stability in the repeated prisoner’s
dilemma game, Journal of Theoretical Biology 136, pp. 47–56.
Carmel, D. and Markovitch, S. (1994). The M* algorithms: Incorporating oppo-
nent models into adversary search, Tech. Rep. CIS9402, Computer Science
Department Technion.
Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy
(almost) optimally, in Proceedings of the Third International Conference on
Multi-Agent Systems, pp. 64–71.
Davidson, A., Billings, D., Schaeffer, J. and Szafron, D. (2000). Improved oppo-
nent modeling in poker, in Proceedings of the 2000 International Conference
on Artificial Intelligence (ICAI’2000), pp. 1467–1473.
Donninger, C. (1986). Paradoxical Effects of Social Behavior, chap. Is it always
efficient to be nice? (Heidelberg: Physica Verlag), pp. 123–134.
Dyer, D. W. (2004). Opponent Modelling and Strategy Evolution in the Iterated
March 14, 2007 8:42 World Scientific Review Volume - 9in x 6in chapter10
262 T-C. Au and D. Nau
Prisoner’s Dilemma, Master’s thesis, School of Computer Science and Soft-
ware Engineering, The University of Western Australia.
Egnor, D. (2000). Iocaine powder explained, ICGA Journal 23, 1, pp. 33–35.
Hingston, P. and Kendall, G. (2004). Learning versus evolution in iterated pris-
oner’s dilemma, in Proceedings of the Congress on Evolutionary Computa-
tion (CEC’04).
Hoffmann, R. (2000). Twenty years on: The evolution of cooperation revisited,
Journal of Artificial Societies and Social Simulation 3, 2.
Kraines, D. and Kraines, V. (1989). Pavlov and the prisoner’s dilemma, Theory
and Decision 26, pp. 47–79.
Kraines, D. and Kraines, V. (1993). Learning to cooperate with pavlov an adap-
tive strategy for the iterated prisoner’s dilemma with noise, Theory and
Decision 35, pp. 107–150.
Kraines, D. and Kraines, V. (1995). Evolution of learning among pavlov strategies
in a competitive environment with noise, The Journal of Conflict Resolution
39, 3, pp. 439–466.
Kuhn, S. T. (2001). Prisoner’s dilemma,
http://karmak.org/archive/2002/11/Prisoner’s Dilemma.html
Stanford Encyclopedia of Philosophy.
Molander, P. (1985). The optimal level of generosity in a selfish, uncertain envi-
ronment, The Journal of Conflict Resolution 29, 4, pp. 611–618.
Mueller, U. (1987). Optimal retaliation for optimal cooperation, The Journal of
Conflict Resolution 31, 4, pp. 692–724.
Nowak, M. and Sigmund, K. (1990). The evolution of stochastic strategies in the
prisoner’s dilemma, Acta Applicandae Mathematicae 20, pp. 247–265.
Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that out-
performs tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58.
Nowak, M. A. and Sigmund, K. (1992). Tit for tat in heterogeneous populations,
Nature 355, pp. 250–253.
O’Riordan, C. (2001). Iterated prisoner’s dilemma: A review, Tech. Rep. NUIG-
IT-260601, Department of Information Technology, National University of
Ireland, Galway.
Powers, R. and Shoham, Y. (2005). Learning against opponents with bounded
memory, in IJCAI.
Rapoport, A. and Chammah, A. M. (1965). Prisoner’s dilemma (University of
Michigan Press).
Sugden, R. (1986). The economics of rights, co-operation and welfare (Blackwell).
Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s
dilemma, Journal of Conflict Resolution 39, pp. 183–189.