Co-evolution time, changing environments agents
Dec 29, 2015
Static search space
solutions are determined by the optimization process only, by doing variation and selection
evaluation determines fitness of each solution
BUT…
Dynamic search space
what happens if state of environment changes between actions
of the optimization process
AND / OR other processes are concurrently trying to
change the state also
Agents to act in changing environments the search is now for an agent strategy to
act successfully in a changing environmentexample:
agent is defined with an algorithm based on a set of parameters
‘learning’ means searching for parameter settings to make the algorithm successful in the environment
agent algorithm takes current state of environment as input and outputs an action to change the environment
Optimizing an agent
procedural optimization recall 7.1.3 finite state machines
(recognize a string)
and
7.1.4 symbolic expressions
(control operation of a cart)
which are functions
The environment of activity
parameter space, e.g. game nodes are game states edges are legal moves that transform one
state to another example: tic-tac-toe, rock-scissors-paper
Agent in the environment
procedure to input current state and determine an action to alter the state
evaluated according to its success at achieving goals in the environment external goal - environment in desirable state
(win game) internal goal - maintain internal state
(survive)
Models of agent implementation
1. look-up table: state / action2. deterministic algorithm
hard-coded intelligence
ideal if perfect strategy is known,otherwise, parameterized algorithm
‘tune’ the parameters to optimize performance
3. neural network
Optimizing agent procedure
initialize agent parameters
evaluate agent success in environment
repeat
modify parameters
evaluate modified agent success
if improved success, retain modified parameters
Environment features
1. other agents?
2. state changes independent of agent(s)?
3. randomness?
4. discrete or continuous?
5. sequential or parallel?
Environment featuresgame examples
2. state changes independent of agent(s)?
YES
simulators (weather, resources)
NO
board games
Environment featuresgame examples
5. sequential or parallel?turn-taking, simultaneous
which sports?marathon , javelin, hockey, tennis
Models of agent optimization
evolving agent vs environment evolving agent vs skilled player co-evolving agents
competing as equals e.g., playing a symmetric game
competing as non-equals (predator - prey) e.g., playing an asymmetric game
co-operative
Example: Game Theoryinvented by John von Neumann models concurrent* interaction between
agents each player has a set of choices players concurrently make a choice each player receives a payoff based on the
outcome of play: the vector of choices e.g. paper-scissors-rock
2 players, with 3 choices each 3 outcomes: win, lose, draw
*there are sequential Game Theory models also
Payoff matrix
tabulates payoffs for all outcomes e.g. paper-scissors-rock
P1\P2 paper scissors rock
paper (draw,draw) (lose,win) (win,lose)
scissors (win,lose) (draw,draw) (lose,win)
rock (lose,win) (win,lose) (draw,draw)
Two-player two-choiceordinal games (2 x 2 games) Four outcomes Each player has four ordered payoffs
example game:
P1\P2 L R
U 1,2 3,1
D 2,4 4,3
2 x 2 games
144 distinct games based on relation of payoffs to outcomes: (4! x 4!) / (2 x 2)
model many real world encounters environment for studying agent strategies
how do players decide what choice to make?
P1\P2 L R
U 1,2 3,1
D 2,4 4,3
2 x 2 games
player strategy: minimax - minimize losses
pick row / column with largest minimum assumes nothing about other player
P1\P2 L R
U 1,2 3,1
D 2,4 4,3
2 x 2 games
player strategies: dominant strategy
preferred choice, whatever opponent does does not determine a choice in all games
P1\P2 L R
U 1,3 4,2
D 3,4 2,1
P1\P2 L R
U 1,2 4,3
D 3,4 2,1
P1\P2 L R
U 2,1 1,2
D 4,3 3,4
2 x 2 games
some games are dilemmas
“prisoner’s dilemma”
C = ‘cooperate’, D = ‘defect’
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
2 x 2 games - playing strategies
what happens if players use different strategies?
how do/should players decide in difficult games?
modeling players and evaluating in games
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
iterated games
can players analyze games after playing and determine a better strategy for next encounter?
iterated play
e.g., prisoner’s dilemma:
can players learn to trust each other and get better payoffs?
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
iterated prisoner’s dilemma
player strategy for choice as a function of previous choices and outcomes
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
game 1 2 … i-1 i i+1 …
P1 C C … D D ?
P2 D C … C D ?
P1: choicei+1,P1 = f( choicei,P1, choicei,P2)
iterated prisoner’s dilemma
evaluating strategies based on payoffs in iterated play
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
P1: choicen+1,P1 = f( choicen,P1, choicen,P2)
game 1 2 … i-1 i i+1 … n-1 n Σ
1 3 … 4 2 3 … 4 3 298
P1 C C … D D C … D C
P2 D C … C D C … C C
4 3 … 1 2 3 … 1 3 343
tournament play
set of K strategies for a game,e.g., iterated prisoner’s dilemma
evaluate strategies by performance against all other strategies
round-robin tournament
example of tournament play
P1 P2 … P(K-1) PK ΣP1 308 298 … 340 278 7140
P2 343 302 … 297 254 6989
… … … … … … …
P(K-1) 280 288 … 301 289 6860
PK 312 322 … 297 277 7045
tournament play
P1 P2 … P(K-1) PK Σ
P1 308 298 … 340 278 7140
P2 343 302 … 297 254 6989
… … … … … … …
P(K-1) 280 288 … 301 289 6860
PK 312 322 … 297 277 7045
order players by total payoff
change ‘weights’ (proportion of population) based on success
repeat tournament until weights stabilize
tournament survivors
final weights of players define who is successful
typical distribution:
1 dominant player
a few minor players
most players eliminated
variations on fitness evaluation
spatially distributed agents
random players on an n x n grid
each player plays 4 iterated games against neighbours and computes total payoff
each player compares total payoff with 4 neighbours; replaced by neighbour with best payoff
defining agents, e.g., prisoner’s dilemma
algorithms from experts Axelrod 1978
exhaustive variations on a model e.g. react to previous choices of both players:
need choices for:first moveafter (C,C) // (self, opponent)after (C,D)after (D,C)after (D,D)
5 bit representation of strategy32 possible strategies, 32 players in
tournament
tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D
tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D
tit-for-tat interpreting the strategy
start by trusting repeat opponent’s last behaviour
cooperate after cooperation punish defection
don’t hold a grudge
BUT… Axelrod’s payoffsbecame the standard
tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D
tit-for-tatfirst: C(C,C) C(C,D) D(D,C) C(D,D) D
P1\P2 C D
C 3,3 0,5
D 5,0 1,1
problems with the research
if payoffs are changed but the prisoner’s dilemma pattern remains,
tournaments can produce different winners
--> relative payoff does matter
P1\P2 C D
C 3,3 1,4
D 4,1 2,2
P1\P2 C D
C 3,3 0,5
D 5,0 1,1
game 1 2 … i-1 i i+1 … n-1 n Σ
1 3 … 4 2 3 … 4 3 298
P1 C C … D D C … D C
P2 D C … C D C … C C
4 3 … 1 2 3 … 1 3 343
agent is function of more than previous strategy
choiceP1,n influenced by variables previous choices - how many? 2, 4, 6, ..? payoffs - of both players 4, 8 ? individual’s goals 1
same, different measures? maximum payoff maximum difference altruism equality social utility
other factors? ?
agent research (Robinson & Goforth) choiceP1,n influenced by
previous choices - how many payoffs - of both playersrepeated tournaments by sampling the space of payoffs- demonstrated that payoffs matterexample of discovery: extra level of trust beyond tit-for-tat
effect of players’ goals on outcomes and payoffs and classic strategies 3 x 3 games ~3 billion games -> grid computing
general model environment is payoff space where encounters
take place multiple agents interact in the space
repeated encounters where agents get payoffs agents ‘display’ their own strategies and agents ‘learn’ the strategies of others
goals of research reproduce real-world behaviour by simulating strategies develop effective strategies
Developing effective agent strategies
co-evolution bootstrapping symmetric competition
games, symmetric game theory asymmetric competition
asymmetric game theory cooperation
common goal, social goal
Coevolutionin symmetric competition
Fogel’s checkers player evolving player strategies search space of strategies
fitness is determined by game play variation by random change to parameters selection by ranking in play against others initial strategies randomly generated
Checkers game representation
32 element vector for the playable squares on the board (one colour)
Domain of each element: {-K,-1,0,1,K} e.g., start state of game:
{-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,1,1}
Game play algorithm
board.initialize()
while (not board.gameOver())
player1.makeMove(board)
if (board.gameOver()) break
player2.makeMove(board)
end while
board.assignPoints(player1,player2)
makeMove algorithm
makeMove(board) {
boardList = board.getNextBoards(this)
bestBoardVal = -1; bestBoard = null
forAll (tempBoard in boardList)
boardVal = evaluate(tempBoard)
if (boardVal>bestBoardVal)
bestBoardVal = boardVal
bestBoard = tempBoard
board = bestBoard
evaluate algorithm
evaluate(board) { function of 32 elements of board parameters are variables of search space, including K output is in range [-1,1]
used for comparing boards}
what to put in the evaluation?
Chellapilla and Fogel’s evaluation
no knowledge of chess rules board generates legal moves and updates
state no ‘look-ahead’ to future moves implicitly a pattern recognition system
BUT only relational patterns predefined - see diagram
(neural network)
Neural network
input layer output layerhidden layer(s)
i1
h1 = w11.i1+w12.i2+w13.i3
o1=w31.g1+w32.g2+w33.g3g1
Neural network gent 32 inputs - board state 1st hidden layer: 91 nodes
3x3 squares (36), 4x4 (25), 5x5 (16), 6x6 (9), 7x7 (4), 8x8 (1)
2nd hidden layer: 40 nodes 3rd hidden layer: 10 nodes 1 output - evaluation of board [-1, 1] parameters - weights on neural network
plus K (value of king)
Genetic algorithm population: 15 player agents one child agent from each parent: variation
by random changes in K and weights fitness based on score in tournament
among 30 agents (win: 1, loss: -2, draw: 0) agents with best 15 scores selected for
next generationafter 840 generations, effective checkers
player in competition against humans