CHAPTER 10 E VOLUTIONARY C OMPUTATION II : G ENERAL M ETHODS AND T HEORY Organization of chapter in ISSO –Introduction –Evolution strategy and evolutionary.

CHAPTER 10CHAPTER 10

EEVOLUTIONARYVOLUTIONARY CCOMPUTATIONOMPUTATION IIII: : GGENERAL ENERAL MMETHODS AND ETHODS AND TTHEORYHEORY

•Organization of chapter in ISSO–Introduction

–Evolution strategy and evolutionary programming; comparisons with GAs

–Schema theory for GAs

–What makes a problem hard?

–Convergence theory

–No free lunch theorems

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

10-2

Methods of EC Methods of EC

• Genetic algorithms (GAs), evolution strategy (ES), and evolutionary programming (EP) are most common EC methods

• Many modern EC implementations borrow aspects from one or more EC methods

• Generally: ES generally for function optimization; EP for AI applications such as automatic programming

10-3

ES Algorithm with Noise-Free Loss Measurements

Step 0 (initialization)Step 0 (initialization) Randomly or deterministically generate initial population of N values of and evaluate L for each of the values.

Step 1 (offspring)Step 1 (offspring) Generate offspring from current population of N candidate values such that all values satisfy direct or indirect constraints on .

Step 2 (selection)Step 2 (selection) For ((NN++)-ES)-ES, select N best values from combined population of N original values plus offspring; for

((NN,,)-ES)-ES, select N best values from population of > N offspring only.

Step 3 (repeat or terminate) Step 3 (repeat or terminate) Repeat steps 1 and 2 or terminate.

10-4

Schema Theory for GAsSchema Theory for GAs

• Key innovation in Holland (1975) is a form of theoretical foundation for GAs based on schemasschemas– Represents first attempt at serious theoretical analysis– But not entirely successful, as “leap of faith” required to

relate schema theory to actual convergence of GA

• “GAs work by discovering, emphasizing, and recombining good ‘building blocks’ of solutions in a highly parallel fashion.” (Melanie Mitchell, An Introduction to Genetic Algorithms [p. 27], 1996, paraphrasing John Holland)– Statement above more intuitive than formal– Notion of building block is characterized via schemas– Schemas are propagated or destroyed according to the

laws of probability

10-5

Schema Theory for GAsSchema Theory for GAs• Schema is template for chromosomes in GAs• Example: [* 1 0 * * * * 1], where the * symbol represents a don’t

care (or free) element – [11001101] is specific instance of this schema

• Schemas sometimes called building blocks of GAs• Two fundamental results: Schema theoremSchema theorem and implicit implicit

parallelismparallelism• Schema theorem says that better templates dominate the

population as generations proceed• Implicit parallelism says that GA processes >> N schemas at

each iteration• Schema theory is controversial

– Not connected to algorithm performance in same direct way as usual convergence theory for iterates of algorithm

10-6

Convergence Theory via Markov ChainsConvergence Theory via Markov Chains

• Schema theory inadequate– Mathematics behind schema theory not fully rigorous

– Unjustified claims about implications of schema theory

• More rigorous convergence theory exists– Pertains to noise-free loss (fitness) measurements– Pertains to finite representation (e.g., bit coding or floating

point representation on digital computer)

• Convergence theory relies on Markov chains• Each state in chain represents possible population• Markov transition matrix P contains all information for

Markov chain analysis

10-7

GA Markov Chain ModelGA Markov Chain Model

• GAs with binary bit coding can be modeled as (discrete state) Markov chains

• Recall states in chain represent possible populations

• ith element of probability vector pk represents probability of achieving ith population at iteration k

• Transition matrix: The i, j element of P represents the probability of population i producing population j through the selection, crossover and mutation operations – Depends on loss (fitness) function, selection method, and

reproduction and mutation parameters

• Given transition matrix P, it is known that

+1p p PT Tk k=

10-8

Rudolph (1994) and Markov Chain Rudolph (1994) and Markov Chain Analysis for Canonical GAAnalysis for Canonical GA

• Rudolph (1994, IEEE Trans. Neural Nets.) uses Markov chain analysis to study “canonical GA” (CGA)

• CGA includes binary bit coding, crossover, mutation, and “roulette wheel” selection

– CGA is focus of seminal book, Holland (1975)

• CGA does notnot include elitismlack of elitism is critical aspect of theoretical analysis

• CGA assumes mutation probability 0 < Pm < 1 and single-point crossover probability 0 Pc 1

• Key preliminary result: CGA is ergodic Markov chainergodic Markov chain:

– Exists a unique limiting distribution for the states of chain

– Nonzero probability of being in any state regardless of initial condition

10-9

Rudolph (1994) and Markov Chain Rudolph (1994) and Markov Chain Analysis for CGA (cont’d)Analysis for CGA (cont’d)

• Ergodicity for CGA provides a negative result on convergence in Rudolph (1994)

• Let denote lowest of N (= population size) loss values within population at iteration k

– represents loss value for in population k that has maximum fitness value

• Main theorem: CGA satisfies

(above limit on left-hand side exists by ergodicity)

• Implies CGA does not converge to the global optimumCGA does not converge to the global optimum

min,ˆlim 1( )k

kP L L

min,ˆ

kL

min,ˆ

kL

10-10

Rudolph (1994) and Markov Chain Rudolph (1994) and Markov Chain Analysis for CGA (cont’d)Analysis for CGA (cont’d)

• Fundamental problem with CGA is that optimal solutions are found but then lost

• CGA has no mechanism for retaining optimal solution

• Rudolph discusses modification to CGA yielding positive convergence results

• Appends “super individual” to each population

– Super individual represents best chromosome so far

– Not eligible for GA operations (selection, crossover, mutation)

– Not same as elitism

• CGA with added super individualwith added super individual converges in probability

10-11

Contrast of Suzuki (1995) and Rudolph Contrast of Suzuki (1995) and Rudolph (1994) in Markov Chain Analysis for GA(1994) in Markov Chain Analysis for GA

• Suzuki (1995, IEEE Trans. Systems, Man, and Cyber.) uses Markov chain analysis to study GA with elitismwith elitism– Same as CGA of Rudolph (1994) except for elitism

• Suzuki (1995) only considers unique states (populations)– Rudolph (1994) includes redundant states

• With N = population size and B = no. of bits/chromosome:

unique states in Suzuki (1995),

2NB states in Rudolph (1994) (much larger than number of unique states above)

• Above affects bookkeeping; does not fundamentally change relative results of Suzuki (1995) and Rudolph (1994)

( 2 1)!2 1

(2 1)! !

BB

B

NN

N N

10-12

Convergence Under ElitismConvergence Under Elitism• In both CGA case (Rudolph, 1994) and case with elitism

(Suzuki, 1995) the limit exists:

(dimension of differs according to definition of states, unique or nonunique as on previous slide)

• Suzuki (1995) assumes each population includes oneone elite element and that crossover probability Pc = 1

• Let represent jth element of , and J represent indices j where population j includes chromosome achieving L()

• Then from Suzuki (1995):

• Implies GA with elitism converges in probability to set of GA with elitism converges in probability to set of optima optima

0

limp p PT T k

k

p

p

p

jp

1jj J

p

10-13

Calculation of Stationary DistributionCalculation of Stationary Distribution• Markov chain theory provides useful conceptual device

• Practical calculation difficult due to explosive growth of number of possible populations (states)

• Growth is in terms of factorialsfactorials of N and bit string length (B)

• Practical calculation of pk usually impossible due to

difficulty in getting P

• Transition matrix can be very largevery large in practice

– E.g., if N = B = 6, P is 108108 matrix!

– Real problems have N and B much largermuch larger than 6

• Ongoing work attempts to severely reduce dimension by limiting states to only most important (e.g., Spears, 1999; Moey and Rowe, 2004)

10-14

Example 10.2 from Example 10.2 from ISSOISSO: Markov Chain : Markov Chain Calculations for Small-Scale ImplementationCalculations for Small-Scale Implementation

• Consider L() = = [0,15]

• Function has local and global minimum; plot on next slide

• Several GA implementations with very small population sizes (N) and numbers of bits (B)

• Small scale implementations imply Markov transition matrices are computable

– But still not trivial, as matrix dimensions range from approximately 20002000 to 40004000

sin ,

10-15

Loss Function for Example 10.2 in Loss Function for Example 10.2 in ISSOISSOMarkov chain theory provides probability of finding

solution ( = 15) in given number of iterations

10-16

Example 10.2 (cont’d): Probability Example 10.2 (cont’d): Probability Calculations for Very Small-Scale GAsCalculations for Very Small-Scale GAs

Probability that GA with elitism produces

population containing optimal solution

GA iteration 0 5 10 20 30 40 50 100 150

Crossover (Pc) = 1.0

Mutation (Pm) = 0.05 Population (N) = 2 Bit length (B) = 6

0.03 0.08 0.15 0.32 0.48 0.62 0.74 0.97 1.00

Pc = 1.0

Pm = 0.05 N = 4 B = 4

0.21 0.51 0.69 0.92 1.00 -- -- -- --

Pc = 1.0

Pm = 0.05 N = 2 B = 4

0.12 0.23 0.34 0.55 0.75 0.93 1.00 -- --

10-17

Summary of GA Convergence TheorySummary of GA Convergence Theory

• Schema theory (Holland, 1975) was most popular method for theoretical analysis until approximately mid-1990s– Schema theory not fully rigorous and not fully connected to

actual algorithm performance• Markov chain theory provides more formal means of

convergence—and convergence raterate—analysis• Rudolph (1994) used Markov chains to provide largely

negative result on convergence for canonical GAs– Canonical GA does not converge to optimum

• Suzuki (1995) considered GAs with elitismwith elitism; unlike Rudolph (1994), GA is now convergent

• Challenges exist in practical calculation of Markov transition matrix

10-18

No Free Lunch Theorems (Reprise, Chap. 1)No Free Lunch Theorems (Reprise, Chap. 1)

• No free lunch (NFL) Theorems apply to EC algorithms – Theorems imply there can be no universally efficient EC

algorithm

– Performance of one algorithm when averaged over all problems is identical to that of any other algorithm

• Suppose EC algorithm A applied to loss L– Let denote lowest loss value from most recent N

population elements after n N unique function evaluations

• Consider the probability that after n unique evaluations of the loss:

ˆ nL

ˆ ,nP L L A

ˆ nL

NFL theorems state that the sum of above probabilities over all loss functions is independent of A

10-19

Comparison of Algorithms for Stochastic Comparison of Algorithms for Stochastic Optimization in Chaps. 2 – 10 of Optimization in Chaps. 2 – 10 of ISSOISSO

• Table next slide is rough summary of relative merits of several algorithms for stochastic optimization– Comparisons based on semi-subjective impressions from

numerical experience (author and others) and theoretical or analytical evidence

– NFL theorems not generally relevant as only considering “typical” problems of interest, not all possible problems

• Table does not consider root-finding per se• Table is for “basic” implementation forms of algorithms

• Ratings range from LL (low), MLML (medium-low), MM (medium), MHMH (mediumhigh), and HH (high) – These scales are for stochastic optimization setting and have

no meaning relative to classical deterministic methods

10-20

Comparison of AlgorithmsComparison of Algorithms

Rand. search

RLS Stoch grad.

SPSA (basic)

ASP SAN GA

Ease of implementation HH MMHH MM MM MM MM MMLL

Efficiency in high dimen.

MM ((aallggss.. BB&&CC)) HH MMHH MMHH MMHH MM HHiigghhllyy

vvaarriiaabbllee

Generality of loss fn. HH LL MM MMHH MM MMHH MMHH

Global optimization HH NN//AA MMLL MMHH MMLL MMHH MMHH

Handles noise in loss/gradient MMLL MMHH HH HH MM MMLL MMLL

Real-time applications LL HH HH MMHH MM MMLL LL

Theoretical foundation MMHH HH HH HH HH MMHH MMHH

Sexiness LL MM MM MM MM MMHH HH

CHAPTER 10 E VOLUTIONARY C OMPUTATION II : G ENERAL M ETHODS AND T HEORY Organization of chapter in ISSO –Introduction –Evolution strategy and evolutionary.

Documents

gasschema theory

algorithmconvergence

iterationschema theory

population of n offspring

n best values

usual convergence theory

initial population of

schema schemas