GPU-Accelerated Genetic Algorithms Rajvi Shah + , P J Narayanan + , Kishore Kothapalliˆ IIIT Hyderabad Hyderabad, India + : Center for Visual Information Technology ˆ : Center for Security, Theory and Algorithmic Research
Jan 29, 2016
GPU-Accelerated Genetic Algorithms
Rajvi Shah+, P J Narayanan+, Kishore KothapalliˆIIIT Hyderabad
Hyderabad, India
+ : Center for Visual Information Technology ˆ : Center for Security, Theory and Algorithmic Research
International Institute of Information Technology, Hyderabad, India
GAs – an introduction
Genetic Algorithms A class of evolutionary algorithms Efficiently solves optimization tasks Potential Applications in many fields
Challenges Large execution time
International Institute of Information Technology, Hyderabad, India
Typical flow of a GA
A representation for chromosome
Create Initial Population
Select Parents
Create New Population
GA Parameters
Terminate?
Evaluate Fitness
Crossover Operator
Mutation Operator
Termination Criteria
User Specifies …
A method for fitness evaluation
N o
ExitYes
International Institute of Information Technology, Hyderabad, India
Accelerating Genetic Algorithms
High degree of parallelism Fitness evaluation Crossover Mutation
Most obvious : chromosome level parallelism Same Operations on each chromosome Use a thread per chromosome
International Institute of Information Technology, Hyderabad, India
Gene-level Parallelism
Thread-per-chromosome model Good enough for small to moderate sized multi-
core Doesn’t map well to a massively multithreaded
GPUs
Solution : identify and exploit gene-level
parallelism
International Institute of Information Technology, Hyderabad, India
CUDA
International Institute of Information Technology, Hyderabad, India
Our Approach
A column of threads read a chromosome gene-by-gene and cooperate to perform operations
Results in coalesced read and faster processing
Population Matrix in Memory
Thread Blocks in a grid
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Evaluation Kernel
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
Population
Scores
Evaluation KernelEvaluation Kernel
International Institute of Information Technology, Hyderabad, India
Fitness EvaluationPartially parallel method
Partially-parallel Method
User Specifies a serial code fragment for fitness evaluation.
Threads are arranged in a 1D grid.
Each thread executes user’s code on one chromosome.
Providing chromosome level parallelism.
Benefit : Abstraction
Fully parallel method
CUDA familiar user can effectively use 2D thread layout
Use gene level Parallelism for fitness evaluation
Benefit : Efficiency
International Institute of Information Technology, Hyderabad, India
Example – 0/1 Knapsack
Task : Given weights , costs &
knapsack capacity Aim : maximize the cost.
Representation 1D binary string 0/1: Absence/Presence of an item, W and C are total weight and Cost
of given representation
Best Solution : One with max C given W < Wmax
Fully Parallel Method
Use a group of threads to compute total cost and weight in logarithmic time
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
Scores
Statistics
Evaluation Kernel
Statistics Update Kernel
International Institute of Information Technology, Hyderabad, India
Statistics
Selection and Termination most often use Population Statistics
We use standard parallel reduce algorithm to calculate Max, Min, Average Scores
We use highly optimized public library CUDPP To sort and rank chromosomes
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
Statistics
Parents
Evaluation Kernel
Selection Kernel
International Institute of Information Technology, Hyderabad, India
Selection
Selection Kernel Uses N/2 threads Each thread selects two parents for producing
offspring
Uniform Selection : Selects parents in a uniform random manner
Roulette Wheel Selection: Fitness based approach, more the fitness,
better the chance of selection
International Institute of Information Technology, Hyderabad, India
Selection
Roulette Wheel Sort fitness scores
Compute a roulette wheel array by doing a prefix-sum scan of scores and normalizing it.
Generate a random number in 0-1.
Perform binary search in roulette wheel array for the nearest smaller number to the randomly selected number.
Return the index of the result in array
Image Courtesy : xyz
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
Old Population
New Population
Evaluation Kernel
Crossover Kernel
International Institute of Information Technology, Hyderabad, India
Crossover
GPU Global Memory
Parent102
08
12
05
15
Parent204
13
07
19
14Crossove
r03
02
02
04
01
Population
Thread idy Thread idy
08
13
02
Thread idy
12
07
02
Thread idy
05
19
02
Thre
ad idx 1
-L
Thre
ad idx 1
-L
Thre
ad idx 1
-L
Thre
ad idx 1
-L1 2 3 4 5 6 7 8
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
New Population
New Population
Evaluation Kernel
Mutation Kernel
International Institute of Information Technology, Hyderabad, India
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
Thread 1,4Coin State Gene
X
Flip Coin
Coin State Gene
T
Mutation
Flip Mutator Each thread handles
one gene and mutates it with probability of mutation
Thre
ad Id x
Thread Id y
Population
International Institute of Information Technology, Hyderabad, India
Thre
ad Id x
Thread Id y
Population
Mutation
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
x
X
xx
xx
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
F
F
T
F
F
F
F
F
F
F
T
F
Thread 1,4Coin State Gene
X
Flip Coin
Coin State Gene
T
Flip Mutator Each thread handles
one gene and mutates it with probability of mutation
International Institute of Information Technology, Hyderabad, India
Program Execution Flow
Construct Initial Population
On CPU
GPU Global Memory
Random NumbersOld PopulationNew PopulationFitness Scores
Statistics
Statistics Update Kernel
Selection Kernel
Crossover Kernel
Mutation Kernel
Parse GA Parameters Generate Random Numbers
On GPU
Random No.s
Evaluation Kernel
Generate Random Numbers
International Institute of Information Technology, Hyderabad, India
Random Number Generation
Extensive use of random numbers
No primitive for on the fly single random number generation
Solution: Generate a pool of random numbers and copy it on GPU
We use CUDPP routine to generate a large pool of random numbers on GPU (faster)
If better quality random numbers are needed, this can be replaced by a CPU based routine
International Institute of Information Technology, Hyderabad, India
Results
Test Device : A quarter of Nvidia Tesla S1030 GPU
Test Problem : Solve a 0/1 knapsack problem
Test Parameters: Representation : A 1D Binary String Crossover : One-point crossover Mutation : Flip Mutation Selection : Uniform and Roulette Wheel
International Institute of Information Technology, Hyderabad, India
Results
Ave. Run-time for 100 iterations (Uniform Selection)
Ave. Run-time for 100 iterations (Roulette Wheel Selection)
Growth in run-time for increase in NxLN: Population Size , L: Chromosome
Length
International Institute of Information Technology, Hyderabad, India
Scope
Our approach is modeled after GAlib and maintains structures for GA, Genome and Statistics
It is built with enough abstraction from user program so that user does not need to know CUDA architecture or programming.
This can be extended to build a GPU-Accelerated GA library
Thank You
[email protected]@iiit.ac.in
International Institute of Information Technology, Hyderabad, India