Top Banner
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stef ano Cagnoni, Information Sciences, 181(20), 2011, pp. 4642-4657. Presenter: Guan-Yu Chen Stu. No. : MA0G0202 Advisor : Shu-Chen Cheng
28

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Jan 01, 2016

Download

Documents

Osborn Hensley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

1

Evaluation of parallel particle swarm optimization algorithms within the

CUDA™ architecture

Luca Mussi, Fabio Daolio, Stefano Cagnoni,

Information Sciences, 181(20), 2011, pp. 4642-4657.

Presenter: Guan-Yu ChenStu. No. : MA0G0202Advisor : Shu-Chen Cheng

Page 2: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

2

Outline

1. Particle swarm optimization (PSO)

2. PSO parallelization

3. The CUDA™ architecture

4. Parallel PSO within the CUDA™

5. Results

6. Final remarks

Page 3: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

3

1. Particle swarm optimization (1/3)

• Kennedy & Eberhart (1995).

– Velocity function.

– Fitness function.

Page 4: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

4

1. Particle swarm optimization (2/3)

• Velocity function

1 1

2 2

( ) ( 1) [ ( 1) ( 1)]

[ ( 1) ( 1)]

lbest

gbest

t w t C R t t

C R t t

V V X X

X X

( ) ( 1) ( )t t t X X V

V: the velocity of a particle.

Xgbest: the best-fitness point ever found by the whole swarm.

C1, C2: two positive constants.

R1, R2: two random numbers uniformly drawn between 0 and 1.

w: inertia weight.

X: the position of a particle.

Xlbest: the best-fitness position reached by the particle.

t: at time t.

Page 5: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

5

1. Particle swarm optimization (3/3)

• Fitness function

* arg min ( )ZX X

( )z Z X Self-definition.

Page 6: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

6

2. PSO parallelization

• Master-Slave paradigm.

• Island model (coarse-grained algorithms).

• Cellular model (fine-grained paradigm).

• Synchronous or Asynchronous.

Page 7: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

7

3. The CUDA™ architecture (1/5)

• CUDA™ (nVIDIA™, Nov. 2006). – A handy tool to develop scientific programs orient

ed to massively parallel computation.

• Kernels Grid Thread blocks Threads

• How many thread blocks for problem ?• How many threads per thread block?

Page 8: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

8

3. The CUDA™ architecture (2/5)

• Streaming Multiprocessors (SMs) – 8 scalar processing cores,– A number of fast 32-bit registers,– A parallel data cache shared between all cores,– A read-only constant cache,– A read-only texture cache.

Page 9: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

9

3. The CUDA™ architecture (3/5)

• SIMT (Single Instruction, Multiple Thread)– Which creates, manages, schedules, and executes

groups (warps) of 32 parallel threads.– The main difference from a SIMD (Single

Instruction, Multiple Data) architecture is that SIMT instructions specify the whole execution and branching behavior of a single thread.

Page 10: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

10

3. The CUDA™ architecture (4/5)

Each kernel should reflect the following structure:

a) Load data from global/texture memory;

b) Process data;

c) Store results back to global memory.

Page 11: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

11

3. The CUDA™ architecture (5/5)

The most important specific programming guidelines:

a) Minimize data transfers between the host and the graphics card;

b) Minimize the use of global memory: shared memory should be preferred;

c) Ensure global memory accesses are coalesced whenever possible;

d) Avoid different execution paths within the same warp.

Page 12: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

12

4. Parallel PSO within the CUDA™• The main obstacle to PSO parallelization is the

dependence between particle’s updates.

SyncPSO– Xgbest or Xlbest are updated at the end of each generation only.

RingPSO– Relaxing the synchronization constraint.– Allowing the computation load to be distributed over all S

Ms available.

Page 13: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

13

4.1 Basic parallel PSO design (1/2)

Page 14: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

14

4.1 Basic parallel PSO design (2/2)

Page 15: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

15

4.2 Multi-kernel parallel PSO algorithm (1/3)

posID = ( swarmID * n + particleID ) * D + dimensionID

Page 16: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

16

4.2 Multi-kernel parallel PSO algorithm (2/3)

• PositionUpdateKernel ( 1st kernel )– Be used to update the particles’ positions by scheduling a n

umber of thread blocks equal to the number of particles.

• FitnessKernel ( 2nd kernel )– Be used to compute the fitness.

• BestUpdateKernel ( 3rd kernel )– Be used to update Xgbest or Xlbest.

Page 17: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

17

4.2 Multi-kernel parallel PSO algorithm (3/3)

Page 18: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

18

5. Results

w = 0.729844 and C1 = C2 = 1.49618.

Page 19: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

19

5.1 SyncPSO (1/2)• Asus GeForce EN8800GT GPU; Intel Core2 Duo™

CPU 1.86 GHz.

a) 100 consecutive runs; a single swarm of 32, 64, and 128 particles; 5-dimensional Rastrigin function vs. the number of generations.

b) run 10,000 generations of one swarm with 32, 64 and 128 particles scales with respect to the dimension of the generalized Rastrigin function (up to nine dimensions).

Page 20: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

20

5.1 SyncPSO (2/2)

Page 21: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

21

5.2 RingPSO (1/5)• nVIDIA™ Quadro FX 5800; Zotac GeForce GTX26

0 AMP 2 edition; Asus GeForce EN8800GT• SPSO on 64-bit Intel(R) Core(TM) i7 CPU 2.67 GH

z.

1) The sequential SPSO version modified to implement the ring topology;

2) The ‘basic’ three-kernel version of RingPSO;3) RingPSO implemented with two kernels only (one kernel wh

ich fuses BestUpdateKernel and PositionUpdateKernel, and FitnessKernel)

Page 22: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

22

5.2 RingPSO (2/5)

Sphere functionD [100, 100]

Page 23: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

23

5.2 RingPSO (3/5)

Rastrigin functionD[5.12, 5.12]

Page 24: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

24

5.2 RingPSO (4/5)

Rosenbrock functionD[30, 30]

Page 25: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

25

5.2 RingPSO (5/5)

Page 26: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

26

6. Final remarks (1/2)

• SyncPSO is usually more than enough for any practical application.

• SyncPSO’s usage of computation resources is very inefficient in cases when only one or few swarms need to be simulated.

• SyncPSO becomes inefficient when the problem size increases above a certain threshold.

Page 27: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

27

6. Final remarks (2/2)

• The drawbacks of accessing global memory for the multi-kernel version are more than compensated by the advantages of parallelization.

• The speed-up for the multi-kernel version increases with problem size.

• Both versions are far better than the most recent results published on the same task.

Page 28: 1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

28

The End~

Thanks for your attention!!