Multi-GPU Island-Based Genetic Algorithm for Solving the Knapsack Problem Jiri Jaros 1 Overview The Genetic Algorithms (GAs) have become widely applied optimization tools since their development by Holland in 1975 [1]. One of the famous NP-hard problems successfully solved by GAs is the knapsack. However, millions of candidate solutions have to be created and evaluated for large problem instances rising the execution time up to hours and days [2]. The latest GPUs are about 15 times faster than six-core Intel CPUs which opens new possibilities for massive acceleration of GAs [3]. 2 Multi-GPU Cluster Systems The availability of multiple PCI-Express buses, even on very low cost commodity computers, means that it is possible to construct cluster nodes with multiple GPUs. Inter-node communications are done via MPI over a high speed network while intra-node communications exploit CPU shared memory. 3 Multi-GPU Island-Based Genetic Algorithm The population of the GA is distributed over multiple GPUs. Every GPU, controlled by a single MPI process, entirely evolves a single island. Migration of individuals occurs after a predefined number of generations exchanging the best local solution and an optional number of randomly selected individuals over a ring topology. 4 Local GPU Island Implementation Details [4] • 32 knapsack items packed into a single integer value • Individuals processed by CUDA WARPS in multiple rounds • Most CUDA block barriers removed • Negligible thread divergence ( < 0.5%) • Blocks of knapsack data shared within the block • Uniform crossover, bit-flip mutation, binary tournament 5 Experimental Results • Highly optimized CPU implementation running on 4 6-core Intel Xeon processors with 40Gb infiniband interconnection • CUDA implementation running on 14 NVIDIA GTX 580 Knapsack problem with 10,000 items 6 Conclusions The proposed multi-GPU island-based GA allows the solution of large-scale instances of the knapsack problem. The significant benefits: • Speedups up to 35, 194, 781 (14 GPUs vs. 24, 6, 1 cores) • Overall performance of 5.67 TFLOPS (14 GPUs) • Overall efficiency of 26% The codes will be released as an open-source software (http://www.fit.vutbr.cz/~jarosjir). [1] J. H. Holland, “Adaptation in Natural and Artificial Systems”, Ann Arbor, no. 53. University of Michigan Press, 1975, p. 211 [2] Z. Michalewicz and J. Arabas, “Genetic algorithms for the 0/1 knapsack problem”, in Lecture Notes in Computer Science, 1994, vol. 869/1994, 134-143 [3] V. W. Lee et al., “Debunking the 100X GPU vs. CPU myth,” in Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10, 2010, p. 451 [4] J. Jaros and P. Pospichal, “A Fair Comparison of Modern CPUs and GPUs Running the Genetic Algorithm under the Knapsack Benchmark”, in Applications of Evolutionary Computation, Heidelberg, DE, Springer, 2012, p. 426-435 College of Engineering and Computer Science, Australian National University This research has been partially supported by the research grant "Natural Computing on Unconventional Platforms", GP103/10/1517, Czech Science Foundation (2010-13). Evaluate the local GPU island Select emigrants in the local GPU island Transfer the emigrants from GPU to CPU and then over network Receive immigrants from network by CPU and upload them to GPU Incorporate the immigrants into the local GPU island MPI process 2820 2825 2830 2835 2840 128 256 512 1024 2048 Fitness value x1000 Individuals per local island Solution quality 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs 0 20 40 60 80 100 120 140 128 256 512 1024 2048 Execution time [s] Individuals per local island Total execution time 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs 0 10 20 30 40 50 60 128 256 512 1024 2048 Speedup vs. single-thread CPU Individuals per local island Speedup on a single island reached by multicore CPU and GPU 1xGPU 2x6 CPU threads 6 CPU threads 0 4 8 12 16 20 24 28 32 36 40 128 256 512 1024 2048 Speedup vs. 4 6-core Xeons Individuals per local island GPU Speedup vs. 4 6-core Xeons 1 GPU 6 GPUs 7 GPUs 12 GPUs 14 GPUs Interconnection Network