Accelerating Parallel Monte Carlo Tree Search using CUDA Kamil Rocki and Reiji Suda Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo This work was partially supported by Core Research of Evolutional Science and Technology (CREST) project "ULP-HPC: Ultra Low-Power, High- Performance Computing via Modeling and Optimization of Next Generation HPC Technologies" of Japan Science and Technology Agency (JST) and Grant-in-Aid for Scientic Research of MEXT Japan. Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. It can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization. The motivation behind this work is caused by the emerging GPU-based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved we chose to develop an AI GPU-based agent in the game of Reversi (Othello) which provides a sufficiently complex problem for tree searching with non-uniform structure and an average branching factor of over 8. We present an efficient parallel GPU MCTS implementation based on the introduced ’block-parallelism’ scheme which combines GPU SIMD thread groups and performs independent searches without any need of intra-GPU or inter-GPU communication. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0 system one GPU can be compared to 100-200 CPU threads depending on factors such as the search time and other MCTS parameters in terms of obtained results. We propose and analyze simultaneous CPU/GPU execution which improves the overall result. Introduction • The basic MCTS algorithm is simple • 1. Selection • 2. Expansion • 3. Simulation • 4. Backpropagation standard UCB formula mean value of node (i.e. success/loss ratio) C - exploitation/exploration ratio factor, tunable MCTS - Coulom (2006) UCB - Kocsis and Szepervari (2006) Parallel MCTS Schemes - Chaslot et al. (2008) Easy Efficient Complex, not efficient Our approach - Parallel MCTS on GPU = Block parallelism (c) Weakness: CPU sequential tree management part (proportional to the number n simulations a. Leaf parallelism n trees b. Root parallelism c. Block parallelism n = blocks(trees) x threads (simulations at once) Advantage: Works well with SIMD hardware, improves the overall result on 2 levels of parallelization 3/6 1/3 3/5 2/3 1/3 0 2 parts Tree building Stored in the CPU memory Simulating 1. Temporary - not remembered 2. Done by CPU or GPU 3. The results are used to affect the tree’s expansion strategy Final result: 0 or 1 • MCTS has many applications already • New ones are appearing • The architecture is likely to follow the trend in the future • Programming GPUs may become easier, rather not harder CPU GPU TSUBAME 2.0 • CPUs - Intel(R) Xeon(R) CPU X5670 @ 2.93GHz ~ 1400 Nodes of 12 cores • GPU - NVIDIA Tesla C2050 - 14 (MP) x 32 (Cores/MP) = 448 (Cores) @ 1.15 GHz, ~ 1400 Nodes of 3 GPUs each (around 515GFlops max capability per GPU) • If not specified otherwise, the MCTS search time = 500 ms, and GPU block size = 128 Root parallel MCTS - many starting points Greater chance of reaching the global solution Sequential/leaf parallel MCTS Seen as an optimization problem STATE SPACE STATE SPACE Sequential MCTS Root parallel MCTS Leaf parallel MCTS Block (root-leaf ) parallel MCTS STATE SPACE STATE SPACE Starting point - with root parallelism, more chance of finding the global solution Local solution (extremum) Search scope - with leaf parallelism, the search is broader/more accurate (more samples) Problem statement Parallel tree search is one of the basic problems in computer science. It is used to solve many kinds of problems. Effective parallelization is hard, especially for more than hundreds of threads. SIMD hardware (i.e. GPU) is fast, but hard to utilize. How to utilize GPUs/CUDA? Mapping MCTS trees to blocks Multiprocessor GPU Hardware Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor GPU Program Block 0 Block 2 Block 4 Block 6 Block 1 Block 3 Block 5 Block 7 SIMD warp SIMD warp SIMD warp SIMD warp 32 threads ( xed, for current hardware) Thread 0 Thread 1 Thread 4 Thread 5 Thread 2 Thread 3 Thread 6 Thread 7 Thread 8 Thread 9 Thread 12 Thread 13 Thread 10 Thread 11 Thread 14 Thread 15 Thread 16 Thread 17 Thread 20 Thread 21 Thread 18 Thread 19 Thread 22 Thread 23 Number of blocks con gurable Number of threads con gurable Number of MPs xed Root parallelism Leaf parallelism Block parallelism } } Scalability - MPI Parallel Scheme Root process id = 0 n-1 processes N processes init simulate broadcast data collect data (reduce) Output data Input data simulate simulate Other machine i.e. core i7, Fedora Other machine i.e. Phenom, Ubuntu Network Send the current state of the game to all processes Think Choose the best move and send it to the opponent Receive the opponent’s move Accumulate results All simulations are independent Process number 0 controls the game Results and findings Simultaneous CPU/GPU simulating 10 6 10 7 Simulations/second 1 2 4 8 16 32 26.5 27 27.5 28 28.5 29 29.5 No of GPUs (112 block x 64 Threads) Average Point Difference No communication bottleneck Improvement gets worse 229,376 threads ~20mln sim/s 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 GPU Threads Win ratio Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) GPU vs 1 cpu thread 448 trees 112 trees 112 trees Average for 2000 games 10 20 30 40 50 60 0 2 4 6 8 10 12 14 Game step Points 10 20 30 40 50 60 0 10 20 30 40 Game step Depth GPU + CPU GPU 1 GPU vs 128 CPUs Average point difference (score) Average tree depth 500ms search time 10 20 30 40 50 60 0 10 20 30 40 50 Game step Avegare score e e e e 256 GPUs (3,670,016 threads) and 2048 CPU threads vs sequential MCTS • Findings: • Weak scaling of the algorithm - problem’s complexity affects the scalability • Exploitation/exploration ratio - higher exploitation needed for more trees • No communication bottleneck • Much mor e efficient than the CPU version Exploration/exploitation in parallel MCTS Trees Trees 1 2 3 4 5 SUM 1 2 3 4 5 SUM High exploitation High exploration 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 x 10 5 GPU Threads Simulations/second Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) GPU vs 1 cpu thread 448 trees 112 trees 1 CPU - around 10.000 sim/s GPU is much faster! • More trees = higher score • More simulations = higher score • More trees = fewer simulations • Block size needs to be adjusted • 1 GPU ~ 64-128 CPUs (AI power) • While GPU runs a kernel CPU can work too • Increases the tree depth, improves the overall result GPU kernel execution time kernel execution call gpu ready event cpu control CPU can work here! processed by GPU expanded by CPU in the meantime Hybrid CPU/GPU search