Efficient Intra-SM Slicing for GPU Multiprogramming Qiumin Xu, EE / Dr. Annavaram [email protected], [email protected] Intra-SM Slicing vs. Inter-SM Slicing App 1 App 2 App N … CPU + GPU Servers P(i, T i ) : normalized IPC of application i K : # of applications sharing the SM. R Ti : the resource requirement of T i CTAs R tot : the total resources available in an SM N: maximum # of CTAs per SM How to assign kernels to different GPU SMs? Inter-SM Slicing Intra-SM Slicing A A A A A A A A A A A Even Partitioning A A A A A A A A A A Left-Over Allocation No A Left A A A A A A A A A A A A A A B B B B B B B B B B B B (b) (c) (d) Warped-Slicer Re 1. Assign a sequentially increasing # of CTAs from each kernel 2. Employ a sampling phase to measure the IPC of each SM 3. Calculate the best partitioning based on Equ. (1) 4. Continue to assign CTAs to SMs based on the decision (1) Application Pairs 30 Speedup 1.23x Increased Util. 15% Fairness Improved 14% Hardware Overhead ~0 *Cooperative Thread Array (CTA), a.k.a. GPU thread blocks Warped-Slicer A B A A A A B B B Kernel Aware Thread Block Scheduler Kernel A’s CTAs Kernel B’s CTAs *Streaming Multiprocessors (SM), a.k.a. GPU shader cores Naïve Brute force algorithm: Ο ( N K ) Proposed waterfilling algorithm: Ο ( NK ) i. identify kernel #2 has the MIN performance ii. assign 1 CTA from kernel #2 3 SM 7 CTA 3 CTA 6 CTA 8 CTA 9 SM 6 CTA 5 CTA 7 CTA 2 CTA 15 SM 5 CTA 4 CTA 1 CTA 12 CTA 14 SM 4 CTA CTA 10 CTA 11 CTA 13 SM 3 CTA 9 CTA 8 CTA 3 CTA 6 SM 2 CTA 7 CTA 2 CTA 5 CTA 15 SM 1 CTA 1 CTA 4 CTA 12 CTA 14 SM 0 CTA CTA 10 CTA 11 CTA 13 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler Sampler Case2: Choose Spatial Multitasking 4 CTA 1 CTA 4 CTA 12 CTA 14 CTA 10 CTA 11 CTA 13 CTA 0 CTA 1 CTA 4 CTA 12 CTA 14 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 CTA 0 CTA 10 CTA 11 CTA 13 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 Case1: Choose Intra-SM Slicing at (2,2) Sampler CTA Scheduling for Intra-SM Slicing Dynamic Resource Partitioning Experimental Results