Top Banner
Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl Pearson 1 , Abdul Dakkak 1 , Sarah Hashash 1 , Cheng Li 1 , I-Hsin Chung 2 , Jinjun Xiong 2 , Wen-Mei Hwu 1 1 University of Illinois Urbana-Champaign, Urbana, IL 2 IBM T. J. Watson Research, Yorktown Heights, NY
34

Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Jul 05, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Evaluating Characteristics of CUDA Communication

Primitives on High-Bandwidth Interconnects

Carl Pearson1, Abdul Dakkak1, Sarah Hashash1, Cheng Li1, I-Hsin Chung2, Jinjun Xiong2, Wen-Mei Hwu1

1 University of Illinois Urbana-Champaign, Urbana, IL2 IBM T. J. Watson Research, Yorktown Heights, NY

Page 2: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

GPU Consumes

15.7 TFLOP FP32

31.4 x 1012 operands/s

Host Produces

15.8 GB/s over PCIe

3.95 x 109 operands/s

Why GPU Interconnect Bandwidth?

2

~8000 FP32 operations per operand transferred

or

~2000 FP32 operations per byte transferred

Nvidia V100 attached by PCIe 3

Page 3: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

CUDA data transfer

bandwidth depends on

allocation and transfer

method

Microbenchmarks for all CUDA

communication methods

Avoid synchronization overhead

from measurements

Challenges and Contributions

3

Incomplete System

Characterization

Challenge Bad Result Comm|Scope Solution

Page 4: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

CUDA data transfer

bandwidth depends on

allocation and transfer

method

Microbenchmarks for all CUDA

communication methods

Avoid synchronization overhead

from measurements

Challenges and Contributions

4

Variability across

measurements

Explore effect of system topology

Understand and control non-

CUDA system parameters during

measurements

Incomplete System

Characterization

Bandwidth influenced

by non-CUDA knobs

and system topology

Challenge Bad Result Comm|Scope Solution

Page 5: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

CUDA data transfer

bandwidth depends on

allocation and transfer

method

Microbenchmarks for all CUDA

communication methods

Avoid synchronization overhead

from measurements

Challenges and Contributions

5

Variability across

measurements

Explore effect of system topology

Understand and control non-

CUDA system parameters during

measurements

Complicated API

behavior and system

interaction

Open-source, cross-platform,

error reporting, plotting results

Incomplete System

Characterization

Reinventing the

wheel and repeating

mistakes

Bandwidth influenced

by non-CUDA knobs

and system topology

Challenge Bad Result Comm|Scope Solution

Page 6: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Explicit transfers

▪ Peer Access

▪ “Zero-Copy”

Comprehensive Coverage of CUDA Bulk Transfers

6

▪ Unified Memory

▪ Unidirectional Transfers

▪ Bidirectional Transfers

Local Storage

Component Component

Local Storage

Data Transfer

Communication Path

Page 7: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Not all cudaMemcpy created

equal on high-bandwidth

interconnects

Non-CUDA Parameter: NUMA Pinning

7

CPU

“remote”GPU

CPU

“local”GPU

Configuration (Limiter) Theoretical (GB/s)

Observed (GB/s)

AC922 Local (3x NVLink 2) 75 66.6 ± 0.013

AC922 Remote (X-bus) 64 41.3 ± 0.009

S822LC Local (2x NVLink 1) 40 31.9 ± 0.008

S822LC Remote (x-bus) 38.4 29.3 ± 0.013

4029GP Local (PCIe 3) 15.8 12.4 ± 0.0002

4029GP Remote (PCIe 3) 15.8 12.4 ± 0.0002

1GB pinned host allocation transferred to GPU

Page 8: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Variable CPU Clock Speeds

$ cpupower frequency-set --governor performance

▪ CPU Data Caching

Non-CUDA Parameters

8

// arch/x86/include/asm/special_insns.h

void flush(void *p) {asm volatile("clflush %0"

: "+m"(p): // no inputs: // no clobbers

);}

// linux/arch/powerpc/include/asm/cache.h

void flush(void *p) {asm volatile("dcbf 0, %0"

: // no outputs: "r"(p): "memory”

);}

Page 9: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Pinned Allocation and cudaMemcpy

9

GPU MemoryCPU DRAM

▪ GPU does DMA to access pinned data on CPU

GPU MemoryCPU DRAM

cudaMemcpy( … , cudaMemcpyHostToDevice) cudaMemcpy( … , cudaMemcpyDeviceToHost)

Page 10: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

cudaMemcpy & CPU Cache

10

▪ CPU writes values to initialize data

▪ For small allocations, data may reside entirely in cache

GPU MemoryCPU DRAM

cudaMemcpy( … , cudaMemcpyHostToDevice) cudaMemcpy( … , cudaMemcpyDeviceToHost)

CPU Cache

Dirty data from cache

GPU MemoryCPU DRAM

CPU Cache

Page 11: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

cudaMemcpy & CPU Cache

11

▪ Flushing the cache forces

data to start in the DRAM

GPU MemoryCPU DRAM

cudaMemcpy( … , cudaMemcpyHostToDevice) cudaMemcpy( … , cudaMemcpyDeviceToHost)

CPU Cache

GPU MemoryCPU DRAM

CPU Cache

▪ Flushing the cache

prevents write-back of

dirty data

Page 12: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

12 GPU MemoryCPU DRAM

CPU Cache

GPU MemoryCPU DRAM

CPU Cache

GPU MemoryCPU DRAM

CPU Cache

GPU MemoryCPU DRAM

CPU Cache

No Flushing

Flushing

Host to GPU GPU to Host

52.14 GB/s

45.20 GB/s 29.40 GB/s

14.91 GB/s

Flushing forces data to start in DRAM, slowing transfer

Flushing prevents dirty data from being evicted, speeding transfer

Dirty data from

cache

Page 13: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Using Google Benchmark Support Library

– Each benchmark run consists of some number

of iterations

– The number of iterations is

1 < n < 1e9 and

total time under measurement >= 0.5s

▪ Support synchronous and asynchronous

operations

▪ Report variability across runs

– High variability suggests not all relevant

system parameters are fixed

Benchmark Design

13

Ben

chm

ark

Ru

n

Initialization

Ben

chm

ark

Iter

atio

n

Cleanup

Setup

Timing Strategy

Page 14: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Resetting CUDA devices

▪ NUMA pinning

▪ Creating allocations

▪ Creating CUDA streams and events

▪ Zeroing allocations

▪ Configure CUDA device peer access

Initialization (as needed)

14

Ben

chm

ark

Ru

n

Initialization

Ben

chm

ark

Iter

atio

n

Cleanup

Setup

Timing Strategy

Page 15: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Move unified memory data to a source

device

▪ Flush caches

▪ Set CUDA devices

▪ Adjust NUMA pinning

Setup (as needed)

15

Ben

chm

ark

Ru

n

Initialization

Ben

chm

ark

Iter

atio

n

Cleanup

Setup

Timing Strategy

Page 16: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Timing the data transfer operation

▪ Different approaches for different

transfer types:

– Synchronous

– Asynchronous

– Simultaneous

Timing Strategies

16

Ben

chm

ark

Ru

n

Initialization

Ben

chm

ark

Iter

atio

n

Cleanup

Setup

Timing Strategy

Page 17: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ An operation that may complete at any time (from the

perspective of the host)

▪ CUDA API call may return before the operation is complete

Asynchronous Operations

17

Page 18: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ cudaMemcpy

– CUDA Runtime API §2: “for transfers from pageable host memory

to device memory…the function will return once the pageable

buffer has been copied to the staging memory, but the DMA to

final destination may not have completed”

// wrongstart = std::chrono::system_clock::now()cudaMemcpy(..., cudaMemcpyHostToDevice)end = std::chrono::system_clock::now()

Asynchronous Behavior in Synchronous APIs

18

Page 19: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ No spurious synchronization costs!

Timing Single Operations

19

start wall time

operation

stop wall time

Host ThreadR

ep

ort

ed

Tim

e“start” event

operation

“stop” event

CUDA Stream

Re

po

rte

d T

ime

AsynchronousSynchronous

Page 20: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Timing Simultaneous Sync/Async Operations

20

Start wall time

Synchronous Operation

cudaStreamSynchronize

stop wall time

Asynchronous Operation

Host Thread CUDA Stream

Re

po

rte

d T

ime

Launch async

Unavoidable stream

synchronization is

measured

Page 21: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Timing Simultaneous Asynchronous Operations

21

Device 0 / Stream 0 Device 1 / Stream 1

“start” event

operation 0

wait

“stop” event

operation 1

“other” event

Re

po

rte

d T

ime

Multiple DeviceSingle Device

Device 0 / Stream 0 Device 0 / Stream 1

“start” event

operation 0

“stop” event

operation 1

Re

po

rte

d T

ime “start” event

“stop” event

No spurious synchronization costs! Streams synchronization event measured

Page 22: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

IBM S822LC and IBM AC922

22

Spec S822LC AC922

CPU 2x IBM POWER 8 2x IBM POWER 9

GPU 4x Nvidia P100 (Pascal) 4x Nvidia V100 (Volta)

CPU ↔ CPU X-bus (38.4 GB/s) X-bus (64 GB/s)

CPU ↔ GPU 2x NVLink 1 (80 GB/s) 3x NVLink 2 (150 GB/s)

GPU ↔ GPU 2x NVLink 1 (80 GB/s) 3x NVLink 2 (150 GB/s)

Page 23: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

SuperMicro 4029GP-TVRT

23

Spec

CPU 2x Intel Xeon Gold 6148

GPU 8x Nvidia V100 (Volta)

CPU ↔ CPU Intel UPI (62.4 GB/s)

CPU ↔ GPU PCIe 3.0 x16 (31.6 GB/s)

GPU ↔ GPU 1x/2x NVLink 2(25-50 GB/s)

Page 24: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Low bandwidth PCIe 3.0 on 4029GP hides interesting behavior

No Locality or Anisotropy on PCIe

24

cudaMemcpyAsync vs zero-copy CPU/GPU

Unified memorydemand transfers

explicit vs zero-copy CPU/GPU

cudaMemcpyAsync

Page 25: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ The implicit pageable-to-pinned copy prevents exploiting fast interconnects

▪ Multiple threads should speed up pageable-pinned copy

– Application could use simultaneous transfers

– CUDA runtime could use multiple worker threads

Pageable Host Allocations and Fast Interconnects

25

Page 26: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Transfers

across

NVLink 2

show strong

locality

effects

Strong Locality with High Bandwidth Configurations

26

cudaMemcpyAsync GPU-GPUcudaMemcpyAsync CPU-GPU

80 GB/s

80 GB/s

80 GB/s

Page 27: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ CUDA system software

limits performance

available in hardware

– Page faults

– Per-page driver heuristics

▪ Underlying interconnect

performance not so

important

Demand Page Migration

27

50 GB/s

50 GB/s

50 GB/s

Page 28: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Multiple host threads are needed to make UM faster

Demand Page Migration vs Explicit Tranfer

28

50 GB/s

50 GB/s

50 GB/s

Page 29: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Implicit, like unified

memory

▪ Unlike unified

memory, can achieve

near interconnect

theoretical bandwidth

Zero-Copy

29

80 GB/s

80 GB/s

80 GB/s

Page 30: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Unified memory prefetch is slow at intermediate sizes

Unified Memory Prefetch vs Explicit

30

80 GB/s

Page 31: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ v0.7.2 released April 8th

▪ Github: c3sr/comm_scope

▪ Docker: c3sr/comm_scope

▪ CUDA 8.0+, CMake 3.12+

▪ x86 and POWER

▪ Apache 2.0 license

▪ Python scope_plot package for plotting results

Open-source & Docker

31

Page 32: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Unified Memory Microbenchmarks

– Access patterns & driver heuristics

▪ System-aware CPU/GPU and GPU/GPU data structures

– How to allocate and move data depending on who produces and

who consumes

• Hints from application or records from previous executions

▪ System health status

– Sanity check during system firmware development or system

bring-up

Future Work

32

Page 33: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

▪ Comprehensive coverage of CUDA communication

methods

▪ Bandwidth affected by CUDA APIs, non-CUDA system

knobs, system topology

▪ High-bandwidth interconnects expose idiosyncracies of

hardware/software system

▪ Open-source, cross-platform, artifact evaluation stamp

Conclusion

33

Page 34: Evaluating Characteristics of CUDA Communication Primitives on … · 2020-05-19 · Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Carl

Thank you / Questions

[email protected]

https://cwpearson.github.io

Other C3SR System Performance Research ProjectsSystem microbenchmarks: https://scope.c3sr.com

Full-stack machine learning with tracing: https://mlmodelscope.org

34

This work is supported by IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) -a research collaboration as part of the IBM AI Horizon Network.

This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation award OCI-0725070 and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications