Top Banner
John Danskin Vice President GPU Architecture Salishan 2017 HETEROGENEOUS COMPUTING CHALLENGES & DIRECTIONS
18

Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

Aug 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

John Danskin Vice President GPU Architecture

Salishan 2017

HETEROGENEOUS COMPUTINGCHALLENGES & DIRECTIONS

Page 2: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

2

TOPICS

Why Throughout Optimized Processors are Efficient

Throughput Optimized = GPU Latency Optimized = CPU

Why You Need Both Kinds of Processors

Why Fat Nodes are Better Thin Nodes

A Qualitative Discussion

Page 3: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

3

THROUGHPUT VS LATENCY

One BrickMany Bricks

Page 4: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

4

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Resistance

frequency ∝ Voltage

Power ∝ frequency3

Energy/Operation ∝ frequency2

Frequency Caution: Qualitative Approximations

Serial Performance Costs Power

Wide & Slow is Efficient

Implication: 1.5 GHz should be ~7x more efficient than 4 GHz

Page 5: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

5

WHY THROUGHPUT IS EFFICIENT

Power = Capacitance * Voltage2 * frequency

Capacitance ∝ Area !!Caution

Energy/Op ∝ Area/Op

=>Small Simple Cores are Efficient*

*Unless Communication Explodes

Core Area Caution: Qualitative Argument

NVIDIAGP100610mm^23840 Cores

IBM P8650mm^212*8 Cores

40x Difference in Area/Core*

*Redefined core

Page 6: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

6

GP100: Highly Evolved Throughput Engine

Not a Sea of CPUs

Mostly Computation

Tiny Caches

14MB Register File

Single Instruction Multiple Thread (SIMT)

Not vectors

manages divergence

SW Coherence

Programming model

Compilers

Page 7: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

7

BASIC PATTERN OF COMPUTATION

Parallel

“Hurry Up”

Serial

“Wait”

Parallel

“Hurry Up”

Serial

“Wait”

• Serial Sections• Parallel Resources Wait• Resource Utilization Low

Push Latency When Serial

Parallel Sections• Every Unit Active

Push Efficiency When Parallel

Page 8: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

8

ARCHITECTURAL OPTIONS

Sea of Latency Processors

Poor Throughput

Sea of Throughput Processors

Poor Utilization

Heterogeneous Mix

Possibly Perfect But:

Processor BalanceProcessor TransitionsI/O Balance

Page 9: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

9

BALANCE: DGX-1 NODE FOR DEEP LEARNING

Balance:2 CPUs8 GPUs4 NICs

Flexible Balance VIA I/O Fabric Switches

PCIE limits bandwidth

Future: Disaggregate?Ick?

Page 10: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

10

BALANCE: SUMMIT NODE

2 CPUs6 GPUs2 NICs

I/O VIA CPU fixes ratio

Very high bandwidth CPU/GPU Connection

Flexible Ratio

Future: NICs closer to GPUs?

P9

NIC

Volta Volta Volta

Coherent NVLINK2

16GB

HBM2

16GB

HBM2

16GB

HBM2

256GB

DDR4 P9

NIC

Volta Volta Volta

Coherent NVLINK2

16GB

HBM2

16GB

HBM2

16GB

HBM2

256GB

DDR4

Page 11: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

11

LOC/TOC PROCESSING TRANSITIONS

Dependency Loop Bottlenecks:

Bandwidth:

CPU Scale (1x)

=>Fast Fabric

=>See IBM Minsky

Latency:

Interesting Work in Progress

=>Hide Dependent Latencies!

Basic Computational Loop

GPUs – Throughput

10-30x Ops

CPU– Latency

1x Ops

Response

Stimulus

Dependency Loop

NVLINK is Fast Enough™

Page 12: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

12

FAT VS THIN NODESTHIN (1X) FAT (10X)

Node Count 10X 1X

Flat Hierarchy Y N

But Hierarchy can be Hidden

Memory per Node

(Flat System Total)

1X 10X

NIC BW Per System

(Surface/Volume)

100% 3D: 46% 5D: 60%

Targets for All to All 10X 1X

Page 13: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

13

FAT NODE NETWORK I/OSome I/O Hidden Inside Fast Efficient Local Network

8 Thin Nodes 2 Fat Nodes

% of I/O Hidden Inside Fat Nodes Depends on Dimensionality of Problem

Page 14: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

14

FAT NODE I/O & DIMENSIONAL GRIDS

External I/O Requirements

0%

25%

50%

75%

100%

1 4 8 16 32

IdealNodeSurface(Network)vsNodeVolume(Capacity)in1-6Dimensions

1D 2D 3D 4D 5D 6D

Page 15: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

15

FAT NODE I/O & GLOBAL HASHING

Applications: Crypto, Gen Splicing

Characteristic Operation: Atomic to Random System Location

Problem: Tiny Global References Never Local to Node, However Fat

Solution: Batch & Sort Updates by Target

10x Fat Nodes have 10x fewer Targets, 10x Fewer Remote Operations

Fat Targets can sort local operations for memory locality

Fat Nodes Increase Target Locality

Page 16: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

16

FAT NODE DIRECTION

Scaling Fat Nodes Introduces New Problems: Coherence, Wire Length, Locality

Questions:

What is a node? What matters? What can be jettisoned?

Coherency Domain? Operating System? Physical Volume?

What is optimal size for a node?

What does physics have to say?

Challenges, not answers

Fat Node Redefinition

Page 17: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications:

17

SUMMARY

Throughput Processors Fundamentally More Efficient than Latency Processors

Heterogeneous Systems More Efficient than Homogeneous Systems

Heterogeneous LOC/TOC Transition

Bandwidth Solved (Summit/Sierra)

Watch Dependent Latency

Fat Nodes

Increase Node Memory

Increase Network Efficiency

Page 18: Heterogeneous Computing Challenges & Directionssalishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017... · 2019-11-18 · 1D 2D 3D 4D 5D 6D. 15 FAT NODE I/O & GLOBAL HASHING Applications: