Top Banner
FHTE 4/26/11 1 From Here to ExaScale Challenges and Potential Solutions Bill Dally Chief Scientist, NVIDIA Bell Professor of Engineering, Stanfor University
44

FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

Mar 28, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

1

From Here to ExaScale Challenges and Potential Solutions

Bill Dally

Chief Scientist, NVIDIA

Bell Professor of Engineering, Stanford University

Page 2: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

2

Two Key Challenges

ProgrammabilityWriting an efficient parallel program is hard

Strong scaling required to achieve ExaScale

Locality required for efficiency

Power1-2nJ/operation today

20pJ required for ExaScale

Dominated by data movement and overhead

Other issues – reliability, memory bandwidth, etc… are subsumed by these two or less severe

Page 3: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

3

ExaScale Programming

Page 4: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

4

Fundamental and Incidental Obstacles to Programmability

FundamentalExpressing 109 way parallelism

Expressing locality to deal with >100:1 global:local energy

Balancing load across 109 cores

IncidentalDealing with multiple address spaces

Partitioning data across nodes

Aggregating data to amortize message overhead

Page 5: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

5

The fundamental problems are hard enough. We must eliminate the incidental ones.

Page 6: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

6

Very simple hardware can provide

Shared global address space (PGAS)No need to manage multiple copies with different names

Fast and efficient small (4-word) messagesNo need to aggregate data to make Kbyte messages

Efficient global block transfers (with gather/scatter)No need to partition data by “node”

Vertical locality is still important

Page 7: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

7

A Layered approach to Fundamental Programming Issues

Hardware mechanisms for efficient communication, synchronization, and thread management

Programmer limited only by fundamental machine capabilities

A programming model that expresses all available parallelism and locality

hierarchical thread arrays and hierarchical storage

Compilers and run-time auto-tuners that selectively exploit parallelism and locality

Page 8: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

8

Execution Model

A B

Active Message

Abstract Memory

Hierarchy

Global Address Space

ThreadObject

B

Lo

ad

/Sto

re

A

B Bulk Xfer

Page 9: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

9

Thread array creation, messages, block transfers, collective operations – at the “speed of light”

Page 10: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

10

Language Describes all Parallelism and Locality – not mapping

forall molecule in set { // launch a thread array forall neighbor in molecule.neighbors { // nested forall force in forces { molecule.force = reduce_sum(force(molecule, neighbor)) } }}

Page 11: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

11

Language Describes all Parallelism and Locality – not mapping

compute_forces::inner(molecules, forces) { tunable N ; set part_molecules[N] ; part_molecules = subdivide(molecules, N) ;

forall(i in 0:N-1) { compute_forces(part_molecules) ; }

Page 12: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

12

Autotuning Search Spaces

T. Kisuki and P. M. W. Knijnenburg and Michael F. P. O'Boyle

Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation.

In IEEE PACT, pages 237-248, 2000.

Execution Time of Matrix Multiplication for Unrolling and Tiling

Architecture enables simple and effective autotuning

Page 13: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

13

Performance of Auto-tuner

Conv2D SGEMM FFT3D SUmb

Cell Auto 96.4 129 57 10.5

Hand 85 119 54

Cluster Auto 26.7 91.3 5.5 1.65

Hand 24 90 5.5

Cluster of PS3s

Auto 19.5 32.4 0.55 0.49

Hand 19 30 0.23

Measured Raw Performance of Benchmarks: auto-tuner vs. hand-tuned version in GFLOPS.

For FFT3D, performances is with fusion of leaf tasks.

SUmb is too complicated to be hand-tuned.

Page 14: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

14

What about legacy codes?

They will continue to run – faster than they do now

But…They don’t have enough parallelism to begin to fill the machine

Their lack of locality will cause them to bottleneck on global bandwidth

As they are ported to the new modelThe constituent equations will remain largely unchanged

The solution methods will evolve to the new cost model

Page 15: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

15

The Power Challenge

Page 16: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

16

Addressing The Power Challenge (LOO)

LocalityBulk of data must be accessed from nearby memories (2pJ) not across the chip (150pJ) off chip (300pJ) or across the system (1nJ)

Application, programming system, and architecture must work together to exploit locality

OverheadBulk of execution energy must go to carrying out the operation not scheduling instructions (100x today)

OptimizationAt all levels to operate efficiently

Page 17: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

17

Locality

Page 18: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

18

The High Cost of Data MovementFetching operands costs more than computing on them

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficientoff-chip link

28nm

256-bitbuses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

Page 19: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

19

Scaling makes locality even more important

Page 20: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

20

Its not about the FLOPS

Its about data movement

Algorithms should be designed to perform more work per unit data movement.

Programming systems should further optimize this data movement.

Architectures should facilitate this by providing an exposed hierarchy and efficient communication.

Page 21: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

21

Locality at all Levels

ApplicationDo more operations if it saves data movement

E.g., recompute values rather than fetching them

Programming systemOptimize subdivision

Choose when to exploit spatial locality with active messages

Choose when to compute vs. fetch

ArchitectureExposed storage hierarchy

Efficient communication and bulk transfer

Page 22: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

22

System Sketch

Page 23: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

23

Echelon Chip Floorplan

L2 Banks

XBAR

NOC

SM

Lan

e

Lan

eL

ane

Lan

eL

ane

Lan

eL

ane

Lan

e

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

CNOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOCS

M

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

LO

C

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

NOC

SM

SM

SM

SM

DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O

DR

AM

I/OD

RA

M I/O

DR

AM

I/OD

RA

M I/O

NW

I/O 17mm

10nm process290mm2

Page 24: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

24

Overhead

Page 25: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

254/11/11 Milad Mohammadi 25

An Out-of-Order CoreSpends 2nJ to schedule a 50pJ FMA (or an 0.5pJ integer add)

Page 26: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

26

SM Lane Architecture

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0AddrL1Addr

Net

LM Bank

0

To LD/ST

LM Bank

3

RFL0AddrL1Addr

Net

RF

Net

DataPath

L0I$

Thr

ead

PC

sA

ctiv

eP

Cs

Inst

ControlPath

Sch

edul

er

64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)

Page 27: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

27

Optimization

Page 28: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

28

Optimization needed at all levelsGuided by where most of the power goes

CircuitsOptimize VDD, VT

Communication circuits – on-chip and off

ArchitectureGrocery list approach – know what each operation costs

Example – temporal SIMT An evolution of the classic vector architecture

Programming SystemsTuning for particular architectures

Macro-optimization

ApplicationsNew methods driven by the new cost equation

Page 29: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

29

On-Chip Communication Circuits

Page 30: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

30

Temporal SIMT

Existing Single Instruction Multiple Thread (SIMT) architectures amortize instruction fetch across multiple threads, but:

Perform poorly (and energy inefficiently) when threads diverge

Execute redundant instructions that are common across threads

Solution: Temporal SIMTExecute threads in thread group in sequence on a single lane

Amortize fetchShared registers for common values

Scalarization – amortize execution

RF RFRFRF

T0-3T4-7T8-11

T12-15

PCs

I$

T0T1T2T3

PCs

I$

Shared RFRF

PCsT4T5T6T7

I$

Shared RFRF

T8T9

T10T11

PCs

I$

Shared RFRF

T12T13T14T15

PCs

I$

Shared RFRF

Page 31: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

31

Solving the Power Challenge – 1, 2, 3

Page 32: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

32

Solving the ExaScale Power Problem

Today Scale Ovh Local0

500

1000

1500

2000

2500

LocalOpOff-ChipOn-ChipOverhead

Page 33: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

33

Log Scale

Today Scale Ovh Local1

10

100

1000

10000

OverheadOn-ChipOff-ChipOpLocal

Bars on top are larger than they appear

Page 34: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

34

The Numbers (pJ)

Page 35: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

35

CUDA GPU Roadmap

16

2

4

6

8

10

12

14

DP G

FLO

PS p

er

Watt

2007 2009 2011

2013

TeslaFermi

Kepler

Maxwell

Jensen Huang’s Keynote at GTC 2010

Page 36: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

36

Investment Strategy

Page 37: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

37

Do we need exotic technology?Semiconductor, optics, memory, etc…

Page 38: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

38

Do we need exotic technology?Semiconductor, optics, memory, etc…

No, but we’ll take what we can get

… and that’s the wrong question

Page 39: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

39

The right questions are:

Can we make a difference in core technologies like semiconductor fab, optics, and memory?

What investments will make the biggest difference (risk reduction) for ExaScale?

Page 40: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

40

Can we make a difference in core technologies like semiconductor fab, optics, and memory?

No, there is a $100B+ industry already driving these technologies in the right direction.

The little we can afford to invest (<$1B) won’t move the needle (in speed or direction)

Page 41: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

41

What investments will make the biggest difference (risk reduction) for ExaScale?

Look for long poles that aren’t being addressed by the data center or mobile industries.

Page 42: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

42

What investments will make the biggest difference (risk reduction) for ExaScale?

Programming systems – they are the long pole of the tent and modest investments will make a huge difference.

Scalable, fine-grain, architecture –communication, synchronization, and thread management mechanisms needed to achieve strong scaling – conventional machines will stick with weak scaling for now.

Page 43: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

43

Summary

Page 44: FHTE 4/26/11 1. FHTE 4/26/11 2 Two Key Challenges Programmability Writing an efficient parallel program is hard Strong scaling required to achieve ExaScale.

FHTE4/26/11

44

ExaScale Requires Change

Programming SystemsEliminate incidental obstacles to parallelism

Provide global address space, fast, short messages, etc…

Express all of the parallelism and locality - abstractlyNot the way current codes are written

Use tools to map these applications to different machinesPerformance portability

PowerLocality: In the application, mapped by the programming system, supported by the architecture

OverheadFrom 100x to 2x by building throughput cores

OptimizationAt all levels

The largest challenge is admitting we need to make big changes.

This requires investment in research, not just procurements