Top Banner
GPUs
20

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Dec 19, 2015

Download

Documents

Warren Weaver
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

GPUs

Page 2: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

• An enlarging peak performance advantage:– Calculation: 1 TFLOPS vs. 100 GFLOPS– Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s

– GPU in every PC and workstation – massive volume and potential impact

Performance Advantage of GPUs

Courtesy: John Owens

Page 3: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Heart of Blue Waters: Two Chips

AMD Interlagos157 GF peak performance

Features:

2.3-2.6 GHz8 core modules, 16 threadsOn-chip Caches

L1 (I:8x64KB; D:16x16KB)

L2 (8x2MB)Memory Subsystem

Four memory channels51.2 GB/s bandwidth

NVIDIA Kepler1,300 GF peak performance

Features:

15 Streaming multiprocessors (SMX)SMX: 192 CUDA SPs, 64 dp units,

32 special function unitsL1 caches/shared mem (64KB,

48KB)L2 cache (1,536KB)

Memory subsystem Six memory channels

250 GB/s bandwidth

Page 4: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

CPU GPU

CPUs and GPUs have fundamentally different design

philosophies.

Page 5: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

CPUs: Latency Oriented Design

• Large caches– Convert long latency memory

accesses to short latency cache accesses

• Sophisticated control– Branch prediction for reduced

branch latency– Data forwarding for reduced data

latency• Powerful ALU

– Reduced operation latency

5

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU

Page 6: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

GPUs: Throughput Oriented Design

• Small caches– To boost memory throughput

• Simple control– No branch prediction– No data forwarding

• Energy efficient ALUs– Many, long latency but heavily pipelined

for high throughput• Require massive number of threads to

tolerate latencies

6

DRAM

GPU

Page 7: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Winning Applications Use Both CPU and GPU

• CPUs for sequential parts where latency matters– CPUs can be 10+X faster

than GPUs for sequential code

• GPUs for parallel parts where throughput wins– GPUs can be 10+X faster

than CPUs for parallel code

7

Page 8: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

GPU Architecture

FETCH

DECODE

ISSUE

RF

FU

RF

FU

RF

FU. . .

0 1 . . . N

Good for data parallel code.Bad when threads diverge.

Page 9: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Control Divergence

A

CB

D

A

CB

D

Thread 0 Thread 1

A

A

C

B

D

D

A

A

C

B

D

DThread 0

Thread 1

Threads must execute in lockstep, but different threads execute different instruction sequences

Page 10: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Memory Divergence

MEM CTRL

DRAM

L2

INTERCONNECT

. . .

MEM CTRL

DRAM

L2

. . .

L1

SM

L1

SM

L1

SMLoad 0 Load 1

Dependent instructions must stall until longest memory request finishes

Page 11: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Massive Parallelism Requires Regularity

Page 12: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Undesirable Patterns

• Serialization due to conflicting use of critical resources

• Over subscription of Global Memory bandwidth

• Load imbalance among parallel threads

Page 13: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Conflicting Data Accesses Cause Serialization and Delays

• Massively parallel execution cannot afford serialization

• Contentions in accessing critical data causes serialization

Page 14: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

A Simple Example• A naïve inner product algorithm of two vectors of one

million elements each– All multiplications can be done in one time unit (parallel)

– Additions to a single accumulator in one million time units (serial)

*

*

*

*

*

+*+ + + ……

Time

Page 15: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

How much can conflicts hurt?

• Amdahl’s Law– If fraction X of a computation is serialized, the

speedup cannot be more than 1/X

• In the previous example, X = 50%– Half the calculations are serialized– No more than 2X speedup, no matter how

many computing cores are used

Page 16: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Load Balance• The total amount of time to complete a

parallel job is limited by the thread that takes the longest to finish

good bad

Page 17: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

How bad can it be?

• Assume that a job takes 100 units of time for one person to finish– If we break up the job into 10 parts of 10 units

each and have 10 people to do it in parallel, we can get a 10X speedup

– If we break up the job into 50, 10, 5, 5, 5, 5, 5, 5, 5, 5 units, the same 10 people will take 50 units to finish, with 9 of them idling for most of the time. We will get no more than 2X speedup.

Page 18: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

How does imbalance come about?

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010

• Non-uniform data distributions– Highly concentrated

spatial data areas

– Astronomy, medical imaging, computer vision, rendering, …

• If each thread processes the input data of a given spatial volume unit, some will do a lot more work than others

Page 19: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Global Memory Bandwidth

Ideal Reality

Page 20: GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Global Memory Bandwidth• Many-core processors have limited off-chip memory access

bandwidth compared to peak compute throughput

• Fermi

– 1.5 TFLOPS SPFP peak throughput

– 0.75 TFLOPS DPFP peak throughput

– 144 GB/s peak off-chip memory access bandwidth• 36 G SPFP operands per second

• 18 G DPFP operands per second

– To achieve peak throughput, how many operations must a program perform for each operand?

• 1,500/36 = ~42 SPFP arithmetic operations for each operand value fetched from off-chip memory

• How about DPFP?