Top Banner
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1 – ARM R&D, Austin Tx 2 – ACAL, University of Michigan
25

University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

Jan 01, 2016

Download

Documents

Leona Chase
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

PEPSC : A Power Efficient Computer for Scientific Computing.

Ganesh Dasika1 Ankit Sethia2 Trevor Mudge2 Scott Mahlke2

1 – ARM R&D, Austin Tx 2 – ACAL, University of Michigan

Page 2: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

2

The Efficiency ofHigh-Performance Compute

1

10

100

1,000

10,000

1 10 100 1,000

Pe

rfo

rma

nc

e (

GF

LO

Ps

)

Power (Watts)Ultra-

PortablePortable with

frequent chargesWall Power

DedicatedPower Network

Pentium M

Core 2

CortexA8

Core i7

GTX 280

GTX 295S1070

IBM Cell

AMD 6850

Target EfficiencyTo reach 1 Petaflop 200 KiloWatts will be required.

Page 3: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

3

General-PurposeScientific Computing

• Currently, best performance is by GPGPUs– Generalized shader pipelines– Graphics 1st priority, generality 2nd

– Power inefficient graphics-specific hardware

• Can we improve efficiency by building a processor ground-up?

Page 4: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

4

Important Domains• We mean scientific computing to be:– Dense matrix.– Large datasets.– Floating point computation intensive.

• We specifically look at:– Communications, signal processing.– Mathematics.– Financial applications.

Page 5: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

5

binOpt black fft fwt lps lu mc nw sde srad0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Mem StallControl DivDatapath stallRunning

Inefficiency Sources

• No single common source, so no panacea.• GPUs unutilized even with thousands of threads.• Need a multi-faceted approach to mitigate inefficiency.

%GPU unutilized

37% 35% 55% 60% 54% 90% 42% 95% 43% 37%

GPU

Util

iztio

n

Page 6: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

6

• Prefetching rather than threading for energy efficiency.

PEPSC

• Wide SIMD architecture as a baseline.

• Novel datapath to eliminate datapath and divergence stalls efficiently.

Page 7: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

7

Outline

• Data-path stalls.

• Memory stalls.

• Control stalls.

• Experiments and Results.

• Conclusion.

Page 8: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

8

*-

-

+

LL

x

S

Traditional SIMD

C[i] = A[i]*B[i]-B[i]-A[i]+x;

Lane 0 Lane 1 Lane 2 Lane 3

** * * *

Register File

WB WB WB WB

-- - - -

-

- - - -

+

+ + + +

Page 9: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

9

2D SIMD using FPU Chaining

Subgraphdepth?

ALU0

Register File

ALU1

ALU2

ALU3

MUX

*-

-

+

LL

x

S

**-

--

-+

+

No intermediate values are written back to Register File.

ALU0

Register File

ALU1

ALU2

ALU3

MUX

ALU0

Register File

ALU1

ALU2

ALU3

MUX

ALU0

Register File

ALU1

ALU2

ALU3

MUX

Page 10: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

10

Preprocessor

Arithmetic

Postprocessor

Arithmetic

Postprocessor

Pre-Proc

Arithmetic

Postprocessor

Pre-Proc

Pre-Proc

Normalizer

FPU0

MUXSubgraph

depth?

Arithmetic

Postprocessor

FPU1

FPU2

FPU3

Register File

2D SIMD using FPU Chaining• Increased performance– Fewer cycles/operation– Fewer instructions

• Reduced power– Fewer intermediate values

=> Fewer RF R/W• Trade-offs:– Increased # RF ports– Increased area

Preprocessor

Arithmetic

Postprocessor

Normalizer

Register File

Use intermediate,non-standard values

Normalize at end

1-Deep FPU(3 cycle latency)

4-Deep FPU(9 cycle latency)

Page 11: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

11

SIMD Width Efficiency Trade-offs

• Efficiency improves as chain-length and SIMD width increases.• Control flow divergence limits this efficiency.

123451.01.52.02.53.03.5

8 16 32 64

Nor

mal

ized

Perf

/Pow

er E

ffici

ency

Chain LengthSIMD Width

Page 12: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

12

Dual FPU Chains• Targets datapath

stalls• Exploit ILP among

chains• Requires:– Extra RF-W port– Extra RF-R port– Extra normalizer

stage

Pre.

Arith.

Post.

Arith.

Post.

Pre.

Arith.

Post.

Pre.

Pre.

Normalizer

Arith.

Post.

Register File

Normalizer

Preprocessor

Arithmetic

Postprocessor

Arithmetic

Postprocessor

Pre-Proc

Arithmetic

Postprocessor

Pre-Proc

Pre-Proc

Normalizer

FPU0

MUXSubgraph

depth?

Arithmetic

Postprocessor

FPU1

FPU2

FPU3

Register File

Page 13: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

13

Dual-FPU Speedup

binOpt black fft fwt lps lu mc nw sde srad mean0%

5%

10%

15%

20%

25%

Dual Single

% S

peed

up

binopt black fft fwt lps lu mc nw sde srad mean

2-deep 4-deep 5-deep

By exploiting ILP among chains, on an average 18% speedup could be found.

FPU Chains FPU chain

Page 14: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

14

Memory Stalls

• GPUs use massive multithreading to hide memory stalls:– Inefficient in terms of static power.– Already several MBs of register file on GPUs

• Caches + Prefetching– Reduce # of thread contexts– Works across different memory latencies

Page 15: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

15

Weighted Average Degree

• Significant variation between benchmarks• Single-stride is insufficient• Significant variation in degree between different loops

binOpt black fft fwt lps lu mc nw sde srad0

1

2

3

4

5

6

7

Wei

ghte

d Av

erag

e D

egre

e.

Page 16: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

16

Dynamic Degree PrefetcherLoad PC Address Stride Confidence Degree0x253ad 0xfcface 8 2 2

- =

+

+>

Current PC

Miss address

Prefetch Queue

Prefetch Table

Prefetch Address

Prefetch Enable

Enab

le

1

1

+

1

On every subsequentmiss, re-check strideand improve confidence

Create entry in PF tablebased on last two requests

Increase degree if missin the cache is already queued in the PFQ

Page 17: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

17

DDP Results

• ~2.5X speedup from Degree-1• ~3.5X from DDP• 80% reduction in D-cache wait latency.

binOpt

black fft fwt lps lu mc nw sde srad mean1.0

2.0

3.0

4.0

5.0

6.0

7.0

Degree-1 DDP

Spee

dup

over

no-

pref

etch

Page 18: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

18

Pre.

Arith.

Post.

Arith.

Post.

Pre.

Arith.

Post.

Pre.

Pre.

Normalizer

Arith.

Post.

Register File

Normalizer

Divergence-Folding• Reduce control-flow overhead.• Use dual-FPU chaining to execute

opposite sides of branches concurrently.

• Support simultaneous dual path execution.

Register File

b + c

+ d

e + f

+ g

“then”

“else”

if (cond2) a = b + c + d;else a = e + f + g;

Vt1 = Vb + Vc [0xFF00]Va = Vt1 + Vd [0xFF00]Vt2 = Ve + Vf [0x00FF]Va = Vt2 + Vg [0x00FF]

Page 19: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

19

Other Serializing Constructs

• Several instances of inner loops aggregating single value

• FP adder tree could help mitigate this– log(SIMD width) depth– Minimal area overhead

with FPU chaining– Potential for 5-6X

speedup of aggregation

FPU0

FPU1

FPU2

Lane0 Lane1 Lane2 Lane3

Page 20: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

20

Evaluation Methodology

• Trimaran compiler used to find chains of dependent instructions.

• Major components synthesized using Design Compiler. Power measured using Prime Time.

• For memory system simulation, we used M5.– L1 : 512B per SIMD Lane, 1 cycle latency.– L2 : 4kB per SIMD Lane, 10 cycle latency.– Memory : 200 cycle latency.

Page 21: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

21

PEPSC Architecture

• 750MHz @ 45nm• 32-way SIMD• 120 GFLOPs• ~2.1 W/core• 1 TFLOP

@ 9 cores, 19 W

• Power-Efficient Processor for Scientific Computing

Page 22: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

22

Utilization Improvement on GPU

binOpt

black

fft fwt

lps lu mc

nw

sde

srad

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

Mem StallControl DivDatapath stallRunningG

PU U

tiliza

tion

%Improvement in utilization.

5% 21% 47% 85% 41% 370% 29% 1340% 33%67%

On an average 2x increase in utilization for GPUs with our techniques.

Page 23: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

23

Comparisons

1

10

100

1,000

10,000

1 10 100 1,000

Pe

rfo

rma

nc

e (

GF

LO

Ps

)

Power (Watts)Ultra-

PortablePortable with

frequent chargesWall Power

DedicatedPower Network

GTX 280 peak

GTX 280 realized(with enhancements)

GTX 280 realized(without enhancements)

PEPSC peak

PEPSC realized

Page 24: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

24

Conclusion

• GPUs generally energy-inefficient for scientific computing.

• PEPSC addresses various reasons of inefficiencies by GPUs.– Chained FPU datapath design.– Dynamic Prefetching reduces memory stalls.– Hardware for handling control divergence.

• PEPSC provides a 10x efficiency over modern GPUs.

Page 25: University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.

University of MichiganEECS

PEPSC : A Power Efficient Computer for Scientific Computing.

Questions?