Top Banner
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D. Dissertation Talk [email protected] Auto-tuning Performance on Multicore Computers
81

Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

Sep 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Auto-tuning Performanceon Multicore Computers

Samuel WilliamsDavid Patterson, advisor and chairKathy YelickSara McMains

Ph.D. Dissertation Talk

[email protected]

Auto-tuning Performanceon Multicore Computers

Page 2: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

2

BERKELEY PAR LAB

Multicore ProcessorsMulticore Processors

Page 3: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

3

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Memory Bandwidth

Superscalar Era

Single thread performancescaled at 50% per year

Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.

Lack of diversity inarchitecture = lack ofindividual tuning

Power wall has cappedsingle thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ yea

r

ProcessorMemory Latency

~25% / year~12% / year

0% / year

7% / yearMulticore is the

agreed solution

Page 4: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

4

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Superscalar Era

Single thread performancescaled at 50% per year

Bandwidth increases muchmore slowly, but we couldadd additional bits orchannels.

Lack of diversity inarchitecture = lack ofindividual tuning

Power wall has cappedsingle thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ yea

r

~25% / year

ProcessorMemory

~12% / year

0% / year

But, no agreement on

what it should look like

Page 5: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

5

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Take existing power limitedsuperscalar processors, andadd cores.

Serial code shows noimprovement

Parallel performance mightimprove at 25% per year.= improvement in powerefficiency from process tech.

DRAM bandwidth is currentlyonly improving by ~12% peryear.

Intel / AMD model.

Multicore EraLo

g(P

erfo

rman

ce)

2003 2007

+0% / year (serial)

2005 2009

+25% / year (parallel)

~12% / year

Page 6: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Radically different approach Give up superscalar OOO

cores in favor of small,efficient, in-order cores

Huge, abrupt, shocking drop insingle thread performance

May eliminate power wall,allowing many cores, andperformance to scale at up to40% per year

Troubling, the number ofDRAM channels many need todouble every three years

Niagara, Cell, GPUs

Multicore EraLo

g(P

erfo

rman

ce)

2003 20072005 2009

+0% / year (serial)

+40%

/ yea

r (pa

rallel

)

+12% / year

Page 7: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

7

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Another option would be toreduce per core performance(and power) every year

Graceful degradation in serialperformance

Bandwidth is still an issue

Multicore EraLo

g(P

erfo

rman

ce)

2003 20072005 2009

+40%

/ yea

r (pa

rallel

)

-12% / year (serial)

+12% / year

Page 8: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore Era

There are still many other architectural knobs Superscalar? Dual issue? VLIW ? Pipeline depth SIMD width ? Multithreading (vertical, simultaneous)? Shared caches vs. Private Caches ? FMA vs. MUL+ADD vs. MUL or ADD Clustering of cores Crossbars vs. Rings vs. Meshes

Currently, no consensus on optimal configuration As a result, there is a plethora of multicore architectures

Page 9: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

9

BERKELEY PAR LAB

Computational MotifsComputational Motifs

Page 10: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Computational Motifs

Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing Numerical Methods common throughout scientific computing

Dense and Sparse Linear Algebra Computations on Structured and Unstructured Grids Spectral Methods N-Body Particle Methods Monte Carlo

Within each dwarf, there are a number of computational kernels

The Berkeley View, and subsequently Par Lab expanded these to many otherdomains in computing (embedded, SPEC, DB, Games) Graph Algorithms Combinational Logic Finite State Machines etc…

rechristened Computational Motifs

Each could be black-boxed into libraries or frameworks by domain experts. But how do we get good performance given the diversity of architecture?

Page 11: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

11

BERKELEY PAR LAB

Auto-tuningAuto-tuning

Page 12: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(motivation)

Given a huge diversity of processor architectures, an code hand-optimizedfor one architecture will likely deliver poor performance on another.Moreover, code optimized for one input data set may deliver poorperformance on another.

We want a single code base that delivers performance portability across thebreadth of architectures today and into the future.

Auto-tuners are composed of two principal components: A code generator based on high-level functionality rather than parsing C And the auto-tuner proper that searches for the optimal parameters for each

optimization.

Auto-tuner’s don’t invent or discover optimizations, they search through theparameter space for a variety of known optimizations.

Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL),and Sparse Methods(OSKI)

Page 13: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

13

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(code generation)

The code generator produces many code variants for somenumerical kernel using known optimizations and transformations.

For example, cache blocking adds several parameterized loop nests. Prefetching adds parameterized intrinsics to the code Loop unrolling and reordering explicitly unrolls the code. Each unrolling

is a unique code variant.

Kernels can have dozens of different optimizations, some of whichcan produce hundreds of code variants.

The code generators used in this work are kernel-specific, and werewritten in Perl.

Page 14: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(search)

In this work, we use two search techniques: Exhaustive - examine every combination of parameter for every

optimization. (often intractable) Heuristics - use knowledge of architecture or algorithm to restrict

the search space.

Parameter Space forOptimization A

Para

met

er S

pace

for

Opt

imiz

atio

n B

Parameter Space forOptimization A

Para

met

er S

pace

for

Opt

imiz

atio

n B

Page 15: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

15

BERKELEY PAR LAB

OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Auto-tuning Performanceon Multicore ComputersAuto-tuning Performanceon Multicore Computers

Page 16: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

16

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Thesis Contributions

Introduced the Roofline Model Extended auto-tuning to the structured grid motif

(specifically LBMHD) Extended auto-tuning to multicore

Fundamentally different from running auto-tuned serial code onmulticore SMPs.

Apply the concept to LBMHD and SpMV.

Analyzed the breadth of multicore architectures in thecontext of auto-tuned SpMV and LBMHD

Discussed future directions in auto-tuning

Page 17: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

17

BERKELEY PAR LAB

Multicore SMPsChapter 3

Multicore SMPsOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 18: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems_

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

Page 19: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(Conventional cache-based memory hierarchy)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

EIB (ring network)

XDR memory controllers

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

BIF

VMTPPE

512KL2

BIF

Page 20: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(local store-based memory hierarchy)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

IBM QS20(Cell Blade)

BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

<20G

B/s

(eac

h di

rect

ion)

BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

XDR memory controllers XDR memory controllers

EIB (ring network) EIB (ring network)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

Page 21: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(CMT = Chip Multithreading)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

IBM QS20(Cell Blade)

BIF

512MB XDR DRAM

25.6 GB/s

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

XDR memory controllers XDR memory controllers

EIB (ring network) EIB (ring network)

Sun UltraSPARC T2+ T5140(Victoria Falls)

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

Page 22: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak double-precision FLOP rates)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

74.66 GFLOP/s 73.60 GFLOP/s

29.25 GFLOP/s18.66 GFLOP/s

Page 23: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

23

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(DRAM pin bandwidth)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

21 GB/s(read)10 GB/s(write)

21 GB/s

51 GB/s42 GB/s(read)21 GB/s(write)

Page 24: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(Non-Uniform Memory Access)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs

MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

Cor

eC

ore

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

4GB

/s(e

ach

dire

ctio

n)

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

512K

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

2MB victimSRI / xbar

2x64b controllers

Hyp

erTr

ansp

ort

8 x

6.4

GB

/s(1

per

hub

per

dire

ctio

n)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs2x128b controllers

MT

SP

AR

C

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

C

Crossbar

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E25

6KM

FC

<20G

B/s

(eac

h di

rect

ion)

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FC

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

SP

E25

6KM

FCS

PE

256K

MFC

VMTPPE

512KL2

Page 25: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

25

BERKELEY PAR LAB

Roofline ModelChapter 4

Roofline ModelOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 26: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

26

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Memory Traffic

Total bytes to/from DRAM

Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …

Oblivious of lack of sub-cache line spatial locality

Page 27: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

27

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity

For purposes of this talk, we’ll deal with floating-point kernels Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes Includes cache effects Many interesting problems have constant AI (w.r.t. problem size)

Bad given slowly increasing DRAM bandwidth Bandwidth and Traffic are key optimizations

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 28: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Basic Idea

Synthesize communication, computation, and locality into a singlevisually-intuitive performance figure using bound and bottleneckanalysis.

Given a kernel’s arithmetic intensity (based on DRAM traffic afterbeing filtered by the cache), programmers can inspect the figure,and bound performance.

Moreover, provides insights as to which optimizations will potentiallybe beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 29: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computational ceilings)

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performanceas we eliminate specificforms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 30: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computational ceilings)

Opterons have dedicatedmultipliers and adders.

If the code is dominated byadds, then attainableperformance is half of peak.

We call these Ceilings They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidthmul / add imbalance

Page 31: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

31

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computational ceilings)

Opterons have 128-bitdatapaths.

If instructions aren’tSIMDized, attainableperformance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidthmul / add imbalance

w/out SIMD

Page 32: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

32

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computational ceilings)

On Opterons, floating-pointinstructions have a 4 cyclelatency.

If we don’t express 4-wayILP, performance will dropby as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidthmul / add imbalance

Page 33: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(communication ceilings)

We can perform a similarexercise taking awayparallelism from thememory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 34: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(communication ceilings)

Explicit software prefetchinstructions are required toachieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W prefe

tch

Page 35: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(communication ceilings)

Opterons are NUMA As such memory traffic

must be correctly balancedamong the two sockets toachieve good Streambandwidth.

We could continue this byexamining strided orrandom memory accesspatterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out S

W prefe

tch

w/out N

UMA

Page 36: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computation + communication)

We may boundperformance based on thecombination of expressedin-core parallelism andattained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILPStre

am B

andwidth

w/out S

W prefe

tch

w/out N

UMA

Page 37: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(locality walls)

Remember, memory trafficincludes more than justcompulsory misses.

As such, actual arithmeticintensity may besubstantially lower.

Walls are unique to thearchitecture-kernelcombination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic

FLOPsCompulsory Misses

AI =

Page 38: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

38

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(locality walls)

Remember, memory trafficincludes more than justcompulsory misses.

As such, actual arithmeticintensity may besubstantially lower.

Walls are unique to thearchitecture-kernelcombination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

FLOPsAllocations + Compulsory Misses

AI =

Page 39: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

39

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(locality walls)

Remember, memory trafficincludes more than justcompulsory misses.

As such, actual arithmeticintensity may besubstantially lower.

Walls are unique to thearchitecture-kernelcombination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

+capacity miss traffic

FLOPsCapacity + Allocations + Compulsory

AI =

Page 40: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

40

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(locality walls)

Remember, memory trafficincludes more than justcompulsory misses.

As such, actual arithmeticintensity may besubstantially lower.

Walls are unique to thearchitecture-kernelcombination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out S

W prefe

tch

w/out N

UMA

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

only compulsory m

iss traffic+w

rite allocation traffic

+capacity miss traffic

+conflict miss traffic

FLOPsConflict + Capacity + Allocations + Compulsory

AI =

Page 41: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

41

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Note, the multithreadedNiagara is limited by theinstruction mix rather than alack of expressed in-coreparallelism

Clearly some architectures aremore dependent on bandwidthoptimizations while others onin-core optimizations.

0.5

1.0

1/8actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

Page 42: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

42

BERKELEY PAR LAB

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)Chapter 6

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 43: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

43

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Introduction to Lattice Methods

Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions No temporal locality between points in space within one time step Higher dimensional phase space

Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to

reconstruct macroscopic quantities Significant Memory capacity requirements

14

12

4

16

13

5

9

8

212

0

25

3

1

24

22

23

2618

15

6

19

17

7

11

10

20

+Z

+Y

+X

Page 44: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

44

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(general characteristics)

Plasma turbulence simulation Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 FLOPs/byte (ideal)

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 45: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

45

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(implementation details)

Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees

spatial locality, unit-stride, and vectorizes well

Parallelization Fortran version used MPI to communicate between nodes.

= bad match for multicore The version in this work uses pthreads for multicore

(this thesis is not about innovation in the threading model or programming language) MPI is not used when auto-tuning

Two problem sizes: 643 (~330MB) 1283 (~2.5GB)

Page 46: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

46

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SOA Memory Access Pattern

Consider a simple D2Q9 lattice method using SOA There are 9 read arrays, and 9 write arrays,

but all accesses are unit stride LBMHD has 73 read and 79 write streams per thread

(+1,0)

(0,+1)

(0,-1)

(-1,0) (0,0)

(+1,+1)

(+1,-1)(-1,-1)

(-1,+1)

x dimension

write_array[ ][ ]

read_array[ ][ ]

?

Page 47: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

47

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

LBMHD has a AI of 0.7 onwrite allocate architectures,and 1.0 on those with cachebypass or no write allocate.

MUL / ADD imbalance Some architectures will be

bandwidth-bound, while otherscompute bound.

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 48: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

48

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Standard cache-basedimplementation can beeasily parallelized withpthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?Per

form

ance

Concurrency &Problem Size

Page 49: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

49

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Standard cache-basedimplementation can beeasily parallelized withpthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?

Page 50: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

50

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Superscalar performance issurprisingly good given thecomplexity of the memoryaccess pattern.

Cell PPE performance isabysmal.

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 51: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

51

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(array padding)

LBMHD touches >150arrays.

Most caches have limitedassociativity

Conflict misses are likely Apply heuristic to pad

arrays

Page 52: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

52

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(vectorization)

LBMHD touches > 150arrays.

Most TLBs have << 128entries.

Vectorization techniquecreates a vector of pointsthat are being updated.

Loops are interchangedand strip mined

Exhaustively search forthe optimal “vector length”that balances page localitywith L1 cache misses.

Page 53: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

53

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(other optimizations)

Heuristic-based softwareprefetching

Exhaustive search for low-level optimizations

Loop unrolling/reordering SIMDization Cache Bypass increases

arithmetic intensity by 50% Small TLB pages on VF

Page 54: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

54

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(Cell SPE implementation)

We can write a local storeimplementation and run iton the Cell SPEs.

Ultimately, Cell’s weak DPhampers performance

Page 55: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

55

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(speedup for largest problem)

We can write a local storeimplementation and run iton the Cell SPEs.

Ultimately, Cell’s weak DPhampers performance1.6x 4x

3x 130x

Page 56: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

56

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(vs. Roofline)

Most architectures reach theirroofline bound performance

Clovertown’s snoop filter isineffective.

Niagara suffers frominstruction mix issues

Cell PPEs are latency limited Cell SPEs are compute-bound

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 57: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

57

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(Summary)

Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs

despite only 2x the area, and 2x the peak DP FLOPs

Page 58: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

58

BERKELEY PAR LAB

Auto-tuningSparse Matrix-Vector Multiplication

Chapter 8

Auto-tuningSparse Matrix-Vector Multiplication

OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 59: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

59

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate: y = Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance

A x y

Page 60: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

60

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

Protein FEM /Spheres

FEM /Cantilever

WindTunnel

FEM /Harbor QCD FEM /

Ship Economics Epidemiology

FEM /Accelerator Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 61: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

61

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

Reference SpMV implementationhas an AI of 0.166, but can readilyexploit FMA.

The best we can hope for is an AIof 0.25 for non-symmetricmatrices.

All architectures are memory-bound, but some may need in-core optimizations

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 62: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

62

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Reference implementationis CSR.

Simple parallelization byrows balancing nonzeros.

No implicit NUMAexploitation

Despite superscalar’s useof 8 cores, they see littlespeedup.

Niagara and PPEs shownear linear speedups

Per

form

ance

Matrix

Page 63: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

63

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Reference implementationis CSR.

Simple parallelization byrows balancing nonzeros.

No implicit NUMAexploitation

Despite superscalar’s useof 8 cores, they see littlespeedup.

Niagara and PPEs shownear linear speedups

Page 64: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

64

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Roofline for dense matrix insparse format.

Superscalars achieve bandwidth-limited performance

Niagara comes very close tobandwidth limit.

Clearly, NUMA and prefetchingwill be essential.

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 65: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

65

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(NUMA and Software Prefetching)

NUMA-aware allocation isessential on memory-boundNUMA SMPs.

Explicit softwareprefetching can boostbandwidth and changecache replacement policies

Cell PPEs are likelylatency-limited.

used exhaustive search

Page 66: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

66

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Matrix Compression)

After maximizing memorybandwidth, the only hope isto minimize memory traffic.

exploit: register blocking other formats smaller indices

Use a traffic minimizationheuristic rather thansearch

Benefit is clearlymatrix-dependent.

Register blocking enablesefficient softwareprefetching (one per cacheline)

Page 67: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

67

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Cache and TLB Blocking)

Based on limitedarchitectural knowledge,create heuristic to choosea good cache and TLBblock size

Hierarchically store theresultant blocked matrix.

Benefit can be significanton the most challengingmatrix

Page 68: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

68

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Cell SPE implementation)

Cache blocking can beeasily transformed intolocal store blocking.

With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks

Page 69: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

69

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(median speedup)

Cache blocking can beeasily transformed intolocal store blocking.

With a few small tweaks forDMA, we can run asimplified version of theauto-tuner on Cell BCOO only 2x1 and larger always blocks

2.8x 2.6x

1.4x 15x

Page 70: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

70

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(vs. Roofline)

Roofline for dense matrix insparse format.

Compression improves AI Auto-tuning can allow us to slightly

exceed Stream bandwidth(but not pin bandwidth)

Cell PPEs perennially deliver poorperformance

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILPw/out FMA

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth o

n small

datas

ets

peak DP

Bandw

idth o

n larg

e data

sets

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out S

W prefe

tch

w/out N

UMA

0.5

1.0

1/8actual FLOP:byte ratio

atta

inab

le G

FLO

P/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/4 1/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misalig

ned D

MA

w/out N

UMA

Page 71: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

71

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(summary)

Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve

performance Cell still requires a non-portable, ISA-specific implementation to

achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code

to achieve better performance.

Page 72: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

72

BERKELEY PAR LAB

SummarySummaryOverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 73: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

73

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary

Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually

Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.

Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.

(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,

but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.

Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-

based Cell

Page 74: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

74

BERKELEY PAR LAB

Future Directionsin Auto-tuning

Chapter 9

Future Directionsin Auto-tuning

OverviewMulticore SMPsThe Roofline ModelAuto-tuning LBMHDAuto-tuning SpMVSummaryFuture Work

Page 75: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

75

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Roofline)

Automatic generation of Roofline figures Kernel oblivious Select computational metric of interest Select communication channel of interest Designate common “optimizations” Requires a benchmark

Using performance counters to generate runtime Roofline figures Given a real kernel, we wish to understand the bottlenecks to

performance. Much friendlier visualization of performance counter data

Page 76: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

76

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Making search tractable)

Parameter Space forOptimization A

Para

met

er S

pace

for

Opt

imiz

atio

n B

Parameter Space forOptimization A

Para

met

er S

pace

for

Opt

imiz

atio

n B

Given the explosion in optimizations, exhaustive search is clearlynot tractable.

Moreover, heuristics require extensive architectural knowledge. In our SC08 work, we tried a greedy approach (one optimization at a

time) We could make it iterative or, we could make it look like steepest

descent (with some local search)

Page 77: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

77

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Auto-tuning Motifs)

We could certainly auto-tune other individual kernels in any motif,but this requires building a kernel-specific auto-tuner

However, we should strive for motif-wide auto-tuning. Moreover, we want to decouple data type (e.g. double precision)

from the parallelization structure.

1. A motif description or pattern language for each motif. e.g. taxonomy of structured grids + code snippet for stencil write auto-tuner parses these, and produces optimized code.

2. A series of DAG rewrite rules for each motif.Rules allow:Insertion of additional nodesDuplication of nodesReordering

Page 78: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

78

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Rewrite Rules

x

yA0

0

0

0

0

0

0

0

x

yA0

0

0

0

0

0

0

0

Consider SpMV In FP, each node in the DAG is a MAC DAG makes locality explicit (e.g. local store blocking) BCSR adds zeros to the DAG We can cut edges and reconnect them to enable parallelization. We can reorder operations Any other data type/node type conforming to these rules can reuse all our

auto-tuning effortsx

yA0

0

0

0

0

0

0

0

Page 79: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

79

BERKELEY PAR LAB

AcknowledgmentsAcknowledgments

Page 80: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

80

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgments

Berkeley ParLab Thesis Committee: David Patterson, Kathy Yelick, Sara McMains BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc,

Rajesh Nishtala, etc… Rest of ParLab

Lawrence Berkeley National Laboratory FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …

Hardware Donations and Remote Access Sun Microsystems IBM AMD FZ Julich Intel

This research was supported by the ASCR Office in the DOE Office of Science undercontract number DE-AC02-05CH11231, by Microsoft and Intel funding through award#20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.

Page 81: Auto-tuning Performance on Multicore Computers · Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

81

BERKELEY PAR LAB

Questions?Questions?