Top Banner
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences 1 BERKELEY PAR LAB Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D. Dissertation Talk [email protected] Auto-tuning Performance on Multicore Computers
81

Auto-tuning Performance on Multicore Computers

Jan 03, 2016

Download

Documents

tasha-frye

Auto-tuning Performance on Multicore Computers. Auto-tuning Performance on Multicore Computers. Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D. Dissertation Talk [email protected]. Multicore Processors. Multicore Processors. Superscalar Era. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

1

BERKELEY PAR LAB

Auto-tuning Performance on Multicore Computers

Auto-tuning Performance on Multicore Computers

Samuel WilliamsDavid Patterson, advisor and chair

Kathy Yelick

Sara McMains

Ph.D. Dissertation Talk

[email protected]

Auto-tuning Performance on Multicore Computers

Page 2: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

2

BERKELEY PAR LAB

Multicore ProcessorsMulticore ProcessorsMulticore Processors

Page 3: Auto-tuning Performance on Multicore Computers

3

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Memory Bandwidth

Superscalar Era

Single thread performance scaled at 50% per year

Bandwidth increases much more slowly, but we could add additional bits or channels.

Lack of diversity in architecture = lack of individual tuning

Power wall has capped single thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ ye

ar

ProcessorMemory Latency

~25% / year

~12% / year

0% / year

7% / year

Multicore is th

e

agreed solution

Multicore is th

e

agreed solution

Page 4: Auto-tuning Performance on Multicore Computers

4

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Superscalar Era

Single thread performance scaled at 50% per year

Bandwidth increases much more slowly, but we could add additional bits or channels.

Lack of diversity in architecture = lack of individual tuning

Power wall has capped single thread performance

Log(

Per

form

ance

)

1990 1995 2000 2005

~50%

/ ye

ar

~25% / year

ProcessorMemory

~12% / year

0% / year

But, no agreement o

n

what it should lo

ok like

But, no agreement o

n

what it should lo

ok like

Page 5: Auto-tuning Performance on Multicore Computers

5

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Take existing power limited superscalar processors, and add cores.

Serial code shows no improvement

Parallel performance might improve at 25% per year.

= improvement in power efficiency from process tech.

DRAM bandwidth is currently only improving by ~12% per year.

Intel / AMD model.

Multicore Era

Log(

Per

form

ance

)

2003 2007

+0% / year (serial)

2005 2009

+25% / year (p

arallel)

~12% / year

Page 6: Auto-tuning Performance on Multicore Computers

6

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Radically different approach Give up superscalar OOO

cores in favor of small, efficient, in-order cores

Huge, abrupt, shocking drop in single thread performance

May eliminate power wall, allowing many cores, and performance to scale at up to 40% per year

Troubling, the number of DRAM channels many need to double every three years

Niagara, Cell, GPUs

Multicore Era

Log(

Per

form

ance

)

2003 20072005 2009

+0% / year (serial)

+40%

/ ye

ar (p

arall

el)

+12% / year

Page 7: Auto-tuning Performance on Multicore Computers

7

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Another option would be to reduce per core performance (and power) every year

Graceful degradation in serial performance

Bandwidth is still an issue

Multicore Era

Log(

Per

form

ance

)

2003 20072005 2009

+40%

/ ye

ar (p

arall

el)

-12% / year (serial)

+12% / year

Page 8: Auto-tuning Performance on Multicore Computers

8

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore Era

There are still many other architectural knobs Superscalar? Dual issue? VLIW ? Pipeline depth SIMD width ? Multithreading (vertical, simultaneous)? Shared caches vs. Private Caches ? FMA vs. MUL+ADD vs. MUL or ADD Clustering of cores Crossbars vs. Rings vs. Meshes

Currently, no consensus on optimal configuration As a result, there is a plethora of multicore architectures

Page 9: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

9

BERKELEY PAR LAB

Computational MotifsComputational MotifsComputational Motifs

Page 10: Auto-tuning Performance on Multicore Computers

10

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Computational Motifs

Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing Numerical Methods common throughout scientific computing

Dense and Sparse Linear Algebra Computations on Structured and Unstructured Grids Spectral Methods N-Body Particle Methods Monte Carlo

Within each dwarf, there are a number of computational kernels

The Berkeley View, and subsequently Par Lab expanded these to many other domains in computing (embedded, SPEC, DB, Games) Graph Algorithms Combinational Logic Finite State Machines etc…

rechristened Computational Motifs

Each could be black-boxed into libraries or frameworks by domain experts. But how do we get good performance given the diversity of architecture?

Page 11: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

11

BERKELEY PAR LAB

Auto-tuningAuto-tuningAuto-tuning

Page 12: Auto-tuning Performance on Multicore Computers

12

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(motivation)

Given a huge diversity of processor architectures, an code hand-optimized for one architecture will likely deliver poor performance on another. Moreover, code optimized for one input data set may deliver poor performance on another.

We want a single code base that delivers performance portability across the breadth of architectures today and into the future.

Auto-tuners are composed of two principal components: A code generator based on high-level functionality rather than parsing C And the auto-tuner proper that searches for the optimal parameters for each

optimization.

Auto-tuner’s don’t invent or discover optimizations, they search through the parameter space for a variety of known optimizations.

Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

Page 13: Auto-tuning Performance on Multicore Computers

13

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(code generation)

The code generator produces many code variants for some numerical kernel using known optimizations and transformations.

For example, cache blocking adds several parameterized loop nests. Prefetching adds parameterized intrinsics to the code Loop unrolling and reordering explicitly unrolls the code. Each unrolling

is a unique code variant.

Kernels can have dozens of different optimizations, some of which can produce hundreds of code variants.

The code generators used in this work are kernel-specific, and were written in Perl.

Page 14: Auto-tuning Performance on Multicore Computers

14

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Auto-tuning(search)

In this work, we use two search techniques: Exhaustive - examine every combination of parameter for every

optimization. (often intractable) Heuristics - use knowledge of architecture or algorithm to restrict

the search space.

Parameter Space for Optimization A

Par

amet

er S

pace

for

Opt

imiz

atio

n B

Parameter Space for Optimization A

Par

amet

er S

pace

for

Opt

imiz

atio

n B

Page 15: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

15

BERKELEY PAR LAB

Overview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Auto-tuning Performance on Multicore Computers

Auto-tuning Performance on Multicore ComputersAuto-tuning Performance on Multicore Computers

Page 16: Auto-tuning Performance on Multicore Computers

16

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Thesis Contributions

Introduced the Roofline Model Extended auto-tuning to the structured grid motif

(specifically LBMHD) Extended auto-tuning to multicore

Fundamentally different from running auto-tuned serial code on multicore SMPs.

Apply the concept to LBMHD and SpMV.

Analyzed the breadth of multicore architectures in the context of auto-tuned SpMV and LBMHD

Discussed future directions in auto-tuning

Page 17: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

17

BERKELEY PAR LAB

Multicore SMPsMulticore SMPsChapter 3

Multicore SMPsOverview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 18: Auto-tuning Performance on Multicore Computers

18

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems_

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

Page 19: Auto-tuning Performance on Multicore Computers

19

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(Conventional cache-based memory hierarchy)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

BIFBIF

VMTPPEVMTPPE

512KL2

512KL2

BIFBIF

Page 20: Auto-tuning Performance on Multicore Computers

20

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(local store-based memory hierarchy)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

IBM QS20(Cell Blade)

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

VMTPPEVMTPPE

512KL2

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

VMTPPEVMTPPE

512KL2

512KL2

XDR memory controllersXDR memory controllers XDR memory controllersXDR memory controllers

EIB (ring network)EIB (ring network) EIB (ring network)EIB (ring network)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

Page 21: Auto-tuning Performance on Multicore Computers

21

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(CMT = Chip Multithreading)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

IBM QS20(Cell Blade)

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

XDR memory controllersXDR memory controllers XDR memory controllersXDR memory controllers

EIB (ring network)EIB (ring network) EIB (ring network)EIB (ring network)

Sun UltraSPARC T2+ T5140(Victoria Falls)

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

Page 22: Auto-tuning Performance on Multicore Computers

22

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(peak double-precision FLOP rates)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

74.66 GFLOP/s 73.60 GFLOP/s

29.25 GFLOP/s18.66 GFLOP/s

Page 23: Auto-tuning Performance on Multicore Computers

23

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(DRAM pin bandwidth)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

21 GB/s(read)

10 GB/s(write)21 GB/s

51 GB/s42 GB/s(read)

21 GB/s(write)

Page 24: Auto-tuning Performance on Multicore Computers

24

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Multicore SMP Systems(Non-Uniform Memory Access)

Intel Xeon E5345(Clovertown)

AMD Opteron 2356(Barcelona)

Sun UltraSPARC T2+ T5140(Victoria Falls)

IBM QS20(Cell Blade)

667MHz FBDIMMs667MHz FBDIMMs

MCH (4x64b controllers)MCH (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

FSB10.66 GB/s

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

Cor

eC

ore

Cor

eC

ore

4MBL2

4MBL2

10.66 GB/sFSB

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

4GB

/s(e

ach

dire

ctio

n)O

pter

onO

pter

on

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

667MHz DDR2DIMMs

667MHz DDR2DIMMs

10.6 GB/s

Opt

eron

Opt

eron

512K

512K

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

Opt

eron

512K

512K

512K

512K

512K

512K

2MB victim2MB victim

SRI / xbarSRI / xbar

2x64b controllers2x64b controllers

Hyp

erT

rans

port

Hyp

erT

rans

port

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

667MHz FBDIMMs 667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs4 Coherency Hubs

2x128b controllers2x128b controllers

MT

SP

AR

CM

T S

PA

RC

179 GB/s 90 GB/s

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

MT

SP

AR

CM

T S

PA

RC

CrossbarCrossbar

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

VMTPPEVMTPPE

512KL2

512KL2

SP

ES

PE

256K

256K

MF

CM

FC

<20

GB

/s(e

ach

dire

ctio

n)

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

BIFBIF

512MB XDR DRAM512MB XDR DRAM

25.6 GB/s

EIB (ring network)EIB (ring network)

XDR memory controllersXDR memory controllers

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

SP

ES

PE

256K

256K

MF

CM

FC

VMTPPEVMTPPE

512KL2

512KL2

Page 25: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

25

BERKELEY PAR LAB

Roofline ModelRoofline ModelChapter 4

Roofline ModelOverview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 26: Auto-tuning Performance on Multicore Computers

26

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Memory Traffic

Total bytes to/from DRAM

Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations …

Oblivious of lack of sub-cache line spatial locality

Page 27: Auto-tuning Performance on Multicore Computers

27

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Arithmetic Intensity

For purposes of this talk, we’ll deal with floating-point kernels Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes Includes cache effects Many interesting problems have constant AI (w.r.t. problem size)

Bad given slowly increasing DRAM bandwidth Bandwidth and Traffic are key optimizations

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 28: Auto-tuning Performance on Multicore Computers

28

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Basic Idea

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 29: Auto-tuning Performance on Multicore Computers

29

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (computational ceilings)

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 30: Auto-tuning Performance on Multicore Computers

30

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (computational ceilings)

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 31: Auto-tuning Performance on Multicore Computers

31

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(computational ceilings)

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

w/out SIMD

Page 32: Auto-tuning Performance on Multicore Computers

32

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (computational ceilings)

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 33: Auto-tuning Performance on Multicore Computers

33

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (communication ceilings)

We can perform a similar exercise taking away parallelism from the memory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 34: Auto-tuning Performance on Multicore Computers

34

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (communication ceilings)

Explicit software prefetch instructions are required to achieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

Page 35: Auto-tuning Performance on Multicore Computers

35

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model(communication ceilings)

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 36: Auto-tuning Performance on Multicore Computers

36

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (computation + communication)

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 37: Auto-tuning Performance on Multicore Computers

37

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (locality walls)

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

FLOPs

Compulsory MissesAI =

Page 38: Auto-tuning Performance on Multicore Computers

38

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (locality walls)

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

FLOPs

Allocations + Compulsory MissesAI =

Page 39: Auto-tuning Performance on Multicore Computers

39

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (locality walls)

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

FLOPs

Capacity + Allocations + CompulsoryAI =

Page 40: Auto-tuning Performance on Multicore Computers

40

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Constructing a Roofline Model (locality walls)

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

FLOPs

Conflict + Capacity + Allocations + CompulsoryAI =

Page 41: Auto-tuning Performance on Multicore Computers

41

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism

Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations.

0.5

1.0

1/8

actual FLOP:Byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

Page 42: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

42

BERKELEY PAR LAB

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)Chapter 6

Auto-tuningLattice-Boltzmann

Magnetohydrodynamics (LBMHD)

Overview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 43: Auto-tuning Performance on Multicore Computers

43

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Introduction to Lattice Methods

Structured grid code, with a series of time steps Popular in CFD Allows for complex boundary conditions No temporal locality between points in space within one time step Higher dimensional phase space

Simplified kinetic model that maintains the macroscopic quantities Distribution functions (e.g. 5-27 velocities per point in space) are used to

reconstruct macroscopic quantities Significant Memory capacity requirements

14

12

4

16

13

5

9

8

212

0

25

3

1

24

22

23

2618

15

6

19

17

7

11

10

20

+Z

+Y

+X

Page 44: Auto-tuning Performance on Multicore Computers

44

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(general characteristics)

Plasma turbulence simulation Two distributions:

momentum distribution (27 scalar components) magnetic distribution (15 vector components)

Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector)

Must read 73 doubles, and update 79 doubles per point in space Requires about 1300 floating point operations per point in space Just over 1.0 FLOPs/byte (ideal)

momentum distribution

14

4

13

16

5

8

9

21

12

+Y

2

25

1

3

24

23

22

26

0

18

6

17

19

7

10

11

20

15

+Z

+X

magnetic distribution

14

13

16

21

12

25

24

23

22

26

18

17

19

20

15

+Y

+Z

+X

macroscopic variables

+Y

+Z

+X

Page 45: Auto-tuning Performance on Multicore Computers

45

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD(implementation details)

Data Structure choices: Array of Structures: no spatial locality, strided access Structure of Arrays: huge number of memory streams per thread, but guarantees

spatial locality, unit-stride, and vectorizes well

Parallelization Fortran version used MPI to communicate between nodes.

= bad match for multicore The version in this work uses pthreads for multicore

(this thesis is not about innovation in the threading model or programming language) MPI is not used when auto-tuning

Two problem sizes: 643 (~330MB) 1283 (~2.5GB)

Page 46: Auto-tuning Performance on Multicore Computers

46

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SOA Memory Access Pattern

Consider a simple D2Q9 lattice method using SOA There are 9 read arrays, and 9 write arrays,

but all accesses are unit stride LBMHD has 73 read and 79 write streams per thread

(+1,0)

(0,+1)

(0,-1)

(-1,0) (0,0)

(+1,+1)

(+1,-1)(-1,-1)

(-1,+1)

x dimension

write_array[ ][ ]

read_array[ ][ ]

?

Page 47: Auto-tuning Performance on Multicore Computers

47

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

LBMHD has a AI of 0.7 on write allocate architectures, and 1.0 on those with cache bypass or no write allocate.

MUL / ADD imbalance Some architectures will be

bandwidth-bound, while others compute bound.

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:Byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 48: Auto-tuning Performance on Multicore Computers

48

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Standard cache-based implementation can be easily parallelized with pthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?

Per

form

ance

Concurrency &Problem Size

Page 49: Auto-tuning Performance on Multicore Computers

49

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Standard cache-based implementation can be easily parallelized with pthreads.

NUMA is implicitly exploited Although scalability looks

good, is performance ?

Page 50: Auto-tuning Performance on Multicore Computers

50

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(reference implementation)

Superscalar performance is surprisingly good given the complexity of the memory access pattern.

Cell PPE performance is abysmal.

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:Byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 51: Auto-tuning Performance on Multicore Computers

51

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(array padding)

LBMHD touches >150 arrays.

Most caches have limited associativity

Conflict misses are likely Apply heuristic to pad

arrays

Page 52: Auto-tuning Performance on Multicore Computers

52

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(vectorization)

LBMHD touches > 150 arrays.

Most TLBs have << 128 entries.

Vectorization technique creates a vector of points that are being updated.

Loops are interchanged and strip mined

Exhaustively search for the optimal “vector length” that balances page locality with L1 cache misses.

Page 53: Auto-tuning Performance on Multicore Computers

53

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(other optimizations)

Heuristic-based software prefetching

Exhaustive search for low-level optimizations

Loop unrolling/reordering SIMDization Cache Bypass increases

arithmetic intensity by 50% Small TLB pages on VF

Page 54: Auto-tuning Performance on Multicore Computers

54

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(Cell SPE implementation)

We can write a local store implementation and run it on the Cell SPEs.

Ultimately, Cell’s weak DP hampers performance

Page 55: Auto-tuning Performance on Multicore Computers

55

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(speedup for largest problem)

We can write a local store implementation and run it on the Cell SPEs.

Ultimately, Cell’s weak DP hampers performance1.6x1.6x 4x4x

3x3x 130x130x

Page 56: Auto-tuning Performance on Multicore Computers

56

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(vs. Roofline)

Most architectures reach their roofline bound performance

Clovertown’s snoop filter is ineffective.

Niagara suffers from instruction mix issues

Cell PPEs are latency limited Cell SPEs are compute-bound

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out FMA

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:Byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out SIMD

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 57: Auto-tuning Performance on Multicore Computers

57

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

LBMHD Performance(Summary)

Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs

despite only 2x the area, and 2x the peak DP FLOPs

Page 58: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

58

BERKELEY PAR LAB

Auto-tuningSparse Matrix-Vector Multiplication

Auto-tuningSparse Matrix-Vector Multiplication

Chapter 8

Auto-tuningSparse Matrix-Vector Multiplication

Overview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 59: Auto-tuning Performance on Multicore Computers

59

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Sparse Matrix-Vector Multiplication

Sparse Matrix Most entries are 0.0 Performance advantage in only

storing/operating on the nonzeros Requires significant meta data

Evaluate: y = Ax A is a sparse matrix x & y are dense vectors

Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance

A x y

Page 60: Auto-tuning Performance on Multicore Computers

60

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Dataset (Matrices)

Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M

Dense

ProteinFEM /

SpheresFEM /

CantileverWind

TunnelFEM /Harbor

QCDFEM /Ship

Economics Epidemiology

FEM /Accelerator

Circuit webbase

LP

2K x 2K Dense matrixstored in sparse format

Well Structured(sorted by nonzeros/row)

Poorly Structuredhodgepodge

Extreme Aspect Ratio(linear programming)

Page 61: Auto-tuning Performance on Multicore Computers

61

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Roofline Models for SMPs

Reference SpMV implementation has an AI of 0.166, but can readily exploit FMA.

The best we can hope for is an AI of 0.25 for non-symmetric matrices.

All architectures are memory-bound, but some may need in-core optimizations

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 62: Auto-tuning Performance on Multicore Computers

62

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Reference implementation is CSR.

Simple parallelization by rows balancing nonzeros.

No implicit NUMA exploitation

Despite superscalar’s use of 8 cores, they see little speedup.

Niagara and PPEs show near linear speedups

Per

form

ance

Matrix

Page 63: Auto-tuning Performance on Multicore Computers

63

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Reference implementation is CSR.

Simple parallelization by rows balancing nonzeros.

No implicit NUMA exploitation

Despite superscalar’s use of 8 cores, they see little speedup.

Niagara and PPEs show near linear speedups

Page 64: Auto-tuning Performance on Multicore Computers

64

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Reference Implementation)

Roofline for dense matrix in sparse format.

Superscalars achieve bandwidth-limited performance

Niagara comes very close to bandwidth limit.

Clearly, NUMA and prefetching will be essential.

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 65: Auto-tuning Performance on Multicore Computers

65

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(NUMA and Software Prefetching)

NUMA-aware allocation is essential on memory-bound NUMA SMPs.

Explicit software prefetching can boost bandwidth and change cache replacement policies

Cell PPEs are likely latency-limited.

used exhaustive search

Page 66: Auto-tuning Performance on Multicore Computers

66

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Matrix Compression)

After maximizing memory bandwidth, the only hope is to minimize memory traffic.

exploit: register blocking other formats smaller indices

Use a traffic minimization heuristic rather than search

Benefit is clearly

matrix-dependent. Register blocking enables

efficient software prefetching (one per cache line)

Page 67: Auto-tuning Performance on Multicore Computers

67

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Cache and TLB Blocking)

Based on limited architectural knowledge, create heuristic to choose a good cache and TLB block size

Hierarchically store the resultant blocked matrix.

Benefit can be significant on the most challenging matrix

Page 68: Auto-tuning Performance on Multicore Computers

68

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(Cell SPE implementation)

Cache blocking can be easily transformed into local store blocking.

With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell BCOO only 2x1 and larger always blocks

Page 69: Auto-tuning Performance on Multicore Computers

69

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(median speedup)

Cache blocking can be easily transformed into local store blocking.

With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell BCOO only 2x1 and larger always blocks

2.8x2.8x 2.6x2.6x

1.4x1.4x 15x15x

Page 70: Auto-tuning Performance on Multicore Computers

70

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(vs. Roofline)

Roofline for dense matrix in sparse format.

Compression improves AI Auto-tuning can allow us to slightly

exceed Stream bandwidth

(but not pin bandwidth) Cell PPEs perennially deliver poor

performance

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(PPEs)

w/out ILP

w/out FMA

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Victoria Falls)

25% FP

12% FP

6% FP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Xeon E5345(Clovertown)

mul / add imbalance

w/out SIMD

w/out ILP

Bandw

idth

on sm

all d

atas

ets

peak DP

Bandw

idth

on la

rge

data

sets

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

Opteron 2356(Barcelona)

mul / add imbalance

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidt

h

w/out

SW

pre

fetch

w/out

NUM

A

0.5

1.0

1/8

actual FLOP:byte ratio

att

ain

ab

le G

FL

OP

/s

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

QS20 Cell Blade(SPEs)

w/out ILP

w/out FMA

w/out SIMD

peak DP

Stream

Ban

dwidt

h

misa

ligne

d DM

A

w/out

NUM

A

Page 71: Auto-tuning Performance on Multicore Computers

71

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SpMV Performance(summary)

Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve

performance Cell still requires a non-portable, ISA-specific implementation to

achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code

to achieve better performance.

Page 72: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

72

BERKELEY PAR LAB

SummarySummarySummaryOverview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 73: Auto-tuning Performance on Multicore Computers

73

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Summary

Introduced the Roofline Model Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually

Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV.

Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning.

(move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations,

but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues.

Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-

based Cell

Page 74: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

74

BERKELEY PAR LAB

Future Directionsin Auto-tuning

Future Directionsin Auto-tuning

Chapter 9

Future Directionsin Auto-tuning

Overview

Multicore SMPs

The Roofline Model

Auto-tuning LBMHD

Auto-tuning SpMV

Summary

Future Work

Page 75: Auto-tuning Performance on Multicore Computers

75

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Roofline)

Automatic generation of Roofline figures Kernel oblivious Select computational metric of interest Select communication channel of interest Designate common “optimizations” Requires a benchmark

Using performance counters to generate runtime Roofline figures Given a real kernel, we wish to understand the bottlenecks to

performance. Much friendlier visualization of performance counter data

Page 76: Auto-tuning Performance on Multicore Computers

76

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Making search tractable)

Parameter Space for Optimization A

Par

amet

er S

pace

for

O

ptim

izat

ion

B

Parameter Space for Optimization A

Par

amet

er S

pace

for

O

ptim

izat

ion

B Given the explosion in optimizations, exhaustive search is clearly

not tractable. Moreover, heuristics require extensive architectural knowledge. In our SC08 work, we tried a greedy approach (one optimization at a

time) We could make it iterative or, we could make it look like steepest

descent (with some local search)

Page 77: Auto-tuning Performance on Multicore Computers

77

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Future Work(Auto-tuning Motifs)

We could certainly auto-tune other individual kernels in any motif, but this requires building a kernel-specific auto-tuner

However, we should strive for motif-wide auto-tuning. Moreover, we want to decouple data type (e.g. double precision)

from the parallelization structure.

1. A motif description or pattern language for each motif. e.g. taxonomy of structured grids + code snippet for stencil write auto-tuner parses these, and produces optimized code.

2. A series of DAG rewrite rules for each motif.Rules allow:

Insertion of additional nodes

Duplication of nodes

Reordering

Page 78: Auto-tuning Performance on Multicore Computers

78

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Rewrite Rules

x

yA0

0

0

0

0

0

0

0

x

yA0

0

0

0

0

0

0

0

Consider SpMV In FP, each node in the DAG is a MAC DAG makes locality explicit (e.g. local store blocking) BCSR adds zeros to the DAG We can cut edges and reconnect them to enable parallelization. We can reorder operations Any other data type/node type conforming to these rules can reuse all our auto-tuning efforts

x

yA0

0

0

0

0

0

0

0

Page 79: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

79

BERKELEY PAR LAB

AcknowledgmentsAcknowledgmentsAcknowledgments

Page 80: Auto-tuning Performance on Multicore Computers

80

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgments

Berkeley ParLab Thesis Committee: David Patterson, Kathy Yelick, Sara McMains BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc,

Rajesh Nishtala, etc… Rest of ParLab

Lawrence Berkeley National Laboratory FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …

Hardware Donations and Remote Access Sun Microsystems IBM AMD FZ Julich Intel

This research was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231, by Microsoft and Intel funding through award #20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.

Page 81: Auto-tuning Performance on Multicore Computers

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences

81

BERKELEY PAR LAB

Questions?Questions?Questions?