Top Banner
Design and Optimization of a Multi-stencil CFD Solver Bahareh Davani · Ferran Marti Duran · Feng Liu Aparna Chandramowlishwaran August 2, 2017 — PADAL PC actory
42

Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Mar 30, 2018

Download

Documents

lynhi
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Design and Optimization of a Multi-stencil CFD

SolverBahareh Davani · Ferran Marti Duran · Feng Liu

Aparna Chandramowlishwaran

August 2, 2017 — PADAL

PCactory

Page 2: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

CONTEXT: HIPER(“HIGH PERFORMANCE TURBULENT FLOW SIMULATIONS”)

Page 3: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

CONTEXT: HIPER(“HIGH PERFORMANCE TURBULENT FLOW SIMULATIONS”)

Page 4: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Problem Formulation

Page 5: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

GOVERNING EQUATIONS

๏ 3D Unsteady Reynolds Averaged Navier-Stokes (URANS) equations

๏ Dual time-stepping scheme๏ Pseudo-time marching — multi-stage Runge-Kutta

scheme

๏ Marched to a steady state in pseudo time

๏ Spatial discretization of the residual

๏ 2nd order accurate

Page 6: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·
Page 7: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Page 8: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Page 9: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Update boundary condition

Page 10: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Page 11: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

Page 12: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

Page 13: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

Page 14: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Calculate the residual

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

After the 5th stage

Page 15: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Calculate the residual

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

After the 5th stageSolution not converged

Page 16: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Calculate the residual

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

After the 5th stage

Page 17: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Calculate the residual Collect results

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

After the 5th stage

Solution converged

Page 18: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 19: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 20: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 21: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 22: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 23: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 24: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 25: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Stencil Patterns ๏ Cell-centered stencils

๏ Most well-studied in literature

๏ Vertex-centered stencils๏ More complex memory access pattern

๏ More memory-bound than cell-centered stencils

Page 26: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

OPTIMIZATIONS

๏ Single-core, manually coded & tuned๏ Low-level: SIMD vectorization (x86), strength reduction

๏ Data: Structure reorg. (transpose or “SOA”)

๏ Traffic: Intra-stencil and inter-stencil fusion, cache blocking

๏ NUMA-aware OpenMP parallelization

Page 27: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

OPTIMIZATIONS

๏ Single-core, manually coded & tuned๏ Low-level: SIMD vectorization (x86), strength reduction

๏ Data: Structure reorg. (transpose or “SOA”)

๏ Traffic: Intra-stencil and inter-stencil fusion, cache blocking

๏ NUMA-aware OpenMP parallelization

Page 28: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

1

2

4

8

16

32

64

1 2 4 8 14 28Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Haswell

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

1

2

4

8

16

32

64

1 2 4 8 16 32 64Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Abu Dhabi

Single- and Multi-core Optimizations (Step flow with 2 million cells)

~349

~ 9x

~277x

~19x

Page 29: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

1

2

4

8

16

32

64

1 2 4 8 14 28Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Haswell

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

1

2

4

8

16

32

64

1 2 4 8 16 32 64Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Abu Dhabi

Single- and Multi-core Optimizations (Step flow with 2 million cells)

Page 30: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

i

j

k

GradientsofVelocity

ik Ad

dincomingflu

xinidirection

ij

k Addincomingflu

xinjdirection

i

j

k Addincomingflu

xinkdire

ction

j

ViscousFlux

Intra-stencil fusion

Inter-stencil fusion

Page 31: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Improving locality and parallelism requires trading off redundant work.

Page 32: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

1

2

4

8

16

32

64

1 2 4 8 14 28Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Haswell

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Mul

tiple

soc

kets

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

Hyp

erth

read

ing

1

2

4

8

16

32

64

1 2 4 8 16 32 64Number of threads

Spee

dup

Optimization+Blocking+SIMD+SIMD Code Struct+NUMA+Parallelism+Fusion+Strength Reduction

Abu Dhabi

Single- and Multi-core Optimizations (Step flow with 2 million cells)

Page 33: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Read Mesh, Calculate volume

& area

Start iteration (5-stage Runge

Kutta)

Calculate the residual Collect results

Update boundary condition

Calculate viscous flux, inviscid flux

and artificial dissipation

Update values at each cell

One stage of RK

After the 5th stage

Solution converged

Solution not converged

Page 34: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Further improving locality requires trading off accuracy.

Page 35: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

The preceding optimizations were manually coded. Can such CFD solvers can be expressed in stencil DSL’s?

Page 36: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

The preceding optimizations were manually coded. Can such CFD solvers can be expressed in stencil DSL’s?

Yes! 1 month effort in Halide.

→ K.J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI ’13

Page 37: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Can stencil DSL’s deliver a sufficient combination of optimizations to compete with a hand-tuned code?

Page 38: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Haswell Abu Dhabi

Hand-tuned Halide Hand-

tuned Halide

Optimization 4x 2x 3.3x 1.23x

This gap is due to strength reduction and inter-stencil fusion in the hand-tuned code.

Page 39: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Haswell Abu Dhabi

Hand-tuned Halide Hand-

tuned Halide

Optimization 4x 2x 3.3x 1.23x

+Parallelization 21.8x 6x 37.8x 5x

This gap is partly due to NUMA-aware parallelization in the hand-tuned code. (Halide is currently not NUMA-aware)

Page 40: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Haswell Abu Dhabi

Hand-tuned Halide Hand-

tuned Halide

Optimization 4x 2x 3.3x 1.23x

+Parallelization 21.8x 6x 37.8x 5x

+Vectorization 2.8x 1.1x 1.65x 1x

Page 41: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

Can stencil DSL’s deliver a sufficient combination of optimizations to compete with a hand-tuned code?

Not yet! But, there is hope.

Page 42: Design and Optimization of a Multi-stencil CFD Solvernewport.eecs.uci.edu/~amowli/hpcfactory/pdf/PADAL-17.pdfDesign and Optimization of a Multi-stencil CFD Solver Bahareh Davani ·

CONCLUSIONSImproving locality and parallelism requires trading off redundant work and accuracy.

CFD solvers can be expressed in stencil DSL’s with minimal effort.

Limitations ๏ Finding the optimal schedule for performance is

non-trivial. ๏ Most DSL’s are only optimized for cell-centered

stencils. ๏ Does not support sufficient combination of

optimizations to compete with hand-tuned code yet.