Top Banner
31

The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

May 07, 2018

Download

Documents

vutruc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific
Page 2: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

The Scalable HeterOgeneous Computing (SHOC) Benchmark SuiteKyle Spafford | Oak Ridge National Laboratory

SHOC Contributors:Anthony Danalis | Collin McCurdy | Gabriel Marin | Jeremy Meredith | Phil Roth | Vinod Tipparaju | Jeffrey Vetter

Page 3: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

History & Motivation

3

“An experimental high performance 

computing system of innovative design.”

“Outside the mainstream of what is routinely 

available from computer vendors.”National Science Foundation, Track2D Call Fall, 2008

Page 4: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

QuestionsIn 2008, there was no available benchmark suite for OpenCLNew heterogeneous architectures

Which architecture performs specific operations best?New programming systems

New – rapidly improving – by how much?How efficient is it?How does it compare to existing methods?

Multiple OpenCL stacksCompare different runtimes, compilers, SDKs

Energy efficiency

©2010 Advanced Micro Devices, Inc. All rights reserved. 4

Page 5: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

The SHOC Benchmark SuiteWhat ?

An OpenCL benchmark suite focused on distributed (MPI) scientific computing workloads

Where?Download source code at http://ft.ornl.gov/doku/shoc/start“The Scalable Heterogeneous (SHOC) Benchmark Suite.” GPGPU 2010 Workshop. ACM Portal

Why?1) Use SHOC to help you understand hardware performance2) Measure the performance of scientific kernels3) Validate your hardware and set standards for a procurement4) Use SHOC as example code to help you learn OpenCL and get familiar with the toolchain

5

Page 6: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

SHOC - OrganizationOrganized into 3 Levels

Level 0 – “Feeds and Speeds”Level 1 – Parallel PrimitivesLevel 2 – Real Application Kernels

3 Modes of ParallelismSerial – Just a single OpenCL deviceEmbarrassingly Parallel – Do a copy of the same small problem on > 1 deviceTrue Parallel – Use multiple devices to work on the same problem (MPI)

6

Page 7: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Performance Tests“Feeds and Speeds”

MaxFLOPS, DeviceMemory (Global, Local, Image), BusSpeed (PCIe)kernelCompile, queueDelay

Parallel PrimitivesFFT, GEMMReduction, Scan, SortSpMVStencil2DTriad

Real ApplicationsS3DMD (from LAMMPS)

7

Page 8: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Use #1: Better Understand Your Hardware

8

NB: Unless otherwise noted, results are shown using OpenCL on AMD APP SDK 2.4 or NV SDK 4.0

Page 9: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Simple Example – Memory BandwidthGPUs attain the best memory bandwidth by coalescing memory accesses

SHOC’s DeviceMemory test quantifies this advantage.

©2010 Advanced Micro Devices, Inc. All rights reserved. 9

Coalesced

WI sequential /Uncoalesced

Work Item 1Work Item 2Work Item 3Work Item 4

Page 10: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

10

Results

130

0.98

99.94

60.42

26.9315.02

AMD FirePro v8800 (Cypress)

Intel Xeon 5500 2.7Ghz

NV Tesla C2050 (Fermi)

Global Memory BW (4 byte granularity, GB/s)

Coalesced Strided

Page 11: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Simple Example #2: Measure (and pay attention to) Precision

2540.88

60.44512.85

34.04

AMD FirePro v8800 (Cypress) Intel Xeon 5500 2.7Ghz

MaxFLOPS (GFLOPS)

Single Double

Page 12: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

A Little More Complicated: NUMA System Paths

©2010 Advanced Micro Devices, Inc. All rights reserved. 12

DDR3DDR3CPU #0

GPU #1

GPU #2

QPI\HT

IBI/O Hub GPU #0

PCIe x8

PCIe x16

RAM

DDR3DDR3CPU #1

I/O Hub

RAM

Page 13: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

A Little More Complicated: NUMA System Paths

©2010 Advanced Micro Devices, Inc. All rights reserved. 13

DDR3DDR3CPU #0

GPU #1

GPU #2

QPI\HT

IBI/O Hub GPU #0

PCIe x8

PCIe x16

RAM

DDR3DDR3CPU #1

I/O Hub

RAM

Bandwidth Penalty CPU #0 H->D Copy

Page 14: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

A Little More Complicated: NUMA System Paths

©2010 Advanced Micro Devices, Inc. All rights reserved. 14

DDR3DDR3CPU #0

GPU #1

GPU #2

QPI\HT

IBI/O Hub GPU #0

PCIe x8

PCIe x16

RAM

DDR3DDR3CPU #1

I/O Hub

RAM

Bandwidth Penalty CPU #0 D->H Copy

~2 GB/s

K. Spafford, J. Meredith, J. S. Vetter. Quantifying NUMA and Contention Effects in Multi‐GPU Systems. Proceedings of the Workshops on General Purpose Computation on Graphics Processors (GPGPU  ‘11). 

Page 15: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

A Little More Complicated: Bus ContentionSimple Idea – GPUs and IB

HCA’s share the same PCIebus. Will large amounts of concurrent MPI transfers and GPU transfers degrade performance?

Measurement performed on ORNL Lens cluster, NV OCL 3.1, PCIe 1.0 x16

©2010 Advanced Micro Devices, Inc. All rights reserved. 15

Page 16: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Use #2: Measure the Performance of Scientific Kernels

16

Page 17: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Example -- Stencil2DMotivation

Supports investigation of acceleratorusage within parallel application contextGood representative of data movement in real apps

Basic design9-point stencil operation applied to 2D data setMPI uses 2D Cartesian data distribution, with periodic halo exchangesApplies stencil to data in local memory

17

Page 18: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Stencil2D Scaling Study on Keeneland

©2010 Advanced Micro Devices, Inc. All rights reserved. 18

Our FAQ page walks you through generating scaling studies like this one.

Page 19: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Example -- S3DMotivation

Used by DoE to model the combustion of biofuelsFLOP intensive and scales to 230k+ cores on Jaguar

Basic designS3D solves the incompressible Navier-Stokes equations for a regular 3D domain.Parallelize by assigning each grid point to a work item

19

Page 20: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

S3D Performance

©2010 Advanced Micro Devices, Inc. All rights reserved. 20

K. Spafford, J. Meredith, J. S. Vetter, J. Chen, R. Grout, and R. Sankaran. Accelerating S3D: A GPGPU Case Study. Proceedings of the Seventh International Workshop on Algorithms, Models, and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 2009) Delft, The Netherlands. 

0.199

0.1690.192

AMD FirePro v8800 (Cypress)

NV Tesla C2050 (Fermi)

NV ION

S3D Single Precision per TDP (GFLOPS / watt)

S3D‐SP

Page 21: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Example: Sparse Matrix Vector Multiplication (SpMV)Motivation

Extremely common scientific kernelBandwidth bound, and much harder to get performance than GEMM

Basic design3 Algorithms, padded & unpadded dataCSR and ELLPACKR data formatsSupports random matrices or matrix market formatExample: Gould, Hu, & Scott: expanded system-3D PDE, visualized at UF.

21

Page 22: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Example: Sparse Matrix Vector Multiplication (SpMV)Motivation

Extremely common scientific kernelBandwidth bound, and much harder to get performance than GEMM

Basic design3 Algorithms, padded & unpadded dataCSR and ELLPACKR data formatsSupports random matrices or matrix market formatExample: Gould, Hu, & Scott: expanded system-3D PDE.

22

Page 23: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

SpMV Performance

23

1.10.64

3.994.69

3.62

0.179

3.95 4.05

0.18

AMD FirePro v8800 (Cypress)

NV Tesla C2050 (Fermi) Intel Xeon 5500 2.7Ghz

DP SpMV Random Matrix (10k x 10k, 1% sparsity) 

CSR‐Scalar CSR‐Vector ELLPACKR

Page 24: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Example: Molecular DynamicsMotivation

Classic n-body pairwisecomputation, important to all MD codes such as GPU-LAMMPS, AMBER, NAMD, and Gromacs

Basic designComputation of the Lennard-Jones potential force3D domain, random distributionNeighbor list algorithm

24

Page 25: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Reduction and Scan

©2010 Advanced Micro Devices, Inc. All rights reserved. 25

MotivationTwo fundamental primitives for almost all parallel algorithmsScan, e.g. (1, 1, 1, 1) (1, 2, 3, 4)

Basic designBoth use a tree-based algorithm operating on local memoryReduction uses linear access strided by total number of threads for coalesced reads

b d

Page 26: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Reduction and Scan

©2010 Advanced Micro Devices, Inc. All rights reserved. 26

107.1

89.5

6.9219.97 25.7

1.37

AMD FirePro v8800 (Cypress)

NV Tesla C2050 (Fermi) NV ION

Reduction and Scan (SP, GB/s)

Reduction Scan

Page 27: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Radix SortMotivation

Most common GPU sort, but nontrivial to obtain performancePopular GPU libraries use radix sort, ends up in a lot of apps

Basic designIteratively sort data 4 bits at a timeHighly parallel, takes advantage of fast shared memory for shuffling results

Performance ObservationsOpenCL/CUDA exhibit comparable performanceBoth perform at a small fraction of peak memory bandwidth

24 June 201127

Page 28: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Use #3: Validate your hardware and set standards for procurement

28

Page 29: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

Validate - Prime95-Style “Torture Test”SHOC includes a stability test which

repeatedly runs and checks FFTsJust like prime95• Extremely sensitive to:

- Stuck bits- Other HW errors

Looking to burn in that new cluster?mpirun –np num_nodes $SHOC_BIN/Stability –minutes 60

29

FFT

Inverse

FFT

Inverse

Page 30: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

SummarySHOC is an open source OpenCL benchmark suite focused on scientific

computingIt’s the only OpenCL + MPI benchmark suiteIt’s available for download today: http://ft.ornl.gov/doku/shoc/start• Send any questions to: [email protected]

Questions?

30

Page 31: The Scalable HeterOgeneous Computing - AMDdeveloper.amd.com/wordpress/media/2013/06/2907_1_final.pdf · The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite ... (MPI) scientific

31

Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is noobligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.