SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

HPC Advisory Council China Workshop October 28, 2012

CENTER for MANYCORE PROGRAMMING

�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p

E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l��T.��X��]O4fW�?tJ�(*��+XGR�'V��,lK�

]_n��V�CklK�>On ��O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�

PC�]O�bU�:#.�%c�HT5��>OE�=-��h�0�ZW.�9�o�D�I ��E��eW�@?U� �� ^O@?V�

PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU��J�D/$�"Xi�>OV�

SgT.�n

매니코어 프로그래밍 연구단

SnuCL: An OpenCL Framework for Heterogeneous

CPU/GPU Clusters

Jaejin LeeCenter for Manycore Programming

School of Computer Science and EngineeringSeoul National University

[email protected]://aces.snu.ac.kr


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n

매니코어 프로그래밍 연구단HPC Advisory Council

China Workshop October 28, 2012

Heterogeneous Computing SystemsContain different types of processors

Processors: CPUs, DSPs, GPUs, FPGAs, or ASICs For extra performance and power efficiency

Heterogeneity inISAs, processing power, power consumption, memory hierarchies, micro-architectures, etc.

GPGPU systems and clusters are widening their user base

2

PCI-E

CPUcore corecore corecore core


mem

GPU

mem

GPU

mem

GPU

mem

GPU

Main memory

Compute node

Interconnection network


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Parallel Programming ModelsAn interface between the programmer and the parallel machine when developing an application

Languages, libraries, language extensions, compiler directives, etc.

Important to have balance between delivering high performance and ease of programming

3

High performance

Ease of

programming


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Ease of Programming

How to handle the heterogeneity?

4


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



OpenCL

Open Computing Language

A framework (parallel programming model) for heterogeneous parallel computing

A language, API, libraries, and a runtime system

The specification of OpenCL 1.0 was released in late 2008

Now, OpenCL 1.2

5


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



OpenCL (contd.)

From mobile devices to supercomputers

Portable code across different architecturesCPUs, GPUs, Cell BE processors, etc.Not yet portable performance

Based on ANSI/ISO C99 standard

Supported by many vendors, such as Apple, AMD, ARM, IBM, Intel, NVIDA, SAMSUNG, etc.

6


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Data ParallelismAlso known as loop-level parallelismPerforming the same operation to different items of data at the same timeMore data, more parallelism

7

for(i=0; i<16; i++){ c[i] = a[i] + b[i];}

+ =

a[i] b[i] c[i]+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+ =

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

+

=

a[i]

b[i]

c[i]

+

=

+

=

+

=

+

=

+

=


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



OpenCL ApplicationThe combination of programs running on a host processor and OpenCL compute devices

Compute devices: CPUs, GPUs, etc.A host program + OpenCL programs

OpenCL program: a set of kernelsBased on ISO C99

8

main(){…cl_context = clCreateContextFromType( … );…cmd_queue = clCreateCommandQueue(…);…memobj[0] = clCreateBuffer(…);...}

Host program

OpenCL program

__kernel void mat_mul( __global const float *a,__global const float *b,__global float *c)

{ ...}

__kernel void vec_add(__global const float *a, __global const float *b, __global float *c){ ...}

OpenCL program+


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Vector Addition Example

9

void vec_add(int n, const float *A, const float *B, float *C){ int i; for (i=0; i<n; i++) c[i] = a[i] + b[i];}

__kernel void vec_add( __global const float *A, __global const float *B, __global float *C){ int id = get_global_id(0);

C[id] = A[id] + B[id];}

main(){

float srcA[N], srcB[N], srcC[N];

// initialize srcA and srcB...vec_add(m, srcA, srcB, srcC);...}


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



OpenCL ApplicationHost program

Executes on the host and manages kernel execution

KernelsBasic unit of executable code (a function) on compute devicesWhen executed, many instances are created

Exploits data parallelism

The host program and kernels all run in parallel

10

Host Device 0 Device 1

Kernel-A

Kernel-BKernel-C

Kernel-A


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Limitations

Current OpenCL implementations are targeting parallelism for multiple compute devices under a single OS instance

An application for a heterogeneous CPU/GPU cluster MPI + OpenCL or MPI + CUDAComplicated, less portable, and hard to maintain

11


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



An Illusion of a Single OS Instance

If the programmer can write applications for heterogeneous CPU/GPU clusters using only OpenCL

Easy to program and more portable

12

PCI-E



mem

GPU

mem

GPU

mem

GPU

mem

GPU

Main memory

Compute node

Interconnection network

Main memory

CPU

A system image running a single OS instance

...... GPUCPU CPU CPU CPU GPU GPU GPU GPU GPU

SnuCL


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SnuCL

An OpenCL framework [ICS’12]Freely available, open-source software developed at Seoul National UniversitySupports x86 CPUs, AMD GPUs, and NVIDIA GPUsIts source code is publicly available at http://aces.snu.ac.kr

SnuCL version 1.2 beta released June 13, 2012 (supports OpenCL 1.2)Passed most of OpenCL conformance tests

Platform layer + runtime + kernel compiler

13


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SnuCL (contd.)

Naturally extends the original OpenCL semantics to the heterogeneous cluster environment

Provides an illusion of a heterogeneous system running a single OS instance

With SnuCL, an OpenCL application written for a single heterogeneous system runs on a heterogeneous cluster without any modification

14


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



The Effect of Using SnuCL

Copy buffers between different nodes in the cluster environment (Buffer A → Buffer B)

15

Previous approach (Mixture of MPI and OpenCL)

SnuCL(OpenCL only)

MPI_Init(..);MPI_Comm_rank(MPI_COMM_WORLD, &rank);…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…void *temp = malloc(…);if (rank == SRC_DEV) { clEnqueueReadBuffer(cq, bufferA, …, temp, …); MPI_Send(temp, …, DST_DEV, …);} else if (rank == DST_DEV) { MPI_Recv(temp, …, SRC_DEV, …); clEnqueueWriteBuffer(cq, bufferB, …, temp, …);}…MPI_Finalize();

…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…clEnqueueCopyBuffer(cq, bufferA, bufferB, …);…


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



How to Achieve the Single System Image?

SnuCL runtime provides the illusionMapping components between the OpenCL platform and underlying hardware resources

Source-to-source kernel restructuring techniquesOpenCL C to C for CPUs

Buffer management techniquesEfficient node to node data transferConsistency management

16

OpenCL application

x86 CPUs NVIDIA GPUs

SnuCL runtime

SnuCL OpenCL-to-C

compiler

AMD GPUs

NVIDIAOpenCL runtime

AMD OpenCL runtime


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SnuCL Extensions to OpenCLSnuCL has extensions to OpenCL for copying buffers (e.g., memory objects)

Buffer-copy memory commands are often inefficient in the cluster environment depending on the access patternSimilar to MPI collective communication operations

17

SnuCL MPI EquivalentclEnqueueAlltoAllBufferclEnqueueBroadcastBufferclEnqueueScatterBufferclEnqueueGatherBufferclEnqueueAllGatherBufferclEnqueueReduceBufferclEnqueueAllReduceBufferclEnqueueReduceScatterBufferclEnqueueScanBuffer

MPI_AlltoallMPI_BcastMPI_ScatterMPI_GatherMPI_AllgatherMPI_ReduceMPI_AllreduceMPI_Reduce_scatterMPI_Scan


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Matrix Multiplication Performance

18


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SnuCL for CPU Devices in the Cluster

19

The kernel code is portable across different types of compute devices

The OpenCL application for GPUs will run on CPU devices with SnuCL

Replace CL_DEVICE_TYPE_GPU with CL_DEVICE_TYPE_CPU


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SNU NPB Suite

20

Most of the applications in NAS Parallel Benchmarks (NPB 3.3) are implemented in C, OpenMP C, and OpenCL [IISWC ’11]

NPB-SER-C: a serial C version of the NPB code NPB-OMP-C: an OpenMP C version of the NPB codeNPB-OCL: an OpenCL version of the NPB code for a single deviceNPB-OCL-MD: an OpenCL version of the NPB code for multiple OpenCL compute devices

Source code is publicly availablehttp://aces.snu.ac.kr


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



21

ApplicationsApplication Source Description Input Global memory

size (MB)Extensions

used

BinomialOption AMD Binomial option pricing 65504 or 2097152 samples, 512 steps, 100 iterations 2.0 or 64.0

BlackScholes PARSEC Black-Scholes PDE 33538048 options, 100 iterations 895.6

BT NAS Block tridiagonal solver Class C or Class D 1982.1 or 30686.7

CG NAS Conjugate gradient Class C or Class D 1102.6 or 20399.1

CP Parboil Coulombic potential 16384x16384, 1000 atoms 4.1

EP NAS Embarrassingly parallel Class D 0.8

FT NAS 3-D FFT PDE Class B or Class C 2816.0 or 11264.0 AlltoAll

MatrixMul NVIDIA Matrix multiplication 10752x10752 or 16384x16384 1323.0 or 3072.0 Broadcast

MG NAS Multigrid Class C or Class D 3575.3 or 28343.7

Nbody NVIDIA N-Body simulation 1048576 bodies 64.0

SP NAS Pentadiagonal solver Class C or Class D 1477.9 or 19974.4


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Speedup (over 1 CPU core)

22

CPU devices only (11 cores in a CPU device), 1 CPU device per nodeSnuCL-Static : Using the static scheduling for the kernel workload distributionThe numbers on x-axis represent the number of CPU compute devices

GPU devices only (4 GPU device per node)The numbers on x-axis represent the number of GPU compute devices.

0 1000 2000 3000 4000 5000 6000

1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 BinomialOption BlackScholes CP EP.D MatrixMul Nbody

Spee

dup

9515

0

4

8

12

16

4 9 16 25 36 1 2 4 8 16 32 4 8 16 32 4 8 16 32 4 9 16 25 36 BT.C CG.C FT.B MG.C SP.C

0

10

20

30

40

50

1 2 4 8 1 2 4 8 1 4 9 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 4 9 BinomialOption BlackScholes BT.C CG.C CP EP.D FT.B MatrixMul MG.C Nbody SP.C

Spee

dup

SnuCL-Static SnuCL 76,82 70,73


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



SnuCL vs. MPI-Fortran

23

Speedup over a single compute node (a CPU compute device with 4 CPU cores) Normalized to MPI-Fortran

MPI-Fortran : The unmodified original MPI-Fortran versions from NPB

The numbers on x-axis represent the number of compute nodes (CPU compute devices)

0 1 4

16 64

256

1 4 16

64

256 1 4 16

64

25

6 4 16

64

256 4 16

64

25

6 1 4 16

64

256 1 4 16

64

25

6 1 4 16

64

256 1 4 16

64

25

6 4 16

64

256 1 4 16

64

25

6 4 16

64

256

BinomialOption BlackScholes BT.D CG.D CP EP.D FT.C MatrixMul MG.D Nbody SP.D

Spee

dup

MPI-Fortran SnuCL


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Future Directions

Scalability for more than 1000 nodes

Achieving a single compute device image for multiple heterogeneous devices in a heterogeneous CPU/GPU cluster

AutotuningTo make performance portable across heterogeneous devices

Intelligent load balancing between multiple heterogeneous compute devices

24


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



References

[ICS] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. SnuCL: an OpenCL Framework for Heterogeneous CPU/GPU Clusters, ICS ’12: Proceedings of the 26th International Conference on Supercomputing, San Servolo Island, Venice, Italy, June 2012.[IISWC] Sangmin Seo, Gangwon Jo, and Jaejin Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL, IISWC ’11: Proceedings of the 2011 IEEE International Symposium on Workload Characterization, Austin, Texas, USA, November 2011.[PACT] Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. An OpenCL Framework for Homogeneous Manycores with no Hardware Cache Coherence, PACT ’11: Proceedings of the 20th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, Galveston Island, Texas, USA, October 2011.

25


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



References (contd.)[LCPC] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. OpenCL as a Programming Model for GPU Clusters, LCPC ’11: Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, Fort Collins, Colorado, USA, September 2011.[PPoPP] Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, PPoPP ʼ11: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 277 — 288, San Antonio, Texas, USA, February 2011, DOI: 10.1145/1941553.1941591. [PACT] Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. An OpenCL Framework for Heterogeneous Multicores with Local Memory, PACT ’10: Proceedings of the 19th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, pp. 193 — 204, Vienna, Austria, September 2010, DOI: 10.1145/1854273.1854301.

26


�6aP`p

�M6aP`p

M6aP`p�

M6aP`p�

!"#$E��e�� aP`p





SgT.�n



Contributors to SnuCL

27

Jungwon KimSangmin Seo

Jun LeeJeongho NahGangwon Jo

Jaejin Lee

SnuCL is publicly available at http://aces.snu.ac.kr

SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU … · HPC Advisory Council China Workshop October 28, 2012 C ENTER for M ANYCORE P ROGRAMMING 6aP`p M6aP`p M6aP`p M6aP`p E e

Documents