Page 1
HPC Advisory Council China Workshop October 28, 2012
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단
SnuCL: An OpenCL Framework for Heterogeneous
CPU/GPU Clusters
Jaejin LeeCenter for Manycore Programming
School of Computer Science and EngineeringSeoul National University
[email protected] ://aces.snu.ac.kr
Page 2
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Heterogeneous Computing SystemsContain different types of processors
Processors: CPUs, DSPs, GPUs, FPGAs, or ASICs For extra performance and power efficiency
Heterogeneity inISAs, processing power, power consumption, memory hierarchies, micro-architectures, etc.
GPGPU systems and clusters are widening their user base
2
PCI-E
CPUcore corecore corecore core
CPUcore corecore corecore core
mem
GPU
mem
GPU
mem
GPU
mem
GPU
Main memory
Compute node
Interconnection network
Page 3
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Parallel Programming ModelsAn interface between the programmer and the parallel machine when developing an application
Languages, libraries, language extensions, compiler directives, etc.
Important to have balance between delivering high performance and ease of programming
3
High performance
Ease of
programming
Page 4
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Ease of Programming
How to handle the heterogeneity?
4
Page 5
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
OpenCL
Open Computing Language
A framework (parallel programming model) for heterogeneous parallel computing
A language, API, libraries, and a runtime system
The specification of OpenCL 1.0 was released in late 2008
Now, OpenCL 1.2
5
Page 6
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
OpenCL (contd.)
From mobile devices to supercomputers
Portable code across different architecturesCPUs, GPUs, Cell BE processors, etc.Not yet portable performance
Based on ANSI/ISO C99 standard
Supported by many vendors, such as Apple, AMD, ARM, IBM, Intel, NVIDA, SAMSUNG, etc.
6
Page 7
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Data ParallelismAlso known as loop-level parallelismPerforming the same operation to different items of data at the same timeMore data, more parallelism
7
for(i=0; i<16; i++){ c[i] = a[i] + b[i];}
+ =
a[i] b[i] c[i]+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+ =
+
=
+
=
+
=
+
=
+
=
+
=
+
=
+
=
+
=
+
=
+
=
+
=
a[i]
b[i]
c[i]
+
=
+
=
+
=
+
=
+
=
Page 8
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
OpenCL ApplicationThe combination of programs running on a host processor and OpenCL compute devices
Compute devices: CPUs, GPUs, etc.A host program + OpenCL programs
OpenCL program: a set of kernelsBased on ISO C99
8
main(){…cl_context = clCreateContextFromType( … );…cmd_queue = clCreateCommandQueue(…);…memobj[0] = clCreateBuffer(…);...}
Host program
OpenCL program
__kernel void mat_mul( __global const float *a,__global const float *b,__global float *c)
{ ...}
__kernel void vec_add(__global const float *a, __global const float *b, __global float *c){ ...}
OpenCL program+
Page 9
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Vector Addition Example
9
void vec_add(int n, const float *A, const float *B, float *C){ int i; for (i=0; i<n; i++) c[i] = a[i] + b[i];}
__kernel void vec_add( __global const float *A, __global const float *B, __global float *C){ int id = get_global_id(0);
C[id] = A[id] + B[id];}
main(){
float srcA[N], srcB[N], srcC[N];
// initialize srcA and srcB...vec_add(m, srcA, srcB, srcC);...}
Page 10
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
OpenCL ApplicationHost program
Executes on the host and manages kernel execution
KernelsBasic unit of executable code (a function) on compute devicesWhen executed, many instances are created
Exploits data parallelism
The host program and kernels all run in parallel
10
Host Device 0 Device 1
Kernel-A
Kernel-BKernel-C
Kernel-A
Page 11
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Limitations
Current OpenCL implementations are targeting parallelism for multiple compute devices under a single OS instance
An application for a heterogeneous CPU/GPU cluster MPI + OpenCL or MPI + CUDAComplicated, less portable, and hard to maintain
11
Page 12
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
An Illusion of a Single OS Instance
If the programmer can write applications for heterogeneous CPU/GPU clusters using only OpenCL
Easy to program and more portable
12
PCI-E
CPUcore corecore corecore core
CPUcore corecore corecore core
mem
GPU
mem
GPU
mem
GPU
mem
GPU
Main memory
Compute node
Interconnection network
Main memory
CPU
A system image running a single OS instance
...... GPUCPU CPU CPU CPU GPU GPU GPU GPU GPU
SnuCL
Page 13
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SnuCL
An OpenCL framework [ICS’12]Freely available, open-source software developed at Seoul National UniversitySupports x86 CPUs, AMD GPUs, and NVIDIA GPUsIts source code is publicly available at http://aces.snu.ac.kr
SnuCL version 1.2 beta released June 13, 2012 (supports OpenCL 1.2)Passed most of OpenCL conformance tests
Platform layer + runtime + kernel compiler
13
Page 14
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SnuCL (contd.)
Naturally extends the original OpenCL semantics to the heterogeneous cluster environment
Provides an illusion of a heterogeneous system running a single OS instance
With SnuCL, an OpenCL application written for a single heterogeneous system runs on a heterogeneous cluster without any modification
14
Page 15
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
The Effect of Using SnuCL
Copy buffers between different nodes in the cluster environment (Buffer A → Buffer B)
15
Previous approach (Mixture of MPI and OpenCL)
SnuCL(OpenCL only)
MPI_Init(..);MPI_Comm_rank(MPI_COMM_WORLD, &rank);…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…void *temp = malloc(…);if (rank == SRC_DEV) { clEnqueueReadBuffer(cq, bufferA, …, temp, …); MPI_Send(temp, …, DST_DEV, …);} else if (rank == DST_DEV) { MPI_Recv(temp, …, SRC_DEV, …); clEnqueueWriteBuffer(cq, bufferB, …, temp, …);}…MPI_Finalize();
…cl_mem bufferA = clCreateBuffer(…);cl_mem bufferB = clCreateBuffer(…);…clEnqueueCopyBuffer(cq, bufferA, bufferB, …);…
Page 16
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
How to Achieve the Single System Image?
SnuCL runtime provides the illusionMapping components between the OpenCL platform and underlying hardware resources
Source-to-source kernel restructuring techniquesOpenCL C to C for CPUs
Buffer management techniquesEfficient node to node data transferConsistency management
16
OpenCL application
x86 CPUs NVIDIA GPUs
SnuCL runtime
SnuCL OpenCL-to-C
compiler
AMD GPUs
NVIDIAOpenCL runtime
AMD OpenCL runtime
Page 17
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SnuCL Extensions to OpenCLSnuCL has extensions to OpenCL for copying buffers (e.g., memory objects)
Buffer-copy memory commands are often inefficient in the cluster environment depending on the access patternSimilar to MPI collective communication operations
17
SnuCL MPI EquivalentclEnqueueAlltoAllBufferclEnqueueBroadcastBufferclEnqueueScatterBufferclEnqueueGatherBufferclEnqueueAllGatherBufferclEnqueueReduceBufferclEnqueueAllReduceBufferclEnqueueReduceScatterBufferclEnqueueScanBuffer
MPI_AlltoallMPI_BcastMPI_ScatterMPI_GatherMPI_AllgatherMPI_ReduceMPI_AllreduceMPI_Reduce_scatterMPI_Scan
Page 18
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Matrix Multiplication Performance
18
Page 19
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SnuCL for CPU Devices in the Cluster
19
The kernel code is portable across different types of compute devices
The OpenCL application for GPUs will run on CPU devices with SnuCL
Replace CL_DEVICE_TYPE_GPU with CL_DEVICE_TYPE_CPU
Page 20
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SNU NPB Suite
20
Most of the applications in NAS Parallel Benchmarks (NPB 3.3) are implemented in C, OpenMP C, and OpenCL [IISWC ’11]
NPB-SER-C: a serial C version of the NPB code NPB-OMP-C: an OpenMP C version of the NPB codeNPB-OCL: an OpenCL version of the NPB code for a single deviceNPB-OCL-MD: an OpenCL version of the NPB code for multiple OpenCL compute devices
Source code is publicly availablehttp://aces.snu.ac.kr
Page 21
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
21
ApplicationsApplication Source Description Input Global memory
size (MB)Extensions
used
BinomialOption AMD Binomial option pricing 65504 or 2097152 samples, 512 steps, 100 iterations 2.0 or 64.0
BlackScholes PARSEC Black-Scholes PDE 33538048 options, 100 iterations 895.6
BT NAS Block tridiagonal solver Class C or Class D 1982.1 or 30686.7
CG NAS Conjugate gradient Class C or Class D 1102.6 or 20399.1
CP Parboil Coulombic potential 16384x16384, 1000 atoms 4.1
EP NAS Embarrassingly parallel Class D 0.8
FT NAS 3-D FFT PDE Class B or Class C 2816.0 or 11264.0 AlltoAll
MatrixMul NVIDIA Matrix multiplication 10752x10752 or 16384x16384 1323.0 or 3072.0 Broadcast
MG NAS Multigrid Class C or Class D 3575.3 or 28343.7
Nbody NVIDIA N-Body simulation 1048576 bodies 64.0
SP NAS Pentadiagonal solver Class C or Class D 1477.9 or 19974.4
Page 22
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Speedup (over 1 CPU core)
22
CPU devices only (11 cores in a CPU device), 1 CPU device per nodeSnuCL-Static : Using the static scheduling for the kernel workload distributionThe numbers on x-axis represent the number of CPU compute devices
GPU devices only (4 GPU device per node)The numbers on x-axis represent the number of GPU compute devices.
0 1000 2000 3000 4000 5000 6000
1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 1 2 4 8 16 32 BinomialOption BlackScholes CP EP.D MatrixMul Nbody
Spee
dup
9515
0
4
8
12
16
4 9 16 25 36 1 2 4 8 16 32 4 8 16 32 4 8 16 32 4 9 16 25 36 BT.C CG.C FT.B MG.C SP.C
0
10
20
30
40
50
1 2 4 8 1 2 4 8 1 4 9 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 4 9 BinomialOption BlackScholes BT.C CG.C CP EP.D FT.B MatrixMul MG.C Nbody SP.C
Spee
dup
SnuCL-Static SnuCL 76,82 70,73
Page 23
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
SnuCL vs. MPI-Fortran
23
Speedup over a single compute node (a CPU compute device with 4 CPU cores) Normalized to MPI-Fortran
MPI-Fortran : The unmodified original MPI-Fortran versions from NPB
The numbers on x-axis represent the number of compute nodes (CPU compute devices)
0 1 4
16 64
256
1 4 16
64
256 1 4 16
64
25
6 4 16
64
256 4 16
64
25
6 1 4 16
64
256 1 4 16
64
25
6 1 4 16
64
256 1 4 16
64
25
6 4 16
64
256 1 4 16
64
25
6 4 16
64
256
BinomialOption BlackScholes BT.D CG.D CP EP.D FT.C MatrixMul MG.D Nbody SP.D
Spee
dup
MPI-Fortran SnuCL
Page 24
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Future Directions
Scalability for more than 1000 nodes
Achieving a single compute device image for multiple heterogeneous devices in a heterogeneous CPU/GPU cluster
AutotuningTo make performance portable across heterogeneous devices
Intelligent load balancing between multiple heterogeneous compute devices
24
Page 25
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
References
[ICS] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. SnuCL: an OpenCL Framework for Heterogeneous CPU/GPU Clusters, ICS ’12: Proceedings of the 26th International Conference on Supercomputing, San Servolo Island, Venice, Italy, June 2012.[IISWC] Sangmin Seo, Gangwon Jo, and Jaejin Lee. Performance Characterization of the NAS Parallel Benchmarks in OpenCL, IISWC ’11: Proceedings of the 2011 IEEE International Symposium on Workload Characterization, Austin, Texas, USA, November 2011.[PACT] Jun Lee, Jungwon Kim, Junghyun Kim, Sangmin Seo, and Jaejin Lee. An OpenCL Framework for Homogeneous Manycores with no Hardware Cache Coherence, PACT ’11: Proceedings of the 20th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, Galveston Island, Texas, USA, October 2011.
25
Page 26
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
References (contd.)[LCPC] Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. OpenCL as a Programming Model for GPU Clusters, LCPC ’11: Proceedings of the 24th International Workshop on Languages and Compilers for Parallel Computing, Fort Collins, Colorado, USA, September 2011.[PPoPP] Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, PPoPP ʼ11: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 277 — 288, San Antonio, Texas, USA, February 2011, DOI: 10.1145/1941553.1941591. [PACT] Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee, Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi. An OpenCL Framework for Heterogeneous Multicores with Local Memory, PACT ’10: Proceedings of the 19th ACM/IEEE/IFIP International Conference on Parallel Architectures and Compilation Techniques, pp. 193 — 204, Vienna, Austria, September 2010, DOI: 10.1145/1854273.1854301.
26
Page 27
CENTER for MANYCORE PROGRAMMING
�6aP`p
�M6aP`p
M6aP`p�
M6aP`p�
!"#$E��e�� �aP`p
E��e��BQ!m�W�F<1hN�.�i[W�`pV�2l���T.��X��]O4fW�?tJ�(*�����+XGR�'V��,lK�
]_n��V�CklK�>On ������O#J�3#/�aP`p��?l`p�'W�E��e�rj0��8lL��aP`p�)��?l`pW�
PC�]O�bU�:#.�%c�HT5��>OE�=-������h�0�ZW.�9�o�D�I ��E��eW�@?U� ������ �^O@?V�
PC]T.�>Ol5�X�7W�@?�sOJ�!n�>qU�;��X&*YW�@?sOV�d�n ��\AU������J�D/$�"Xi�>OV�
SgT.�n
매니코어 프로그래밍 연구단HPC Advisory Council
China Workshop October 28, 2012
Contributors to SnuCL
27
Jungwon KimSangmin Seo
Jun LeeJeongho NahGangwon Jo
Jaejin Lee
SnuCL is publicly available at http://aces.snu.ac.kr