Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing Laboratory University College Dublin, Ireland Cluster 2012 Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 1 / 22
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Partitioning on Heterogeneous Multicore andMulti-GPU systems Using Functional Performance
Models of Data-Parallel Applictions
Ziming Zhong Vladimir Rychkov Alexey Lastovetsky
Heterogeneous Computing LaboratoryUniversity College Dublin, Ireland
Cluster 2012
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 1 / 22
Model-based Data Partitioning
Motivation
Hybrid GPU-accelerated parallel computers
Higher power efficiency, performance/price ratio, etc.Successfully applied to bioinformatics, astrophysics, moleculardynamics, computational fluid dynamics, etc.
Hybrid Clusters Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22
Model-based Data Partitioning
Motivation
Hybrid Multicores+GPUs presents challenges
Parallel programming is hardLoad balancing problem
Heterogeneity: processor, memory, etc.Hierarchical levels of parallelism: node, socket, core, etc.
and others
Hybrid Clusters Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22
Model-based Data Partitioning
Leveraging Hybrid Multicores/GPUs System
In this work, we target:
Data parallel application
Divisible computational workloadWorkload proportional to data sizeDependent on data locality
Dedicated hybrid system
Reuse of optimized software stack
Our approach:
Heterogeneous distributed-memory system
Performance modeling of hybrid system
Model-based data partitioning to balance load Processing Flow
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 3 / 22
Model-based Data Partitioning
Data Partitioning on Heterogeneous Platform:
(1) (2) (3)
1 Workload is divisible and proportional to data size
2 Workload is partitioned in proportion to processor speed
3 Workload is distributed in proportion to processor speed
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 4 / 22
Model-based Data Partitioning
Data partitioning relies on accurate performance modelsTraditionally, performance is defined by a single constant number
- Constant Performance Model (CPM)- Computed from clock speed or by performing a benchmark- Computational units are partitioned as di = N × (si/
∑pj=1 sj)
- Simplistic, algorithms may fail to converge to a balanced solution[1]
Functional Performance Model(FPM)[2]:
Represent speed as afunction of problem size
Realistic
Application centric
Hardware specific 0
5
10
15
20
25
30
35
40
0 1e+07 2e+07 3e+07 4e+07 5e+07
Sp
ee
d (
GF
LO
PS
)
Problem Size - matrix elements
Matrix Multiplication Benchmark on Grid5000-Lille
chirloute-3chimint-1
chinqchint-1chicon-1
[1] D. Clarke et al: Dynamic Load Balancing of Parallel Iterative Routines on Heterogeneous HPC Platforms, 2010
[2] A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 5 / 22
Solution lies on a line passing through the origin when di/si (di ) = constant
However, only designed for heterogeneous uniprocessor cluster
* A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 6 / 22
Data Partitioning on Hybrid System
Outline
1 Model-based Data Partitioning
2 Data Partitioning on Hybrid System
3 Experiment Results
4 Conclusions and Future Work
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 7 / 22
Data Partitioning on Hybrid System
Performance Modeling of Hybrid System
Multicore/GPUs are modeled independently
Separate memory, programming modelsRepresented by speed functions (FPM)Benchmarking with computational kernels
Performance model of multicore:
Approximate the speed of multiple corese.g. all cores in a processor except theones dedicated to GPUs
Performance model of GPU:
Approximate combined speed of a GPUand it’s dedicated core Processing Flow
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 8 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Generic measurement techniques
Process binding - avoid process migrationSynchronization - ensure resources are shared between coresRepeating measurement- ensure statistically reliable results
However, how to measure the processor performance accurately on ahybrid system?
Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 9 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Performance measurement of multiple cores:
Programming model:one process (thread)per core to achievehigh performance
Cores interfere witheach other due toresource contention
Performance areevaluated in group
All cores in the groupexecuting the sameamount of workload inparallel
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Performance measurement of GPU:
One core dedicated tothe GPU, other coresbeing idle
Kernel computationtime and data transfertime are both included
Additional issue: HostNUMA affects PCIetransfer throughput inDual-IOH system
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22
Data Partitioning on Hybrid System
Application: Matrix Multiplication on Heterogeneous Platform*
Matrices partitioned unevenly to achieve load balancing
Processors arranged so that communication is minimized
Computational kernel: panel-panel updateReuse vendor-optimized GEMM routineComputation is proportional to the area of submatrix Ci
The same memory access pattern as the whole application
* Beaumont,O. et al: Matrix Multiplication on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 2001.
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 11 / 22
Data Partitioning on Hybrid System
Development of Computational Kernel
Multicore CPU:
Use GEMM routine from ACML libraryMultiple processes running sequential routineAlternative: Single process running threaded routine
GPU accelerator:
Use GEMM routine from CUBLAS libraryDevelop out-of-core kernel to overcome memory limitationOverlap data transfers and kernel execution to hide latency
Out-of-core Kernel, Overlap of Data Transfers and Kernel Execution:- allocated 5 buffers in device memory: A0, A1, B0, C0, C1
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 12 / 22
Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22
Experiment Results
Computation time of each process
0 20 40 60 80
100 120 140
0 2 4 6 8 10 12 14 16 18 20 22 24Co
mp
uta
tio
n t
ime
(se
c)
Process rank
CPM-based partitioning
Tesla C870
Geforce GTX 680
0 20 40 60 80
100 120 140
0 2 4 6 8 10 12 14 16 18 20 22 24Co
mp
uta
tio
n t
ime
(se
c)
Process rank
FPM-based partitioning
Matrix size 60 × 60, Computation time reduced by 40%Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 18 / 22
Experiment Results
Execution time of the application under different partitioning algorithms
Execution time reduced by 23% and 45% respectivelyZiming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 19 / 22
Conclusions and Future Work
Conclusions and Future Work
Conclusions
Defined and built functional performance models (FPMs) of hybridmulticore and multi-GPU system, considering it as a distributedmemory systemAdapted FPM-based data partitioning to hybrid system, achieved loadbalancing and delivered good performance
Future Work
Apply the approach to hybrid clusterPartitioning with respect to interconnect speed
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 20 / 22
Conclusions and Future Work
Thank You!
Science Foundation University College Heterogeneous Computing China ScholarshipIreland Dublin Laboratory Council
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 21 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
d + d + d + d = n1 2 3 4
d1 d2 d3 d4
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
L
U
d + d + d + d < nU1 U2 U3 U4
d + d + d + d > nL1 L2 L3 L4
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
L
U
< n or > n
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
U
L
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
d1 d2 d3 d4
L
U
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22