Advanced Hybrid MPI/OpenMP Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Parallelization Paradigms for Nested Loop Algorithms onto Nested Loop Algorithms onto Clusters of SMPs Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
45
Embed
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Advanced Hybrid MPI/OpenMP Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Parallelization Paradigms for Nested Loop Algorithms onto Nested Loop Algorithms onto
• 3 programming models for the parallelization of nested loops algorithms
• pure MPI• fine-grain hybrid MPI-OpenMP• coarse-grain hybrid MPI-OpenMP
• Advanced hyperplane scheduling• minimize synchronization need• overlap computation with communication
2/10/2003 EuroPVM/MPI 2003 5
IntroductionIntroduction
Algorithmic Model:
FOR j0 = min0 TO max0 DO
…
FOR jn-1 = minn-1 TO maxn-1 DO
Computation(j0,…,jn-1);
ENDFOR
…
ENDFOR
Perfectly nested loops Constant flow data dependencies
2/10/2003 EuroPVM/MPI 2003 6
IntroductionIntroduction
Target Architecture: SMP clusters
2/10/2003 EuroPVM/MPI 2003 7
OverviewOverview
Introduction Pure MPI Model Hybrid MPI-OpenMP Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 8
Pure MPI ModelPure MPI Model
Tiling transformation groups iterations into atomic execution units (tiles) Pipelined execution Overlapping computation with communication Makes no distinction between inter-node and intra-node communication
2/10/2003 EuroPVM/MPI 2003 9
Pure MPI ModelPure MPI Model
Example:
FOR j1=0 TO 9 DO FOR j2=0 TO 7 DO A[j1,j2]:=A[j1-1,j2] + A[j1,j2-1]; ENDFORENDFOR
10
01D
2/10/2003 EuroPVM/MPI 2003 10
Pure MPI ModelPure MPI Model
j1
j2
CPU1
CPU0
CPU1
CPU0
NODE1
NODE0
4 MPI nodes
2/10/2003 EuroPVM/MPI 2003 11
Pure MPI ModelPure MPI Model
j1
j2
CPU1
CPU0
CPU1
CPU0
NODE1
NODE0
4 MPI nodes
2/10/2003 EuroPVM/MPI 2003 12
Pure MPI ModelPure MPI Model
tile0 = nod0;…tilen-2 = nodn-2;FOR tilen-1 = 0 TO DO
Introduction Pure MPI Model Hybrid MPI-OpenMP Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 14
Hyperplane SchedulingHyperplane Scheduling
Implements coarse-grain parallelism assuming inter-tile data dependencies Tiles are organized into data-independent subsets (groups) Tiles of the same group can be concurrently executed by multiple threads Barrier synchronization between threads
Introduction Pure MPI Model Hybrid MPI-OpenMP Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 19
Fine-grain ModelFine-grain Model
Incremental parallelization of computationally intensive parts Relatively straightforward from pure MPI Threads (re)spawned at computation Inter-node communication outside of multi-threaded part Thread synchronization through implicit barrier of omp parallel directive
Introduction Pure MPI Model Hybrid MPI-OpenMP Models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 22
Coarse-grain ModelCoarse-grain Model
SPMD paradigm Requires more programming effort Threads are only spawned once Inter-node communication inside multi-threaded part (requires MPI_THREAD_MULTIPLE) Thread synchronization through explicit barrier (omp barrier directive)
Summary: Fine-grain vs Summary: Fine-grain vs Coarse-grainCoarse-grain
Fine-grain Coarse-grainThreads re-spawning Threads are only
spawned onceInter-node MPI communication outside of multi-threaded region
Inter-node MPI communication inside multi-threaded region, assumed by master thread
Intra-node synchronization through implicit barrier (omp parallel)
Intra-node synchronization through explicit OpenMP barrier
2/10/2003 EuroPVM/MPI 2003 25
OverviewOverview
Introduction Pure MPI model Hybrid MPI-OpenMP models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 26
Experimental ResultsExperimental Results
8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 (--with-device=ch_p4, --with-comm=shared) Intel C++ compiler 7.0 (-O3 -mcpu=pentiumpro -static) FastEthernet interconnection ADI micro-kernel benchmark (3D)
2/10/2003 EuroPVM/MPI 2003 27
Alternating Direction Implicit Alternating Direction Implicit (ADI)(ADI)
Unitary data dependencies 3D Iteration Space (X x Y x Z)
X
Y
Z
2/10/2003 EuroPVM/MPI 2003 28
ADI – 4 nodesADI – 4 nodes
Pure MPI Hybrid
I: MPII: OpenMP
XX
Y Y
2/10/2003 EuroPVM/MPI 2003 29
ADI – 4 nodesADI – 4 nodes
X < Y
X > Y
Pure MPI HybridX
Y
X
Y
Pure MPI HybridXX
Y Y
2/10/2003 EuroPVM/MPI 2003 30
ADI X=512 Y=512 Z=8192 – 4 ADI X=512 Y=512 Z=8192 – 4 nodesnodes
2/10/2003 EuroPVM/MPI 2003 31
ADI X=128 Y=512 Z=8192 – 4 ADI X=128 Y=512 Z=8192 – 4 nodesnodes
2/10/2003 EuroPVM/MPI 2003 32
ADI X=512 Y=128 Z=8192 – 4 ADI X=512 Y=128 Z=8192 – 4 nodesnodes
2/10/2003 EuroPVM/MPI 2003 33
ADI – 2 nodesADI – 2 nodes
Pure MPI Hybrid
I: MPII: OpenMP
XX
Y Y
2/10/2003 EuroPVM/MPI 2003 34
ADI – 2 nodesADI – 2 nodes
X < Y
X > Y
Pure MPI HybridXX
Y Y
Pure MPI HybridXX
Y Y
2/10/2003 EuroPVM/MPI 2003 35
ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes
2/10/2003 EuroPVM/MPI 2003 36
ADI X=256 Y=512 Z=8192 – 2 ADI X=256 Y=512 Z=8192 – 2 nodesnodes
2/10/2003 EuroPVM/MPI 2003 37
ADI X=512 Y=512 Z=8192 – 2 ADI X=512 Y=512 Z=8192 – 2 nodesnodes
2/10/2003 EuroPVM/MPI 2003 38
ADI X=512 Y=256 Z=8192 – 2 ADI X=512 Y=256 Z=8192 – 2 nodesnodes
2/10/2003 EuroPVM/MPI 2003 39
ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes
2/10/2003 EuroPVM/MPI 2003 40
ADI X=128 Y=512 Z=8192 – 2 ADI X=128 Y=512 Z=8192 – 2 nodesnodes
Computation Communication
2/10/2003 EuroPVM/MPI 2003 41
ADI X=512 Y=128 Z=8192 – 2 ADI X=512 Y=128 Z=8192 – 2 nodesnodes
Computation Communication
2/10/2003 EuroPVM/MPI 2003 42
OverviewOverview
Introduction Pure MPI model Hybrid MPI-OpenMP models
• Hyperplane Scheduling• Fine-grain Model• Coarse-grain Model
Experimental Results Conclusions – Future Work
2/10/2003 EuroPVM/MPI 2003 43
ConclusionsConclusions
Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm Hybrid models can be competitive to the pure MPI paradigm Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated Programming efficiently in OpenMP not easier than programming efficiently in MPI
2/10/2003 EuroPVM/MPI 2003 44
Future WorkFuture Work
Application of methodology to real applications and benchmarks Work balancing for coarse-grain model Performance evaluation on advanced interconnection networks (SCI, Myrinet) Generalization as compiler technique