PARALLELIZATION OF GCCG SOLVER Programming of Supercomputers Team 03 Roberto Camacho Miroslava Slavcheva
Nov 27, 2015
PARALLELIZATION OF GCCG SOLVER
Programming of SupercomputersTeam 03Roberto CamachoMiroslava Slavcheva
INTRODUCTION
oFire benchmark suite
oGCCG solver• Generalized orthomin solver with diagonal scaling
• Based on a Linearized Continuity Equation
: [known] boundary pole coefficients (BP) : [known] boundary cell coefficients (BE, BS, etc.) : [known] source value (SU) : [unknown] variation vector/flow to be transported
(VAR)
•While residual > tolerance Compute the new directional values (direc) from the old
ones. Normalize and update values. Compute the new residual.
SU
PROGRAMMING OF SUPERCOMPUTERS LAB FLOWoSequential optimization of the GCCG solver.
oDefinition of performance objectives for parallelization.
oParallelization:• Domain decomposition into volume cells• Definition of a communication model• Implementation using MPI
oPerformance analysis and tuning.
VAR
SEQUENTIAL OPTIMIZATION
oPerformance metrics via PAPI library:
• Execution time, total Mflops, L2- and L3-level cache miss rates
oCompiler optimization flags:• -g, -O0, -O1, -O2, -O3, -O3p
oASCII vs binary input files
SEQUENTIAL OPTIMIZATION
SEQUENTIAL OPTIMIZATION
BENCHMARK PARALLELIZATION
DATA DISTRIBUTIONoMain objective: Avoid data replication
oImplemented distributions:oClassic
oProblem:oApplication is for irregular geometriesoIndirect addressing (LCC)
oMETIS: Graph analysis tool Dual approach: elements -> vertices Balanced computational time
Nodal approach: nodes -> vertices “Optimal neighborhood“
Processor 0
Processor 1
Processor 2
Processor 3
Processor 4
Max difference:1 element
Distributeevenly
Classic distribution
METIS dual distribution
DATA DISTRIBUTION
Global to local mapping8 9 1
011
Local elements: rank 2
Globalidx
LCC
Neighbors: 9, 12, 14Map neighbors
Lookup
Local to global mapping
0 X1 X…8 09 1
10 211 312 413 X14 5
Internal elements0 81 92 103 11
External elements4 125 14
DATA DISTRIBUTION
Input File Input File
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Rank 1
Rank 2
Rank 3
Rank 4
Rank 0
Read data
Transmit distribution
Compute distribution
Distribution by master process Distribution computed by each process
COMMUNICATION MODEL
oMain objective: Optimize communication
oGhost cells
oCommunication listsoSendoReceive
oSynchronization between lists
COMMUNICATION MODEL2D arraySend list
Rank 0
Rank 1
Rank 2
Rank 3
Rank 4
Max #elements in one processor
Total storage: N * max_loc_elems
Idx(0) = 0 Idx(1) = Idx (0) + Send_count(0)
Idx(2) Idx(3)
• Index array must be created• Send_count already existsTotal storage: Total_neighbors + N
1D arraySend list
COMMUNICATION MODEL: MPI_ALLTOALLV
1 2 3 4
0
0
0
0
0 2 3 4
1
0
0
0
1
1
1
Receive lists
Node 0 Node 1
Node 1
Node 2
Node 3
Node 4
Node 0
Node 2
Node 3
Node 4
Send lists
MPI IMPLEMENTATION
oMain loop: Compute_solution
oCollective communicationoLow overheadoNo deadlock
o Direc1 transmissionoIndirect addressingoIncludes ghost cells and external cellsoUse communication lists and mappings
direc1
send_list
send_buffer
MPI_Alltoallv
recv_buffer
recv_list
direc1
MPI IMPLEMENTATIONMPI_Allreduce with MPI_IN_PLACE
Processor 0
Processor 1
Processor 2
Processor 3
Processor 0
Processor 1
Processor 2
Processor 3
local residual
res_updated
local residual
res_updated
local residual
res_updated
local residual
res_updated
+
global residual
res_updated
FINALIZATION
oAdded to verify results
oProcessor 0 does all the work
oImbalanced load
oHigh communicationoDataoLocal to global map
PERFORMANCE ANALYSIS AND
TUNING
SPEEDUPoSpeedup objective: linear speedup using Metis Dual distribution and 1-8 processors.
oPAPI measurements with up to 8 processors.
• Linear (super-linear) speedup with Metis distributions.
oScalasca/Periscope/Vampir/Cube measurements and visualization (code/makefile instrumentation) with up to 64 processors.
•Speedup plateaus.
SPEEDUP
LOAD IMBALANCEoMinimal load imbalance in computation phase by design.
oSevere load imbalance in finalization phase: root process does all work.
MPI OVERHEADoAcceptance criterion: MPI overhead (MPI_Init() and MPI_Finalize() excluded) less than 25 % on pent with 4 processors.
oOverhead plateaus.
CONCLUSION
oNo severe load imbalance or data replication.
oCollective operations used for achieving the performance goals.
oMetis distributions lead to better performance than classic.
oOptimal number of processors used for best speedup and least communication overhead depends on:• the distribution• the input file size