Topology-aware Optimization of Communications for Parallel Matrix Multiplication on Hierarchical Heterogeneous HPC Platforms Tania Malik, Vladimir Rychkov, Alexey Lastovetsky, Jean-No¨ el Quintin Heterogeneous Computing Laboratory University College Dublin, Ireland Heterogeneity in Computing Workshop Phoenix-Arizona, USA 19-25 May, 2014 Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 1 / 26
70
Embed
Topology-aware Optimization of Communications for Parallel ... · Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 5 / 26. Motivation What To Do To address
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Topology-aware Optimization of Communications forParallel Matrix Multiplication on Hierarchical
Heterogeneous HPC Platforms
Tania Malik, Vladimir Rychkov, Alexey Lastovetsky, Jean-Noel Quintin
Heterogeneous Computing LaboratoryUniversity College Dublin, Ireland
Heterogeneity in Computing WorkshopPhoenix-Arizona, USA
19-25 May, 2014
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 1 / 26
Outline
Motivation
Problem Formulation
Topology-aware Communication Optimization Approach
Cost functionHeuristic
Experiments
Conclusion
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 2 / 26
Motivation
Introduction
For efficient execution of data-parallel applications on HPC platform:
Balance the load between processorsOptimize communication cost
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 8 / 26
Problem Formulation
Communication Flow of Heterogeneous SUMMA
A B
Figure : Communication flow of heterogeneous SUMMA implementing FPM-BR:ring
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 9 / 26
Problem Formulation
Comparison of some SUMMA-based algorithms
Table : Comparison of some SUMMA-based algorithms
Algorithm Data partitioning Communication vol. Communication flow
SUMMA homogeneous – broadcastsBR constant speeds min nb-p2p one-to-allFPM-BR speed functions min nb-p2p one-to-all/ring
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 10 / 26
Problem Formulation
Matrix Partitioning Algorithm
FPM-BR algorithm:
Balances the workloadMinimizes the total volume of communication
However, none of the Matrix Multiplication load balancingalgorithms takes into account the underlying networks topology
Goal is to reduce communication cost of the parallel application thatimplements the FPM-BR matrix multiplication algorithm
Rearrange existing heterogeneous data partition based onnetwork topology and application communication flow
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 11 / 26
Problem Formulation
Matrix Partitioning Algorithm
FPM-BR algorithm:
Balances the workloadMinimizes the total volume of communication
However, none of the Matrix Multiplication load balancingalgorithms takes into account the underlying networks topology
Goal is to reduce communication cost of the parallel application thatimplements the FPM-BR matrix multiplication algorithm
Rearrange existing heterogeneous data partition based onnetwork topology and application communication flow
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 11 / 26
Problem Formulation
Matrix Partitioning Algorithm
FPM-BR algorithm:
Balances the workloadMinimizes the total volume of communication
However, none of the Matrix Multiplication load balancingalgorithms takes into account the underlying networks topology
Goal is to reduce communication cost of the parallel application thatimplements the FPM-BR matrix multiplication algorithm
Rearrange existing heterogeneous data partition based onnetwork topology and application communication flow
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 11 / 26
Problem Formulation
Matrix Partitioning Algorithm
FPM-BR algorithm:
Balances the workloadMinimizes the total volume of communication
However, none of the Matrix Multiplication load balancingalgorithms takes into account the underlying networks topology
Goal is to reduce communication cost of the parallel application thatimplements the FPM-BR matrix multiplication algorithm
Rearrange existing heterogeneous data partition based onnetwork topology and application communication flow
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 11 / 26
Problem Formulation
Exhaustive Search Partitions
Performed exhaustive search with all possible arrangements ofrectangles
Found several arrangements that reduced and increased communicationcost
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Exhaustive Search Partitions
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Exhaustive Search Partitions
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Exhaustive Search Partitions
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Exhaustive Search Partitions
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Exhaustive Search Partitions
Figure : Communication optimalarrangements
Figure : Worst case arrangements
Observed regularity in thecomm-optimal arrangementsrelated to the topology
Rectangles were grouped byclustersLess inter-cluster comm.
Table : Exhaustive search experimentalresults
Cost Exec time (sec)Worst case Optimal Worst case Optimal
Exhaustive search 89.80 73.59 6.00 2.78
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 12 / 26
Problem Formulation
Search Space Size
Column widths are different:
Cannot move a rectangle to another column unless the whole columnsare interchanged
In column, no restrictions on interchanges of rectangles
Let
c be the number of columnsri be the number of rectangles in column i , 1 ≤ i ≤ cNumber of combinations will be equal to the product r1!× . . .× rc !
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 13 / 26
Problem Formulation
Search Space Size
Column widths are different:
Cannot move a rectangle to another column unless the whole columnsare interchanged
In column, no restrictions on interchanges of rectangles
Letc be the number of columnsri be the number of rectangles in column i , 1 ≤ i ≤ cNumber of combinations will be equal to the product r1!× . . .× rc !
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 13 / 26
Problem Formulation
NP-Complete
Which arrangement of rectangles is communication-optimal?NP-complete problem
Exhaustive search can be avoidable
By applying some heuristic that efficiently finds a near optimalarrangement
Requires to estimate the communication cost incurred byeach data partitioning
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 14 / 26
Problem Formulation
NP-Complete
Which arrangement of rectangles is communication-optimal?NP-complete problem
Exhaustive search can be avoidableBy applying some heuristic that efficiently finds a near optimalarrangement
Requires to estimate the communication cost incurred byeach data partitioning
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 14 / 26
Topology-aware Optimization Approach
Cost Function
Based on observation from exhaustive search
Propose cost function for FPM-BR
Ring Communication flowTwo level network Hierarchy
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 15 / 26
Topology-aware Optimization Approach
Cost function for Matrix A
Figure : Inter-cluster Communicationrelated to matrix A
Let
o= Overlaps of matrixrectangles
h= No. of inter-clusterCommunication
v= Height of overlap
costA =o∑
i=1h(i)× v(i)
Worst case:2× (11 + 3 + 3 + 3 + 4 + 2 + 6) = 64
Optimal:1×(6+8)+2×(1+9+2+6) = 50
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 16 / 26
Topology-aware Optimization Approach
Cost function for Matrix A
Figure : Inter-cluster Communicationrelated to matrix A
Let
o= Overlaps of matrixrectangles
h= No. of inter-clusterCommunication
v= Height of overlap
costA =o∑
i=1h(i)× v(i)
Worst case:2× (11 + 3 + 3 + 3 + 4 + 2 + 6) = 64
Optimal:1×(6+8)+2×(1+9+2+6) = 50
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 16 / 26
Topology-aware Optimization Approach
Cost function for Matrix A
Figure : Inter-cluster Communicationrelated to matrix A
Let
o= Overlaps of matrixrectangles
h= No. of inter-clusterCommunication
v= Height of overlap
costA =o∑
i=1h(i)× v(i)
Worst case:2× (11 + 3 + 3 + 3 + 4 + 2 + 6) = 64
Optimal:1×(6+8)+2×(1+9+2+6) = 50
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 16 / 26
Topology-aware Optimization Approach
Cost function for Matrix B
Figure : Inter-cluster Communicationrelated to matrix B
Let
c=Total columns
h= No. of inter-clusterCommunication
v= Column width
costB =c∑
i=1h(i)× v(i)
Worst case:(1× 12) + (2× 12) + (3× 9) = 63
Optimal:(1× 12) + (2× 12) + (2× 9) = 54
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 17 / 26
Topology-aware Optimization Approach
Cost function for Matrix B
Figure : Inter-cluster Communicationrelated to matrix B
Let
c=Total columns
h= No. of inter-clusterCommunication
v= Column width
costB =c∑
i=1h(i)× v(i)
Worst case:(1× 12) + (2× 12) + (3× 9) = 63
Optimal:(1× 12) + (2× 12) + (2× 9) = 54
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 17 / 26
Topology-aware Optimization Approach
Cost function for Matrix B
Figure : Inter-cluster Communicationrelated to matrix B
Let
c=Total columns
h= No. of inter-clusterCommunication
v= Column width
costB =c∑
i=1h(i)× v(i)
Worst case:(1× 12) + (2× 12) + (3× 9) = 63
Optimal:(1× 12) + (2× 12) + (2× 9) = 54
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 17 / 26
Topology-aware Optimization Approach
Cost function for M Arrangement
Use Euclidean norm
Represent combined cost and can be used to compare any twoarrangements
‖(costA(M), costB(M))‖Worst case:
√642 + 632 = 89.80
Optimal case:√
502 + 542 = 73.59
finding the communication-optimal arrangement can be formulated asminimization of the Euclidean norm:
‖(costA(M), costB(M))‖ → min
Use cost function in Heuristic
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 18 / 26
Topology-aware Optimization Approach
Cost function for M Arrangement
Use Euclidean norm
Represent combined cost and can be used to compare any twoarrangements
‖(costA(M), costB(M))‖Worst case:
√642 + 632 = 89.80
Optimal case:√
502 + 542 = 73.59
finding the communication-optimal arrangement can be formulated asminimization of the Euclidean norm:
‖(costA(M), costB(M))‖ → min
Use cost function in Heuristic
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 18 / 26
Topology-aware Optimization Approach
Cost function for M Arrangement
Use Euclidean norm
Represent combined cost and can be used to compare any twoarrangements
‖(costA(M), costB(M))‖Worst case:
√642 + 632 = 89.80
Optimal case:√
502 + 542 = 73.59
finding the communication-optimal arrangement can be formulated asminimization of the Euclidean norm:
‖(costA(M), costB(M))‖ → min
Use cost function in Heuristic
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 18 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Propose heuristic to avoid too many combination
Permutation based on groups
Requires to test g2! + . . . + gc ! arrangements of submatrices
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 19 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Propose heuristic to avoid too many combinationPermutation based on groups
Requires to test g2! + . . . + gc ! arrangements of submatrices
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 19 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 20 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 20 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 20 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 20 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 20 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-2
Figure :Permutation orderk=1
Figure :Permutation orderk=2
Accept c1 as optimal order
Generate group permutations gi !
For each permutation k = 1 to gi
Find k that has minimum cost functionfor extended sub-matrix
Cost function for k1=45 and k2=35
Add minimum k to resultingarrangement
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 21 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Heuristic
Heuristic for the Communication-Optimal Arrangement-3
Figure : Permutationorder k=1
Figure : Permutationorder k=2
Repeat the same steps for all ccolumn
Cost function of k1=74 andk2=65
Choose k2 as optimal order
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 22 / 26
Experimental Result
Heterogeneous Inter-Cluster Experiments
Figure : Matrix partitioning for32 nodes
Figure : Matrix partitioning for90 nodes
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 23 / 26
Experimental Result
Heterogeneous Inter-Cluster Experiments
Figure : Matrix partitioning for32 nodes
Figure : Matrix partitioning for90 nodes
Tania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 23 / 26
Experimental Result
Heterogeneous Inter-Cluster Experiments
Figure : Matrix partitioning for32 nodes
Figure : Matrix partitioning for90 nodesTania Malik (UCD HCL) Topology-aware Communication Optimization IPDPS 2014 23 / 26
Experimental Result
Heterogeneous Inter-Cluster Experiments
Table : Inter-cluster experimental results
Nodes Cost Exec time (sec) RatioOrig Heuristic Orig Heuristic