Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics University College Dublin ___________________________________________________ ____ HeteroPar’06 Barcelona Sept. 28, 2006
20
Embed
Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Matrix Multiplication on Two Interconnected Processors
Total Volume of Inter-Processor Communication (TVC) = N 2
Introduction: Square-Corner Partitioning
0TVC ,0 as X
N2TVC
Square-Corner Partitioning
NS
whwhwhSi
ii
2,0 as
)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Square-Corner Partitioning can meet the lower bound, L
Square-Corner Partitioning
Average and Minimum values of L
Sfor 2 million randomly generated areas
Power Ratio > 3:1
Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Square-Corner PartitioningMinimizing the TVC
The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1
The Total Volume of Communication is minimized when the slower processor’s partition is a square
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Execution Time = 10%
Square-Corner Partitioning Overlapping Communication and Computation
A sub-partition of Processor 1’s C Partition is Immediately Calculable
Square-Corner Partitioning Overlapping Communication and Computation
Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.