High Performance Computing Group Feasibility Study of MPI Feasibility Study of MPI Implementation on the Heterogeneous Implementation on the Heterogeneous Multi-Core Cell BE Multi-Core Cell BE TM TM Architecture Architecture A. Kumar 1 , G. Senthilkumar 1 , M. Krishna 1 , N. Jayam 1 , P.K. Baruah 1 , R. Sarma 1 , S. Kapoor 2 , A. Srinivasan 3 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin, [email protected]3 Florida State University, [email protected]Goals 1. Determine the feasibility of Intra-Cell MPI 2. Evaluate the impact of different design choices on performance
8
Embed
Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture
Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture. Kumar 1 , G. Senthilkumar 1 , M. Krishna 1 , N. Jayam 1 , P.K. Baruah 1 , R. Sarma 1 , S. Kapoor 2 , A. Srinivasan 3 1 Sri Sathya Sai University, Prashanthi Nilayam, India - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
High Performance Computing GroupHigh Performance Computing Group
Feasibility Study of MPI Implementation Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell on the Heterogeneous Multi-Core Cell
BEBETMTM Architecture Architecture
A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1
, P.K. Baruah1, R. Sarma1,
S. Kapoor2, A. Srinivasan3
1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin, [email protected]
Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight
Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance
MPI design choices Application data in: (i) local store or (ii) main memory MPI meta-data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Point-to-point communication mode: (i) synchronous or (ii) buffered
High Performance Computing GroupHigh Performance Computing Group
Blocking Point-to-Point Communication Blocking Point-to-Point Communication Performance Performance
Results are from a 3.2 GHz Cell Blade, at IBM Rochester
The final version uses buffered mode for small messages and synchronous mode for long messages
Threshold to switch to Synchronous mode is set to 2KB
In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement
High Performance Computing GroupHigh Performance Computing Group
MPI/PlatformLatency
(0 Byte)Maximum throughput
MPICELL 0.41 µs 6.01 GB/s
MPICELL Congested NA 4.48 GB/s
MPICELL Small 0.65 µs 23.12 GB/s
Nemesis/Xeon 1.0 µs 0.65 GB/s
Shm/Xeon 1.3 µs 0.5 GB/s
Open MPI/Xeon 2.8 µs 0.5 GB/s
Nemesis/Opteron 0.34 µs 1.5 GB/s
Open MPI/Opteron 0.6 µs 1.0 GB/s
Comparison of MPICELL with MPI on Other Hardware
High Performance Computing GroupHigh Performance Computing Group
Collective Communication Example – Collective Communication Example – BroadcastBroadcast
Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data
Broadcast with good choice of
algorithms for each data size and SPE count
Maximum main memory bandwidth is also shown
High Performance Computing GroupHigh Performance Computing Group
Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024
The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication
The Opteron results used LAM MPI
Performance of Double Precision matrix-vector multiplication
High Performance Computing GroupHigh Performance Computing Group
Conclusions and Future WorkConclusions and Future Work
Conclusions The Cell processor has good potential for MPI applications.
PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory
But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck
Good performance for collectives even with two Cell processors
Current and future work Implemented
Collective communication operations optimized for contiguous data Blocking and non-blocking communication
Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc