High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

High Performance Computing GroupHigh Performance Computing Group

Feasibility Study of MPI Implementation Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell on the Heterogeneous Multi-Core Cell

BEBETMTM Architecture Architecture

A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1

, P.K. Baruah1, R. Sarma1,

S. Kapoor2, A. Srinivasan3

1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin, [email protected]

3 Florida State University, [email protected]

A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,

S. Kapoor2, A. Srinivasan3

1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin, [email protected]

3 Florida State University, [email protected]

Goals

1. Determine the feasibility of Intra-Cell MPI

2. Evaluate the impact of different design choices on performance

Goals

1. Determine the feasibility of Intra-Cell MPI

2. Evaluate the impact of different design choices on performance


A PowerPC core, with 8 co-processors (SPE) with 256 K local

store each

Shared 512 MB - 2 GB main memory - SPEs can DMA

Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops

in double precision for SPEs

204.8 GB/s EIB bandwidth, 25.6 GB/s for memory

Two Cell processors can be combined to form a Cell blade with

global shared memory

Cell ArchitectureCell Architecture

DMA put timesDMA put times

Memory to memory copy using:

• SPE local store

• memcpy by PPE

Memory to memory copy using:

• SPE local store

• memcpy by PPE


Intra-Cell MPI Design ChoicesIntra-Cell MPI Design Choices

Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight

Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance

MPI design choices Application data in: (i) local store or (ii) main memory MPI meta-data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Point-to-point communication mode: (i) synchronous or (ii) buffered


Blocking Point-to-Point Communication Blocking Point-to-Point Communication Performance Performance

Results are from a 3.2 GHz Cell Blade, at IBM Rochester

The final version uses buffered mode for small messages and synchronous mode for long messages

Threshold to switch to Synchronous mode is set to 2KB

In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement


MPI/PlatformLatency

(0 Byte)Maximum throughput

MPICELL 0.41 µs 6.01 GB/s

MPICELL Congested NA 4.48 GB/s

MPICELL Small 0.65 µs 23.12 GB/s

Nemesis/Xeon 1.0 µs 0.65 GB/s

Shm/Xeon 1.3 µs 0.5 GB/s

Open MPI/Xeon 2.8 µs 0.5 GB/s

Nemesis/Opteron 0.34 µs 1.5 GB/s

Open MPI/Opteron 0.6 µs 1.0 GB/s

Comparison of MPICELL with MPI on Other Hardware


Collective Communication Example – Collective Communication Example – BroadcastBroadcast

Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data

Broadcast with good choice of

algorithms for each data size and SPE count

Maximum main memory bandwidth is also shown


Application Performance – Matrix-Application Performance – Matrix-Vector MultiplicationVector Multiplication

Used a 1-D decomposition (not very efficient)

Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024

The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication

The Opteron results used LAM MPI

Performance of Double Precision matrix-vector multiplication


Conclusions and Future WorkConclusions and Future Work

Conclusions The Cell processor has good potential for MPI applications.

PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory

But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck

Good performance for collectives even with two Cell processors

Current and future work Implemented

Collective communication operations optimized for contiguous data Blocking and non-blocking communication

Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Documents

performance slide

mpi data

main memory mpi metadata

cell processors

gbs nemesisopteron

gbs shmxeon

gbs nemesisxeon

data broadcast