Current Trends in HPC § Tremendous increase in system and job sizes § Dense many-core systems becoming popular § Less memory available per process § Fast and scalable job startup is essential Importance of Fast Job Startup § Development and debugging § Regression/Acceptance testing § Checkpoint restart Performance Bottlenecks Static Connection Setup § Setting up O(num_procs 2 ) connections is expensive § OpenSHMEM, UPC and other PGAS libraries lack on-demand connection management Network Address Exchange over PMI § Limited scalability, no potential for overlap § Not optimized for symmetric exchange Global Barriers § Unnecessary synchronization and connection setup Memory Scalability Issues § Each node requires O(number of processes * processes per node) memory for storing remote endpoint information Proposed Solutions On-demand Connection Management § Exchange information and establish connection only when two peers are trying to communicate [1] PMIX_Ring Extension § Move bulk of the data exchange to high- performance networks [2] Non-blocking PMI Collectives § Overlap the PMI exchange with other tasks [3] Shared-memory based PMI Get/Allgather § All clients access data directly from the launcher daemon through shared memory regions [4] Summary § Near-constant MPI and OpenSHMEM initialization time at any process count § 10x and 30x improvement in startup time of MPI and OpenSHMEM respectively at 16,384 processes § Reduce memory consumption by O(ppn) § 1GB Memory saved @ 1M processes and 16 ppn References [1] On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. (Chakraborty et al, HIPS ’15) [2] PMI Extensions for Scalable MPI Startup. (Chakraborty et al, EuroMPI/Asia ’14) [3] Non-blocking PMI Extensions for Fast MPI Startup. (Chakraborty et al, CCGrid ’15) [4] SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. (Chakraborty et al, CCGrid ’16) More Information § Available in latest MVAPICH2 and MVAPICH2-X § http://mvapich.cse.ohio-state.edu/downloads/ § https://go.osu.edu/mvapich-startup Job Startup at Exascale: Challenges and Solutions Results - Non-blocking PMI [3] § Near-constant MPI_Init at any scale § MPI_Init with Iallgather performs 288% better than the default based on Fence § Blocking Allgather is 38% faster than blocking Fence 0.4 0.8 1.2 1.6 2 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes Performance of MPI_Init Fence Ifence Allgather Iallgather 0 0.4 0.8 1.2 1.6 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes Performance Comparison of Fence and Allgather PMI2_KVS_Fence PMIX_Allgather PMIX_KVS_Ifence § Non-blocking version of PMI2_KVS_Fence § int PMIX_KVS_Ifence(PMIX_Request *request) PMIX_Iallgather § Optimized for symmetric data movement § Reduces data movement by up to 30% § 286KB → 208KB with 8,192 processes § int PMIX_Iallgather(const char value[], char buffer[], PMIX_Request *request) PMIX_Wait § Wait for the specified request to be completed § int PMIX_Wait(PMIX_Request request) Non-blocking PMI Collectives § PMI operations are progressed by separate processes handling process management § MPI library not involved in progressing PMI communication § Similar to Functional Partitioning approaches § Can be overlapped with other initialization tasks PMIX_Request § Non-blocking collectives return before the operations is completed § Return an opaque handle to the request object that can be used to check for completion Results – On-demand Connection [1] § 29.6 times faster initialization time § Hello world performs 8.31 times better § Execution time of NAS benchmarks improved by up to 35% with 256 processes and class B data On-demand Connection Establishment Static Connection Setup § Setting up connections takes over 85% of the total startup time with 4,096 processes § RDMA operations require exchanging information about memory segments registered with the HCA Results Solution Challenges Overview Results – PMI Ring Extension [2] § MPI_Init based on PMIX_Ring performs 34% better compared to the default PMI2_KVS_Fence § Hello World runs 33% faster with 8K processes § Up to 20% improvement in total execution time of NAS parallel benchmarks 0 1 2 3 4 5 6 7 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (seconds) Number of Processes Performance of MPI_Init and Hello World with PMIX_Ring Hello World (Fence) Hello World (Ring) MPI_Init (Fence) MPI_Init (proposed) 0 1 2 3 4 5 6 7 EP MG CG FT BT SP Time Taken (seconds) Benchmark NAS Benchmarks with 1K Processes, Class B Data Fence Ring New Collective – PMIX_Ring § A ring can be formed by exchanging data with only the left and the right neighbors § Once the ring is formed, data can be exchanged over the high speed networks like InfiniBand § int PMIX_Ring(char value[], char left[], char right[], …) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 16 64 256 1K 4K 16K Time Taken (Seconds) Number of Processes PMI Fence with different Number of Puts 100% Put + Fence 50% Put + Fence Fence Only 0 4 8 12 16 16 64 256 1K 4K 16K Time Taken (milliseconds) Number of Processes Time Taken by PMIX_Ring Shortcomings of Current PMI design § Puts and Gets are local operations § Fence consumes most of the time § Time taken for Fence grows approximately linearly with amount of data transferred (number of keys) 0 0.5 1 1.5 2 2.5 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes Breakdown of MPI_Init PMI Exchanges Shared Memory Other 0 1 2 3 4 5 6 7 16 64 256 1k 4k 16k Time Taken (seconds) Number of Processes Time spent in Put, Fence, & Get Fence Put Gets Main Thread Main Thread Connection Manager Thread Connection Manager Thread Process 1 Process 2 Put/Get (P2) Create QP QP→Init Enqueue Send Create QP QP→Init QP→RTR QP→RTR QP→RTS Connection Established Dequeue Send Connect Request (LID, QPN) (address, size, rkey) Connect Reply (LID, QPN) (address, size, rkey) QP→RTS Connection Established Put/Get (P2) 0 2 4 6 8 BT EP MG SP Execution Time (seconds) Benchmark Execution time of OpenSHMEM NAS Parallel Benchmarks Static On-demand 0 20 40 60 80 100 16 32 64 128 256 512 1K 2K 4K 8K Time Taken (Seconds) Number of Processes Performance of OpenSHMEM Initialization and Hello World Hello World - Static start_pes - Static Hello World - On-demand start_pes - On-demand 0 5 10 15 20 25 30 35 32 64 128 256 512 1K 2K 4K Time Taken (Seconds) Number of Processes Breakdown of OpenSHMEM Initialization Connection Setup PMI Exchange Memory Registration Shared Memory Setup Other Application Processes Average Peers BT 64 8.7 1024 10.6 EP 64 3.0 1024 5.0 MG 64 9.5 1024 11.9 SP 64 8.8 1024 10.7 2D Heat 64 5.3 1024 5.4 Results – Shared Memory based PMI [4] § PMI Get takes 0.25 ms with 32 ppn § 1,000 times reduction in PMI Get latency compared to default socket based protocol § Memory footprint reduced by O(Processes Per Node) ≈ 1GB @ 1M processes, 16 ppn § Backward compatible, negligible overhead Shared Memory (shmem) based PMI § Open a shared memory channel between the server and the clients § A hash table is suitable for Fence while Allgather only requires an array of values § Use a hash table based on two shmem regions for efficient insertion and merge, and compactness Memory Scalability in PMI § PMI communication between the server and the clients are based on local sockets § Latency is high with large number of clients § Copying data to client’s memory causes large memory overhead Sourav Chakraborty, Dhabaleswar K Panda, (Advisor), The Ohio State University 0 50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes Time Taken by one PMI_Get 0.01 0.1 1 10 100 1000 10000 Memory Usage per Node (MB) Number of Processes PMI Memory Usage PPN = 16 PPN = 32 PPN = 64 0 50 100 150 200 250 300 1 2 4 8 16 32 Time Taken (milliseconds) Number of Processes Time Taken by one PMI_Get Default Shmem 1 10 100 1000 10000 32K 64K 128K 256K 512K 1M Memory Usage per Node (MB) Number of Processes PMI Memory Usage Fence - Default Allgather - Default Fence - Shmem Allgather - Shmem Empty Head Tail Key Value Next Hash Table (Table) Key Value Store (KVS) Top P M O Job Startup Performance Memory Required to Store Endpoint Information a b c d e P M PGAS – State of the art MPI – State of the art O PGAS/MPI – Optimized PMIX_Ring PMIX_Ibarrier PMIX_Iallgather Shmem based PMI b c d e a On-demand Connection b c d e a