MPI+Threads: Runtime Contention and Remedies Abdelhalim Amer*, Huiwei Lu+, Yanjie Wei # , Pavan Balaji+, Satoshi Matsuoka* * Tokyo Institute of Technology + Argonne National Laboratory # Shenzhen Institute of Advanced Technologies, Chinese Academy of Sciences PPoPP’15, February 7–11, 2015, San Francisco, CA, USA.
27
Embed
MPI+Threads: Runtime Contention and Remedies Abdelhalim Amer*, Huiwei Lu+, Yanjie Wei #, Pavan Balaji+, Satoshi Matsuoka* * Tokyo Institute of Technology.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
[1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), 2013.
Evolution of the memory capacity per core in the Top500 list [1]
3
4
MPI_Init_thread (…, required, …)
• Restriction
•Low Thread-Safety Costs
• Flexibility
•High Thread-Safety Costs
MPI +Threads Interoperation
• MPI_THREAD_SINGLE– No additional threads
• MPI_THREAD_FUNNELED– Master thread communication
only• MPI_THREAD_SERIALIZED
– Multithreaded communication serialized
• MPI_THREAD_MULTIPLE– No restrictions
5
Architecture NehalemProcessor Xeon E5540Clock frequency 2.6 GHzNumber of Sockets 2Cores per Socket 4L3 Size 8192 KBL2 Size 256 KBNumber of nodes 310Interconnect Mellanox QDRMPI LibraryNetwork Module
MPICHNemesis:MXM
Fusion cluster at Argonne National Laboratory
Test Environement
6Multithreaded Point-to-Point BW
P0 P1
Contention in Multithreaded Communication
1 10 100
1000
1000
0
1000
00
1000
000
1000
0000
2
20
200
2000
20000
1 ppn
2 ppn
4 ppn
8 ppn
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
P0 P4P1 P5P2 P6P3 P7
Multi-process Point-to-Point BW
1 10 100
1000
1000
0
1000
00
1000
000
1000
0000
2
20
200
2000
1 tpn
2 tpn
4 tpn
8 tpn
Message Size [Bytes]
Me
ss
ag
e R
ate
[1
03
ms
gs
/s]
7
• Critical Section Granularity– Shorter is better but more
complex• Synchronization Mechanism
– How to hand-off to the next thread?
• Atomic ops, memory barriers, system calls, NUMA-awareness
– Arbitration: Who enters the CS?
• Fairness• Random, FIFO, Priority
Threads
Critical SectionLength
Arbitration
Dimensions of Thread-Safety
Hand-Off
8
• Critical Section Granularity– Shorter is better but more
complex• Synchronization Mechanism
– How to hand-off to the next thread?
• Atomic ops, memory barriers, system calls, NUMA-awareness
– Arbitration: Who enters the CS?
• Fairness• Random, FIFO, Priority
Threads
Critical SectionLength
Arbitration
Dimensions of Thread-Safety
Hand-Off
9Balaji, Pavan, et al. "Fine-grained multithreading support for hybrid threaded MPI programming." International Journal of High Performance Computing Applications 24.1 (2010): 49-57.
Reducing Contention by Refining Critical Section Granularity
10
• GCS: Global CS only• POCS: Per-Object CS supported
MPICHMPI MPID Headers
ThreadCH3
NemesisMRail PSM
PAMID
IB MXM …
Current Work(GCS)
MVAPICH(GCS)
BlueGene(POCS)
SockTCP
POCS
Thread-Safety in MPICH
• Supports a 1:1 threading model: only sees kernel threads
11
• Global critical section• Implementation: NPTL
Pthread mutex• Pthread mutex
– CAS in the user-space– Futex wait/wake in contended
cases– Arbitration: Fastest thread
first Possible unfainess
Use
r-Sp
ace
Kernel-Space
pthread_mutex_lock
CASFUTEX_WAIT
FUTEX_WA
KE
Sleep
FUTEX_WAIT
FUTEX_WA
KE
Sleep
CAS
CAS
Go inside the critical section
Baseline Thread-Safety in MPICH:Nemesis: Pthread Mutex
12
Hierarchical Memory
Mutex
Core Core Core Core
T0 T1 T2 T3
Access biased by the proximity to the cache containing the mutex
Mutex
Core Core Core Core
T0 T1 T2 T3
Access should be random
Flat memory
User Space
L1 L1 L1 L1
L2 L2
CAS CAS CAS CAS
CAS CAS CAS CAS
User Space
Unfairness May Occur!
13
• Bandwidth benchmark• Unfairness levels
– Core Level : A single thread is monopolizing the lock
– Socket Level : Threads on the same socket are monopolizing the lock
• Bias factor– How much a fair
arbitration is biased– Bias factor = 1 = fair
arbitrationFairness analysis of the BW benchmark with 8 threads