This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parallel Application Memory Scheduling
Eiman Ebrahimi*
Rustam Miftakhutdinov*, Chris Fallin‡
Chang Joo Lee*+, Jose Joao*
Onur Mutlu‡, Yale N. Patt*
* HPS Research Group The University of Texas at
Austin‡ Computer Architecture Laboratory
Carnegie Mellon University+ Intel Corporation
Austin
2
Background
Core 0 Core 1 Core 2 Core N
Shared Cache
Memory Controller
DRAMBank 0
DRAMBank 1
DRAM Bank 2
... DRAMBank K
...
Shared MemoryResources
Chip BoundaryOn-chipOff-chip
2
Background
Memory requests from different cores interfere in shared memory resources
Multi-programmed workloads System Performance and Fairness
A single multi-threaded application?
3
Core 0
Core 1
Core 2
Core N
Shared Cache
Memory Controller
...DRAMBank
K
...
Shared MemoryResources
Chip Boundary
3
DRAMBank
0
DRAMBank
1
DRAMBank
2
Memory System Interference in A Single Multi-Threaded Application
Inter-dependent threads from the same application slow each other down
Most importantly the critical path of execution can be significantly slowed down
Problem and goal are very different from interference between independent applications Interdependence between threads Goal: Reduce execution time of a single application No notion of fairness among the threads
of the same application
4
Potential inA Single Multi-Threaded Application
hist mg cg is bt ft gmean0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Norm
alize
d E
xecu
tion
Tim
e
If all main-memory related interference is ideally eliminated, execution time is reduced by 45% on average
5
Normalized to systemusing FR-FCFS memoryscheduling
Outline
Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion
6
Outline
Problem Statement Parallel Application Memory Scheduling Evaluation Conclusion
7
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
8
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
9
Runtime System Limiter Identification
Contended critical sections are often on the critical path of execution
Extend runtime system to identify thread executing the most contended critical section as the limiter thread Track total amount of time all threads wait on
each lock in a given interval Identify the lock with largest waiting time as
the most contended Thread holding the most contended lock is a limiter and
this information is exposed to the memory controller
10
Prioritizing Requests from Limiter Threads
11
Critical Section 1 BarrierNon-Critical Section
Waiting for Sync or Lock
Thread D
Thread C
Thread B
Thread A
Time
Barrier
Time
Barrier
Thread D
Thread C
Thread B
Thread A
Critical Section 2 Critical Path
SavedCycles Limiter Thread: DBCA
Most ContendedCritical Section:1
Limiter Thread Identification
Parallel Application Memory Scheduler
Identify the set of threads likely to be on the critical path as limiter threads Prioritize requests from limiter threads
Among limiter threads: Prioritize requests from latency-sensitive threads
(those with lower MPKI)
Among non-limiter threads: Shuffle priorities of non-limiter threads to reduce
inter-thread memory interference Prioritize requests from threads falling behind
others in a parallel for-loop
12
Time-based classification of threads as latency- vs. BW-sensitive
Inter-thread main memory interference within a multi-threaded application increases execution time
Parallel Application Memory Scheduling (PAMS) improves a single multi-threaded application’s performance by Identifying a set of threads likely to be on the critical path and
prioritizing requests from them Periodically shuffling priorities of non-likely critical threads to
reduce inter-thread interference among them
PAMS significantly outperforms Best previous memory scheduler designed for
multi-programmed workloads A memory scheduler that uses a state-of-the-art