1
Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
2
NUMA multicores
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
IC ICMC
DRAM memory DRAM memory
MCIC IC
Processor 0 Processor 1
3
10
MC
DRAM memory
Cache
DRAM memory
32
MC
Cache
NUMA multicores
Two problems:
• NUMA:interconnect overhead
BA
MA MB
IC IC
Processor 0 Processor 1
4
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
NUMA multicores
BA
MA MB
Cache
Two problems:
• NUMA:interconnect overhead
• multicore:cache contention
IC IC
Processor 0 Processor 1
5
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
6
Multi-clone experiments
• Intel Xeon E5520
• 4 clones of soplex (SPEC CPU2006)
– local clone
– remote clone
DRAM memory
MC
Cache
0
MC
DRAM memory
Cache
IC IC
1 32 4 6 75
• Memory behavior of unrelated programs
M M M M
C C C C
C C C C
C
C
Processor 0 Processor 1
1 2
3
4 57
Cache
C
DRAM
Cache
C C C
Local bandwidth: 100%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 80%
M MMM
Cache
C
DRAM
Cache
C CC
Local bandwidth: 57%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 32%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 0%
M MMM
8
Performance of schedules
• Which is the best schedule?
• Baseline: single-program execution mode
Cache
C
Cache
M
9
0% 25% 50% 75% 100%1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Local memory bandwidth
Execution time
local clones
remote clones
average
Slowdown relative to baseline
C
C
C
10
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
11
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
12
Step 1: Maximum-local mapping
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
B MB
A MA
C MC
D MD
Processor 0 Processor 1
13
Default OS scheduling
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75BA D
MBMA MC MD
C
Processor 0 Processor 1
14
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
15
Step 2: Cache-aware refinement
In an SMP:
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BAD C
Processor 0 Processor 1
16
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA DC
MA
In an SMP:Processor 0 Processor 1
17
Step 2: Cache-aware refinement
A B C
D
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MD
BA D C
MA
A B CD
Performance degradation
In an SMP:
NUMA penalty
Processor 0 Processor 1
18
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA C DIn a NUMA:Processor 0 Processor 1
19
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
A DCBIn a NUMA:Processor 0 Processor 1
20
Step 2: Cache-aware refinement
A B C
D
Performance degradation
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MDMA
BA DC
A
B
C D
NUMA allowance
In a NUMA:
NUMA penalty
Processor 0 Processor 1
21
Performance factors
Two factors cause performance degradation:
1. NUMA penaltyslowdown due toremote memory access
2. cache pressure local processes:
misses / KINST (MPKI) remote processes:
MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28
1.0
1.1
1.2
1.3
1.4
1.5
SPEC programs
NUMA penalty
22
Implementation
• User-mode extension to the Linux scheduler
• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI
• All memory for a process allocated on one
processor
23
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
24
0.00
001
0.00
01
0.00
1
0.01 0.
1 1 10 100
0.9
1.0
1.1
1.2
1.3
1.4
1.5
not used programsused programs
MPKI
Workloads
• SPEC CPU2006 subset
• 11 multi-program workloads (WL1 WL11)
4-program workloads(WL1 WL9)
8-program workloads(WL10, WL11)
NUMA penalty
CPU-bound Memory-bound
25
Memory allocation setup
• Where the memory of each process is allocated influences performance
• Controlled setup: memory allocation maps
26
Memory allocation maps
B MB
A C MC
D MD
MA
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
27
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
28
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
Unbalanced Balanced
29
Evaluation
• Baseline: Linux average
– Linux scheduler non-deterministic
– average performance degradation in all possible
cases
• N-MASS with perfect NUMA penalty
information
30
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux average
Allocation maps
WL9: Linux averageAverage slowdown relative to single-program mode
31
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL9: N-MASSAverage slowdown relative to single-program mode
32
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL1: Linux average and N-MASSAverage slowdown relative to single-program mode
33
N-MASS performance
• N-MASS reduces performance degradation by up to 22%
• Which factor more important: interconnect overhead or cache contention?
• Compare:
- maximum-local- N-MASS (maximum-local
+ cache refinement step)
34
Data-locality vs. cache balancing (WL9)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
35
Data-locality vs. cache balancing (WL1)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
36
Data locality vs. cache balancing
• Data-locality more important than cache balancing
• Cache-balancing gives performance benefits mostly with unbalanced allocation maps
• What if information about NUMA penalty not available?
37
0 10 20 30 40 501.0
1.1
1.2
1.3
1.4
1.5
MPKI
Estimating NUMA penalty
• NUMA penalty is not directly measurable
• Estimate: fit linear regression onto MPKI data
NUMA penalty
38
Estimate-based N-MASS: performance
Performance improvement relative to Linux average
WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%
0%
2%
4%
6%
8%
maximum-local N-MASS Estimate-based N-MASS
39
Conclusions
• N-MASS: NUMAmulticore-aware scheduler
• Data locality optimizations more beneficial than cache contention avoidance
• Better performance metrics needed for scheduling
40
Thank you! Questions?