Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1
Feb 08, 2016
1
Memory System Performance in a NUMA Multicore Multiprocessor
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
2
Summary
• NUMA multicore systems are unfair to local memory accesses
• Local execution sometimes suboptimal
3
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
4
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
Active cores
Total bandwidth [GB/s]
3210
BusC
Northbridge
MC
DRAM memory
0 1 2 3 7654
BusC
4 5 6 7
BusC BusC BusC BusC BusC BusC
MC
First generation: SMP
5
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
Active cores
Total bandwidth [GB/s]
3210
BusC
Northbridge
DRAM memory
7654
BusC
MC MCMC
DRAM memory
BusC BusC
Next generation: NUMA
IC IC
6
NUMA multicores: how it happened
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
NUMA (local)
Active cores
Total bandwidth [GB/s]
3210
DRAM memory
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
7
1 2 3 4 5 6 7 80
5000
10000
15000
20000
25000
SMP
NUMA (local)
NUMA (re-mote)
Active cores
Total bandwidth [GB/s]
NUMA multicores: how it happened
3210
DRAM memory
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
8
3210
DRAM memory
7654
MC MC
DRAM memory
IC IC
Bandwidth sharing
• Frequent scenario:
bandwidth shared between cores
• Sharing model for the Intel Nehalem
0 1 2 3 4 5 6 7
9
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
10
Evaluation system
Intel Nehalem E5520
2 x 4 cores
8 MB level 3 cache
12 GB DDR3 RAM
5.86 GT/s QPI
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
QPI QPI
Global Queue Global Queue
Processor 0 Processor 1
11
Bandwidth sharing: local accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
0
DRAM memory
3
Global Queue
Processor 0 Processor 1
12
Bandwidth sharing: remote accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
13
Bandwidth sharing: combined accesses
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
Global Queue
14
Global Queue
• Mechanism to arbitrate between different types of memory accesses
• We look at fairness of the Global Queue:
– local memory accesses
– remote memory accesses
– combined memory accesses
15
Benchmark program
• STREAM triad
for (i=0; i<SIZE; i++){
a[i]=b[i]+SCALAR*c[i];}
• Multiple co-executing triad clones
16
Multi-clone experiments
• All memory allocated on Processor 0
• Local clones: Remote clones:
• Example benchmark configurations:
C C
C C
(2L, 0R)
C C C C C C C C
(0L, 3R) (2L, 3R)
Processor 0 Processor 1 Processor 0 Processor 1
17
GQ fairness: local accesses
1 L 2 L 3 L 4 L0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
18
1 R 2 R 3 R 4 R0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
1 L 1 R 2 L 2 R 3 L 3 R 4 L 4 R0
2000
4000
6000
8000
10000
12000
14000
Core 0 Core 1 Core 2 Core 3
Benchmark configurations
GQ fairness: remote accesses
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
19
Global Queue fairness
• Global Queue fair when there areonly local/remote accesses in the system
• What about combined accesses?
20
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4(2L, 3R)
21
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
22
GQ fairness: combined accessesTotal bandwidth [GB/s]
(4L, 0R) (4L, 1R) (4L, 2R) (4L, 3R) (4L, 4R)0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
local clones remote clones
Benchmark configurations
23
GQ fairness: combined accesses
Execute clones in all possible configurations
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
24
Combined accessesTotal bandwidth [GB/s]
(1L,1R) (2L,1R) (3L,1R) (4L,1R)0
2000
4000
6000
8000
10000
12000
14000
16000
remote clonelocal clone 1local clone 2local clone 3local clone 4
25
Combined accesses
• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone
• Remote execution can be better than local
26
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
27
Bandwidth sharing model
remotelocaltotal bandwidthbandwidthbandwidth )1(
3210
DRAM memory
7654
IMC IMC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
DRAM memory
C C
28
Sharing factor ()
• Characterizes the fairness of the Global Queue
• Dependence of sharing factor on contention?
29
Contention affects sharing factor
DRAM
Processor 0 Processor 0
C
CQPI
contenders
C
C
C
30
Contention affects sharing factor
+0L +1L +2L +3L0%
10%
20%
30%
40%
50%
Additional contention
Sharing factor ()
31
Combined accessesTotal bandwidth [GB/s]
(1L,1R) (2L,1R) (3L,1R) (4L,1R)0
2000
4000
6000
8000
10000
12000
14000
16000
remote clonelocal clone 1local clone 2local clone 3local clone 4
32
Contention affects sharing factor
• Sharing factor decreases with contention
• With local contention remote execution becomes more favorable
33
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
34
The next generation
Intel Westmere X5680
2 x 6 cores
12 MB level 3 cache
144 GB DDR3 RAM
6.4 GT/s QPI
3210
DRAM memory
IMC
DRAM memory
QPI
Level 3 cache
Global Queue
BA98
IMCQPI
Level 3 cache
Global Queue
764 5
35
The next generationTotal bandwidth [GB/s]
(1L,
1R)
(2L,
1R)
(3L,
1R)
(4L,
1R)
(5L,
1R)
(6L,
1R)
0
2000
4000
6000
8000
10000
12000
remote clonelocal clone 1local clone 2local clone 3local clone 4local clone 5local clone 6
Benchmark configurations
36
Conclusions
• Optimizing for data locality can be suboptimal
• Applications:
– OS scheduling (see ISMM’11 paper)
– data placement and computation scheduling
37
Thank you! Questions?