Memory System Performance in a NUMA Multicore Multiprocessor

1

Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

2

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

3

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

4

NUMA multicores: how it happened

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

Active cores

Total bandwidth [GB/s]

3210

BusC

Northbridge

MC

DRAM memory

0 1 2 3 7654

BusC

4 5 6 7

BusC BusC BusC BusC BusC BusC

MC

First generation: SMP

5


1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

Active cores


3210

BusC

Northbridge

DRAM memory

7654

BusC

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

IC IC

6


1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

NUMA (local)

Active cores


3210

DRAM memory

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC


7

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

NUMA (local)

NUMA (re-mote)

Active cores



3210

DRAM memory

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC


8

3210

DRAM memory

7654

MC MC

DRAM memory

IC IC

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

0 1 2 3 4 5 6 7

9

Outline





10

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

11

Bandwidth sharing: local accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

0

DRAM memory

3

Global Queue


12

Bandwidth sharing: remote accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3


13

Bandwidth sharing: combined accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3


Global Queue

14

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

15

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++){

a[i]=b[i]+SCALAR*c[i];}

• Multiple co-executing triad clones

16

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

C C

C C

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

17

GQ fairness: local accesses

1 L 2 L 3 L 4 L0

2000

4000

6000

8000

10000

12000

14000

Core 0 Core 1 Core 2 Core 3

Benchmark configurations


3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C


CC

18

1 R 2 R 3 R 4 R0

2000

4000

6000

8000

10000

12000

14000



1 L 1 R 2 L 2 R 3 L 3 R 4 L 4 R0

2000

4000

6000

8000

10000

12000

14000



GQ fairness: remote accesses


3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C


CC

19

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

20

GQ fairness: combined accesses

Execute clones in all possible configurations

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4(2L, 3R)

21



# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

22

GQ fairness: combined accessesTotal bandwidth [GB/s]

(4L, 0R) (4L, 1R) (4L, 2R) (4L, 3R) (4L, 4R)0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

local clones remote clones


23



# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

24

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

2000

4000

6000

8000

10000

12000

14000

16000

remote clonelocal clone 1local clone 2local clone 3local clone 4

25

Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

26

Outline





27

Bandwidth sharing model

remotelocaltotal bandwidthbandwidthbandwidth )1(

3210

DRAM memory

7654

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

C C

28

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

29

Contention affects sharing factor

DRAM


C

CQPI

contenders

C

C

C

30


+0L +1L +2L +3L0%

10%

20%

30%

40%

50%

Additional contention

Sharing factor ()

31

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

2000

4000

6000

8000

10000

12000

14000

16000

remote clonelocal clone 1local clone 2local clone 3local clone 4

32


• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

33

Outline





34

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

3210

DRAM memory

IMC

DRAM memory

QPI

Level 3 cache

Global Queue

BA98

IMCQPI

Level 3 cache

Global Queue

764 5

35

The next generationTotal bandwidth [GB/s]

(1L,

1R)

(2L,

1R)

(3L,

1R)

(4L,

1R)

(5L,

1R)

(6L,

1R)

0

2000

4000

6000

8000

10000

12000

remote clonelocal clone 1local clone 2local clone 3local clone 4local clone 5local clone 6


36

Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling

37

Thank you! Questions?

Memory System Performance in a NUMA Multicore Multiprocessor

Documents

bandwidth sharing

cacheglobal queuelevel

intel nehalem bandwidth

numaicic5numa multicores

smp4numa multicores

numa6numa multicores

intel westmere

memory system performance