Top Banner
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1
37

Memory System Performance in a NUMA Multicore Multiprocessor

Feb 08, 2016

Download

Documents

gavivi

Memory System Performance in a NUMA Multicore Multiprocessor. Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich. Summary. NUMA multicore systems are unfair to local memory accesses Local execution sometimes suboptimal. Outline. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Memory System Performance in a NUMA  Multicore  Multiprocessor

1

Memory System Performance in a NUMA Multicore Multiprocessor

Zoltan Majo and Thomas R. Gross

Department of Computer ScienceETH Zurich

Page 2: Memory System Performance in a NUMA  Multicore  Multiprocessor

2

Summary

• NUMA multicore systems are unfair to local memory accesses

• Local execution sometimes suboptimal

Page 3: Memory System Performance in a NUMA  Multicore  Multiprocessor

3

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

Page 4: Memory System Performance in a NUMA  Multicore  Multiprocessor

4

NUMA multicores: how it happened

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

Active cores

Total bandwidth [GB/s]

3210

BusC

Northbridge

MC

DRAM memory

0 1 2 3 7654

BusC

4 5 6 7

BusC BusC BusC BusC BusC BusC

MC

First generation: SMP

Page 5: Memory System Performance in a NUMA  Multicore  Multiprocessor

5

NUMA multicores: how it happened

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

Active cores

Total bandwidth [GB/s]

3210

BusC

Northbridge

DRAM memory

7654

BusC

MC MCMC

DRAM memory

BusC BusC

Next generation: NUMA

IC IC

Page 6: Memory System Performance in a NUMA  Multicore  Multiprocessor

6

NUMA multicores: how it happened

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

NUMA (local)

Active cores

Total bandwidth [GB/s]

3210

DRAM memory

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 7: Memory System Performance in a NUMA  Multicore  Multiprocessor

7

1 2 3 4 5 6 7 80

5000

10000

15000

20000

25000

SMP

NUMA (local)

NUMA (re-mote)

Active cores

Total bandwidth [GB/s]

NUMA multicores: how it happened

3210

DRAM memory

7654

MC MC

DRAM memory

0 1 2 3 4 5 6 7

IC IC

Next generation: NUMA

Page 8: Memory System Performance in a NUMA  Multicore  Multiprocessor

8

3210

DRAM memory

7654

MC MC

DRAM memory

IC IC

Bandwidth sharing

• Frequent scenario:

bandwidth shared between cores

• Sharing model for the Intel Nehalem

0 1 2 3 4 5 6 7

Page 9: Memory System Performance in a NUMA  Multicore  Multiprocessor

9

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

Page 10: Memory System Performance in a NUMA  Multicore  Multiprocessor

10

Evaluation system

Intel Nehalem E5520

2 x 4 cores

8 MB level 3 cache

12 GB DDR3 RAM

5.86 GT/s QPI

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

QPI QPI

Global Queue Global Queue

Processor 0 Processor 1

Page 11: Memory System Performance in a NUMA  Multicore  Multiprocessor

11

Bandwidth sharing: local accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

0

DRAM memory

3

Global Queue

Processor 0 Processor 1

Page 12: Memory System Performance in a NUMA  Multicore  Multiprocessor

12

Bandwidth sharing: remote accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Page 13: Memory System Performance in a NUMA  Multicore  Multiprocessor

13

Bandwidth sharing: combined accesses

3210

DRAM memory

7654

MC MC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

4

DRAM memory

5

Global Queue

0 3

Processor 0 Processor 1

Global Queue

Page 14: Memory System Performance in a NUMA  Multicore  Multiprocessor

14

Global Queue

• Mechanism to arbitrate between different types of memory accesses

• We look at fairness of the Global Queue:

– local memory accesses

– remote memory accesses

– combined memory accesses

Page 15: Memory System Performance in a NUMA  Multicore  Multiprocessor

15

Benchmark program

• STREAM triad

for (i=0; i<SIZE; i++){

a[i]=b[i]+SCALAR*c[i];}

• Multiple co-executing triad clones

Page 16: Memory System Performance in a NUMA  Multicore  Multiprocessor

16

Multi-clone experiments

• All memory allocated on Processor 0

• Local clones: Remote clones:

• Example benchmark configurations:

C C

C C

(2L, 0R)

C C C C C C C C

(0L, 3R) (2L, 3R)

Processor 0 Processor 1 Processor 0 Processor 1

Page 17: Memory System Performance in a NUMA  Multicore  Multiprocessor

17

GQ fairness: local accesses

1 L 2 L 3 L 4 L0

2000

4000

6000

8000

10000

12000

14000

Core 0 Core 1 Core 2 Core 3

Benchmark configurations

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 18: Memory System Performance in a NUMA  Multicore  Multiprocessor

18

1 R 2 R 3 R 4 R0

2000

4000

6000

8000

10000

12000

14000

Core 0 Core 1 Core 2 Core 3

Benchmark configurations

1 L 1 R 2 L 2 R 3 L 3 R 4 L 4 R0

2000

4000

6000

8000

10000

12000

14000

Core 0 Core 1 Core 2 Core 3

Benchmark configurations

GQ fairness: remote accesses

Total bandwidth [GB/s]

3210

DRAM

7654

IMC IMC

DRAM

QPI QPI

Cache

GQ

Cache

GQ

C

DRAM memory

C

Processor 0 Processor 1

CC

Page 19: Memory System Performance in a NUMA  Multicore  Multiprocessor

19

Global Queue fairness

• Global Queue fair when there areonly local/remote accesses in the system

• What about combined accesses?

Page 20: Memory System Performance in a NUMA  Multicore  Multiprocessor

20

GQ fairness: combined accesses

Execute clones in all possible configurations

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4(2L, 3R)

Page 21: Memory System Performance in a NUMA  Multicore  Multiprocessor

21

GQ fairness: combined accesses

Execute clones in all possible configurations

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 22: Memory System Performance in a NUMA  Multicore  Multiprocessor

22

GQ fairness: combined accessesTotal bandwidth [GB/s]

(4L, 0R) (4L, 1R) (4L, 2R) (4L, 3R) (4L, 4R)0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

local clones remote clones

Benchmark configurations

Page 23: Memory System Performance in a NUMA  Multicore  Multiprocessor

23

GQ fairness: combined accesses

Execute clones in all possible configurations

# local clones

0 1 2 3 4

# remote clones

0

1

2

3

4

Page 24: Memory System Performance in a NUMA  Multicore  Multiprocessor

24

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

2000

4000

6000

8000

10000

12000

14000

16000

remote clonelocal clone 1local clone 2local clone 3local clone 4

Page 25: Memory System Performance in a NUMA  Multicore  Multiprocessor

25

Combined accesses

• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone

• Remote execution can be better than local

Page 26: Memory System Performance in a NUMA  Multicore  Multiprocessor

26

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

Page 27: Memory System Performance in a NUMA  Multicore  Multiprocessor

27

Bandwidth sharing model

remotelocaltotal bandwidthbandwidthbandwidth )1(

3210

DRAM memory

7654

IMC IMC

DRAM memory

QPI QPI

Level 3 cache

Global Queue

Level 3 cache

Global Queue

DRAM memory

C C

Page 28: Memory System Performance in a NUMA  Multicore  Multiprocessor

28

Sharing factor ()

• Characterizes the fairness of the Global Queue

• Dependence of sharing factor on contention?

Page 29: Memory System Performance in a NUMA  Multicore  Multiprocessor

29

Contention affects sharing factor

DRAM

Processor 0 Processor 0

C

CQPI

contenders

C

C

C

Page 30: Memory System Performance in a NUMA  Multicore  Multiprocessor

30

Contention affects sharing factor

+0L +1L +2L +3L0%

10%

20%

30%

40%

50%

Additional contention

Sharing factor ()

Page 31: Memory System Performance in a NUMA  Multicore  Multiprocessor

31

Combined accessesTotal bandwidth [GB/s]

(1L,1R) (2L,1R) (3L,1R) (4L,1R)0

2000

4000

6000

8000

10000

12000

14000

16000

remote clonelocal clone 1local clone 2local clone 3local clone 4

Page 32: Memory System Performance in a NUMA  Multicore  Multiprocessor

32

Contention affects sharing factor

• Sharing factor decreases with contention

• With local contention remote execution becomes more favorable

Page 33: Memory System Performance in a NUMA  Multicore  Multiprocessor

33

Outline

• NUMA multicores: how it happened

• Experimental evaluation: Intel Nehalem

• Bandwidth sharing model

• The next generation: Intel Westmere

Page 34: Memory System Performance in a NUMA  Multicore  Multiprocessor

34

The next generation

Intel Westmere X5680

2 x 6 cores

12 MB level 3 cache

144 GB DDR3 RAM

6.4 GT/s QPI

3210

DRAM memory

IMC

DRAM memory

QPI

Level 3 cache

Global Queue

BA98

IMCQPI

Level 3 cache

Global Queue

764 5

Page 35: Memory System Performance in a NUMA  Multicore  Multiprocessor

35

The next generationTotal bandwidth [GB/s]

(1L,

1R)

(2L,

1R)

(3L,

1R)

(4L,

1R)

(5L,

1R)

(6L,

1R)

0

2000

4000

6000

8000

10000

12000

remote clonelocal clone 1local clone 2local clone 3local clone 4local clone 5local clone 6

Benchmark configurations

Page 36: Memory System Performance in a NUMA  Multicore  Multiprocessor

36

Conclusions

• Optimizing for data locality can be suboptimal

• Applications:

– OS scheduling (see ISMM’11 paper)

– data placement and computation scheduling

Page 37: Memory System Performance in a NUMA  Multicore  Multiprocessor

37

Thank you! Questions?