An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

1

An Empirical Guide to Scalable Persistent Memory

Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu,

Subramanya R. Dulloor, Jishen Zhao, Steven Swanson

Non-Volatile Systems LaboratoryDepartment of Computer Science & Engineering

University of California, San Diego

Department of Electrical, Computer & Energy EngineeringUniversity of Colorado, Boulder

2

Optane DIMMs are Here!

3


• Not Just Slow Dense DRAMTM

4


• Not Just Slow Dense DRAMTM

• Slower, denser, media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

6

Basics: Emulation is misleading

RocksDB

OptaneDRAM

7

Outline

• Background

• Basics: Optane DIMM Performance

• Lessons: Optane DIMM Best Practices

• Conclusion

8

Background

9

Background: Optane in the Machine

Core CoreCore

10


Core Core

L1/2 L1/2

L3

Core

L1/2

11


Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

12


Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

13


Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

14


Core Core

L1/2 L1/2

L3

iMC iMC

AppDirectMode

Core

L1/2

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

15


Core Core

L1/2 L1/2

L3

iMC iMC

Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode

16


Core Core

L1/2 L1/2

L3

iMC iMC

Optane DIMM

Optane DIMM

Optane DIMM

DRAM

DRAM

DRAM

Far MemoryNear Memory

(Direct Mapped Cache)

Memory Mode Core

L1/2

AppDirectMode

DRAM

DRAM

DRAM

Optane DIMM

Optane DIMM

Optane DIMM

17

Background: Optane Internals

$

MC

Optane

DRAM

Optane DIMM

18


$

MC

Optane

DRAM

WPQ

iMC

Optane DIMM

From CPU

19


$

MC DRAM

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

Optane

20


$

MC DRAM

Optane Media

Optane Controller

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

21


$

MC DRAM

Optane Media

Optane ControllerOptaneBuffer

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

22


$

MC DRAM

Optane Media


AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

Optane

23


$

MC DRAM

Optane Media


AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

ADR (Asynchronous DRAM Refresh)

Optane

24

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

25

Background: Optane Interleaving

• Optane is interleaved across NVDIMMs at 4KB granularity.

Optane4

Optane5

Optane6

Optane1

Optane2

Optane3

4 5 61 2 3

Phys. Addr: 4KB 12KB 20KB0 8KB 16KB

1

24KB

26

Basics: How does Optane DC perform?

27

Basics: Our Approach

• Microbenchmark sweeps across state space– Access Patterns (random, sequential)

– Operations (read, ntstore, clflush, etc.)

– Access/Stride Size

– Power Budget

– NUMA Configuration

– Address Space Interleaving

• Targeted experiments

• Total: 10,000 experiments

28

Test Platform

• CPU– Intel Cascade Lake, 24 cores at 2.2 GHz in 2 sockets

– Hyperthreading off

• DRAM– 32 GB Micron DDR4 2666 MHz

– 384 GB across 2 sockets w/ 6 channels

• Optane– 256 GB Intel Optane 2666 MHz QS

– 3 TB across 2 sockets w/ 6 channels

• OS– Fedora 27, 4.13.0

29

Basics: Latency

• 2x -3x as slow as DRAM

• Write latency masked by ADR

31

Basics: Bandwidth

• ~Scalable reads, Non-scalable writes

32

Basics: Bandwidth

• Access size matters

33

Basics: Bandwidth

• A mystery!

*answer in ~10 minutes

34

Lessons: What are Optane Best Practices?

35

Lessons: What are Optane Best Practices?

• Avoid small random accesses

• Use ntstores for large writes

• Limit threads accessing one NVDIMM

36

Lesson #1: Avoid small random accesses

37


Optane Media


AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

38


Optane Media


AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

39


Optane Media


AIT Cache

AIT

WPQ

iMC

Optane DIMM

From CPU

256B Block

$

MC

Optane

DRAM

40

Lessons: Optane Buffer Size

• Write amplification if working set is larger than Optane Buffer

42


• Bad bandwidth with:

– Small random writes (<256B)

– Not tiny working set / NVDIMM (>16KB)

• Good bandwidth with

– Sequential accesses

43

Lesson #2: Use ntstores for large writes

44


• Non-temporal stores bypass the cache

45

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

46

Lessons: Store instructions

ntstore store + clwb store

$

MC

Optane

DRAM

1

2

3

1

2

3

*lost bandwidth *lost locality

47


48


• Non-temporal stores bypass the cache

– Avoid cache-line read

– Maintain locality

49

Lesson #3: Limit threads accessing one NVDIMM

50


• Contention at Optane Buffer

• Contention at iMC

51

Lessons: Contention at Optane Buffer

• More threads = access amplification = lower bandwidth

52

Lessons: Contention at media & iMC

• iMCs queues are small & media is slow

– Short queues end up clogged

54


• Contention is largest when random access size = interleave size (4KB)

55


• Contention is largest when random access size = interleave size (4KB)

• Load is fairest when access size = #DIMMs x interleave size (24KB)

56


• Contention at Optane Buffer

– Increase access amplification

• Contention at media/iMC

– Lose bandwidth through uneven NVDIMM load

– Avoid interleave aligned random accesses

61

Conclusion

62

Conclusion

• Not Just Slow Dense DRAM

• Slower media

-> More complex architecture

-> Second-order performance anomalies

-> Fundamentally different

• Max performance is tricky– Avoid small random accesses

– Use ntstores for large writes

– Limit threads accessing one NVDIMM

An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across

Documents