1 An Empirical Guide to Scalable Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R. Dulloor, Jishen Zhao, Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science & Engineering University of California, San Diego Department of Electrical, Computer & Energy Engineering University of Colorado, Boulder
54
Embed
An Empirical Guide to Scalable Persistent Memory · Persistent Memory Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, ... –32 GB Micron DDR4 2666 MHz –384 GB across
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
An Empirical Guide to Scalable Persistent Memory
Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu,
Subramanya R. Dulloor, Jishen Zhao, Steven Swanson
Non-Volatile Systems LaboratoryDepartment of Computer Science & Engineering
University of California, San Diego
Department of Electrical, Computer & Energy EngineeringUniversity of Colorado, Boulder
2
Optane DIMMs are Here!
3
Optane DIMMs are Here!
• Not Just Slow Dense DRAMTM
4
Optane DIMMs are Here!
• Not Just Slow Dense DRAMTM
• Slower, denser, media
-> More complex architecture
-> Second-order performance anomalies
-> Fundamentally different
6
Basics: Emulation is misleading
RocksDB
OptaneDRAM
7
Outline
• Background
• Basics: Optane DIMM Performance
• Lessons: Optane DIMM Best Practices
• Conclusion
8
Background
9
Background: Optane in the Machine
Core CoreCore
10
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
Core
L1/2
11
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
Core
L1/2
12
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
Core
L1/2
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
13
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
Core
L1/2
DRAM
DRAM
DRAM
Optane DIMM
Optane DIMM
Optane DIMM
DRAM
DRAM
DRAM
14
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
AppDirectMode
Core
L1/2
DRAM
DRAM
DRAM
Optane DIMM
Optane DIMM
Optane DIMM
DRAM
DRAM
DRAM
15
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
Core
L1/2
AppDirectMode
DRAM
DRAM
DRAM
Optane DIMM
Optane DIMM
Optane DIMM
Optane DIMM
Optane DIMM
Optane DIMM
DRAM
DRAM
DRAM
Far MemoryNear Memory
(Direct Mapped Cache)
Memory Mode
16
Background: Optane in the Machine
Core Core
L1/2 L1/2
L3
iMC iMC
Optane DIMM
Optane DIMM
Optane DIMM
DRAM
DRAM
DRAM
Far MemoryNear Memory
(Direct Mapped Cache)
Memory Mode Core
L1/2
AppDirectMode
DRAM
DRAM
DRAM
Optane DIMM
Optane DIMM
Optane DIMM
17
Background: Optane Internals
$
MC
Optane
DRAM
Optane DIMM
18
Background: Optane Internals
$
MC
Optane
DRAM
WPQ
iMC
Optane DIMM
From CPU
19
Background: Optane Internals
$
MC DRAM
Optane Controller
WPQ
iMC
Optane DIMM
From CPU
Optane
20
Background: Optane Internals
$
MC DRAM
Optane Media
Optane Controller
WPQ
iMC
Optane DIMM
From CPU
256B Block
Optane
21
Background: Optane Internals
$
MC DRAM
Optane Media
Optane ControllerOptaneBuffer
WPQ
iMC
Optane DIMM
From CPU
256B Block
Optane
22
Background: Optane Internals
$
MC DRAM
Optane Media
Optane ControllerOptaneBuffer
AIT Cache
AIT
WPQ
iMC
Optane DIMM
From CPU
256B Block
Optane
23
Background: Optane Internals
$
MC DRAM
Optane Media
Optane ControllerOptaneBuffer
AIT Cache
AIT
WPQ
iMC
Optane DIMM
From CPU
256B Block
ADR (Asynchronous DRAM Refresh)
Optane
24
Background: Optane Interleaving
• Optane is interleaved across NVDIMMs at 4KB granularity.
Optane4
Optane5
Optane6
Optane1
Optane2
Optane3
25
Background: Optane Interleaving
• Optane is interleaved across NVDIMMs at 4KB granularity.
Optane4
Optane5
Optane6
Optane1
Optane2
Optane3
4 5 61 2 3
Phys. Addr: 4KB 12KB 20KB0 8KB 16KB
1
24KB
26
Basics: How does Optane DC perform?
27
Basics: Our Approach
• Microbenchmark sweeps across state space– Access Patterns (random, sequential)
– Operations (read, ntstore, clflush, etc.)
– Access/Stride Size
– Power Budget
– NUMA Configuration
– Address Space Interleaving
• Targeted experiments
• Total: 10,000 experiments
28
Test Platform
• CPU– Intel Cascade Lake, 24 cores at 2.2 GHz in 2 sockets
– Hyperthreading off
• DRAM– 32 GB Micron DDR4 2666 MHz
– 384 GB across 2 sockets w/ 6 channels
• Optane– 256 GB Intel Optane 2666 MHz QS
– 3 TB across 2 sockets w/ 6 channels
• OS– Fedora 27, 4.13.0
29
Basics: Latency
• 2x -3x as slow as DRAM
• Write latency masked by ADR
31
Basics: Bandwidth
• ~Scalable reads, Non-scalable writes
32
Basics: Bandwidth
• Access size matters
33
Basics: Bandwidth
• A mystery!
*answer in ~10 minutes
34
Lessons: What are Optane Best Practices?
35
Lessons: What are Optane Best Practices?
• Avoid small random accesses
• Use ntstores for large writes
• Limit threads accessing one NVDIMM
36
Lesson #1: Avoid small random accesses
37
Lesson #1: Avoid small random accesses
Optane Media
Optane ControllerOptaneBuffer
AIT Cache
AIT
WPQ
iMC
Optane DIMM
From CPU
256B Block
$
MC
Optane
DRAM
38
Lesson #1: Avoid small random accesses
Optane Media
Optane ControllerOptaneBuffer
AIT Cache
AIT
WPQ
iMC
Optane DIMM
From CPU
256B Block
$
MC
Optane
DRAM
39
Lesson #1: Avoid small random accesses
Optane Media
Optane ControllerOptaneBuffer
AIT Cache
AIT
WPQ
iMC
Optane DIMM
From CPU
256B Block
$
MC
Optane
DRAM
40
Lessons: Optane Buffer Size
• Write amplification if working set is larger than Optane Buffer
42
Lesson #1: Avoid small random accesses
• Bad bandwidth with:
– Small random writes (<256B)
– Not tiny working set / NVDIMM (>16KB)
• Good bandwidth with
– Sequential accesses
43
Lesson #2: Use ntstores for large writes
44
Lesson #2: Use ntstores for large writes
• Non-temporal stores bypass the cache
45
Lessons: Store instructions
ntstore store + clwb store
$
MC
Optane
DRAM
1
2
3
1
2
3
46
Lessons: Store instructions
ntstore store + clwb store
$
MC
Optane
DRAM
1
2
3
1
2
3
*lost bandwidth *lost locality
47
Lesson #2: Use ntstores for large writes
48
Lesson #2: Use ntstores for large writes
• Non-temporal stores bypass the cache
– Avoid cache-line read
– Maintain locality
49
Lesson #3: Limit threads accessing one NVDIMM
50
Lesson #3: Limit threads accessing one NVDIMM
• Contention at Optane Buffer
• Contention at iMC
51
Lessons: Contention at Optane Buffer
• More threads = access amplification = lower bandwidth
52
Lessons: Contention at media & iMC
• iMCs queues are small & media is slow
– Short queues end up clogged
54
Lessons: Contention at media & iMC
• Contention is largest when random access size = interleave size (4KB)
55
Lessons: Contention at media & iMC
• Contention is largest when random access size = interleave size (4KB)
• Load is fairest when access size = #DIMMs x interleave size (24KB)
56
Lesson #3: Limit threads accessing one NVDIMM
• Contention at Optane Buffer
– Increase access amplification
• Contention at media/iMC
– Lose bandwidth through uneven NVDIMM load
– Avoid interleave aligned random accesses
61
Conclusion
62
Conclusion
• Not Just Slow Dense DRAM
• Slower media
-> More complex architecture
-> Second-order performance anomalies
-> Fundamentally different
• Max performance is tricky– Avoid small random accesses