Page 1
Dynamic Near Data Processing Framework for SSDs
Gunjae Koo*, Kiran Kumar Matam*, Te I†, H.V. Krishina Giri Nara*, Jing Li‡,Hung-Wei Tseng†, Steven Swanson‡, Murali Annavaram*
*University of Southern California†North Carolina State University
‡University of California, San Diego
Page 2
Conventional Storage = Cheap Passive Devices
2
Conventional storage devices• Slow, limited bandwidth (SATA 150 ~ 600 MB/s) • Passive devices (read, write, erase)
* Figures from Intel and Western Digital
Page 3
Storage in Modern Server Systems
3
Storage devices for Big Data• Huge volumes of data slow, slower, much slower• Data movement is critical for performance
Page 4
Intelligent Storage
4
NVM-based storage devices• No seek time, higher bandwidth over PCIe• Potential to be active systems
* Figures from Intel
Page 5
Intelligent Storage
5
NVM-based storage devices• No seek time, higher bandwidth (PCIe)• Potential to be active systems
* Figures from Intel
SSDProcessor
DRAM
NAND flash packages
Page 6
StorageProcessor
(SP)
Host
Near Data Processing (NDP)
6
CPU Storage interface
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
Page 7
Host
CPU
Near Data Processing (NDP)
7
Storage interface
StorageProcessor
(SP)
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Page 8
Host
Near Data Processing (NDP) on SSDs
8
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Garbage collection
Wear-leveling
Data computation @ storage
Page 9
Host
Near Data Processing (NDP) on SSDs
9
CPU Storage interface SP
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDP
Garbage collection
Wear-leveling
Data computation @ storage
Obstacles to in-SSD processing
• Less powerful embedded processor
• Dynamic computation resource availability
• Manual workload partitioning is difficult Summarizer: Dynamic NDP framework for SSD
Page 10
Host
CPU
Summarizer –Basic Concept
10
Storage interface AP
Monitoring resources
Page 11
Host
CPU
Summarizer –Basic Concept
11
Storage interface AP
Monitoring resources
Page 12
Summarizer –Detailed Firmware Architecture
12
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
SSD Embedded Processors
Page 13
Summarizer – Initialization (Function Offloading)
13
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
INIT ( foo)
foo()
foo()f#1Function offloading
Function registration
New NVMe command
Page 14
Summarizer –Computation (Dynamic mode)
14
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC( LBA,foo)
New NVMe command
New NVMe command decode
RD&PROC(PPA,foo)
goo()f#2
Page 15
Summarizer –Computation (Dynamic mode)
15
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
RD&P(PPA1,foo)
RD&P(PPA2,foo)
Page data
RD&P(PPA1,foo)
goo()f#2
Page 16
Summarizer –Computation (Dynamic mode)
16
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo1()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
buf1, foo
CC/Proc
Register in TQ
goo()f#2
Page 17
Summarizer –Computation (Dynamic mode)
17
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
foo()f#1
RD&PROC(PPA,foo)
Page data
RD&P(PPA1,foo)
CC
TQ is full
goo()f#2
Page 18
Summarizer – Finalization
18
Host Memory
SQ CQ
Host CPU
Sto
rag
e I
nte
rfa
ce (P
CIe
/ N
VM
e)
SSD Firmware
NAND FlashNAND FlashNAND FlashNAND Flash
Flash Controller
SSD DRAM
DRAM Controller
Summarizer
User Functions
TQ
Re
qu
est
qu
eu
e
Re
spo
nse
qu
eu
e
I/O Controller(NVMe command decoder)
SSD SoC Interconnection
Flash Translation Layer (FTL)
NVMe Host Driver
User Applications /Operating Systems
Task Controller
FINAL ( foo)
New NVMe command
foo()f#1
Results
goo()f#2
Page 19
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
19
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
Page 20
Evaluation Platform
• LS2085a intelligent SSD development platform
• ARM cores running FTL and Summarizerfirmware
• FPGA implementing NAND flash controller
• PCIe Gen. 3 4x lanes for host communication
20
LS2085a
Interconnection
DDR4 Memory Controller
DRAM DRAM
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
PC
Ie(h
ost
–L
S2
08
5a
)
PC
Ie(L
S2
08
5a
-F
PG
A)
FPGA(ALTERA Stratix V)
NAND flash DIMMNAND flash DIMMs
CPU
L1D(32KB)
L2(1MB)
L1I(48KB)
CPU
L1D(32KB)
L1I(48KB)
ARM Processor
NAND flash DIMMs
AlteraStratix V
PCIe (to host)
DRAM
Page 21
Evaluation - Performance
21
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Static workload offloading
Page 22
Evaluation - Performance
22
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
CPU only processing (baseline) SSD only processing
Page 23
Evaluation - Performance
23
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Summarizer Dynamic Offloading
Page 24
Evaluation - Performance
24
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
SSD processing + transfer time(internal + external + In-SSD processing)
Host CPU processing time
Page 25
Evaluation - Performance
25
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host timeExecution time normalized to baseline (CPU only)
Page 26
Evaluation - Performance
26
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Page 27
Evaluation - Performance
27
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.60
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host timeE
xe
cuti
on
tim
e (n
orm
ali
zed
to
ba
seli
ne
)
Page 28
Evaluation - Performance
28
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
0.70 0.62
0.30
0.24
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPU only Dynamic
Chart TitleSDD time Host time
Data computation @ host Data transfer from storage
InternalExternal (host – storage)
W/O NDP
With NDPData computation @ storage
Page 29
Evaluation - Performance
29
0
1
2
3
4
0 0.2 0.4 0.6 0.8 1
Static Dynamic
TPC-H Query6
SDD time Host time
Performance degraded by static NDP
Page 30
Evaluation - Performance
30
16% 10%
20% 7%
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Ex
ecu
tio
n t
ime
(no
rma
lize
d t
o b
ase
lin
e)
Page 31
Design Exploration –Better SSD Processor
31
Host
CPU Storage interface
Better embedded processor is cost effective
AP
Page 32
Design Exploration –Higher Internal Bandwidth
32
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Embedded processor performance
Page 33
Design Exploration –Higher Internal Bandwidth
33
0%
20%
40%
60%
80%
100%
120%
X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16 X1 X2 X4 X8 X16
TPC-H Query6 TPC-H Query1 TPC-H Query14 String Similarity Join Average
Sp
ee
du
pChart Title
Summarizer is a cost effective NDP solution with powerful storage processors
Page 34
Conclusion
34
▪Dynamic computation offloading framework• Opportunistic in-SSD computation
• Page-level task control
• Optimal performance improvement
▪ Summrizer programming model
✓ Dynamic NDP framework for SSDs• Opportunistically enables in-SSD processing• Page-level NDP control• Automatic workload partitioning
✓ Summarizer programming model• Evaluation on the real development platform• Explored design space for future SSDs
Page 35
Thank you
(We thank to Dell EMC for supporting the SSD development board)
Summarizer: Trading Communication with Computing Near Storage (MICRO ‘17)