BIG MEMORIES Bruce Jacob University of Maryland SLIDE Big Memories Prof.Dr. Bruce Jacob University of Maryland OUTLINE • The Capacity Problem • Solution 1: BOB Memory Systems • Solution II: Hybrid Memory Cube • Solution III: Non-volatile Main Memories 1
18
Embed
Bruce Jacob Maryland Big Memories - ece.umd.edublj/talks/ISC-2012.pdf · BIG MEMORIES Bruce Jacob University of Maryland SLIDE Attempts at a Solution • Highly Engineered DIMMs ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE Big MemoriesProf.Dr. Bruce JacobUniversity of Maryland
OUTLINE• The Capacity Problem• Solution 1: BOB Memory Systems• Solution II: Hybrid Memory Cube• Solution III: Non-volatile Main Memories
1
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
The Capacity Problem
2
Two DDR2-400 DIMMs Four DDR2-400 DIMMsSource: Steve Woo. DRAM and Memory System Trends. October 2004.
=>
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
The Capacity Problem
… but wait, there’s more:3
2000# 2002# 2004# 2006# 2008# 2010# 2012#
DIM
M$Cap
acity$(G
B)$
Release$Year$
Release$of$Increasing$DIMM$Capacities$
8"GB"
16"GB" 16#GB#
4#GB#
1#GB#256#MB#
Problem: Capacity
MC MC
JEDEC DDRx~10W/DIMM, ~20W total
FB-DIMM~10W/DIMM, ~300W total
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Attempts at a Solution
• Highly Engineered DIMMs(can cost $1000+ per DIMM)
• Fully-Buffered DIMM(pushes the power envelope)
4
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Observations
• Cannot increase power significantly (e.g. to CPU scale)
• Cannot sacrifice aggregate bandwidth
• Need to approach commodity pricing
• Future-proof design would be highly desirable
5
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution I: BOB
Buffer On (mother-)Board6
AMD G3MXIntel SMI/SMBIBM Power 795
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution I: BOB
Buffer On (mother-)Board7
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution I: BOB
Buffer On (mother-)Board8
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution II: Micron HMC
9
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution II: Micron HMC
A single-chip BOB system10
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution II: Micron HMC
11
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution II: Micron HMC
12
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution II: Micron HMC
13
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution III: Non-Volatiles
14
Obvious Conclusions II
• Flash/NV is inexpensive, is fast (rel. to disk), and has better capacity roadmap than DRAM
• Make it a first-class citizen in the memory hierarchy
• Access it via load/store interface, use DRAM to buffer writes, software management
• Probably reduces capacity pressure on DRAM system
$CPUSpeed, density, cost
Can have TB-scale DIMMs today
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution III: Non-Volatiles
156 MB/s. However, this was not considered a major drawbackas transfer times to such cards were less important than theircapacity. Until recently, NAND flash chips utilized a 40 MHzasynchronous 8 bit interface that was capable of 40 MB/s.This was also acceptable for some time as the access latencyof flash was still faster than other external storage media of thetime and this was not the bottleneck in the applications thatutilized it. However, as flash has taken on a new role with theintroduction of SSDs, its transfer times have begun to matter.
One major problem with flash devices was that each manu-facturer had their own interface standard. This problem madedesigning SSD hardware difficult and expensive as it had tobe tailored to a specific manufacturer’s standard. To fostereasier integration of flash devices and drive SSD adoption, theNAND flash industry developed the ONFi 1.0 standard [3].
Another problem with flash devices is that the array of flashcells within the chip are actually capable of producing data ata rate of 330 MB/s without any modifications [12]. Realizingthat the asynchronous interface was the primary bottleneck inflash performance, manufacturers have developed synchronousstandards such as ONFi 2.1 or Toggle Mode DDR. These newstandards enable much faster transfers of data by running atfaster frequencies than was possible with an asynchronousapproach. As a result, newer flash chips are capable of band-widths of up to 200 MB/s. Furthermore, a new standard, ONFi3.0, has even more recently been defined which will allow forbandwidths of up to 400 MB/s. Therefore, the full bandwidthpotential of the flash array will soon be utilized to providefaster data transfers and improve overall performance whenaccessing flash. As a result of this additional bandwidth, thehost interface and software likely need to evolve in order tofully expose the improved performance of the flash devices.
3. Hybrid Main Memory Overview
3.1. Current State of the Art - SSD Design
A block diagram of a typical flash-based solid state drive isshown in Figure 1. The system consists of three main compo-nents: host interface, an SSD controller, and a set of NANDflash devices. The host interface is typically SATA, althoughrecently PCIe interfaces have become available for enterpriseapplications. The SSD controller is the core of the system andcreates the abstractions necessary for utilizing NAND flashdevices in such a way that creates a useful storage system. Itperforms tasks such as memory mapping, garbage collection,wear leveling, error correction, and access scheduling. TheSSD controller also typically has a small amount of memoryeither in the form of SRAM or DRAM to cache metadata andbuffer writes [6].
The NAND flash devices are where the data is stored on thedrive. SSDs leverage multiple devices to achieve high through-put. These are typically organized into parallel channels withone or more devices per channel. Internally, the NAND de-vices are organized into planes, blocks, and pages. Planes are
PCIeController
DDR3Channel
PCIe Lanes
PCIe Solid State Drive
DRAM DIMM
DDR3Channel
Buffer Channel
NV DIMM
NVController
ONFi
ONFi
DRAM DIMM
MemoryController
DRAM
SSDController
ONFi
ONFi
DRAM
X86Core
X86Core
X86Core
X86Core
Shared Last Level Cache
Core i7 CPU
X86Core
X86Core
X86Core
X86Core
Shared Last Level Cache
Core i7 CPU
HybridMemory
Controller
NAND Devices
NAND Devices
Figure 1: System design for SSD (top) and hybrid memory(bottom).
functionally independent units that allow for concurrent opera-tions on the device. Each plane has a set of registers that allowfor interleaved accesses. Blocks form the physical granularityat which erase operations occur. Finally, each block consistsof multiple pages, which are the physical granularity at whichread and write operations occur.
In terms of the computer system performance, the delay foran operation to a solid state drive starts when the user appli-cation issues a request for some data that triggers a page faultand ends when the operating system returns control to the userapplication after the request has completed. At the hardwarelevel, the SSD controller receives an access for a particularaddress and then later the controller raises an interrupt request(IRQ) on the CPU to tell the operating system the data is ready.A typical access to an SSD is shown in Figure 2. The timefrom point B to point C is the amount of time needed for thedisk to process the request. The time from point A to point Dis the total amount of time spent waiting for the request fromthe perspective of the application that made the request.
There are many intermediate software and hardware layersinvolved in an SSD access. The software side on a Linux-based system includes the virtual memory system, the virtualfile system, the specific file system for the partition that holdsthe data (e.g. NTFS or ext3), the block device driver forthe disk, and the device driver for the host interface such asthe Advanced Host Interface Controller (AHCI) for SerialATA (SATA) drives [11]. At the hardware level, the interfacesinvolved include the host interface to the drive, the direct mem-ory access (DMA) engine, and the SSD internals. The hostinterface is typically a SATA interface, which resides on thesouthbridge for modern Intel processors. This means that therequest much first cross the Intel Direct Media Interface (DMI)or equivalent before crossing the SATA interface. However,our model for this paper assumes the pure PCIe 3.0 NVM Ex-press interface and we utilize 16 lanes, which makes the model
3
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Solution III: Non-Volatiles
Performance normalized to that of TB-sized DRAM system
16
0
0.2
0.4
0.6
0.8
1
1.2
Nor
mal
ized
IPC
Un-Optimized Hybrid SLC
SSD MLC
Hybrid MLC
SSD SLC
Hybrid SLC
Figure 9: System performance when combining all techniques. The IPC is normalized to the ideal case with enough DRAM tostore the entire working set.
many realistic workloads, we show that the hybrid memorydesign can provide significant performance improvementscompared to an enterprise-class solid state drive. We believethis design space is worth investigating further, as our paper isonly an initial glimpse into using hybrid memories as a fasterstorage system. In particular, there is much work to be doneoptimizing both the flash system and the operating system todeal with this new design. We intend to investigate both ofthese areas in future work.
[9] N. Agrawal et al., “Design Tradeoffs for SSD Performance,” in Pro-ceedings of the 2008 USENIX Technical Conference (USENIX’08), ser.USENIX ’08, 2008.
[10] A. Badam and V. S. Pai, “SSDAlloc: Hybrid SSD/RAM MemoryManagement Made Easy,” in In Proc. 8th USENIX Symposium onNetworked Systems Design and Implementation (NSDI ’11), 2011.
[11] D. P. Bovet and M. Cesati, Understanding the Linux Kernel, 3rd ed.O’Reilly Media, 2005.
[12] J. Cooke, “Choosing the Right NAND for Your Application,” Micron,2009.
[13] E. Cooper-Balis, P. Rosenfeld, and B. Jacob, “Buffer On Board memorysystems,” in Proceedings of the 39th Annual International Symposiumon Computer Architecture, ser. ISCA ’12, 2012.
[14] C. Dirik and B. Jacob, “The performance of PC Solid-State Disks(SSDs) as a function of bandwidth, concurrency, device architecture,and system organization,” in Proceedings of the 36th Annual Interna-tional Symposium on Computer Architecture, ser. ISCA ’09, 2009, pp.279–289.
[15] E. Harari, “The Non-Volatile Memory Industry - A Personal Journey,”in 3rd IEEE International Memory Workshop (IMW), May 2011, pp.1–4.
[16] E. Harari, “Flash Memory: The Great Disruptor!” in InternationalSolid-State Circuits Conference (ISSCC), Feb. 2012, pp. 10–15.
[17] J. Jex, “Flash memory BIOS for PC and notebook computers,” in IEEEPacific Rim Conference on Communications, Computers and SignalProcessing, vol. 2, 1991, pp. 692–695.
[18] S. Jiang and X. Zhang, “Token-ordered LRU: an effective page replace-ment policy and its implementation in Linux systems,” Perform. Eval.,vol. 60, no. 1-4, pp. 5–29, May 2005.
[19] T. Kgil and T. Mudge, “FlashCache: a NAND Flash Memory FileCache for Low Power Web Servers,” in Proceedings of the 2006 In-ternational Conference on Compilers, Architecture and Synthesis forEmbedded systems, ser. CASES ’06. New York, NY, USA: ACM,2006, pp. 103–112.
[20] E. Koldinger, J. Chase, and S. Eggers, “Architectural Support forSingle Address Space Operating Systems,” in 5th Int. Conference onArchitectural Support for Programming Languages and OperatingSystems (ASPLOS), vol. 27, no. 9, 1992, pp. 175–186.
[21] B. C. Lee et al., “Architecting Phase Change Memory as a ScalableDram Alternative,” in Proceedings of the 36th Annual InternationalSymposium on Computer Architecture, ser. ISCA ’09. New York, NY,USA: ACM, 2009, pp. 2–13.
[22] A. Patel et al., “MARSSx86: A Full System Simulator for x86 CPUs,”in Design Automation Conference 2011 (DAC’11), 2011.
[23] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Perfor-mance Main Memory System Using Phase-Change Memory Technol-ogy,” in Proceedings of the 36th Annual International Symposium onComputer Architecture, ser. ISCA ’09. New York, NY, USA: ACM,2009, pp. 24–33.
[24] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A CycleAccurate Memory System Simulator,” Computer Architecture Letters,vol. 10, no. 1, pp. 16 –19, Jan.-June 2011.
12
BIG MEMORIES
Bruce Jacob
University of Maryland
SLIDE
Bottom Line
• All three solutions are composable (this is GOOD)