This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Small amounts of noise on Vin causes Ids > Isd or Isd > Ids ... and output bucket randomly fills and empties.Result: Vout randomly flips between logic 0 and logic 1.
DRAM has a 6-10X density advantage at the same technology generation.
Big win for DRAM
SRAM is much faster: transistors drive bitlines on reads.SRAM easy to design in logic fabrication process (and premium logic processes have SRAM add-ons)
SRAM has deterministic latency: its cells do not need to be refreshed.
low voltage, achieving low SRAM minimum operating voltage
is desirable to avoid integration, routing, and control overheads
of multiple supply domains.
In the 22 nm tri-gate technology, fin quantization eliminatesthe fine-grained width tuning conventionally used to optimizeread stability and write margin and presents a challenge in
designing minimum-area SRAM bitcells constrained by finpitch. The 22 nm process technology includes both a high
density 0.092 m 6T SRAM bitcell (HDC) and a low voltage
0.108 m 6T SRAM bitcell (LVC) to support tradeoffs in area,
performance, and minimum operating voltage across a range
of application requirements. In Fig. 1, a 45-degree image of an
LVC tri-gate SRAM is pictured showing the thin silicon finswrapped on three sides by a polysilicon gate. The top-down
bitcell images in Fig. 2 illustrate that tri-gate device sizing and
minimum device dimensions are quantized by the dimensions
of each uniform silicon fin. The HDC bitcell features a 1 finpullup, passgate, and pulldown transistor to deliver the highest
6T SRAM density, while the LVC bitcell has a 2 fin pulldowntransistor for improved SRAM ratio (passgate to pulldown)
which enhances read stability in low voltage conditions. Bitcell
optimization via adjustment can be used to adjust the
bitcell (pullup to pulldown) and ratios for adjustments to
read and write margin, in lieu of geometric customization, but
low usage is constrained by bitcell leakage and high
Fig. 3. High density SRAM bitcell scales at 2X per technology node.
Fig. 4. 22 nm tri-gate SRAM array density scales by 1.85X with an unprece-
dented increase in performance at low voltage.
usage is limited by performance degradation at low voltage.
In the 22 nm process technology, the individual bitcell device
targets are co-optimized with the array design and integrated
assist circuits to deliver maximum yield and process margin at a
given performance target. Optical proximity correction
and resolution enhancement technologies extend the capabili-
ties of 193 nm immersion lithography to allow 54% scaling of
the bitcell topologies from the 32 nm node, as shown in Fig. 3.
Fig. 4 shows that SRAM cell size density scaling is preserved
at the 128 kb array level and the array is capable of 2.0 GHz
operation at 625 mV—a 175 mV reduction in supply voltage
required to reach 2 GHz from the prior technology node.
III. 22 NM 128 KB SRAM MACRO DESIGN
The 162 Mb SRAM array implemented on the 22 nm SRAM
test chip is composed of a tileable 128 kb SRAM macro with
integrated read and write assist circuitry. As shown in Fig. 5,
the array macro floorplan integrates 258 bitcells per local bit-line (BL) and 136 bitcells per local wordline (WL) to maintain
high array efficiency (71.6%) and achieve 1.85X density scaling(7.8 Mb/mm ) over the 32 nm design [11] despite the addition
of integrated assist circuits. The macro floorplan uses a foldedbitline layout with 8:2 column multiplexing on each side of the
shared I/O column circuitry. Two redundant row elements and
two redundant column elements are integrated into the macro
to improve manufacturing yield and provide capability to repair
RAM Compilers
On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers.
Compile-timeparameters set number of bits, aspect ratio, ports, etc.
Assume Ccell = 1 fFBit line may have 2000 nFet drains,assume bit line C of 100 fF, or 100*Ccell.Ccell holds Q = Ccell*(Vdd-Vth)When we dump this charge onto the bit line, what voltage do we see?dV = [Ccell*(Vdd-Vth)] / [100*Ccell]dV = (Vdd-Vth) / 100 ≈ tens of millivolts!
In practice, scale array to get a 60mV signal.37Thursday, February 20, 14
Read latency: Time to return first byte of a random access
(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. Requests to N memory banks (interleaving) have potential of N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.
charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.
The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.
2 Modern DRAM Architecture
As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.
Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the
available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].
For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.
A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that
Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
The bandwidth and latency of a memory system are strongly
dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-
istic of contemporary DRAM chips. There is nearly an order
of magnitude difference in bandwidth between successive
references to different columns within a row and different
rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit
locality within the 3-D memory structure. Conservative
reordering, in which the first ready reference in a sequence
is performed, improves bandwidth by 40% for traces from
five media benchmarks. Aggressive reordering, in which
operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-
tions. Memory access scheduling is particularly important
for media processors where it enables the processor to make
the most efficient use of scarce memory bandwidth.
1 Introduction
Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more
often limited by memory system bandwidth than other com-puter systems.
To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.
The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.
This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.
To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-
1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00
128
From:charge, a row access, and a column access for a total ofseven cycles per reference, or 56 cycles for all eight refer-ences. If we reschedule these operations as shown in Figure1B they can be performed in 19 cycles.
The following section discusses the characteristics of mod-ern DRAM architecture. Section 3 introduces the concept ofmemory access scheduling and the possible algorithms thatcan be used to reorder DRAM operations. Section 4describes the streaming media processor and benchmarksthat will be used to evaluate memory access scheduling.Section 5 presents a performance comparison of the variousmemory access scheduling algorithms. Finally, Section 6presents related work to memory access scheduling.
2 Modern DRAM Architecture
As illustrated by the example in the Introduction, the orderin which DRAM accesses are scheduled can have a dra-matic impact on memory throughput and latency. Toimprove memory performance, a memory controller musttake advantage of the characteristics of modern DRAM.
Figure 2 shows the internal organization of modernDRAMs. These DRAMs are three-dimensional memorieswith the dimensions of bank, row, and column. Each bankoperates independently of the other banks and contains anarray of memory cells that are accessed an entire row at atime. When a row of this memory array is accessed (rowactivation) the entire row of the memory array is transferredinto the bank’s row buffer. The row buffer serves as a cacheto reduce the latency of subsequent accesses to that row.While a row is active in the row buffer, any number of readsor writes (column accesses) may be performed, typicallywith a throughput of one per cycle. After completing the
available column accesses, the cached row must be writtenback to the memory array by an explicit operation (bankprecharge) which prepares the bank for a subsequent rowactivation. An overview of several different modern DRAMtypes and organizations, along with a performance compari-son for in-order access, can be found in [4].
For example, the 128Mb NEC µPD45128163 [13], a typicalSDRAM, includes four internal memory banks, each com-posed of 4096 rows and 512 columns. This SDRAM may beoperated at 125MHz, with a precharge latency of 3 cycles(24ns) and a row access latency of 3 cycles (24ns). Pipe-lined column accesses that transfer 16 bits may issue at therate of one per cycle (8ns), yielding a peak transfer rate of250MB/s. However, it is difficult to achieve this rate onnon-sequential access patterns for several reasons. A bankcannot be accessed during the precharge/activate latency, asingle cycle of high impedance is required on the data pinswhen switching between read and write column accesses,and a single set of address lines is shared by all DRAMoperations (bank precharge, row activation, and columnaccess). The amount of bank parallelism that is exploitedand the number of column accesses that are made per rowaccess dictate the sustainable memory bandwidth out ofsuch a DRAM, as illustrated in Figure 1 of the Introduction.
A memory access scheduler must generate a schedule thatconforms to the timing and resource constraints of thesemodern DRAMs. Figure 3 illustrates these constraints forthe NEC SDRAM with a simplified bank state diagram anda table of operation resource utilization. Each DRAM oper-ation makes different demands on the three DRAMresources: the internal banks, a single set of address lines,and a single set of data lines. The scheduler must ensure that
Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
Memory access scheduling is the process of ordering theDRAM operations (bank precharge, row activation, and col-umn access) necessary to complete the set of currentlypending memory references. Throughout the paper, the termoperation denotes a command, such as a row activation or acolumn access, issued by the memory controller to theDRAM. Similarly, the term reference denotes a memory ref-erence generated by the processor, such as a load or store toa memory location. A single reference generates one ormore memory operations depending on the schedule.
Given a set of pending memory references, a memoryaccess scheduler may chose one or more row, column, orprecharge operations each cycle, subject to resource con-straints, to advance one or more of the pending references.The simplest, and most common, scheduling algorithm onlyconsiders the oldest pending reference, so that referencesare satisfied in the order that they arrive. If it is currentlypossible to make progress on that reference by performingsome DRAM operation then the memory controller makesthe appropriate access. While this does not require a compli-cated access scheduler in the memory controller, it is clearlyinefficient, as illustrated in Figure 1 of the Introduction.
If the DRAM is not ready for the operation required by theoldest pending reference, or if that operation would leaveavailable resources idle, it makes sense to consider opera-tions for other pending references. Figure 4 shows the struc-ture of a more sophisticated access scheduler. As memoryreferences arrive, they are allocated storage space whilethey await service from the memory access scheduler. In thefigure, references are initially sorted by DRAM bank. Eachpending reference is represented by six fields: valid (V),load/store (L/S), address (Row and Col), data, and whateveradditional state is necessary for the scheduling algorithm.Examples of state that can be accessed and modified by thescheduler are the age of the reference and whether or notthat reference targets the currently active row. In practice,
the pending reference storage could be shared by all thebanks (with the addition of a bank address field) to allowdynamic allocation of that storage at the cost of increasedlogic complexity in the scheduler.
As shown in Figure 4, each bank has a precharge managerand a row arbiter. The precharge manager simply decideswhen its associated bank should be precharged. Similarly,the row arbiter for each bank decides which row, if any,should be activated when that bank is idle. A single columnarbiter is shared by all the banks. The column arbiter grantsthe shared data line resources to a single column access outof all the pending references to all of the banks. Finally, theprecharge managers, row arbiters, and column arbiter sendtheir selected operations to a single address arbiter whichgrants the shared address resources to one or more of thoseoperations.
The precharge managers, row arbiters, and column arbitercan use several different policies to select DRAM opera-tions, as enumerated in Table 1. The combination of policiesused by these units, along with the address arbiter’s policy,determines the memory access scheduling algorithm. Theaddress arbiter must decide which of the selected precharge,activate, and column operations to perform subject to theconstraints of the address line resources. As with all of theother scheduling decisions, the in-order or priority policiescan be used by the address arbiter to make this selection.Additional policies that can be used are those that select pre-charge operations first, row operations first, or column oper-ations first. A column-first scheduling policy would reducethe latency of references to active rows, whereas a pre-charge-first or row-first scheduling policy would increasethe amount of bank parallelism.
If the address resources are not shared, it is possible for botha precharge operation and a column access to the same bankto be selected. This is likely to violate the timing constraintsof the DRAM. Ideally, this conflict can be handled by hav-ing the column access automatically precharge the bank
Figure 4. Memory access scheduler architecture.
Memory References
Precharge0
Row
Arbiter0
Column
Arbiter
Address
Arbiter
PrechargeN
Row
ArbiterN
DRAM Operations
Memory Access
Scheduler Logic
Bank 0 Pending References
V L/S Row Col Data State
Bank N Pending References
V L/S Row Col Data State
131
Memory Access Scheduling
Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens
The bandwidth and latency of a memory system are strongly
dependent on the manner in which accesses interact withthe “3-D” structure of banks, rows, and columns character-
istic of contemporary DRAM chips. There is nearly an order
of magnitude difference in bandwidth between successive
references to different columns within a row and different
rows within a bank. This paper introduces memory accessscheduling, a technique that improves the performance of amemory system by reordering memory references to exploit
locality within the 3-D memory structure. Conservative
reordering, in which the first ready reference in a sequence
is performed, improves bandwidth by 40% for traces from
five media benchmarks. Aggressive reordering, in which
operations are scheduled to optimize memory bandwidth,improves bandwidth by 93% for the same set of applica-
tions. Memory access scheduling is particularly important
for media processors where it enables the processor to make
the most efficient use of scarce memory bandwidth.
1 Introduction
Modern computer systems are becoming increasingly lim-ited by memory performance. While processor performanceincreases at a rate of 60% per year, the bandwidth of a mem-ory chip increases by only 10% per year making it costly toprovide the memory bandwidth required to match the pro-cessor performance [14] [17]. The memory bandwidth bot-tleneck is even more acute for media processors withstreaming memory reference patterns that do not cache well.Without an effective cache to reduce the bandwidthdemands on main memory, these media processors are more
often limited by memory system bandwidth than other com-puter systems.
To maximize memory bandwidth, modern DRAM compo-nents allow pipelining of memory accesses, provide severalindependent memory banks, and cache the most recentlyaccessed row of each bank. While these features increasethe peak supplied memory bandwidth, they also make theperformance of the DRAM highly dependent on the accesspattern. Modern DRAMs are not truly random accessdevices (equal access time to all locations) but rather arethree-dimensional memory devices with dimensions ofbank, row, and column. Sequential accesses to differentrows within one bank have high latency and cannot be pipe-lined, while accesses to different banks or different wordswithin a single row have low latency and can be pipelined.
The three-dimensional nature of modern memory devicesmakes it advantageous to reorder memory operations toexploit the non-uniform access times of the DRAM. Thisoptimization is similar to how a superscalar processorschedules arithmetic operations out of order. As with asuperscalar processor, the semantics of sequential executionare preserved by reordering the results.
This paper introduces memory access scheduling in whichDRAM operations are scheduled, possibly completingmemory references out of order, to optimize memory sys-tem performance. The several memory access schedulingstrategies introduced in this paper increase the sustainedmemory bandwidth of a system by up to 144% over a sys-tem with no access scheduling when applied to realistic syn-thetic benchmarks. Media processing applications exhibit a30% improvement in sustained memory bandwidth withmemory access scheduling, and the traces of these applica-tions offer a potential bandwidth improvement of up to93%.
To see the advantage of memory access scheduling, con-sider the sequence of eight memory operations shown inFigure 1A. Each reference is represented by the triple (bank,row, column). Suppose we have a memory system utilizinga DRAM that requires 3 cycles to precharge a bank, 3 cyclesto access a row of a bank, and 1 cycle to access a column ofa row. Once a row has been accessed, a new column accesscan issue each cycle until the bank is precharged. If theseeight references are performed in order, each requires a pre-
1. Scott Rixner is an Electrical Engineering graduate student at the Massachusetts Institute of Technology.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the firstpage. To copy otherwise, to republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. ISCA 00 Vancouver, British Columbia Canada Copyright (c) 2000 ACM 1-58113-287-5/00/06-128 $5.00