Dezső Sima
Spring 2008
(Ver. 1.0) Sima Dezső, 2008
FB-DIMM technology
Motivations to introduce FB-DIMMs in servers/workstations
Shortcommings of the stub-bus topology used with conventional DRAM architectures [2]
Impedance discontinuities effectsignal integrity [2]
Stub-bus topology
Data lines of the memory controller are electrically connected
to the data lines of every DRAM deviceon the bus (memory channel)
Memory channels may have 8 DIMMs with 8 DRAM devices/DIMM(i.e. 72 devices/channel)
Heavy signal loading due to the large number of devices and impedance discontinuities on the bus
limit the number of DRAM devices connected to the channelthe more the higher the data rate
Figure: Scaling number of channels with memory hubs [7]. Two ranks of DRAM devices per DIMM is assumed. In the case of single rank per DIMM , while the number of DIMMs per channel may be doubled, the declining trend shown in the figure remains the same.
For higher DRAM speeds less DRAM devicescan be connected
per memory channel [2]
Stub-bus channel capacity(device density x nr. of devices)
has hit its ceiling [2]
but
increasing server performancedoubles memory capacity demand
about every two years [2]
from Jacob mem systems 2007
Increasing the number of memory channels
Each DDR2 memory channel requires 240 pins
• introduce packed based serial transmission (like in the PCI-E, SATA, SAS buses)
• introduce full buffering (registered DIMMs buffer only addresses)
• CRC error checking (cyclic redundancy check)
FB-DIMM technology (1)
Principle of operation
Figure: FB-DIMM memory architecture [4]
FB-DIMM technology (2)
Figure: Maximum supported FB-DIMM configuration [6](6 channels/8 DIMMs)
• Serial transmission between the North Bridge and the DIMMs (each bit needs a pair of wires)
• Read packets (frames, bursts): 168 bits (12 x 14 bits)
• 144 data bits
(equals the number of data bits produced by a 72 bit wide DDR2 module (64 data bits + 8 ECC bits) in two memory cycles)
• 24 CRC bits.
• Every 12 cycles (that is every two memory cycles) constitute a packet.
• Write packets (frames, bursts): 120 bits (12 x 10 bits)
• 98 payload bits
• 22 CRC bits.
• Clocked at 6 x double pumped data rate
e.g. for a DDR 667 DRAM the clock rate is: 6 x 667 MHz = 4 GHz
FB-DIMM technology (3)
Implementation details (1)
• Number of seral links
• 14 read lanes (2 wires each)• 10 write lanes (2 wires each)
98 payload bits.
• 2 frame type bits,
• 24 bits of command,
• 72 bits for data and commands, according to the frame type, e.g. 72 bits of data, 36 bits of data + one command or two commands.
Commands
• row select, precharge, refresh, read, write etc.• all commands include a 3-bit FB-DIMM module address to select one of 8 modules.
FB-DIMM technology (4)
Implementation details (2)
Read bandwidth:
One FB-DIMM channel transfers in one frame (that is in 12 cycles): 128 data bits, + 16 ECC bits
One frame lasts 2 memory cycles
One DDR2 DIMM channel transfers in 2 memory cycles: 2 x 72 bits (2 x 64-bit data + 2 x 8-bit ECC)
The read bandwidth of an FB-DIMM channel
equals
the bandwidth of a DDR2 channel
Write bandwidth:
The write bandwidth of an FB-DIMM channel is up to 0.5 x the read bandwidth.
But FB-DIMMs allow simultan read and write operation
FB-DIMM technology (5)
Implementation details (3)
Source: PC stats
FB-DIMM-4300 (DDR2-533 SDRAM); Clock Speed: 133MHz, Data Rate: 532MHz, Through-put 4300MB/s
PC2-5300 (DDR2-667 SDRAM); Clock Speed: 167MHz, Data Rate: 667MHz, Through-put 5300MB/s
PC2-6400 (DDR2-800 SDRAM); Clock Speed: 200MHz, Data Rate: 800MHz, Through-put 6400MB/s
FB-DIMM data puffer
Figure: Different implementations of FB-DIMMs
FB-DIMM technology (6)
(Advanced Memory Buffer, AMB)
Manages the read/write operationsof the module
(There are two Command/Address buses (C/A) to limit loads of 9 to 36 DRAMs)
Figure: Block diagram of the AMB [3]
Necessary routing to connect the north bridge to the DIMM socket
a) In case of a DDR2 DIMM (240 pins)
b) In case of an FB-DIMM (69 pins)
A 3-layer PCB is needed A 2-layer PCB is needed(but a 3. layer is used for power lines)
Figure: PCB routing [4]
FB-DIMM technology (7)
Figure: Latency and bandwith figures of different DRAM technologies for a mix of SPEC applications [5]
FB-DIMM technology (8)
Advantage of FB-DIMMs vs DDR2 and DDR3 DIMMs
• more memory channels (up to 6) higher total bandwidth
• more DIMM modules (up to 8) per channel
higher memory capacity (up to 192 GB)
Disadvantage of FB-DIMMs vs DDR2 and DDR3 DIMMs
• higher latency and lower bandwidth figures for 4 to 8 DIMM modules
• higher cost
• higher dissipation
• less wires
simplified PCB routing
• symultaneous read/write operation in a channel
FB-DIMM technology (9)
Pros and cons of FB-DIMMs
(Typical dissipation figures: DDR2: about 5 W AMB: about 5 W DDR2 FB-DIMM: about 10 W)
Latency
The other issue is potentially more troubling. Intel addressed this by not having the signals be stored and then retransmitted. The data travels along a special fast-pass-through channel in the buffer itself. This lessens much of the latency that would be
induced by store and forward architectures.
Figure: FB-DIMM heat sinks (heat spreaders)
• 5/2006 Intel adopts it in its Bensley platform (5000) for DPs
• 8/2007 Sun introduces it in the Niagara II
• 9/2006 AMD has taken it off from their road map
• 9/2007 Intel uses it in the Caneland platform (7000) for MPs
• 2007 Major memory manufacturers intend to develop DDR3 DIMMs
instead of DDR3 based FB-DIMMs
FB-DIMM technology (10)
Market penetration of the FB-DIMM technology
Standardisation
3/2007 JESD205 DDR2 SDRAM Fully Buffered DIMM (FBDIMM) Design Specification
DDR2-533, DDR2-667, DDR2-800 x72 ECC, 240 pin256 Mb, 512 Mb, 1 Gb, 2 Gb, 4 Gb devices
1/2007 JESD 206 FBDiMM Architecture and Protocol
The key difference between DDR and DDR2 is that the DDR2 data bus is clocked at twice the speed of the memory cells, so four data words can be transferred in each memory cell cycle without speeding up the memory cells themselves.
DDR2 vs (SDRAM) DDR
FB-DIMM technology (11)
Figure: Clocking schemes of the SDR, DDR and DDR2 SDRAM techologies [1]
Although introduced in Q2 2003 at 200/266 MHz, initially DDR2 could not becompetitive due to too high latency figures. As lower latency parts became available by the end of 2004 DDR2 became widespread.
DDR2's bus frequency is boosted by electrical interface improvements, on-die termination, prefetch buffers and off-chip drivers. However, latency is greatly increased as a trade-off. The DDR2 prefetch buffer is 4 bits deep, whereas it is 2 bits deep for DDR (and 8 bits deep for DDR3). While DDR SDRAM has typical read latencies of between 2 and 3 bus cycles, early DDR2 may have read latencies between 4 and 6 cycles.
Memory Timings Latency Bandwidth in dual-channel mode
DDR400 SDRAM 2.5–3–3 12.5 ns 6.4 GB/sec
DDR400 SDRAM 2–3–2 10 ns 6.4 GB/sec
DDR533 SDRAM 3–4–4 11.2 ns 8.5 GB/sec
DDR533 SDRAM 2.5–3–3 9.4 ns 8.5 GB/sec
DDR2-533 SDRAM 5–5–5 18.8 ns 8.5 GB/sec
DDR2-533 SDRAM 4–4–4 15 ns 8.5 GB/sec
DDR2-533 SDRAM 3–3–3 11.2 ns 8.5 GB/sec
DDR2-600 SDRAM 5–5–5 16.6 ns 9.6 GB/sec
DDR2-600 SDRAM 4–4–4 13.3 ns 9.6 GB/sec
Table: Burst timing, latency and bandwidth figures of DDR and DDR2 DRAM technologies [1]
Early DDR2-533 SDRAM modules available at the time of the announcement of i925 and i915 chipsets (6/2004) had 4-4-4 timings (CAS Latency - RAS to CAS Delay - RAS Precharge Time).
CAS latency (Column Address Select),(CL)
the time delay (in number of clock cycles) between a memory chip is accessed for dataand the first data bit becomes available
For instance, after accessing a 400 MHz CL3 device, the first bit arrives in 3 x 2.5 ns = 7.5 ns
FB-DIMM technology ()
DDR2 has 240 pins instead of 168 pins used by DDR DIMMs
Power savings are achieved primarily due to a drop in operating voltage (1.8 V compared to DDR's 2.5 V).
Official JEDEC Specifications
DDR2 DDR3
Rated Speed 400-800 Mbps 800-1600 Mbps
Vdd/Vddq 1.8V +/- 0.1V 1.5V +/- 0.075V
Internal Banks 4 8
Termination Limited All DQ signals
Topology Conventional T Fly-by
Driver Control OCD Calibration Self Calibration with ZQ
Thermal Sensor No Yes (Optional)
Source: Anandtech
DDR3
Appeared mid 2007 e.g. in Intel’s P35 Bearlake
Source: Wiki
5.2. Speed gap between processor and memory (1a)
Figure 5.1a: DRAM types
DRAM FPM EDO BEDO SDRAM DRDRAM
Cycle time within a burst(for a 60 ns part)
Full burst timing
Max. bandwidth MB/s
Effective bandwidth MB/s
Examples
Remakes
Random access,typ. access time60/70/80/100 ns
(60 ns)
(5-5-5-5)
Access to 4subsequentcolumns
Overlapping theread and addresstransfer operations
Internal 2-bitaddress generator,
dual banks
Full pipelinedoperation,
assuming at leastdual banks
66/100/133 MHz
Asynchronous
Burst mode access (4*8B) on the same row (page)
Synchronous
Up to 66 MHz bus frequency
Internal on-chipSRAM cache,page is filled in
1 clock cycle,1-2 B wide data path256/300/356/400MHz transfer rate
~ 40 ns ~ 25 ns ~ 15 ns ~ 15/10/7.5 ns (4/3.3/2.8/2.5 ns)
(5-7)-3-3-3(5-7)-4-4-4
(5-7)-2-2-2 5-1-1-1 (5-7)-1-1-1
Triton I.: 7-3-3-3Triton III.: 6-3-3-3
Triton I.: 7-2-2-2Triton II,III.:6-2-2-2
Triton III.: 7-1-1-1430 ZX.: 7-1-1-1
820840
Developed byMICRON
Developed byRAMBUS
Level of overlapping
Since 1996
Cached structure
1
1
2
2
3
3
4
4
5
5
6
6
Dynamic RAMFast Page Mode DRAMExtended Data Out DRAM
Burst mode EDOSynchronous DRAMDirect Rambus DRAM
5.2. Speed gap between processor and memory (1b)
Figure 5.1b: Latency of DRAM chips
486 DX P PPro PII PIII386 DX
86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99
200
180
160
140
120
100
80
60
40
20
2000
*
PC AT
*
*
* *
**
**
*
*
16 K 128 K 256 K 256 K 4 M 16 M
tRAC
Year
Processorchipset
Typ. DRAMparts
(ns)
430 NX
4 M
4 M
4 M1 M 1 M
8 M
16 M 64 M64 M
16 M64 M 128 M
256 M
200
150
100
80
80
60
70
5060
50
30
450 KX/GX 440 BX 815
tRAC
: Row access time (time from row address until data valid)
128 K256 K
5.2. Speed gap between processor and memory (1c)
Figure 5.1c: System-level memory latency in x86-based PCs
486 DX P PPro PII PIII386 DX
86 8881 82 83 84 85 87 89 90 91 92 93 94 95 96 97 98 99
100
10
1
2000
PC
Year
Processor
Memory latencyin proc. cycles
AT(286)(8088)
P4
50
1000
3020
500
200
23
5
*
*
*
*
10
40
85
702
300
**
*
1 1
3
Memory latencyns
500
400
300
200
100
*
*
**
*
155
135
141
116
468
*200
Latency in ns
Latency in proc.cycles
5.2. Speed gap between processor and memory (1d)
Figure 5.1d: Latency of DRAM chips (in clock cycles)
20
40
30
1.0 2.0fc
1.5 2.50.5
10 *
*
*
*
*
*
*
3.0 3.5
*
4.0
Memory latency
*
*
*
**
60
50
80
70
100
90
Pentium
Pentium Pro
Pentium II
Pentium III Pentium 4
RDRAM-40
120
110
*
*
*
*
**
RDRAM-60 DDR2 533
DDR 400
DDR 333
PC 133
PC 100
PC 66
386
EDO
(cycles)
FPM
130*
DDR 266
486
*
*
(GHz)
Figure 5.2: Relative transfer rate of memories (D: dual channel)
0.20
0.40
0.30
1.0 2.0fc
1.5 2.50.5
0.10
**
*
**
*
*
*
*
*
***
*
3.0 3.5
*
*
*
**
4.0
Tmemory/f c
*
*
*
**
**
*
*
*
** *
*
**
*
0.60
0.50
0.80
0.70
1.00
0.90
Pentium
Pentium Pro
Pentium II
Pentium III Pentium 4
PC-66
PC-100
PC-133
DDR 266
PC-800D
DDR 333
DDR 333D
**
*
*****
*
DDR 400
DDR 400D
DDR 533D
*
*
*
*
*
*
*
*
FPM
EDO
(GHz)
5.2. Speed gap between processor and memory (2)
References
[1]: Gavrichenkov I., „DDR2 vs. DDR: Revenge Gained,” Xbit Laboratories, 12/17/2004, http://www.xbitlabs.com/articles/memory/display/ddr2-ddr.html
[2]: Vogt P., Fully Buffered DIMM (FB-DIMM) Server Memory Architecture,”, Febr. 18, 2004, Intel Developer Forum, http://www.idt.com/content/OSA_S008_FB-DIMM-Arch.pdf
[3]: McTague M. & David H., „ Fully Buffered DIMM (FB-DIMM) Design Considerations,” Febr. 18, 2004, Intel Developer Forum, http://www.idt.com/content/OSA-S009.pdf
[4]: Haas, J. & Vogt P., Fully buffered DIMM Technology Moves Enterprise Platforms to the Next Level,” Technology Intel Magazine, March 2005, pp. 1-7
[5]: Ganesh B., Jaleel A., Wang D. , Jacob B., „Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling”, Proc. HPCA 2007
[6]: - „Introducing FB-DIMM Memory: Birth of Serial RAM?,” PCStats, Dec. 23, 2005, http://www.pcstats.com/articleview.cfm?articleid=1812&page=1
[7]: Haas J. & Vogt P., „Fully-Buffered DIMM Technology Moves Enterprise Platforms to the Next Level,” Technology Intel Magazin, Technology Intel Magazin, http://www.intel.com/ technology/magazine/computing/fully-buffered-dimm-0305.htm