Design and Analysis of a Robust Pipelined Memory System Hao Wang † , Haiquan (Chuck) Zhao * , Bill Lin † , and Jun (Jim) Xu * † University of California, San Diego * Georgia Institute of Technology Infocom 2010, San Diego
Dec 16, 2015
Design and Analysis of a Robust Pipelined Memory System
Hao Wang†, Haiquan (Chuck) Zhao*, Bill Lin†, and Jun (Jim) Xu*
†University of California, San Diego*Georgia Institute of Technology
Infocom 2010, San Diego
Memory Wall
• Modern Internet routers need to manage large amounts of packet- and flow-level data at line rates
• e.g., need to maintain per-flow records during a monitoring period, but
– Core routers have millions of flows, translating to100’s of megabytes of storage
– 40 Gb/s OC-768 link, new packet can arrive every 8 ns
2
Memory Wall
• SRAM/DRAM dilemma
• SRAM: access latency typically between 5-15 ns (fast enough for 8 ns line rate)
• But the capacity of SRAMs is substantially inadequate in many cases: 4 MB largest typically (much less than 100’s of MBs needed)
3
Memory Wall
• DRAM provides inexpensive bulk storage
• But random access latency typically 50- 100 ns(much slower than 8 ns needed for 40 Gb/s line rate)
• Conventional wisdom is that DRAMsare not fast enough to keep up withever-increasing line rates
4
Memory Design Wish List
• Line rate memory bandwidth (like SRAM)
• Inexpensive bulk storage (like DRAM)
• Predictable performance
• Robustness to adversarial access patterns
5
• Modern DRAMs can be fast and cheap!– Graphics, video games, and HDTV – At commodity pricing, just
$0.01/MB currently, $20 for 2GB!
6
Main Observation
8
Memory Interleaving• Performance achieved through memory interleaving
– e.g. suppose we have B = 6 DRAM banks and access pattern is sequential
– Effective memory bandwidth B times faster
1
7
13
::
2
8
14
::
3
9
15
::
4
10
16
::
5
11
17
::
6
12
18
::
1 2 3 4 5 6
7 8 9 10 11 12
8
9
Memory Interleaving• But, suppose access pattern is as follows:
• Memory bandwidth degrades to worst-case DRAM latency
1
7
13
::
2
8
14
::
3
9
15
::
4
10
16
::
5
11
17
::
6
12
18
::
1
7
1319
25
9
10
Memory Interleaving• One solution is to apply pseudo-randomization of
memory locations
::
::
::
::
::
::
10
Adversarial Access Patterns
• However, memory bandwidth can still degrade to worst-case DRAM latency even with randomization:
1. Lookups to same global variable will trigger accesses to same memory bank
2. Attacker can flood packets with same TCP/IP header, triggering updates to the same memory location and memory bank, regardless of the randomization function.
11
SRAMdataaddr
data outop
012345
W
a
RW
b
W
c
RR
time012345
RRR
time
acc
SRAMEmulationdata
addrdata outop
012345
W
a
RW
b
W
c
RR
timeD+1D+2D+3D+4D
RRR
time+5D
acc
13
Pipelined Memory Abstraction Emulates SRAM with Fixed Delay
Implications of Emulation
• Fixed pipeline delay: If a read operation is issued at time t to an emulated SRAM, the data is available from the memory controller at exactly t + D (instead of same cycle).
• Coherency: The read operations output the same results as an ideal SRAM system.
1414
Proposed Solution: Basic Idea
• Keep SRAM reservation table of memory operations and data that occurred in last C cycles
• Avoid introducing new DRAM operation for memory references to same location within C cycles
15
Details of Memory Architecture
request buffers DRAM banks
reservation table
random address
permutation
input operations
dataout
B
op addr dataop addr dataop addr data
op addr data
C
……
…R-linkR-linkR-link
R-link
ppp
p
MRI table (CAM)
C MRW table (CAM)
C
16
Merging of Operations• Requests arrive from right to left.
1.
2.
3.
1717
WRITEREAD + WRITE…
WRITEWRITE + WRITE…
read copies data from write
2nd write overwrites 1st write
READREAD + READ2nd read copies data from 1st read…
READWRITE + … READWRITE +
Proposed Solution• Rigorously prove that with merging, worst-case delay
for memory operation is bounded by some fixed Dw.h.p.
• Provide pipelined memory abstraction in which operations issued at time t are completed at exactlyt + D cycles later (instead of same cycle).
• Reservation table with C > D also used to implement the pipeline delay, as well as serving as a “cache”.
18
Robustness
• At most one write operation in a request buffer every C cycles to a particular memory address.
• At most one read operation in a request buffer every C cycles to a particular memory address.
• At most one read operation followed by one write operation in a request buffer every C cycles to a particular address.
2020
Theoretical Analysis
• Worst case analysis
• Convex ordering
• Large deviation theory
• Prove: with a cache of size C, the best an attacker can do is to send repetitive requests every C+1 cycles.
21
Bound on Overflow Probability
• Want to bound the probability that a request buffer overflows in n cycles
• is the number of updates to a bank during cycles [s, t], , K is the length of a request queue.
• For total overflow probability bound multiply by B.
,0
Pr[ ] Pr[ ]s ts t n
overflow D
, ,:s t s tD X K
,s tX
t s
22
Chernoff Inequality
• Since this is true for all θ>0,
• We want to find the update sequence that maximizes
,Pr[ ] Pr[ ]s tD X K
( )Pr[ ]X Ke e
( )
[ ]X
K
E e
e
, ( )0
[ ]Pr[ ] min
X
s t K
E eD
e
[ ]XE e
23
Worst Case Request Patterns
• q1+q2 +1 requests for distinct counters ,
• q1 requests repeat 2T times each
• q2 requests repeat 2T-1 times each• 1 request repeat r times each
1 1q T C 2 12 2 1q Tq T
C
TC
1 21, q q ra a
24
1 22 2 1r Tq T q
1
Evaluation
• Overflow probability for 16 million addresses, µ=1/10, and B=32.
80 90 100 110 120 130 140 150 160 170 18010
-14
10-12
10-10
10-8
10-6
10-4
10-2
100
Queue Length K
Ove
rflo
w P
roba
bilit
y B
ound
C=6000
C=7000
C=8000
C=9000
26
SRAM 156 KB,CAM 24 KB
Evaluation
• Overflow probability for 16 million addresses, µ=1/10, and C=8000.
80 90 100 110 120 130 140 150 160 170 18010
-30
10-25
10-20
10-15
10-10
10-5
100
Request Buffer Size K
Ove
rflo
w P
roba
bilit
y B
ound
B=32
B=34
B=36
B=38
27
• Proposed a robust memory architecture that provides throughput of SRAM with density of DRAM.
• Unlike conventional caching that have unpredictable hit/miss performance, our design guarantees w.h.p. a pipelined memory architecture abstraction that can support new memory operation every cycle with fixed pipeline delay.
• Convex ordering and large deviation theory to rigorously prove robustness under adversarial accesses.
28
Conclusion