Worst Case Analysis of DRAM Latency in Multi-Requestor Systems Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni
Worst Case Analysis of DRAM Latency in Multi-Requestor Systems
Zheng Pei WuYogen KrishRodolfo Pellizzoni
Multi-Requestor Systems
CPU CPU CPU
Inter-connect
DRAM DMA I/O
1/26
Multi-Requestor Systems
CPU CPU CPU
Inter-connect
DRAM DMA I/O
INTERFERENCE!!!
1/26
Multi-Requestor Systems
CPU CPU CPU
Inter-connect
DRAM DMA I/O
INTERFERENCE!!!
Hard Real Time Systems Must be Predictable!!!
1/26
Multi-Requestor Systems
• Schedulability Analysis: needs WCET as input
• WCET depends on hardware platform
• WCET: needs Latency to access shared resource (e.g. cache, DRAM)
• Existing approaches can bound the interference but they assume the latency for DRAM access is constant
2/26
Multi-Requestor Systems
• Schedulability Analysis: needs WCET as input
• WCET depends on hardware platform
• WCET: needs Latency to access shared resource (e.g. cache, DRAM)
• Existing approaches can bound the interference but they assume the latency for DRAM access is constant
2/26
Problem:DRAM latency is variable and
changes depending on its state
Contribution
CPU CPU CPU
Inter-connect
DRAM DMA I/O
Timing analysis that bounds the worst case latency for DRAM access
Requestor Under Analysis
3/26
Contribution
CPU CPU CPU
Inter-connect
DRAM DMA I/O
Interfering Requestors
Assuming we do not know what they are doing, so we assume they cause the worst case interference
Interfering Requestors
3/26
Outline
1. Background & Related Work
2. Memory Controller Model
3. Worst Case Latency Analysis
4. Results & Conclusion
Background
Can only Read/Write to Row Buffer
Storage Array contains Data
4/26
Background
READ
Targeting Data in this Row
Row Buffer contain data from a different row
4/26
Background
READ
P, A, R
Front End generates the needed commands
Back End issues commands on command bus
4/26
Background
PRE
ACT
P, A, R
Pre-Charge: store the data back into arrayACT: Load the data from array into buffer
P
Pre-charge command issued on command bus
A
Timing Constraint 4/26
Background
P, A, R
P A DataR
READ
4/26
Background
R
P A DataR
Targeting Data Already in Row Buffer
READ
Only Need Read Command
DataR
Can be issued immediately
4/26
Background
R
P A DataR
READ
DataR
Latency of a close request Latency of a open request
-Latency of a close request is much longer than the latency of an open request
-Latency of memory access is variable!
4/26
Predictable Memory Controllers
• Close Row Policy:– After each access, the row buffer is
automatically pre-charged
A DataR A DataRP
Implicit Pre-chargeMemory Latency is the same for all requests
-Can not take advantage of locality (row hits)-Latency is much longer than open request
Next Request targets same bank
5/26
• Interleaving BanksBank 1 Bank 2 Bank 3 Bank 4
DataR
DataR
DataR
DataR
A
A
A
A
Accessing data in multiple banks
Multiple data can be pipelined
6/26
Predictable Memory Controllers
• Interleaving BanksBank 1 Bank 2 Bank 3 Bank 4
DataR
DataR
DataR
DataR
A
A
A
A
A
Problem: requestors can close each other’s row buffer since they can access all banks
Thus closed row policy is used to make latency predictable The problem of long latency
of close row policy still exist!
6/26
Predictable Memory Controllers
A
• Interleaving BanksBank 1 Bank 2 Bank 3 Bank 4
DataR
DataR
DataR
DataR
A
A
A
A
This is good for system with small DRAM data bus width (e.g. 16 bits)
Larger data buses can transfer same amount
of data without interleaving so many
banks6/26
Predictable Memory Controllers
• Interleaving BanksBank 1 Bank 2
DataR
DataR
A
A
Interleaving two banks for wider data bus (e.g. 32 bits)
A
Time Wasted!!
Interleaving Problems:1. Requestors can close each other’s
rows (interference)2. Must be used with close row
policy to make latency predictable3. For wider data bus, effectiveness
of interleaving is diminished7/26
Predictable Memory Controllers
• Private Banks
Core 1 Core 2 DMA
Bank 1 Bank 2 Bank 3 Bank 4
• Can partition banks to either requestors or tasks
• This can be done by:– Hardware if Memory
controller supports
– By compiler
– In OS, using virtual memory
8/26
Predictable Memory Controllers
Related Work
• AMC[1] and Predator [2]:-Close Row Policy-Interleaved Bank
• Conservative Open-Page [3]:– Interleaved Bank – Leave row open for a small window of time
• PRET DRAM Controller [4]:– Close Row Policy– Private Bank
9/26
Our Approach
• Private Bank– eliminates row buffer interferences from other
requestors
• Open Row Policy– reduce latency and take advantage or row hit
ratio (locality)
10/26
Challenge:1. Analysis is more complex2. More than 20 timing constraints3. Latency depends on the dynamic
state of DRAM
Outline
1. Background & Related Work
2. Memory Controller Model
3. Worst Case Latency Analysis
4. Results & Conclusion
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
AW
Core 2
Core 1
DMA
Front End
A P
WR
Back End
CommandGenerator
ignore CONSTANT front end delayWe focus on the back end latency
W
11/26
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
AW
Core 2
Core 1
DMA
Front End
WR
Back End
CommandGenerator
Each requestor has a private buffer for memory command
Global FIFO is used for arbitration
A PW
11/26
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
AW
Core 2
Core 1
DMA
Front End
WR
Back End
CommandGenerator
Command at head of each private buffer are inserted into the FIFO
A PW
11/26
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
W
Core 2
Core 1
DMA
Front End
R
Back End
CommandGenerator
A
W
A PW
11/26
Command at head of each private buffer are inserted into the FIFO
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
W
Core 2
Core 1
DMA
Front End
R
Back End
CommandGenerator
Controller scan the global FIFO from front to end for a command that can be issued
A
W
A PW
11/26
Memory Controller Model
Per Requestor BuffersGlobal FIFO
Queue CommandBus
DataBus
W
Core 2
Core 1
DMA
Front End
R
Back End
CommandGenerator
Command Issued
A
WA
P
W
Next command must wait until timing constraints are satisfied before it can be inserted into FIFO
Intuitively, the arbitration is fair and is similar to a round robin policy
11/26
Outline
1. Background & Related Work
2. Memory Controller Model
3. Worst Case Latency Analysis
4. Results & Conclusion
Worst Case Analysis
Worst Case Single Request
Latency Analysis
Total # of Requestors
Memory Device Parameters
Cumulative Worst Case
Execution Time
OpenRead
CloseRead
OpenWrite
CloseWrite
Latency for different types of request
TaskUnder
Analysis
# of open reads# of close reads# of open writes# of close writes
WCET
Part 1 – Main Contribution
Part 2 – Only provided for in-order core
Work for any type of cores
Assumption:We do not know about the activity on the other interfering requestors,
so we assume those requestors produce the worst case pattern to
cause maximum interference
12/26
Worst Case Analysis
Worst Case Single Request
Latency Analysis
Total # of Requestors
Memory Device Parameters
Cumulative Worst Case
Execution Time
OpenRead
CloseRead
OpenWrite
CloseWrite
Latency for different types of request
TaskUnder
Analysis
# of open reads# of close reads# of open writes# of close writes
WCET
12/26
Single Request Latency
DataR/W
R/W
Decomposed into two parts
Request Arrival
Arrival until Read/Write command is inserted into the global FIFO
Read/write inserted into FIFO until data is finished transmitting
Arrival to Read/Write Read/Write to Data
13/26
Single Request Latency
DataR/W
Request Arrival
Arrival to Read/Write Read/Write to Data
P A
This part may include Pre-charge and ACT commands
Latency depends on the previous request (i.e., state of the DRAM)
Latency does not depend on state of the DRAM
R/W
13/26
Single Request Latency
R/W
Request Arrival
Arrival to Read/Write Read/Write to Data
Both parts depends on the # of interfering requestors as well as DRAM timing constraints
R/W
P A Data
13/26
Single Request Latency
R/W
Request Arrival
Arrival to Read/Write Read/Write to Data
R/W
P A Data
13/26
We will focus on this partFor details on this part,refer to paper
Read/Write to Data Latency
14/26
DataR
DataR
DataR
Read to Read has no timing constraints, only contention on the data bus
Same for Write to Write
Read/Write to Data Latency
DataR
DataW
DataW
Write to Read timing constraint
Read to Write timing constraint
15/26
Therefore, an alternation of read and write commands produce longer latency
Read/Write to Data Latency
R
W
R
W
Front
R
R
W
W
Data
Data
Data
Data
• Interference on Write command
All other requestors inserts R/W commands to create maximum interference
16/26
Read/Write to Data Latency
R
W
R
W
Front
R
R
W
W
Data
Data
Data
Data
• Interference on Write command
A write command could of finished immediately before t0
W Data
17/26
Read/Write to Data Latency
R
W
R
W
Front
R
R
W
W
W
Data
Data
Data
Data
Data
• Interference on Write command
Therefore, further delay the first Read command
18/26
Worst Case Analysis
Worst Case Single Request
Latency Analysis
Total # of Requestors
Memory Device Parameters
Cumulative Worst Case
Execution Time
OpenRead
CloseRead
OpenWrite
CloseWrite
Latency for different types of request
TaskUnder
Analysis
# of open reads# of close reads# of open writes# of close writes
WCET
Part 2 – Only provided for in-order core
Cumulative Latency
Open Read Close Read Open Write Close Write
19/26
Task Under Analysis:
t
Cumulative Latency
Open Read Close Read Open Write Close Write
19/26
Task Under Analysis:
t
If worst case request order is known, we can sum the latency of each request
Worst case request order depends on input value, code path, cache state, etc.
Cumulative Latency
Open Read Close Read Open Write Close Write
19/26
Task Under Analysis:
t
If worst case request order is known, we can sum the latency of each request
Static Analysis tools can be used to obtain safe bound for # of each type of request
Cumulative Latency
Open Read Close Read Open Write Close Write
Which pattern leads to worst case latency?
This problem can be solved in constant time; see paper for detail
19/26
Task Under Analysis:
Outline
1. Background & Related Work
2. Memory Controller Model
3. Worst Case Latency Analysis– Single Request Latency– Cumulative Latency
4. Results & Conclusion
Results• Comparison against Analyzable Memory Controller [1]
– Since they use fair arbitration (Round Robin) which is similar to our approach
• Synthetic Benchmarks– Used to show how worst case latency varies as
parameters are changed
• CHStone Benchmarks– Memory traces are obtained from gem5 simulator– Memory traces are used as input the worst case
analysis20/26
Results• Synthetic Benchmarks
21/26
Results• Synthetic Benchmarks
22/26
Results
• As memory devices becomes faster, the difference between open and close access is getting larger and therefore close row is becoming too pessimistic
Devices 800D(ns)
1066F(ns)
1333H(ns)
1600K(ns)
1866L(ns)
2133N(ns)
% better
AMC(64 bits) 185 185.27 180.9 178 169.84 163 11.89%Our(64 bits) 125.2 112.47 104.85 102.18 96.97 92.85 25.84%
23/26
50% Row Hit Ratio, 4 Requestors, 20% Writes
Results• CHStone Benchmarks for 64bits bus
24/26
Conclusion• A novel worst case analysis that takes dynamic state into
account
• Open row policy can reduce memory latency as devices are becoming faster
• Private bank scheme is used to eliminate row buffer interference from other requestors
25/26
Future Work• Discussion of shared data
• Bus utilization is still poor due to read/write switching
• Read/Write optimization to reduce latency bound
• Handle Multiple Ranks
• Implementation in hardware
26/26
References[1] M. Paolieri, E. Quin ̃ones, F. Cazorla, and M. Valero, “An Analyzable Memory Controller for Hard Real-Time CMPs,” Embedded Systems Letters, IEEE, vol. 1, no. 4, pp. 86–90, 2009. [2] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: a predictable SDRAM memory controller,” in CODES+ISSS, 2007, pp. 251–256. [3] S. Goossens, B. Akesson, and K. Goossens, “Conservative Open- page Policy for Mixed Time-Criticality Memory Controllers,” in DATE, 2013. [4] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, “Pret dram controller: Bank privatization for predictability and temporal isolation,” in CODES+ISSS, 2011, pp. 99–108.