Worst Case Analysis of DRAM Latency in Multi-Requestor Systems · Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu. Yogen Krish. Rodolfo Pellizzoni

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems

Zheng Pei WuYogen KrishRodolfo Pellizzoni

Multi-Requestor Systems

CPU CPU CPU

Inter-connect

DRAM DMA I/O

1/26


CPU CPU CPU

Inter-connect

DRAM DMA I/O

INTERFERENCE!!!

1/26


CPU CPU CPU

Inter-connect

DRAM DMA I/O

INTERFERENCE!!!

Hard Real Time Systems Must be Predictable!!!

1/26


• Schedulability Analysis: needs WCET as input

• WCET depends on hardware platform

• WCET: needs Latency to access shared resource (e.g. cache, DRAM)

• Existing approaches can bound the interference but they assume the latency for DRAM access is constant

2/26


• Schedulability Analysis: needs WCET as input

• WCET depends on hardware platform

• WCET: needs Latency to access shared resource (e.g. cache, DRAM)

• Existing approaches can bound the interference but they assume the latency for DRAM access is constant

2/26

Problem:DRAM latency is variable and

changes depending on its state

Contribution

CPU CPU CPU

Inter-connect

DRAM DMA I/O

Timing analysis that bounds the worst case latency for DRAM access

Requestor Under Analysis

3/26

Contribution

CPU CPU CPU

Inter-connect

DRAM DMA I/O

Interfering Requestors

Assuming we do not know what they are doing, so we assume they cause the worst case interference

Interfering Requestors

3/26

Outline

1. Background & Related Work

2. Memory Controller Model

3. Worst Case Latency Analysis

4. Results & Conclusion

Background

Can only Read/Write to Row Buffer

Storage Array contains Data

4/26

Background

READ

Targeting Data in this Row

Row Buffer contain data from a different row

4/26

Background

READ

P, A, R

Front End generates the needed commands

Back End issues commands on command bus

4/26

Background

PRE

ACT

P, A, R

Pre-Charge: store the data back into arrayACT: Load the data from array into buffer

P

Pre-charge command issued on command bus

A

Timing Constraint 4/26

Background

P, A, R

P A DataR

READ

4/26

Background

R

P A DataR

Targeting Data Already in Row Buffer

READ

Only Need Read Command

DataR

Can be issued immediately

4/26

Background

R

P A DataR

READ

DataR

Latency of a close request Latency of a open request

-Latency of a close request is much longer than the latency of an open request

-Latency of memory access is variable!

4/26

Predictable Memory Controllers

• Close Row Policy:– After each access, the row buffer is

automatically pre-charged

A DataR A DataRP

Implicit Pre-chargeMemory Latency is the same for all requests

-Can not take advantage of locality (row hits)-Latency is much longer than open request

Next Request targets same bank

5/26

• Interleaving BanksBank 1 Bank 2 Bank 3 Bank 4

DataR

DataR

DataR

DataR

A

A

A

A

Accessing data in multiple banks

Multiple data can be pipelined

6/26



DataR

DataR

DataR

DataR

A

A

A

A

A

Problem: requestors can close each other’s row buffer since they can access all banks

Thus closed row policy is used to make latency predictable The problem of long latency

of close row policy still exist!

6/26


A


DataR

DataR

DataR

DataR

A

A

A

A

This is good for system with small DRAM data bus width (e.g. 16 bits)

Larger data buses can transfer same amount

of data without interleaving so many

banks6/26


• Interleaving BanksBank 1 Bank 2

DataR

DataR

A

A

Interleaving two banks for wider data bus (e.g. 32 bits)

A

Time Wasted!!

Interleaving Problems:1. Requestors can close each other’s

rows (interference)2. Must be used with close row

policy to make latency predictable3. For wider data bus, effectiveness

of interleaving is diminished7/26


• Private Banks

Core 1 Core 2 DMA

Bank 1 Bank 2 Bank 3 Bank 4

• Can partition banks to either requestors or tasks

• This can be done by:– Hardware if Memory

controller supports

– By compiler

– In OS, using virtual memory

8/26


Related Work

• AMC[1] and Predator [2]:-Close Row Policy-Interleaved Bank

• Conservative Open-Page [3]:– Interleaved Bank – Leave row open for a small window of time

• PRET DRAM Controller [4]:– Close Row Policy– Private Bank

9/26

Our Approach

• Private Bank– eliminates row buffer interferences from other

requestors

• Open Row Policy– reduce latency and take advantage or row hit

ratio (locality)

10/26

Challenge:1. Analysis is more complex2. More than 20 timing constraints3. Latency depends on the dynamic

state of DRAM

Outline





Memory Controller Model

Per Requestor BuffersGlobal FIFO

Queue CommandBus

DataBus

AW

Core 2

Core 1

DMA

Front End

A P

WR

Back End

CommandGenerator

ignore CONSTANT front end delayWe focus on the back end latency

W

11/26



Queue CommandBus

DataBus

AW

Core 2

Core 1

DMA

Front End

WR

Back End

CommandGenerator

Each requestor has a private buffer for memory command

Global FIFO is used for arbitration

A PW

11/26



Queue CommandBus

DataBus

AW

Core 2

Core 1

DMA

Front End

WR

Back End

CommandGenerator

Command at head of each private buffer are inserted into the FIFO

A PW

11/26



Queue CommandBus

DataBus

W

Core 2

Core 1

DMA

Front End

R

Back End

CommandGenerator

A

W

A PW

11/26

Command at head of each private buffer are inserted into the FIFO



Queue CommandBus

DataBus

W

Core 2

Core 1

DMA

Front End

R

Back End

CommandGenerator

Controller scan the global FIFO from front to end for a command that can be issued

A

W

A PW

11/26



Queue CommandBus

DataBus

W

Core 2

Core 1

DMA

Front End

R

Back End

CommandGenerator

Command Issued

A

WA

P

W

Next command must wait until timing constraints are satisfied before it can be inserted into FIFO

Intuitively, the arbitration is fair and is similar to a round robin policy

11/26

Outline





Worst Case Analysis

Worst Case Single Request

Latency Analysis

Total # of Requestors

Memory Device Parameters

Cumulative Worst Case

Execution Time

OpenRead

CloseRead

OpenWrite

CloseWrite

Latency for different types of request

TaskUnder

Analysis

# of open reads# of close reads# of open writes# of close writes

WCET

Part 1 – Main Contribution

Part 2 – Only provided for in-order core

Work for any type of cores

Assumption:We do not know about the activity on the other interfering requestors,

so we assume those requestors produce the worst case pattern to

cause maximum interference

12/26

Worst Case Analysis


Latency Analysis




Execution Time

OpenRead

CloseRead

OpenWrite

CloseWrite


TaskUnder

Analysis


WCET

12/26

Single Request Latency

DataR/W

R/W

Decomposed into two parts

Request Arrival

Arrival until Read/Write command is inserted into the global FIFO

Read/write inserted into FIFO until data is finished transmitting

Arrival to Read/Write Read/Write to Data

13/26


DataR/W

Request Arrival


P A

This part may include Pre-charge and ACT commands

Latency depends on the previous request (i.e., state of the DRAM)

Latency does not depend on state of the DRAM

R/W

13/26


R/W

Request Arrival


Both parts depends on the # of interfering requestors as well as DRAM timing constraints

R/W

P A Data

13/26


R/W

Request Arrival


R/W

P A Data

13/26

We will focus on this partFor details on this part,refer to paper

Read/Write to Data Latency

14/26

DataR

DataR

DataR

Read to Read has no timing constraints, only contention on the data bus

Same for Write to Write


DataR

DataW

DataW

Write to Read timing constraint

Read to Write timing constraint

15/26

Therefore, an alternation of read and write commands produce longer latency


R

W

R

W

Front

R

R

W

W

Data

Data

Data

Data

• Interference on Write command

All other requestors inserts R/W commands to create maximum interference

16/26


R

W

R

W

Front

R

R

W

W

Data

Data

Data

Data


A write command could of finished immediately before t0

W Data

17/26


R

W

R

W

Front

R

R

W

W

W

Data

Data

Data

Data

Data


Therefore, further delay the first Read command

18/26

Worst Case Analysis


Latency Analysis




Execution Time

OpenRead

CloseRead

OpenWrite

CloseWrite


TaskUnder

Analysis


WCET

Part 2 – Only provided for in-order core

Cumulative Latency

Open Read Close Read Open Write Close Write

19/26

Task Under Analysis:

t

Cumulative Latency


19/26


t

If worst case request order is known, we can sum the latency of each request

Worst case request order depends on input value, code path, cache state, etc.

Cumulative Latency


19/26


t

If worst case request order is known, we can sum the latency of each request

Static Analysis tools can be used to obtain safe bound for # of each type of request

Cumulative Latency


Which pattern leads to worst case latency?

This problem can be solved in constant time; see paper for detail

19/26


Outline



3. Worst Case Latency Analysis– Single Request Latency– Cumulative Latency


Results• Comparison against Analyzable Memory Controller [1]

– Since they use fair arbitration (Round Robin) which is similar to our approach

• Synthetic Benchmarks– Used to show how worst case latency varies as

parameters are changed

• CHStone Benchmarks– Memory traces are obtained from gem5 simulator– Memory traces are used as input the worst case

analysis20/26

Results• Synthetic Benchmarks

21/26

Results• Synthetic Benchmarks

22/26

Results

• As memory devices becomes faster, the difference between open and close access is getting larger and therefore close row is becoming too pessimistic

Devices 800D(ns)

1066F(ns)

1333H(ns)

1600K(ns)

1866L(ns)

2133N(ns)

% better

AMC(64 bits) 185 185.27 180.9 178 169.84 163 11.89%Our(64 bits) 125.2 112.47 104.85 102.18 96.97 92.85 25.84%

23/26

50% Row Hit Ratio, 4 Requestors, 20% Writes

Results• CHStone Benchmarks for 64bits bus

24/26

Conclusion• A novel worst case analysis that takes dynamic state into

account

• Open row policy can reduce memory latency as devices are becoming faster

• Private bank scheme is used to eliminate row buffer interference from other requestors

25/26

Future Work• Discussion of shared data

• Bus utilization is still poor due to read/write switching

• Read/Write optimization to reduce latency bound

• Handle Multiple Ranks

• Implementation in hardware

26/26

References[1] M. Paolieri, E. Quin ̃ones, F. Cazorla, and M. Valero, “An Analyzable Memory Controller for Hard Real-Time CMPs,” Embedded Systems Letters, IEEE, vol. 1, no. 4, pp. 86–90, 2009. [2] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: a predictable SDRAM memory controller,” in CODES+ISSS, 2007, pp. 251–256. [3] S. Goossens, B. Akesson, and K. Goossens, “Conservative Open- page Policy for Mixed Time-Criticality Memory Controllers,” in DATE, 2013. [4] J. Reineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, “Pret dram controller: Bank privatization for predictability and temporal isolation,” in CODES+ISSS, 2011, pp. 99–108.

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems · Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu. Yogen Krish. Rodolfo Pellizzoni

Documents