Top Banner
Memory Hierarchy Cache DRAM Flash Disk L: 0.5ns, C: 10MB L: 50ns, C: 100GB BW: 100GB/s L: 10us, C: 2TB BW: 2GB/s L: 10ms, C: 4TB BW: 600MB/s Latency, Capacity, Bandwidth Controlle r
28

Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto [email protected].

Dec 14, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Memory Hierarchy

Cache

DRAM

Flash

Disk

L: 0.5ns, C: 10MB

L: 50ns, C: 100GB

BW: 100GB/s

L: 10us, C: 2TB

BW: 2GB/s

L: 10ms, C: 4TB

BW: 600MB/s

Latency, Capacity, Bandwidth

Controller

Page 2: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

DRAM Primer<bank, row, column>

Page buffer per bank

Page 3: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

DRAM Characteristics

DRAM page crossing Charge ~10K DRAM cells and bitlines

Increase power & latency

Decrease effective bandwidth

Sequential access VS. random access Less page crossing

Lower power consumption

4.4x shorter latency

10x better BW

5

Page 4: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Take Away: DRAM = Disk

Page 5: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Embedded Controller

Opportunities for customization

Bad News

None available as in general purpose processor

Good News

Page 6: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Agenda

Overview

Multi-Port Memory Controller (MPMC)

Design

“Out-of-Core” Algorithmic Exploration

Page 7: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

9

Motivating Example: H.264 Decoder

6.4 9.6 1.2 164.8 0.09 31.0 156.7 94MB/s Dynamic latency,

BW and power

Diverse QoS requirements

Latency sensitive

Bandwidth sensitive

Page 8: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

10

Wanted

Bandwidth guarantee

Prioritized access

Reduced page crossing

Page 9: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Previous Works Bandwidth guarantee

• Q0: Distinguish bandwidth guarantee for different classes of ports

• Q1: Distinguish bandwidth guarantee for each port Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth

Q0 Q1 Q2 Q3 Q4

[Rixner,00][McKee,00][Hur,04] ✓

[Heighecker,03,05][Whitty,08] ✓ ✓ ✓

[Lee,05] ✓ ✓

[Burchard,05] ✓ ✓

Proposed BCBR ✓ ✓ ✓ ✓11

Page 10: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

12

Key Observations

Port locality: Same port requests

same DRAM page

Service time flexibility 1/24 second to decode a

video frame 4M cycles at 100 MHz for

request reordering

Residual bandwidth Statically allocated BW Underutilized at runtime

Weighted round robin: Minimum BW guarantee Busting service

Credit borrow & repay Reorder requests according

to priority

Dynamic BW calculation Capture and re-allocate

residual BW

Page 11: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

13

R20

T(Rij): arriving time of jth requests for Qi

Weighted Round Robin

Assume bandwidth requirement Q2: 30% Q1: 50% Q0: 20%

Request time:

Service time:

Clock:

Tround = 10

Time: scheduling cycles

0 1 2 3 4 5 6 7 8 9

T(R2)

Q2

T(R1)

Q1

T(R0)

Q0

R00

R20

R10

R01

R21

R11

R21

R22

R12

R22

R13R10

R14R11 R12 R13 R14

R00 R01

Page 12: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

14

Problem with WRR

Priority: Q0 > Q2

8 cycles of waiting time! Could be worse!

R20

Clock: 0 1 2 3 4 5 6 7 8 9

T(R2)

Q2

T(R1)

Q1

T(R0)

Q0

R00

R20

R10

R01

R21

R11

R21

R22

R12

R22

R13R10

R14R11 R12 R13 R14

R00 R01

Page 13: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

15

Borrow Credits

Zero Waiting time for Q0!

Clock: 0 1 2 3 4 5 6 7 8 9

T(R2)

Q2

T(R1)

Q1

T(R0)

Q0*R00

R20

R10

R01

R21

R11

R22

R12

R20

R00 R01

debtQ0 Q2 Q2Q2

borrow

Page 14: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

16

Repay Later

At Q0’s turn, BW guarantee is recovered

Clock: 0 1 2 3 4 5 6 7 8 9

T(R2)

Q2

T(R1)

Q1

T(R0)

Q0*R00

R20

R10

R01

R21

R11

R22

R12 R13R10

R14R11 R12 R13 R14

R00 R01

debtQ0 Q2 Q2Q2

R20

Q2Q2

Q2Q2

Q2Q2

Q2Q2

Q2Q2

R21 R22

Q2Q2

Q2

repay

Prioritized access!

Page 15: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

17

Problem: Depth of DebtQ DebtQ as residual BW collector

BW allocated to Q0 increases to: 20% + residual BW

Requirement for the depth of DebtQ0 decreasesClock: 0 1 2 3 4 5 6 7 8 9

T(R2)Q2

T(R1)Q1

T(R0)Q0*

R00

R20

R10

R01

R21

R11

R22

R12 R13

R10

R03

R11 R12 R13

R00 R01

debtQ0 Q2 Q2Q2

R20

Q2Q2

Q2Q2

Q2Q2

Q2Q2

Q2Q2

R21 R22

Q2Q2

Q2

Help repay

R03

Page 16: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

18

Evaluation Framework Simulation Framework

Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ

Page 17: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Port 0 1 2 3 4

RR 1.08% 24% 24% 24% 24%PQ 0.73% 80% 18% 0% 0%BGPQ 1.07% 39% 20% 20% 20%WRR 0.76% 33% 22% 22% 22%BCBR 0.76% 33% 22% 22% 22%

19

Bandwidth Guarantee

Bandwidth guarantees: P0: 2% P1: 30% P2: 20% P3:20%

P4:20%

System residual: 8%No BW

guarantee

Provides BW guarantee!

Page 18: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

20

Cache Response Latency

Average 16x faster than WRR As fast as PQ (prioritized access)

Late

ncy

(ns)

Page 19: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

21

DRAM Energy & BW Efficiency

30% less page crossing (compared to RR) 1.4x more energy efficient 1.2x higher effective DRAM BW

As good as WRR (exploit port locality)

RR BGPQ WRR BCBR

GB/J 0.298 0.289 0.412 0.411

Act-Pre Ratio 29.6% 30.1% 23.0% 23.0%

Improvement 1.0x 0.97x 1.38x 1.38x

Page 20: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Hardware Cost

22

Xilinx MPMC: frontend + backend 3450 LUTs 5540 registers 1-9 BRAMs

BCBR + Speedy 3379 LUTs 2264 registers 4 BRAMs

BCBR: frontend 1393 LUTs 884 registers 0 BRAM

Reference backend: speedy DDRMC 1986 LUTs 1380 registers 4 BRAMs

Better performance without higher cost!

Page 21: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Agenda

Overview

Multi-Port Memory Controller (MPMC)

Design

“Out-of-Core” Algorithm / Architecture

Exploration

Page 22: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Idea

24

Remember DRAM=DISK

So let’sAsk the same questionPlug-on DRAM parametersGet DRAM-specific answers

Out-of-core algorithmsData does not fit DRAMPerformance dominated by IO

Key questionsReduce #IOsBlock granularity

Page 23: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Motivating Example: CDN

Caches in CDN Get closer to users

Save bandwidth

Zipf’s law 80-20 rule hit

rate

25

Page 24: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Video Cache

Page 25: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

27

Defining the KnobsTransaction

a number of column access commands enclosed by row activation / precharge

W: burst sizes : # bursts

Function of array organization & timing params

Function of array organization & timing params

Function of algorithmic parameters

Page 26: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

D-nary Heap

Algorithmic Design Variable:Branching Factor

Record Size

Page 27: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

B+ Tree

Page 28: Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu.

Lessons Learned

Optimal result can be beautifully

derived!

Big O does not matter in some cases Depending on data input characteristics