Top Banner
System-Level Memory Bus Power And Performance Optimization for Embedded Systems Ke Ning [email protected] David Kaeli [email protected]
69

System-Level Memory Bus Power And Performance Optimization for Embedded Systems

Jan 13, 2016

Download

Documents

adonia

System-Level Memory Bus Power And Performance Optimization for Embedded Systems. Ke Ning [email protected] David Kaeli [email protected]. Why Power is More Important?. “Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001 - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

System-Level Memory Bus Power And Performance Optimization for

Embedded Systems

Ke [email protected]

David [email protected]

Page 2: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

2

Why Power is More Important?

“Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001

Increasing complexity for higher performance (MIPS)Parallelism, pipeline, memory/cache sizeHigher clock frequency, larger die sizeRising dynamic power consumption

CMOS process continues to shrink: Smaller size logic gates reduce Vthreshold

Lower Vthreshold will have higher leakageLeakage power will exceed dynamic power

Things getting worse in Embedded SystemLow power and low cost systemsFixed or Limited applications/functionalities Real-time systems with timing constraints

Page 3: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

3

Power Breakdown of An Embedded System

Internal Dynamic

Internal Leakage

RTC

PPI

SPORT0

SPORT1

UART

SDRAM

25°C1.2V Internal400MHz CCLK Blackfin Processor3.3V External133MHz SDRAM27MHz PPI

Source: Analog Devices Inc.

ResearchTarget

Page 4: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

4

Introduction

Related work on microprocessor power Low power design trendPower metricsPower performance tradeoffsPower optimization techniques

Power estimation frameworkExperimental framework built from Blackfin cycle accurate

simulatorValidated through a Blackfin EZKit board

Power aware bus arbitrationMemory page remapping

Page 5: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

5

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Page 6: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

6

Power Modeling

Dynamic power estimationInstruction level model: [Tiwari94], JouleTrack[Sinha01]Function level model: [Qu00]Architecture model: Cai-Lim Model, TEMPEST[CaiLim99],

Wattch[Brooks00], Simplepower[Ye00]Static power estimation

Butts-Sohi model [Butts00]Previous memory system power estimation

Activity model: CACTI[Wilton96]Trace driven model: Dinero IV[Elder98]

Page 7: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

7

Power Equation

leakage

leakagedesignDD

dynamic

DDIN kVfACVP 2

A

f

C

DDV

Ndesignk leakageI

Activity Factor

Total Capacitance

Voltage

Frequency

Transistor Number

Technology factor

Page 8: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

8

Common Power Optimization Techniques

Gating (turn off unused components)Clock gatingVoltage gating: Cache decay [Hu01]

Scaling: (scale operating point of an component)Voltage scaling: Drowsy cache [Flautner02]Frequency scaling: [Pering98]Resource scaling: DRAM power mode [Delaluz01]

Banking: (break single component into smaller sub-units)Vertical sub-banking: Filter cache[Kin97]Horizontal sub-banking: Scratchpad [Kandemir01]

Clustering: (partition components into clusters)Switching reduction: (redesigning with lower activity)

Bus encoding: Permutation Code [Mehta96], redundant code[Stan95, Benini98], WZE[Musoll97]

Page 9: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

9

Power Aware Figure of Merit

Delay, DPerformance, MIPS

Power, PBattery life (mobile), packaging (high performance)

Obvious choice for power performance tradeoff, PDJoules/instruction, inversely MIPS/WEnergy figureMobile / low power applications

Energy Delay PD2

MIPS2/W [Gonzalez96]Energy Delay Square PD3

MIPS3/WVoltage and frequency independent

More generically, MIPSm/W

Page 10: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

10

Most of optimization schemes sacrifice performance for lower power consumption, except switching reduction.

All of optimization schemes generate higher power efficiency.All of optimization schemes increase hardware complexity.

Power Optimization Effect on Power Figure

Page 11: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

11

Outline

Research Motivation and Introduction

Related

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Page 12: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

12

External Bus

External Bus ComponentsTypically is off-chip busIncludes: Control Bus, Address Bus, Data Bus

External Bus Power ConsumptionDynamic power factors: activity, capacitance, frequency, voltageLeakage power: supply voltage, threshold voltage, CMOS

technologyDifferent from internal memory bus power:

Longer physical distance, higher bus capacitance, lower speedCross line interference, Higher leakage currentDifferent communication protocols (memory/peripheral dependent)Multiplexed row/column address bus, narrower data bus

Page 13: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

13

Embedded SOC System Architecture

Media Processor

Core

Data Cache Instruction Cache

System DMA Controller

Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

Ex

tern

al

Bu

s I

nte

rfa

ce

Un

it (

EB

IU)

SDRAM

FLASHMemory

AsynchronousDevices

Internal Bus

Ext

ern

al B

us

PowerModeling

Area

Page 14: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

14

ADSP-BF533 EZ-Kit Lite Board

FLASHMemory

SPORTData I/O

Video Codec/ADV Converter

BF533Blackfin

Processor

SDRAMMemory

Video In & OutAudio In Audio Out

Audio Codec/AD Converter

Page 15: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

15

External Bus Power Estimator

Previous ApproachesUsed Hamming distance [Benini98]Control signal was not consideredShared row and column address busMemory state transitions were not considered

In Our EstimatorIntegrate memory control signal power into the modelConsider the case where row and column address are sharedMemory state transitions and stalls also cost powerConsider page miss penalty and traffic reverse penalty

P(bus) = P(page miss) + P(bus turnaround) + P(control signal) + P(address generation)+ P(data transmission) + P(leakage)

Page 16: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

16

Two External Bus SDRAM Timing Models

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

N N N N

N N N N

RPt CAStRCDt

System Clock Cycles (SCLK)

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

NN

(a) SDRAM Access in Sequential Command Mode

(b) SDRAM Access in Pipelined Command Mode

P - PRECHARGE A - ACTIVATE N - NOP R - READ

R R

Page 17: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

17

Bus Power Simulation Framework

Program Target BinaryCompiler

Instruction LevelSimulator

Memory PowerModel

External Bus Power Estimator

Memory TechnologyTiming Model

Memory HierarchyModel

Memory TraceGenerator

Bus Power

Developed software modules

Page 18: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

18

Multimedia Benchmark Configurations

Name Description I-Cache Size

D-Cache Size

MPEG2-ENC MPEG-2 Video encoder with 720x480 4:2:0 inputframes.

16k 16k

MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with4:2:2 CCIR frame output.

16k 16k

H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression.

16k 16k

H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm.

16k 16k

JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k

JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k

PGP-ENC Pretty Good Privacy encryption and digital signatureof text message.

8k 4k

PGP-DEC Pretty Good Privacy decryption of encrypted message.

8k 4k

G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k

G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k

Page 19: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

19

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Page 20: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

20

Optimization I – Bus Arbitration

Multiple bus access masters in an SOC systemProcessor coresData/Instruction cachesDMAASIC modules

Multimedia applicationsHigh bus bandwidth throughputLarge memory footprint

Efficient arbitration algorithm can:Increase power awarenessIncrease bus throughputReduce bus power

Page 21: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

21

Bus Arbitration Target Region

Media Processor

Core

Data Cache Instruction Cache

System DMA Controller

Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

EB

IU w

ith

Arb

itra

tio

n E

nab

led

SDRAM

FLASHMemory

AsynchronousDevices

Internal Bus

Ext

ern

al B

us

Page 22: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

22

Bus Arbitration Schemes

EBIU with arbitration enabledHandle core-to-memory and core-to-peripheral communicationResolve bus access contentionSchedule bus access requests

Traditional AlgorithmsFirst Come First Serve (FCFS)Fixed Priority

Power Aware Algorithms(Categorized by power metric / cost function)Minimum Power (P1D0) or (1, 0)Minimum Delay (P0D1) or (0, 1)Minimum Power Delay Product (P1D1) or (1, 1)Minimum Power Delay Square Product (P1D2) or (1, 2)More generically (PnDm) or (n, m)

Page 23: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

23

Bus Arbitration Schemes (Continued)

Power Aware ArbitrationFrom the current pending requests in the waiting queue, find a

permutation of the external bus requests to achieve the minimum total power and/or performance cost.

Reducible to minimum Hamiltonian path problem in a graph G(V,E).Vertex = Request R(t,s,b,l)

t – request arrival times – starting addressb – block size l – read / write

Edge = Transition of Request i and j. i,j - Request i and jedge weight w(i, j) is cost of transition

Page 24: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

24

Minimum Hamiltonian Path Problem

R0

R3

R2

R1

w(0,3)w

(0,2

)

w(0

,1)

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

R0 – Last Request on the Bus. Must be the starting point of a path.R1, R2, R3 – Requests in the queue

w(i,j) = P(i,j)nD(i,j)m

P(i,j) – Power of Rj after Ri D(i,j) – Delay of Rj after Ri

Hamiltonian Path: R0->R3->R1->R2

Minimum Path weight = w(0,3)+w(3,1)+w(1,2)

NP-Complete Problem

Page 25: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

25

Greedy Solution

R0

R3

R2

R1

w(0,3)w

(0,2

)

w(0

,1)

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

Greedy Algorithm (local min)

Only the next requestin the path is needed

min{w(0,j) | w(i,j) is the edge weight of graph G(V,E)}

In each iteration of arbitration:

1. A new graph G(V,E) need to be constructed.2. A greedy solution request is arbitrated to use the bus.

Page 26: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

26

Experimental Setup

Utilized embedded power modeling framework Implemented eleven different arbitration schemes inside EBIU

FCFS, FixedPriority.minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and

(1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3)10 multimedia application benchmarks are ported to Blackfin

architecture and simulated, including MPEG-2, H.264, JPEG, PGP and G.721.

Page 27: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

27

Power Improvement

Power-aware arbitration schemes have lower power consumptions than Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 6-7% saving than Sequential Command model

for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Power

0.0

10.0

20.0

30.0

40.0

50.0

60.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Av

erag

e B

us P

ower

(mW

)

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Power

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Av

erag

e B

us P

ower

(mW

)

Sequential Command

Pipelined Command

Page 28: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

28

Speed Improvement

Power-aware schemes have smaller bus delay than traditional Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 3-9% speedup than Sequential Command

model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Delay

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Ave

rage

Del

ay (S

CLK

)

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Delay

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Ave

rage

Del

ay (S

CLK

)

Sequential Command

Pipelined Command

Page 29: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

29

Comparison with Exhaustive Algorithm

Greedy Algorithm can fail in certain case.

Complexity of O(n) vs O(n!).Performance difference is

negligible:

R0

R3

R2

R1

20

20

18

17

15

7

5

18

17

ExhaustiveSearch

GreedySearch

new

Page 30: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

30

Comments on Experimental Results

Power aware arbitrators significantly reduce the external bus power for all 8 benchmarks. In average, there are 14% power saving.

Power aware arbitrators reduce the bus access delay. The delay are reduced by 21% in average among 8 benchmarks.

Pipelined SDRAM model has big performance advantage over sequential SDRAM model. It achieve 6% power saving and 12% speedup.

Power and delay in external bus are highly correlated. Minimum power also achieves minimum delay.

Minimum power schemes will lead to simpler design options. Scheme (1,0) is preferred due to its simplicity.

Page 31: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

31

Design of A Power Estimation Unit (PEU)

Last Bank Address

Bank(0) Open Row Addr

Bank(1) Open Row Addr

Bank(2) Open Row Addr

Bank(3) Open Row Addr

Last Column Address

Updated Column AddrRow

Addr

Colum

n Addr

Bank A

ddr

Next RequestAddress

If not equal, output bank miss power

If not equal, output page miss penalty power,

update last column address register

Use hamming distanceto calculate columnaddress data power

Power Estimation Unit (PEU)

EstimatedPower

Page 32: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

32

Two Arbitrator Implementation Structures

Request Queue Buffer

Power EstimatorUnit (PEU)

Power EstimatorUnit (PEU)

Memory/BusStates Info

Com

para

tor

MinimumPower

Request

State Update

t s b l

t s b l

t s b l

t s b l

AccessCommandGenerator

External Bus

Power EstimatorUnit

Power EstimatorUnit

Request Queue Buffer

Memory/BusStates Info

Com

para

tor

MinimumPower

Request

State Update

t s b l

t s b l

t s b l

t s b l

AccessCommandGenerator

External BusPower Estimator

UnitPower EstimatorUnitPower Estimator

Unit (PEU)

Shared PEUStructure

Dedicated PEUStructure

Page 33: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

33

Performance of two structures

Higher PEU delay will lower the external bus performance for both MPEG-2 encoder and decoder.

When PEU delay is 5 or higher, dedicated structure is preferred than shared structure. Otherwise, shared structure is enough.

MPEG-2 Encoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

120.0125.0

130.0135.0

140.0145.0

150.0155.0

160.0165.0

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

Ave

rag

e D

elay

(C

ycle

s) Estimator Unit Shared

Estimator Unit Dedicated

MPEG-2 Decoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

100.0

105.0

110.0

115.0

120.0

125.0

130.0

135.0

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

Ave

rag

e D

elay

(C

ycle

s)

Estimator Unit Shared

Estimator Unit Dedicated

Page 34: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

34

Summary of Bus Arbitration Schemes

Efficient bus arbitrations can provide benefits to both power and performance over traditional arbitration schemes.

Minimum power and minimum delay are highly correlated on external bus performance.

Pipelined SDRAM model has significant advantage over sequential SDRAM model.

Arbitration scheme (1, 0) is recommended. Minimum power approach provides more design options and leads

to simpler design implementations. The trade-off between design complexity and performance was presented.

Page 35: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

35

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Page 36: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

36

address

time

Data Access Pattern in Multimedia Apps

time

addresstime

address

3 common data access patterns in multimedia applications

Majority of cycles in loop bodies and array accesses

High data access bandwidth Poor locality, cross page

references

Fix Stride

2-Way Stream

2-D Stride

Page 37: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

37

Previous work on Access Pattern

Previous work was performance driven and OS/compiler related approachData Pre-fetching [Chen94] [Zhang00]Memory Customization [Adve00] [Grun01] Data Layout Optimization [Catthoor98] [DeLaLuz04]

Shortcoming of OS/compiler-based strategies:Multimedia benchmark’s dominant activities are within large

monolithic data buffers.Buffers generally contain many memory pages and can not be

further optimized.Constraint by the OS and compiler capability. Poor flexibility.

Page 38: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

38

Optimization II - Page Remapping

Technique currently used in large memory space peripheral memory access.

External memories in embedded multimedia systems High bus access overheadPage miss penalty

Efficient page remapping canReduce page missesImprove external bus throughputReduce power / energy consumption.

Page 39: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

39

Page Remapping Target Region

Media Processor

Core

Data Cache Instruction Cache

System DMA Controller

Memory DMA 0PPI

DMASPORTDMA

Memory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

Exte

rnal

Bu

s In

terf

ace U

nit

(E

BIU

)

Internal Bus

Ext

ern

al B

us

SDRAM

FLASHMemory

AsynchronousDevices

Page 40: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

40

SDRAM Memory Pages

X

X*

X

Bank 0Page 0Page 1

Page 2Page 3

Page 4

Page N-1

X

X*

X

Bank 1

X

X*

X

Bank 2X

X

X

X*

Bank M-1

High memory access latency. Minimum latency of an sclk cycle Page miss penalty Additional latency due to refresh cycle No guaranteed access due to arbitration logic Non-sequential read/write would suffer

Page 41: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

41

COMMAND P A R R R R P A R R

RPtRCDt CASt

System Clock Cycles (SCLK)

P A R R R R R R

P - PRECHARGE A - ACTIVATE R - READ

R R

R R

RCDt CASt

D D D DDATA D D D D

D D D D D D D D

COMMAND

DATA

N - NOP

RPt

D - DATA

SDRAM Page Miss Penalty

Page 42: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

42

Access type

Number of cycles

Read cycle trp +n*(tcas)

Write cycle twp

Page miss trp + trcd

Refresh cycle

2*(trcd) * nrows

SDRAM parameter

Sclk cycles

trcd1-15

trp1-7

trcd = tras + trp 1-15

tcas2-3

twp = write to prechargetrp = read to prechargetras = activate to prechargetcas = read latency

~8-10 sclk penalty associated with a page miss

SDRAM Timing Parameters

Page 43: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

43

P – PrechargeA – ActivationR - Read

Bank 0Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

P A R

RR

P A R

R

P A R

R

P A R

RRRR

P A R P A R P A R P A R

System Clock

SDRAM Page Access Sequence (I)

Typically access pattern of 2-D stride / 2-way stream. Poor data layout causes significant access overhead.

P A R P A R P A R P A R

RRRR

12 Reads across 4 banks

Page 44: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

44

P – PrechargeA – ActivationR - Read

Bank 0Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

P A R

R R

P A R

R

P A R

R

P A R

R R R R

R R R R

System Clock

SDRAM Page Access Sequence (II)

R R R R

R R R R

Less access overhead with distributed data layout.

12 Reads across 4 banks

Page 45: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

45

Why we use Page Remapping

X

Bank 0

Page 2 X

Bank 1

X

Bank 2

X

Bank 3

XPage 2 X X X

Page Remapping Entryof Page 2:{2,0,1,3}

Page 46: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

46

Module in an SOC System

Address translation unit, only translates bank address

Non-MMU system inserts a page remapping module before EBIU

MMU system can take advantage the existing address translation unit. No extra hardware needed

Ext

ern

al B

us

Inte

rfac

e U

nit

(E

BIU

)

Ext

ern

al B

us

SDRAM

FLASHMemory

AsynchronousDevices

Pag

e R

em

app

ing

InternalBus

Page 47: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

47

P – PrechargeA – ActivationR - Read

Bank 0Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

P A R

RR

P A R

R

P A R

R

P A R

RR

RR

R R R R

System Clock

Sequence (I) after Remapping

Same performance as sequence II.Applicable for monolithic data buffers (eg. frame buffers).

R R R R

RR

RR

12 Reads across 4 banks

Page 48: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

48

Page Remapping Algorithm

NP complete problem. Reducible to graph coloring problem in a page transition

graph G(V,E).Vertex = Page Im,n

m – page bank numbern – page row number

Edge = Transition of Page Im,n to Ip,q. weighted edges captures page traversal during the program executionedge weight is number of transition from Page Im,n to Page Ip,q

Color = BankEach bank have one distinct color.Every page will be assigned one color.

Page 49: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

49

Page Remapping Algorithm (continued)

Page Remapping AlgorithmFrom the page transition graph, find the color (bank) assignment for each

page, such that the transition cost between same color pages is minimized.

Algorithm Steps:Sort the edges based on their transition weightEdges are process in a decreasing weight orderColor the pages associated with each edgeWeight parameter array for each page represents the cost of mapping that

page into each bankeg: {500, 200, 0, 0}

5 different situations of processing each edgePage remapping table (PMT) is generated as a result of

mapping.

Page 50: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

50

I0,0

I0,1 I1,1

I1,2

I1,3

I2,1 I3,1

I0,0

I0,1

I1,1

I1,2

I1,3

I2,1

I3,1

500

200

100

80

60

5030

40

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Example Case

Original page allocation

Page transition graph

Page 51: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

51

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Initial Step

No page is mapped. All slots are available.

Page 52: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

52

I0,0

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

I0,0 I0,1

500

I0,1

Selected Edge:

Weight Parameters Updates:

I0,0[0]: { 0, 500, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

Actions: Allocate unmapped pages I0,0 and I0,1

Step (1) – two unmapped pages

Page 53: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

53

I0,0

I1,1 I1,2

200

I0,1

Selected Edge:

Weight Parameters Updates:

I1,1[0]: { 0, 200, 0, 0}

I1,2[1]: { 200, 0, 0, 0}

Actions: Allocate unmapped pages I1,1 and I1,2

I1,1

I1,2

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (2) – two unmapped pages

Page 54: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

54

I0,0

I0,0 I3,1

100

I0,1

Selected Edge:

Weight Parameters Updates:

I3,1[2]: { 100, 0, 0, 0}

I0,0[0]: { 0, 500, 100, 0}

Actions: Map pages I3,1 and no change for I0,0

I1,1

I1,2

I3,1

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (3) – one unmapped page

Page 55: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

55

I0,0

I1,2 I2,1

80

I0,1

Selected Edge:

Weight Parameters Updates:

I2,1[3]: { 0, 80, 0, 0}

I1,2[1]: { 200, 0, 0, 80}

Actions: Map pages I2,1 and no change for I1,2

I1,1

I1,2

I3,1 I2,1

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (4) – one unmapped page

Page 56: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

56

I0,0

I3,1 I1,3

60

I0,1

Selected Edge:

Weight Parameters Updates:

I1,3[0]: { 0, 0, 60, 0}

I3,1[2]: { 160, 0, 0, 0}

Actions: Map pages I1,3 and no change for I3,1

I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (5) – one unmapped page

Page 57: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

57

I0,0

I1,1 I3,1

50

I0,1

Selected Edge:

Actions: Both I1,1 and I3,1 are on the same row, no actions.

I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (6) – same row pages

Page 58: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

58

I0,0

I2,1 I1,3

40

I0,1

Selected Edge:

Weight Parameters Updates:

I1,3[0]: { 0, 0, 60, 40}

I2,1[3]: { 40, 80, 0, 0}

Actions: Both I2,1 and I1,3 are mapped, no conflicts.

I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (7) – two mapped pages

Page 59: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

59

I0,0

I0,0 I1,1

30

I0,1

Selected Edge:

Actions: Both I0,0 and I1,1 are mapped and in same bank.

I1,1

I1,2

I3,1 I2,1

I1,3

Current Weight Parameters:

I2,1[3]: { 40, 80, 0, 0}

I3,1[2]: { 160, 0, 0, 0}

I1,1[0]: { 30, 200, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

I0,0

I3,1I2,1

I1,2

I0,1 I1,1

I1,3

Updated Weight Parameters:

I0,0[0]: {0, 500, 100, 30}

No Conflict

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Bank 0

Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

Step (8) – conflict resolving

Page 60: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

60

I-Cache D-Cache

External Memory Address

Pa

ge

Re

ma

pp

ing

Ta

ble

(4

kB

)

EBIU

Row/Column Address (22bits)

Bank Address (2bits)

16MB External SDRAM

Memory Page Address (14bits)

I0,0

I3,1

I2,1

I1,2

I0,1

I1,1

I1,3

00

01

00

01

10

11

00

xx

xx

xx

xx

xx

xx

xx

xx

xx

Generated PMT table

Page 61: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

61

Experimental Setup

Utilized embedded power modeling frameworkExtended address translation unit for page remappingPage coloring program to generate PMTSame 10 Multimedia application benchmarks

MPEG-2 encoder and decoderH.264 encoder and decoderJPEG encoder and decoderPGP encoder and decoderG.721 encoder and decoder

Page 62: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

62

Page Miss Reduction

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC PGP-DEC G721-ENC

G721-DEC

Pa

ge

Mis

s p

er

10

0 R

eq

ue

sts

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

Page 63: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

63

External Bus Power

0

5

10

15

20

25

30

35

40

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

Ext

ern

al P

ow

er (

mW

)

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

Page 64: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

64

Average Access Delay

0

20

40

60

80

100

120

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

Ave

rag

e R

equ

est

Del

ay (

cycl

e)

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

Page 65: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

65

Comments of Page Remapping

Page remapping algorithm is presented by example.Our algorithm can significantly reduce the memory page miss

rate by 70-80% on average.For a 4-bank SDRAM memory system, we reduced externalmemory access time by 12.6%.The proposed algorithm can reduce power consumption in

majority of the benchmarks, averaged by 13.2% of power reduction.

Combining the effects of both power and delay, our algorithm can benefit significantly to the total energy cost.

Stability study was done in dissertation. PMT table generated from one test vector input perform well on different inputs.

Page 66: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

66

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Page 67: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

67

Summary

Reviewed the issues of external bus power in a system-on-a-chip (SOC) embedded system.

Built external bus power estimation framework and experimental methodology.PACS’04

Proposed a series of power aware bus arbitration schemes and their performance improvement over traditional schemes.HiPEAC’05 also appeared in LNCSTransaction of High performance of Embedded Architectures and

CompilersProposed page remapping algorithm to reduce page misses

and its power and delay improvements.LCTES’07

Page 68: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

68

Future Work

Integration of power estimation framework in complete tool chain

Extend arbitration schemes to multiple memory interfaces and other peripheral interfaces.

Compare performance of page remapping with corresponding OS/Compiler schemes

Page 69: System-Level Memory Bus Power And Performance Optimization for Embedded Systems

69

Thank You !