System-Level Memory Bus Power And Performance Optimization for Embedded Systems

System-Level Memory Bus Power And Performance Optimization for

Embedded Systems

Ke [email protected]

David [email protected]

2

Why Power is More Important?

“Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001

Increasing complexity for higher performance (MIPS)Parallelism, pipeline, memory/cache sizeHigher clock frequency, larger die sizeRising dynamic power consumption

CMOS process continues to shrink: Smaller size logic gates reduce Vthreshold

Lower Vthreshold will have higher leakageLeakage power will exceed dynamic power

Things getting worse in Embedded SystemLow power and low cost systemsFixed or Limited applications/functionalities Real-time systems with timing constraints

3

Power Breakdown of An Embedded System

Internal Dynamic

Internal Leakage

RTC

PPI

SPORT0

SPORT1

UART

SDRAM

25°C1.2V Internal400MHz CCLK Blackfin Processor3.3V External133MHz SDRAM27MHz PPI

Source: Analog Devices Inc.

ResearchTarget

4

Introduction

Related work on microprocessor power Low power design trendPower metricsPower performance tradeoffsPower optimization techniques

Power estimation frameworkExperimental framework built from Blackfin cycle accurate

simulatorValidated through a Blackfin EZKit board

Power aware bus arbitrationMemory page remapping

5

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

6

Power Modeling

Dynamic power estimationInstruction level model: [Tiwari94], JouleTrack[Sinha01]Function level model: [Qu00]Architecture model: Cai-Lim Model, TEMPEST[CaiLim99],

Wattch[Brooks00], Simplepower[Ye00]Static power estimation

Butts-Sohi model [Butts00]Previous memory system power estimation

Activity model: CACTI[Wilton96]Trace driven model: Dinero IV[Elder98]

7

Power Equation

leakage

leakagedesignDD

dynamic

DDIN kVfACVP 2

A

f

C

DDV

Ndesignk leakageI

Activity Factor

Total Capacitance

Voltage

Frequency

Transistor Number

Technology factor

8

Common Power Optimization Techniques

Gating (turn off unused components)Clock gatingVoltage gating: Cache decay [Hu01]

Scaling: (scale operating point of an component)Voltage scaling: Drowsy cache [Flautner02]Frequency scaling: [Pering98]Resource scaling: DRAM power mode [Delaluz01]

Banking: (break single component into smaller sub-units)Vertical sub-banking: Filter cache[Kin97]Horizontal sub-banking: Scratchpad [Kandemir01]

Clustering: (partition components into clusters)Switching reduction: (redesigning with lower activity)

Bus encoding: Permutation Code [Mehta96], redundant code[Stan95, Benini98], WZE[Musoll97]

9

Power Aware Figure of Merit

Delay, DPerformance, MIPS

Power, PBattery life (mobile), packaging (high performance)

Obvious choice for power performance tradeoff, PDJoules/instruction, inversely MIPS/WEnergy figureMobile / low power applications

Energy Delay PD2

MIPS2/W [Gonzalez96]Energy Delay Square PD3

MIPS3/WVoltage and frequency independent

More generically, MIPSm/W

10

Most of optimization schemes sacrifice performance for lower power consumption, except switching reduction.

All of optimization schemes generate higher power efficiency.All of optimization schemes increase hardware complexity.

Power Optimization Effect on Power Figure

11

Outline


Related




Summary

12

External Bus

External Bus ComponentsTypically is off-chip busIncludes: Control Bus, Address Bus, Data Bus

External Bus Power ConsumptionDynamic power factors: activity, capacitance, frequency, voltageLeakage power: supply voltage, threshold voltage, CMOS

technologyDifferent from internal memory bus power:

Longer physical distance, higher bus capacitance, lower speedCross line interference, Higher leakage currentDifferent communication protocols (memory/peripheral dependent)Multiplexed row/column address bus, narrower data bus

13

Embedded SOC System Architecture

Media Processor

Core

Data Cache Instruction Cache

System DMA Controller

Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

Ex

tern

al

Bu

s I

nte

rfa

ce

Un

it (

EB

IU)

SDRAM

FLASHMemory

AsynchronousDevices

Internal Bus

Ext

ern

al B

us

PowerModeling

Area

14

ADSP-BF533 EZ-Kit Lite Board

FLASHMemory

SPORTData I/O

Video Codec/ADV Converter

BF533Blackfin

Processor

SDRAMMemory

Video In & OutAudio In Audio Out

Audio Codec/AD Converter

15

External Bus Power Estimator

Previous ApproachesUsed Hamming distance [Benini98]Control signal was not consideredShared row and column address busMemory state transitions were not considered

In Our EstimatorIntegrate memory control signal power into the modelConsider the case where row and column address are sharedMemory state transitions and stalls also cost powerConsider page miss penalty and traffic reverse penalty

P(bus) = P(page miss) + P(bus turnaround) + P(control signal) + P(address generation)+ P(data transmission) + P(leakage)

16

Two External Bus SDRAM Timing Models

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

N N N N

N N N N

RPt CAStRCDt

System Clock Cycles (SCLK)

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

NN

(a) SDRAM Access in Sequential Command Mode

(b) SDRAM Access in Pipelined Command Mode

P - PRECHARGE A - ACTIVATE N - NOP R - READ

R R

17

Bus Power Simulation Framework

Program Target BinaryCompiler

Instruction LevelSimulator

Memory PowerModel

External Bus Power Estimator

Memory TechnologyTiming Model

Memory HierarchyModel

Memory TraceGenerator

Bus Power

Developed software modules

18

Multimedia Benchmark Configurations

Name Description I-Cache Size

D-Cache Size

MPEG2-ENC MPEG-2 Video encoder with 720x480 4:2:0 inputframes.

16k 16k

MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with4:2:2 CCIR frame output.

16k 16k

H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression.

16k 16k

H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm.

16k 16k

JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k

JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k

PGP-ENC Pretty Good Privacy encryption and digital signatureof text message.

8k 4k

PGP-DEC Pretty Good Privacy decryption of encrypted message.

8k 4k

G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k

G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k

19

Outline


Related Work




Summary

20

Optimization I – Bus Arbitration

Multiple bus access masters in an SOC systemProcessor coresData/Instruction cachesDMAASIC modules

Multimedia applicationsHigh bus bandwidth throughputLarge memory footprint

Efficient arbitration algorithm can:Increase power awarenessIncrease bus throughputReduce bus power

21

Bus Arbitration Target Region

Media Processor

Core



Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

EB

IU w

ith

Arb

itra

tio

n E

nab

led

SDRAM

FLASHMemory

AsynchronousDevices

Internal Bus

Ext

ern

al B

us

22

Bus Arbitration Schemes

EBIU with arbitration enabledHandle core-to-memory and core-to-peripheral communicationResolve bus access contentionSchedule bus access requests

Traditional AlgorithmsFirst Come First Serve (FCFS)Fixed Priority

Power Aware Algorithms(Categorized by power metric / cost function)Minimum Power (P1D0) or (1, 0)Minimum Delay (P0D1) or (0, 1)Minimum Power Delay Product (P1D1) or (1, 1)Minimum Power Delay Square Product (P1D2) or (1, 2)More generically (PnDm) or (n, m)

23

Bus Arbitration Schemes (Continued)

Power Aware ArbitrationFrom the current pending requests in the waiting queue, find a

permutation of the external bus requests to achieve the minimum total power and/or performance cost.

Reducible to minimum Hamiltonian path problem in a graph G(V,E).Vertex = Request R(t,s,b,l)

t – request arrival times – starting addressb – block size l – read / write

Edge = Transition of Request i and j. i,j - Request i and jedge weight w(i, j) is cost of transition

24

Minimum Hamiltonian Path Problem

R0

R3

R2

R1

w(0,3)w

(0,2

)

w(0

,1)

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

R0 – Last Request on the Bus. Must be the starting point of a path.R1, R2, R3 – Requests in the queue

w(i,j) = P(i,j)nD(i,j)m

P(i,j) – Power of Rj after Ri D(i,j) – Delay of Rj after Ri

Hamiltonian Path: R0->R3->R1->R2

Minimum Path weight = w(0,3)+w(3,1)+w(1,2)

NP-Complete Problem

25

Greedy Solution

R0

R3

R2

R1

w(0,3)w

(0,2

)

w(0

,1)

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

Greedy Algorithm (local min)

Only the next requestin the path is needed

min{w(0,j) | w(i,j) is the edge weight of graph G(V,E)}

In each iteration of arbitration:

1. A new graph G(V,E) need to be constructed.2. A greedy solution request is arbitrated to use the bus.

26

Experimental Setup

Utilized embedded power modeling framework Implemented eleven different arbitration schemes inside EBIU

FCFS, FixedPriority.minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and

(1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3)10 multimedia application benchmarks are ported to Blackfin

architecture and simulated, including MPEG-2, H.264, JPEG, PGP and G.721.

27

Power Improvement

Power-aware arbitration schemes have lower power consumptions than Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 6-7% saving than Sequential Command model

for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Power

0.0

10.0

20.0

30.0

40.0

50.0

60.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Av

erag

e B

us P

ower

(mW

)

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Power

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)


Av

erag

e B

us P

ower

(mW

)

Sequential Command

Pipelined Command

28

Speed Improvement

Power-aware schemes have smaller bus delay than traditional Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 3-9% speedup than Sequential Command

model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Delay

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)


Ave

rage

Del

ay (S

CLK

)

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Delay

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)


Ave

rage

Del

ay (S

CLK

)

Sequential Command

Pipelined Command

29

Comparison with Exhaustive Algorithm

Greedy Algorithm can fail in certain case.

Complexity of O(n) vs O(n!).Performance difference is

negligible:

R0

R3

R2

R1

20

20

18

17

15

7

5

18

17

ExhaustiveSearch

GreedySearch

new

30

Comments on Experimental Results

Power aware arbitrators significantly reduce the external bus power for all 8 benchmarks. In average, there are 14% power saving.

Power aware arbitrators reduce the bus access delay. The delay are reduced by 21% in average among 8 benchmarks.

Pipelined SDRAM model has big performance advantage over sequential SDRAM model. It achieve 6% power saving and 12% speedup.

Power and delay in external bus are highly correlated. Minimum power also achieves minimum delay.

Minimum power schemes will lead to simpler design options. Scheme (1,0) is preferred due to its simplicity.

31

Design of A Power Estimation Unit (PEU)

Last Bank Address

Bank(0) Open Row Addr




Last Column Address

Updated Column AddrRow

Addr

Colum

n Addr

Bank A

ddr

Next RequestAddress

If not equal, output bank miss power

If not equal, output page miss penalty power,

update last column address register

Use hamming distanceto calculate columnaddress data power

Power Estimation Unit (PEU)

EstimatedPower

32

Two Arbitrator Implementation Structures

Request Queue Buffer

Power EstimatorUnit (PEU)

Power EstimatorUnit (PEU)

Memory/BusStates Info

Com

para

tor

MinimumPower

Request

State Update

t s b l

t s b l

t s b l

t s b l

AccessCommandGenerator

External Bus

Power EstimatorUnit

Power EstimatorUnit

Request Queue Buffer

Memory/BusStates Info

Com

para

tor

MinimumPower

Request

State Update

t s b l

t s b l

t s b l

t s b l

AccessCommandGenerator

External BusPower Estimator

UnitPower EstimatorUnitPower Estimator

Unit (PEU)

Shared PEUStructure

Dedicated PEUStructure

33

Performance of two structures

Higher PEU delay will lower the external bus performance for both MPEG-2 encoder and decoder.

When PEU delay is 5 or higher, dedicated structure is preferred than shared structure. Otherwise, shared structure is enough.

MPEG-2 Encoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

120.0125.0

130.0135.0

140.0145.0

150.0155.0

160.0165.0

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

Ave

rag

e D

elay

(C

ycle

s) Estimator Unit Shared

Estimator Unit Dedicated

MPEG-2 Decoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

100.0

105.0

110.0

115.0

120.0

125.0

130.0

135.0

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

Ave

rag

e D

elay

(C

ycle

s)

Estimator Unit Shared

Estimator Unit Dedicated

34

Summary of Bus Arbitration Schemes

Efficient bus arbitrations can provide benefits to both power and performance over traditional arbitration schemes.

Minimum power and minimum delay are highly correlated on external bus performance.

Pipelined SDRAM model has significant advantage over sequential SDRAM model.

Arbitration scheme (1, 0) is recommended. Minimum power approach provides more design options and leads

to simpler design implementations. The trade-off between design complexity and performance was presented.

35

Outline


Related Work




Summary

36

address

time

Data Access Pattern in Multimedia Apps

time

addresstime

address

3 common data access patterns in multimedia applications

Majority of cycles in loop bodies and array accesses

High data access bandwidth Poor locality, cross page

references

Fix Stride

2-Way Stream

2-D Stride

37

Previous work on Access Pattern

Previous work was performance driven and OS/compiler related approachData Pre-fetching [Chen94] [Zhang00]Memory Customization [Adve00] [Grun01] Data Layout Optimization [Catthoor98] [DeLaLuz04]

Shortcoming of OS/compiler-based strategies:Multimedia benchmark’s dominant activities are within large

monolithic data buffers.Buffers generally contain many memory pages and can not be

further optimized.Constraint by the OS and compiler capability. Poor flexibility.

38

Optimization II - Page Remapping

Technique currently used in large memory space peripheral memory access.

External memories in embedded multimedia systems High bus access overheadPage miss penalty

Efficient page remapping canReduce page missesImprove external bus throughputReduce power / energy consumption.

39

Page Remapping Target Region

Media Processor

Core



Memory DMA 0PPI

DMASPORTDMA

Memory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

Exte

rnal

Bu

s In

terf

ace U

nit

(E

BIU

)

Internal Bus

Ext

ern

al B

us

SDRAM

FLASHMemory

AsynchronousDevices

40

SDRAM Memory Pages

X

X*

X

Bank 0Page 0Page 1

Page 2Page 3

Page 4

Page N-1

X

X*

X

Bank 1

X

X*

X

Bank 2X

X

X

X*

Bank M-1

High memory access latency. Minimum latency of an sclk cycle Page miss penalty Additional latency due to refresh cycle No guaranteed access due to arbitration logic Non-sequential read/write would suffer

41

COMMAND P A R R R R P A R R

RPtRCDt CASt

System Clock Cycles (SCLK)

P A R R R R R R

P - PRECHARGE A - ACTIVATE R - READ

R R

R R

RCDt CASt

D D D DDATA D D D D

D D D D D D D D

COMMAND

DATA

N - NOP

RPt

D - DATA

SDRAM Page Miss Penalty

42

Access type

Number of cycles

Read cycle trp +n*(tcas)

Write cycle twp

Page miss trp + trcd

Refresh cycle

2*(trcd) * nrows

SDRAM parameter

Sclk cycles

trcd1-15

trp1-7

trcd = tras + trp 1-15

tcas2-3

twp = write to prechargetrp = read to prechargetras = activate to prechargetcas = read latency

~8-10 sclk penalty associated with a page miss

SDRAM Timing Parameters

43

P – PrechargeA – ActivationR - Read

Bank 0Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

P A R

RR

P A R

R

P A R

R

P A R

RRRR

P A R P A R P A R P A R

System Clock

SDRAM Page Access Sequence (I)

Typically access pattern of 2-D stride / 2-way stream. Poor data layout causes significant access overhead.

P A R P A R P A R P A R

RRRR

12 Reads across 4 banks

44




P A R

R R

P A R

R

P A R

R

P A R

R R R R

R R R R

System Clock

SDRAM Page Access Sequence (II)

R R R R

R R R R

Less access overhead with distributed data layout.


45

Why we use Page Remapping

X

Bank 0

Page 2 X

Bank 1

X

Bank 2

X

Bank 3

XPage 2 X X X

Page Remapping Entryof Page 2:{2,0,1,3}

46

Module in an SOC System

Address translation unit, only translates bank address

Non-MMU system inserts a page remapping module before EBIU

MMU system can take advantage the existing address translation unit. No extra hardware needed

Ext

ern

al B

us

Inte

rfac

e U

nit

(E

BIU

)

Ext

ern

al B

us

SDRAM

FLASHMemory

AsynchronousDevices

Pag

e R

em

app

ing

InternalBus

47




P A R

RR

P A R

R

P A R

R

P A R

RR

RR

R R R R

System Clock

Sequence (I) after Remapping

Same performance as sequence II.Applicable for monolithic data buffers (eg. frame buffers).

R R R R

RR

RR


48

Page Remapping Algorithm

NP complete problem. Reducible to graph coloring problem in a page transition

graph G(V,E).Vertex = Page Im,n

m – page bank numbern – page row number

Edge = Transition of Page Im,n to Ip,q. weighted edges captures page traversal during the program executionedge weight is number of transition from Page Im,n to Page Ip,q

Color = BankEach bank have one distinct color.Every page will be assigned one color.

49

Page Remapping Algorithm (continued)

Page Remapping AlgorithmFrom the page transition graph, find the color (bank) assignment for each

page, such that the transition cost between same color pages is minimized.

Algorithm Steps:Sort the edges based on their transition weightEdges are process in a decreasing weight orderColor the pages associated with each edgeWeight parameter array for each page represents the cost of mapping that

page into each bankeg: {500, 200, 0, 0}

5 different situations of processing each edgePage remapping table (PMT) is generated as a result of

mapping.

50

I0,0

I0,1 I1,1

I1,2

I1,3

I2,1 I3,1

I0,0

I0,1

I1,1

I1,2

I1,3

I2,1

I3,1

500

200

100

80

60

5030

40

Bank 0

Page 0Page 1Page 2Page 3


Example Case

Original page allocation

Page transition graph

51

Bank 0



Initial Step

No page is mapped. All slots are available.

52

I0,0

Bank 0



I0,0 I0,1

500

I0,1

Selected Edge:

Weight Parameters Updates:

I0,0[0]: { 0, 500, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

Actions: Allocate unmapped pages I0,0 and I0,1

Step (1) – two unmapped pages

53

I0,0

I1,1 I1,2

200

I0,1

Selected Edge:


I1,1[0]: { 0, 200, 0, 0}

I1,2[1]: { 200, 0, 0, 0}

Actions: Allocate unmapped pages I1,1 and I1,2

I1,1

I1,2

Bank 0



Step (2) – two unmapped pages

54

I0,0

I0,0 I3,1

100

I0,1

Selected Edge:


I3,1[2]: { 100, 0, 0, 0}

I0,0[0]: { 0, 500, 100, 0}

Actions: Map pages I3,1 and no change for I0,0

I1,1

I1,2

I3,1

Bank 0



Step (3) – one unmapped page

55

I0,0

I1,2 I2,1

80

I0,1

Selected Edge:


I2,1[3]: { 0, 80, 0, 0}

I1,2[1]: { 200, 0, 0, 80}


I1,1

I1,2

I3,1 I2,1

Bank 0




56

I0,0

I3,1 I1,3

60

I0,1

Selected Edge:


I1,3[0]: { 0, 0, 60, 0}

I3,1[2]: { 160, 0, 0, 0}


I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0




57

I0,0

I1,1 I3,1

50

I0,1

Selected Edge:

Actions: Both I1,1 and I3,1 are on the same row, no actions.

I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0



Step (6) – same row pages

58

I0,0

I2,1 I1,3

40

I0,1

Selected Edge:


I1,3[0]: { 0, 0, 60, 40}

I2,1[3]: { 40, 80, 0, 0}

Actions: Both I2,1 and I1,3 are mapped, no conflicts.

I1,1

I1,2

I3,1 I2,1

I1,3

Bank 0



Step (7) – two mapped pages

59

I0,0

I0,0 I1,1

30

I0,1

Selected Edge:

Actions: Both I0,0 and I1,1 are mapped and in same bank.

I1,1

I1,2

I3,1 I2,1

I1,3

Current Weight Parameters:

I2,1[3]: { 40, 80, 0, 0}

I3,1[2]: { 160, 0, 0, 0}

I1,1[0]: { 30, 200, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

I0,0

I3,1I2,1

I1,2

I0,1 I1,1

I1,3

Updated Weight Parameters:

I0,0[0]: {0, 500, 100, 30}

No Conflict

Bank 0



Bank 0



Step (8) – conflict resolving

60

I-Cache D-Cache

External Memory Address

Pa

ge

Re

ma

pp

ing

Ta

ble

(4

kB

)

EBIU

Row/Column Address (22bits)

Bank Address (2bits)

16MB External SDRAM

Memory Page Address (14bits)

I0,0

I3,1

I2,1

I1,2

I0,1

I1,1

I1,3

00

01

00

01

10

11

00

xx

xx

xx

xx

xx

xx

xx

xx

xx

Generated PMT table

61

Experimental Setup

Utilized embedded power modeling frameworkExtended address translation unit for page remappingPage coloring program to generate PMTSame 10 Multimedia application benchmarks

MPEG-2 encoder and decoderH.264 encoder and decoderJPEG encoder and decoderPGP encoder and decoderG.721 encoder and decoder

62

Page Miss Reduction

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC PGP-DEC G721-ENC

G721-DEC

Pa

ge

Mis

s p

er

10

0 R

eq

ue

sts

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

63

External Bus Power

0

5

10

15

20

25

30

35

40

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

Ext

ern

al P

ow

er (

mW

)

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

64

Average Access Delay

0

20

40

60

80

100

120

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

Ave

rag

e R

equ

est

Del

ay (

cycl

e)

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

65

Comments of Page Remapping

Page remapping algorithm is presented by example.Our algorithm can significantly reduce the memory page miss

rate by 70-80% on average.For a 4-bank SDRAM memory system, we reduced externalmemory access time by 12.6%.The proposed algorithm can reduce power consumption in

majority of the benchmarks, averaged by 13.2% of power reduction.

Combining the effects of both power and delay, our algorithm can benefit significantly to the total energy cost.

Stability study was done in dissertation. PMT table generated from one test vector input perform well on different inputs.

66

Outline


Related Work




Summary

67

Summary

Reviewed the issues of external bus power in a system-on-a-chip (SOC) embedded system.

Built external bus power estimation framework and experimental methodology.PACS’04

Proposed a series of power aware bus arbitration schemes and their performance improvement over traditional schemes.HiPEAC’05 also appeared in LNCSTransaction of High performance of Embedded Architectures and

CompilersProposed page remapping algorithm to reduce page misses

and its power and delay improvements.LCTES’07

68

Future Work

Integration of power estimation framework in complete tool chain

Extend arbitration schemes to multiple memory interfaces and other peripheral interfaces.

Compare performance of page remapping with corresponding OS/Compiler schemes

69

Thank You !

System-Level Memory Bus Power And Performance Optimization for Embedded Systems

Documents