System-Level Memory Bus Power And Performance Optimization for Embedded Systems

System-Level Memory Bus Power And Performance Optimization for

Embedded Systems

Ke Ningkning@ece.neu.edu

David Kaelikaeli@ece.neu.edu

Why Power is More Important?

“Power: A First Class Design Constraint for Future Architecture” – Trevor Mudge 2001

Increasing complexity for higher performance (MIPS)Parallelism, pipeline, memory/cache sizeHigher clock frequency, larger die sizeRising dynamic power consumption

CMOS process continues to shrink: Smaller size logic gates reduce Vthreshold

Lower Vthreshold will have higher leakageLeakage power will exceed dynamic power

Things getting worse in Embedded SystemLow power and low cost systemsFixed or Limited applications/functionalities Real-time systems with timing constraints

Power Breakdown of An Embedded System

Internal Dynamic

Internal Leakage

SPORT0

SPORT1

25°C1.2V Internal400MHz CCLK Blackfin Processor3.3V External133MHz SDRAM27MHz PPI

Source: Analog Devices Inc.

ResearchTarget

Introduction

Related work on microprocessor power Low power design trendPower metricsPower performance tradeoffsPower optimization techniques

Power estimation frameworkExperimental framework built from Blackfin cycle accurate

simulatorValidated through a Blackfin EZKit board

Power aware bus arbitrationMemory page remapping

Outline

Research Motivation and Introduction

Related Work

Power Estimation Framework

Optimization I – Power-Aware Bus Arbitration

Optimization II – Memory Page Remapping

Summary

Power Modeling

Dynamic power estimationInstruction level model: [Tiwari94], JouleTrack[Sinha01]Function level model: [Qu00]Architecture model: Cai-Lim Model, TEMPEST[CaiLim99],

Wattch[Brooks00], Simplepower[Ye00]Static power estimation

Butts-Sohi model [Butts00]Previous memory system power estimation

Activity model: CACTI[Wilton96]Trace driven model: Dinero IV[Elder98]

Power Equation

leakage

leakagedesignDD

dynamic

DDIN kVfACVP 2

Ndesignk leakageI

Activity Factor

Total Capacitance

Voltage

Frequency

Transistor Number

Technology factor

Common Power Optimization Techniques

Gating (turn off unused components)Clock gatingVoltage gating: Cache decay [Hu01]

Scaling: (scale operating point of an component)Voltage scaling: Drowsy cache [Flautner02]Frequency scaling: [Pering98]Resource scaling: DRAM power mode [Delaluz01]

Banking: (break single component into smaller sub-units)Vertical sub-banking: Filter cache[Kin97]Horizontal sub-banking: Scratchpad [Kandemir01]

Clustering: (partition components into clusters)Switching reduction: (redesigning with lower activity)

Bus encoding: Permutation Code [Mehta96], redundant code[Stan95, Benini98], WZE[Musoll97]

Power Aware Figure of Merit

Delay, DPerformance, MIPS

Power, PBattery life (mobile), packaging (high performance)

Obvious choice for power performance tradeoff, PDJoules/instruction, inversely MIPS/WEnergy figureMobile / low power applications

Energy Delay PD2

MIPS2/W [Gonzalez96]Energy Delay Square PD3

MIPS3/WVoltage and frequency independent

More generically, MIPSm/W

Most of optimization schemes sacrifice performance for lower power consumption, except switching reduction.

All of optimization schemes generate higher power efficiency.All of optimization schemes increase hardware complexity.

Power Optimization Effect on Power Figure

Outline

External Bus

External Bus ComponentsTypically is off-chip busIncludes: Control Bus, Address Bus, Data Bus

External Bus Power ConsumptionDynamic power factors: activity, capacitance, frequency, voltageLeakage power: supply voltage, threshold voltage, CMOS

technologyDifferent from internal memory bus power:

Longer physical distance, higher bus capacitance, lower speedCross line interference, Higher leakage currentDifferent communication protocols (memory/peripheral dependent)Multiplexed row/column address bus, narrower data bus

Embedded SOC System Architecture

Media Processor

Data Cache Instruction Cache

System DMA Controller

Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

FLASHMemory

AsynchronousDevices

Internal Bus

PowerModeling

ADSP-BF533 EZ-Kit Lite Board

FLASHMemory

SPORTData I/O

Video Codec/ADV Converter

BF533Blackfin

Processor

SDRAMMemory

Video In & OutAudio In Audio Out

Audio Codec/AD Converter

External Bus Power Estimator

Previous ApproachesUsed Hamming distance [Benini98]Control signal was not consideredShared row and column address busMemory state transitions were not considered

In Our EstimatorIntegrate memory control signal power into the modelConsider the case where row and column address are sharedMemory state transitions and stalls also cost powerConsider page miss penalty and traffic reverse penalty

P(bus) = P(page miss) + P(bus turnaround) + P(control signal) + P(address generation)+ P(data transmission) + P(leakage)

Two External Bus SDRAM Timing Models

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

N N N N

RPt CAStRCDt

System Clock Cycles (SCLK)

Bank 0 Request

Bank 1 Request

P A R R R R

P A R R

(a) SDRAM Access in Sequential Command Mode

(b) SDRAM Access in Pipelined Command Mode

P - PRECHARGE A - ACTIVATE N - NOP R - READ

Bus Power Simulation Framework

Program Target BinaryCompiler

Instruction LevelSimulator

Memory PowerModel

External Bus Power Estimator

Memory TechnologyTiming Model

Memory HierarchyModel

Memory TraceGenerator

Bus Power

Developed software modules

Multimedia Benchmark Configurations

Name Description I-Cache Size

D-Cache Size

MPEG2-ENC MPEG-2 Video encoder with 720x480 4:2:0 inputframes.

16k 16k

MPEG2-DEC MPEG-2 Video decoder of 720x480 sequence with4:2:2 CCIR frame output.

16k 16k

H264-ENC H.264/MPEG-4 Part 10 (AVC) digital video encoder for achieving very high data compression.

16k 16k

H264-DEC H.264/MPEG-4 Part 10 (AVC) video decompression algorithm.

16k 16k

JPEG-ENC JPEG image encoder for 512x512 image. 8k 8k

JPEG-DEC JPEG image decoder for 512x512 image. 8k 8k

PGP-ENC Pretty Good Privacy encryption and digital signatureof text message.

PGP-DEC Pretty Good Privacy decryption of encrypted message.

G721-ENC G.721 Voice Encoder of 16bit input audio samples. 4k 2k

G721-DEC G.721 Voice Decoder of encoded bits. 4k 2k

Outline

Related Work

Summary

Optimization I – Bus Arbitration

Multiple bus access masters in an SOC systemProcessor coresData/Instruction cachesDMAASIC modules

Multimedia applicationsHigh bus bandwidth throughputLarge memory footprint

Efficient arbitration algorithm can:Increase power awarenessIncrease bus throughputReduce bus power

Bus Arbitration Target Region

Media Processor

Memory DMA 0PPI

DMASPORT

DMAMemory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

FLASHMemory

AsynchronousDevices

Internal Bus

Bus Arbitration Schemes

EBIU with arbitration enabledHandle core-to-memory and core-to-peripheral communicationResolve bus access contentionSchedule bus access requests

Traditional AlgorithmsFirst Come First Serve (FCFS)Fixed Priority

Power Aware Algorithms(Categorized by power metric / cost function)Minimum Power (P1D0) or (1, 0)Minimum Delay (P0D1) or (0, 1)Minimum Power Delay Product (P1D1) or (1, 1)Minimum Power Delay Square Product (P1D2) or (1, 2)More generically (PnDm) or (n, m)

Bus Arbitration Schemes (Continued)

Power Aware ArbitrationFrom the current pending requests in the waiting queue, find a

permutation of the external bus requests to achieve the minimum total power and/or performance cost.

Reducible to minimum Hamiltonian path problem in a graph G(V,E).Vertex = Request R(t,s,b,l)

t – request arrival times – starting addressb – block size l – read / write

Edge = Transition of Request i and j. i,j - Request i and jedge weight w(i, j) is cost of transition

Minimum Hamiltonian Path Problem

w(0,3)w

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

R0 – Last Request on the Bus. Must be the starting point of a path.R1, R2, R3 – Requests in the queue

w(i,j) = P(i,j)nD(i,j)m

P(i,j) – Power of Rj after Ri D(i,j) – Delay of Rj after Ri

Hamiltonian Path: R0->R3->R1->R2

Minimum Path weight = w(0,3)+w(3,1)+w(1,2)

NP-Complete Problem

Greedy Solution

w(0,3)w

w(1,2)w(2,1)

w(2,3)

w(3,2)

w(1,3)

w(3,1)

Greedy Algorithm (local min)

Only the next requestin the path is needed

min{w(0,j) | w(i,j) is the edge weight of graph G(V,E)}

In each iteration of arbitration:

1. A new graph G(V,E) need to be constructed.2. A greedy solution request is arbitrated to use the bus.

Experimental Setup

Utilized embedded power modeling framework Implemented eleven different arbitration schemes inside EBIU

FCFS, FixedPriority.minimum power (P1D0) or (1,0), minimum delay (P0D1) or (0, 1), and

(1,1), (1,2), (2,1), (1,3), (3, 1), (3, 2), (2, 3)10 multimedia application benchmarks are ported to Blackfin

architecture and simulated, including MPEG-2, H.264, JPEG, PGP and G.721.

Power Improvement

Power-aware arbitration schemes have lower power consumptions than Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 6-7% saving than Sequential Command model

for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Power

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Arbitration Algorithm

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Power

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Sequential Command

Pipelined Command

Speed Improvement

Power-aware schemes have smaller bus delay than traditional Fixed Priority and FCFS.

Difference across different power-aware arbitration strategies is small. Parallel Command model has 3-9% speedup than Sequential Command

model for MPEG2 ENC & DEC. The results are consistent to all other benchmarks.

MPEG2 Encoder External Bus Delay

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Sequential Command

Pipelined Command

MPEG2 Decoder External Bus Delay

FP FCFS (0, 1) (1, 0) (1, 1) (1, 2) (2, 1) (1, 3) (2, 3) (3, 2) (3, 1)

Sequential Command

Pipelined Command

Comparison with Exhaustive Algorithm

Greedy Algorithm can fail in certain case.

Complexity of O(n) vs O(n!).Performance difference is

negligible:

ExhaustiveSearch

GreedySearch

Comments on Experimental Results

Power aware arbitrators significantly reduce the external bus power for all 8 benchmarks. In average, there are 14% power saving.

Power aware arbitrators reduce the bus access delay. The delay are reduced by 21% in average among 8 benchmarks.

Pipelined SDRAM model has big performance advantage over sequential SDRAM model. It achieve 6% power saving and 12% speedup.

Power and delay in external bus are highly correlated. Minimum power also achieves minimum delay.

Minimum power schemes will lead to simpler design options. Scheme (1,0) is preferred due to its simplicity.

Design of A Power Estimation Unit (PEU)

Last Bank Address

Bank(0) Open Row Addr

Last Column Address

Updated Column AddrRow

n Addr

Bank A

Next RequestAddress

If not equal, output bank miss power

If not equal, output page miss penalty power,

update last column address register

Use hamming distanceto calculate columnaddress data power

Power Estimation Unit (PEU)

EstimatedPower

Two Arbitrator Implementation Structures

Request Queue Buffer

Power EstimatorUnit (PEU)

Memory/BusStates Info

MinimumPower

Request

State Update

t s b l

AccessCommandGenerator

External Bus

Power EstimatorUnit

Request Queue Buffer

Memory/BusStates Info

MinimumPower

Request

State Update

t s b l

AccessCommandGenerator

External BusPower Estimator

UnitPower EstimatorUnitPower Estimator

Unit (PEU)

Shared PEUStructure

Dedicated PEUStructure

Performance of two structures

Higher PEU delay will lower the external bus performance for both MPEG-2 encoder and decoder.

When PEU delay is 5 or higher, dedicated structure is preferred than shared structure. Otherwise, shared structure is enough.

MPEG-2 Encoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

120.0125.0

130.0135.0

140.0145.0

150.0155.0

160.0165.0

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

s) Estimator Unit Shared

Estimator Unit Dedicated

MPEG-2 Decoder (1,0) Arbitrator Estimator Unit Implemation Performance Comparison

0 2 4 6 8 10

Estimator Logic Delay (Cycles)

Estimator Unit Shared

Estimator Unit Dedicated

Summary of Bus Arbitration Schemes

Efficient bus arbitrations can provide benefits to both power and performance over traditional arbitration schemes.

Minimum power and minimum delay are highly correlated on external bus performance.

Pipelined SDRAM model has significant advantage over sequential SDRAM model.

Arbitration scheme (1, 0) is recommended. Minimum power approach provides more design options and leads

to simpler design implementations. The trade-off between design complexity and performance was presented.

Outline

Related Work

Summary

address

Data Access Pattern in Multimedia Apps

addresstime

address

3 common data access patterns in multimedia applications

Majority of cycles in loop bodies and array accesses

High data access bandwidth Poor locality, cross page

references

Fix Stride

2-Way Stream

2-D Stride

Previous work on Access Pattern

Previous work was performance driven and OS/compiler related approachData Pre-fetching [Chen94] [Zhang00]Memory Customization [Adve00] [Grun01] Data Layout Optimization [Catthoor98] [DeLaLuz04]

Shortcoming of OS/compiler-based strategies:Multimedia benchmark’s dominant activities are within large

monolithic data buffers.Buffers generally contain many memory pages and can not be

further optimized.Constraint by the OS and compiler capability. Poor flexibility.

Optimization II - Page Remapping

Technique currently used in large memory space peripheral memory access.

External memories in embedded multimedia systems High bus access overheadPage miss penalty

Efficient page remapping canReduce page missesImprove external bus throughputReduce power / energy consumption.

Page Remapping Target Region

Media Processor

Memory DMA 0PPI

DMASPORTDMA

Memory DMA 1

NTSC/PALEncoder

StreamingInterface

S-Video/CVBS NIC

Internal Bus

FLASHMemory

AsynchronousDevices

SDRAM Memory Pages

Bank 0Page 0Page 1

Page 3

Page N-1

Bank 1

Bank 2X

Bank M-1

High memory access latency. Minimum latency of an sclk cycle Page miss penalty Additional latency due to refresh cycle No guaranteed access due to arbitration logic Non-sequential read/write would suffer

COMMAND P A R R R R P A R R

RPtRCDt CASt

System Clock Cycles (SCLK)

P A R R R R R R

P - PRECHARGE A - ACTIVATE R - READ

RCDt CASt

D D D DDATA D D D D

D D D D D D D D

COMMAND

N - NOP

D - DATA

SDRAM Page Miss Penalty

Access type

Number of cycles

Read cycle trp +n*(tcas)

Write cycle twp

Page miss trp + trcd

Refresh cycle

2*(trcd) * nrows

SDRAM parameter

Sclk cycles

trcd1-15

trp1-7

trcd = tras + trp 1-15

tcas2-3

twp = write to prechargetrp = read to prechargetras = activate to prechargetcas = read latency

~8-10 sclk penalty associated with a page miss

SDRAM Timing Parameters

P – PrechargeA – ActivationR - Read

Bank 0Page 0Page 1Page 2Page 3

Bank 1 Bank 2 Bank 3

P A R P A R P A R P A R

System Clock

SDRAM Page Access Sequence (I)

Typically access pattern of 2-D stride / 2-way stream. Poor data layout causes significant access overhead.

P A R P A R P A R P A R

12 Reads across 4 banks

R R R R

System Clock

SDRAM Page Access Sequence (II)

R R R R

Less access overhead with distributed data layout.

Why we use Page Remapping

Bank 0

Bank 1

Bank 2

Bank 3

XPage 2 X X X

Page Remapping Entryof Page 2:{2,0,1,3}

Module in an SOC System

Address translation unit, only translates bank address

Non-MMU system inserts a page remapping module before EBIU

MMU system can take advantage the existing address translation unit. No extra hardware needed

FLASHMemory

AsynchronousDevices

InternalBus

R R R R

System Clock

Sequence (I) after Remapping

Same performance as sequence II.Applicable for monolithic data buffers (eg. frame buffers).

R R R R

Page Remapping Algorithm

NP complete problem. Reducible to graph coloring problem in a page transition

graph G(V,E).Vertex = Page Im,n

m – page bank numbern – page row number

Edge = Transition of Page Im,n to Ip,q. weighted edges captures page traversal during the program executionedge weight is number of transition from Page Im,n to Page Ip,q

Color = BankEach bank have one distinct color.Every page will be assigned one color.

Page Remapping Algorithm (continued)

Page Remapping AlgorithmFrom the page transition graph, find the color (bank) assignment for each

page, such that the transition cost between same color pages is minimized.

Algorithm Steps:Sort the edges based on their transition weightEdges are process in a decreasing weight orderColor the pages associated with each edgeWeight parameter array for each page represents the cost of mapping that

page into each bankeg: {500, 200, 0, 0}

5 different situations of processing each edgePage remapping table (PMT) is generated as a result of

mapping.

I0,1 I1,1

I2,1 I3,1

Bank 0

Page 1Page 2Page 3

Example Case

Original page allocation

Page transition graph

Bank 0

Page 1Page 2Page 3

Initial Step

No page is mapped. All slots are available.

Bank 0

Page 1Page 2Page 3

I0,0 I0,1

Selected Edge:

Weight Parameters Updates:

I0,0[0]: { 0, 500, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

Actions: Allocate unmapped pages I0,0 and I0,1

Step (1) – two unmapped pages

I1,1 I1,2

Selected Edge:

I1,1[0]: { 0, 200, 0, 0}

I1,2[1]: { 200, 0, 0, 0}

Actions: Allocate unmapped pages I1,1 and I1,2

Bank 0

Page 1Page 2Page 3

Step (2) – two unmapped pages

I0,0 I3,1

Selected Edge:

I3,1[2]: { 100, 0, 0, 0}

I0,0[0]: { 0, 500, 100, 0}

Actions: Map pages I3,1 and no change for I0,0

Bank 0

Page 1Page 2Page 3

Step (3) – one unmapped page

I1,2 I2,1

Selected Edge:

I2,1[3]: { 0, 80, 0, 0}

I1,2[1]: { 200, 0, 0, 80}

I3,1 I2,1

Bank 0

Page 1Page 2Page 3

I3,1 I1,3

Selected Edge:

I1,3[0]: { 0, 0, 60, 0}

I3,1[2]: { 160, 0, 0, 0}

I3,1 I2,1

Bank 0

Page 1Page 2Page 3

I1,1 I3,1

Selected Edge:

Actions: Both I1,1 and I3,1 are on the same row, no actions.

I3,1 I2,1

Bank 0

Page 1Page 2Page 3

Step (6) – same row pages

I2,1 I1,3

Selected Edge:

I1,3[0]: { 0, 0, 60, 40}

I2,1[3]: { 40, 80, 0, 0}

Actions: Both I2,1 and I1,3 are mapped, no conflicts.

I3,1 I2,1

Bank 0

Page 1Page 2Page 3

Step (7) – two mapped pages

I0,0 I1,1

Selected Edge:

Actions: Both I0,0 and I1,1 are mapped and in same bank.

I3,1 I2,1

Current Weight Parameters:

I2,1[3]: { 40, 80, 0, 0}

I3,1[2]: { 160, 0, 0, 0}

I1,1[0]: { 30, 200, 0, 0}

I0,1[1]: { 500, 0, 0, 0}

I3,1I2,1

I0,1 I1,1

Updated Weight Parameters:

I0,0[0]: {0, 500, 100, 30}

No Conflict

Bank 0

Page 1Page 2Page 3

Bank 0

Page 1Page 2Page 3

Step (8) – conflict resolving

I-Cache D-Cache

External Memory Address

Row/Column Address (22bits)

Bank Address (2bits)

16MB External SDRAM

Memory Page Address (14bits)

Generated PMT table

Experimental Setup

Utilized embedded power modeling frameworkExtended address translation unit for page remappingPage coloring program to generate PMTSame 10 Multimedia application benchmarks

MPEG-2 encoder and decoderH.264 encoder and decoderJPEG encoder and decoderPGP encoder and decoderG.721 encoder and decoder

Page Miss Reduction

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC PGP-DEC G721-ENC

G721-DEC

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

External Bus Power

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

Average Access Delay

MPEG2-ENC

MPEG2-DEC

H264-ENC

H264-DEC

JPEG-ENC

JPEG-DEC

PGP-ENC

PGP-DEC

G721-ENC

G721-DEC

2 Bank Original

4 Bank Original

8 Bank Original

2 Bank Remapped

4 Bank Remapped

8 Bank Remapped

Comments of Page Remapping

Page remapping algorithm is presented by example.Our algorithm can significantly reduce the memory page miss

rate by 70-80% on average.For a 4-bank SDRAM memory system, we reduced externalmemory access time by 12.6%.The proposed algorithm can reduce power consumption in

majority of the benchmarks, averaged by 13.2% of power reduction.

Combining the effects of both power and delay, our algorithm can benefit significantly to the total energy cost.

Stability study was done in dissertation. PMT table generated from one test vector input perform well on different inputs.

Outline

Related Work

Summary

Reviewed the issues of external bus power in a system-on-a-chip (SOC) embedded system.

Built external bus power estimation framework and experimental methodology.PACS’04

Proposed a series of power aware bus arbitration schemes and their performance improvement over traditional schemes.HiPEAC’05 also appeared in LNCSTransaction of High performance of Embedded Architectures and

CompilersProposed page remapping algorithm to reduce page misses

and its power and delay improvements.LCTES’07

Future Work

Integration of power estimation framework in complete tool chain

Extend arbitration schemes to multiple memory interfaces and other peripheral interfaces.

Compare performance of page remapping with corresponding OS/Compiler schemes

Thank You !

System-Level Memory Bus Power And Performance Optimization for Embedded Systems

Documents

Embedded Processors ¾ AMBA BusAMBA Bus(cont.) High speed...

EMBEDDED MEMORY anD STORaGE SOLUTIOnS - Digi …...

Fusion Embedded Flash Memory Blocks

CHALLENGES IN EMBEDDED MEMORY DESIGN AND TEST History and...

Intel 80286. Intel family of microprocessor, bus and memory....

Embedded Systems, Memory Systems, and...

Memory Allocation Embedded System

Memory optimization techniques for embedded systems

Lecture #8 Memory & Processor Bus

Embedded Systems, Memory Systems, and Embedded Memory...

Embedded Software Memory Size Estimation Using...

Chordal Ring - eng.auburn.edusylee/ee6230/Figures.pdf ·...

Introduction to Embedded Systems Memory, I/O and...

Embedded Memory...

Parallel NOR Flash Embedded Memory

Con gurable Memory Security In Embedded Systems · Con...