Top Banner
Distributed L0 Buffer Distributed L0 Buffer Architecture and Exploration Architecture and Exploration for Low Energy Embedded for Low Energy Embedded Systems Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium
41

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Feb 03, 2016

Download

Documents

swain

Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems. Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium. Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium. Overview. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

Distributed L0 Buffer Architecture and Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Exploration for Low Energy Embedded

SystemsSystems

Murali Jayapala

Francisco Barat

Pieter Op de Beeck

Tom Vander Aa

Geert Deconinck

ESAT/ACCA, K.U.Leuven, Belgium

Francky Catthoor

Henk Corporaal

IMEC, Leuven,

Belgium

Page 2: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

2

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conclusions

Page 3: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

3

ContextContext

Low Power Embedded Systems Battery operated (low energy)

10-50 MOPS/mW

Small Low cost Flexible Multimedia Applications

Video, audio, wireless High performance

10-100 GOPS real-time constraints

Low Energy Embedded systems

Page 4: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

4

ContextContext

Embedded processors• Power Breakdown

43 % of power in on-chip Memory StrongARM SA110: A 160MHz 32b 0.5W

CMOS ARM processor

40 % of power in internal memory C6x, Texas Instruments Inc.

25-30% of power in Instruction Memory

To address the data memory issues:• Data Transfer and Storage Methodology (DTSE)

F.Catthoor et. al.

Embedded systems:Programmable

Processor Based

Page 5: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

5

Related WorkRelated Work

Significant Power consumption in Instruction Memory Hierarchy

Core

Main Memory(off-chip)

L1 cache(on-chip)

Compression (code size reduction)

- L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999

- P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999

- T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000.

Software Transformations

- N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001

- S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001

Page 6: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

6

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conculsions

Page 7: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

7

Application Domain: Multimedia Application Domain: Multimedia Characteristics (1)Characteristics (1)

Instruction Count Static Instruction Count Dynamic

High locality

Instruction count

ICstatic < 1% ICdynamic

IC dynamic IC static

0%

100%

2%

0%

Page 8: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

8

Application Domain: Multimedia Application Domain: Multimedia Characteristics (2)Characteristics (2)

Normalized static instruction count

Nor

mal

ized

dyn

amic

inst

ruct

ion

coun

t

Within a program, few basic blocks or instructions

take up most of the execution time (ICdynamic)

Page 9: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

9

Motivation for additional Motivation for additional small memorysmall memory

Application Domain:high locality in few basic blocks

Small memory, in addition to the conventional L1 cache should be used to reduce energy without compromising performance

Size ( basic blockshigh locality) is still large

if L1 cache (on-chip) is made small

performance degrades

• capacity (compulsory) misses

system power increases

• off-chip memory / bus activity increasesCore

Main Memory(off-chip)

L1 cache(on-chip)

Page 10: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

10

Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design

N. Jouppi et.al, “Improving direct-mapped cache performance by addition of a small fully-associative cache and prefetch buffers”, ISCA 1990

• Aim: to reduce miss penalty cycles

• miss caching, victim caching, stream buffers

Core

Main Memory(off-chip)

L1 cache(on-chip)

cache

Page 11: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

11

J. D. Bunda et.al, “Instruction-Processing Optimization Techniques for VLSI Microprocessors”, Phd thesis 1993

• Aim: to reduce instruction cache energy

• L0 buffer: cache block buffer (1 cache block + 1 tag)

• Limitations: block trashing

Related Work (Microarchitecture):Related Work (Microarchitecture):Cache DesignCache Design

Core

Main Memory(off-chip)

L1 cache(on-chip)

L0 Buffer

J. Kin et.al, “Filtering memory references to increase energy efficiency”, IEEE Trans on Computer, 2000

• Aim: to reduce instruction cache energy

• L0 buffer: filter cache

– Small regular cache (< 1KB)

– L0 access (hit) latency: 1 cycle

– L1 access (hit) latency: 2 cycles

• Limitations:

– Energy reduced at the expense of performance

– 256Byte, 58% power reduction with 21% performance degradation

Page 12: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

12

R.S. Bajwa et.al, “Instruction Buffering to Reduce Power in Processors for Signal Processing”, IEEE Trans VLSI Systems, vol 5, no 4, 1997

L. H. Lee et.al, (M-CORE), “Instruction Fetch Energy Reduction Using Loop Caches for Applications with Small and Tight Loops”, ISLPED 1999

Core

Main Memory(off-chip)

L1 cache(on-chip)

L0 Buffer

LC

- L0 Buffer: Buffer (< 1KB) + Local Controller (LC); [no tags]

- L0 / L1 access latency: 1 cycle

- Used only for specific program segments (innermost loops)- Software control:

Special instruction (lbon, sbb) to map program segments to L0 buffer

Datapath

L1

L0

Datapath

L1

L0

Datapath

L1

L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers

Page 13: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

13

• Assumed Architecture MIPS 4000 ISA Single Issue Processor L1 Cache

16KB Direct Mapped

Loop Buffer (2KB) Depth = 128 instructions Width = 16 Bytes

• Tools Simplescalar 2.0 Wattch Power estimator

• Loops with less than 128 instructions were hand-mapped onto the loop buffer

0

10

20

30

40

50

60

70

80

90

100

cav_det

c jpeg

djpeg

epic

g721gsm

mpeg2d

pegwit

unepic

Normalized Energy Consumption

Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers

Page 14: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

14

Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers

• Advantages 50% (avg) energy reduction, with no performance degradation Software control: enables to map only a selected program segments

• Limitations Supports only innermost loops (regular basic blocks)

Other basic blocks frequently executed are still fetched from L1 cache

No support for control constructs within loops

F. Vahid et.al [2001-2002]: Hardware support for conditional constructs within loops Identifying the loop address bounds (preloading the program segment/loop) Sub-routines conditional constructs 1 level nested loop

Page 15: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

15

Related Work (Architecture):Related Work (Architecture):Compiler controlled L0 buffersCompiler controlled L0 buffers

N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998

• Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer.

• L0 Buffer: Regular cache (< 1KB; 128 instr)

• Technique:

– profile– function inlining

– identify basic blocks

– code layout

Core

Main Memory(off-chip)

L1 cache(on-chip)

L0 Buffer

code layout

basic blocks allocated to

L0 buffer

L0 Buffer address space

Advantages

- Automated: a ‘tool’ can do this job- Use of basic block as atomic unit of allocation- 60% (avg) energy reduction in i-mem hierarchy [SPEC95]

Limitations

- Tag overhead

Page 16: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

16

Loop Buffers: Commercial ProcessorsLoop Buffers: Commercial Processors

• RISC DSP Processors SH-DSP

Decoded instruction buffers Supports regular loops (no conditional constructs/nested

loops)

• VLIW Processors StarCore SC140

Supports regular and nested loopsConditional constructs through predication

STMicroelectronics, ST120Supports nested loops and loops with conditional constructs

Page 17: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

17

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conclusions

Page 18: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

18

ShortcomingsShortcomings

• So far...

Hardware, software, compiler optimizations to increase accesses/activity at L0 Buffers

Core

Main Memory(off-chip)

L1 cache(on-chip)

L0 Buffer

Incr

ease

d A

cces

ses

(act

ivity

)

• Bottleneck to solve

– L0 Buffer organization

– Interconnect: from L0 Buffer to Datapath

– Efficient buffer controller

• Organization Scalable with increase in #FUs

L0 Buffer

FU FU FU FU

Centralized Organization

LC

Page 19: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

19

Current Organizations for L0 BuffersCurrent Organizations for L0 Buffers

Uncompressed L0 Buffer

• Buffer: Width issue width (# FUS)

• Interconnect: Long

• LC: Simple Addressing (counter based)

Ref: Bajwa et.al., L.H. Lee et.al., F. Vahid et.al.

L0 Buffer

FU FU FU FU

L0 Buffer

FU FU FU FU

Decompressor/Dispatch

Compressed L0 Buffer

• Buffer: – High storage density (no NOPs)

– Width issue width (# FUS) – Overhead in decompressing

• Interconnect : Still centralized, long lines

• LC: Simple Addressing (counter based)

Ref: TI (execute packet fetch mechanism)

Page 20: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

20

Current Organizations for L0 Buffers….Current Organizations for L0 Buffers….

Sub-banked/Partitioned L0 Buffer with Compression

• Buffer: Smaller memories, overhead in re-organizer

• Interconnect: Still centralized

• LC: Complex addressing (needs expensive tags)

Ref: T. Conte et.al [TINKER]

• No correlation between partitioning and FUs

Bank 1

FU FU FU FU

Re-organizer

Bank 2 Bank 3 Bank 4

LC

par 1

FU FU FU FU

par 2 par 3 par 4

LC

Partitioned L0 Buffer

• Buffer: Smaller memories

• Interconnect: Still long

• LC:

– Simple addressing (counter based)

– Need to access all the banks simultaneously, even if some of the FUs are not active

Ref: Sub-banking

Page 21: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

21

SolutionSolutionDistributed Instruction Buffer OrganizationDistributed Instruction Buffer Organization

A balance of energy consumption betweenBuffers, Interconnect and Local Controllers

is needed

Buffers

FU FU FU FU

Distributor/Dispatch

Buffers BuffersATC

FU

ATC ATC

Instruction Cluster

IROC

Buffer Control

• Stores instructions in each partition

• Fetches instructions during loop execution

• Regulates the accesses to each partition

Buffers

• Sub-banked/Partitioned in correlation with FU activation

Interconnect

• Localized (limited connectivity b/w FUs and Buffers)

ATC: Address Translation and Control

IROC: Instruction Registers Operation and Control

Page 22: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

22

Distributed L0 Buffer OperationDistributed L0 Buffer Operation

• Similar to conventional L0 buffer operation• Initiation

Special instruction LBON <offset>

• Filling Pre-fetching instructions from <start> to <end>

• Termination When the program flow jumps to an address out of <start> to <end> range

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Datapath

L1

Distributed L0

Normal Operation

Filling L0 Buffer Operation

Initiation Execution

Termination

Page 23: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

23

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

Page 24: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

24

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 25: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

25

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 26: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

26

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 27: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

27

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 28: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

28

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 29: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

29

The Buffer Operation:The Buffer Operation:An IllustrationAn Illustration

OP11

for (..)

{ …

if (..) {.….}

else {.….} …}

OP21 OP31 NOP

NOP OP22 OP32 BNZ ‘x’

OP12 NOP NOP BR ‘y’

OP13 NOP OP33 NOP

OP14 OP23 NOP BNZ ‘s’

S:

X:

Y:

LBON <offset>

if block

else block

IROCSTART_ADDR

END_ADDR

IR_USE

NEW_PC

PC

FU1

OP11OP12OP13OP14

01-0112131

FU2

OP21OP22OP23

0111-0-021

FU3

OP31OP32OP33

0111-021-0

BR

BNZ ‘x’BR ‘y’

BNZ ‘s’

-00111-021

Page 30: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

30

Energy Trade-OffsEnergy Trade-Offs

Energy = E buffer i + E LC i + E interconnect i

i = 1

#partitions

i = 1

#partitions

i = 1

#partitions

#partitions

Ene

rgy

(nor

mal

ized

)

1

1

E buffer i

E interconnect i

E LC i

Baseline

#FUs

Page 31: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

31

Profile Based ClusteringProfile Based Clustering

Instruction Clustering

1 1 1 0 0 … 11 0 1 0 1 … 00 1 1 0 1 … 1

..

.1 1 1 0 1 … 0

Energy Models(Register File)

Dynamic Trace(during loop execution)

Static Trace(loops mapped to L0)

begin1 1 1 0 0 … 11 0 1 0 1 … 0endbegin0 1 1 0 1 … 1end

Instruction Clusters

Instruction Cluster

A group of functional units with a separate local controller and an

instruction buffer partition

Min { Energy(clust, Dynamicprofile, Staticprofile) }

clust(i,j) = 1; j

i =1

max_clusters

clust (i,j) = 1; if jth FU is assigned to cluster ‘j’

= 0; otherwise

S.T

Where,

- FU grouping

- Width and Depth of instruction buffers in each partition

Page 32: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

32

ResultsResults

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6 7 8 9 10

adpcmd

djpeg

idct

mpeg2d

Energy = E buffer i + E LC ii = 1

#partitions

i = 1

#partitions

#partitions

Ene

rgy

(nor

ma

lized

)

Assumptions

- Only the buffers and controller is modeled (no interconnect as yet)

- #FUs in datapath = 10

- Fixed Schedule ( activation trace)

- Schedule generated using

Trimaran 2.0

Page 33: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

33

In Comparison With Other SchemesIn Comparison With Other Schemes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffers Controller/Overhead

Uncompress

ed

Compress

ed

Paritioned (s

ub-banked)

( no acc

ess re

gulation )

Clustere

d (varyi

ng width only)

Clustered (v

arying both

width and depth)

Results Shown for ADPCM

Uncompressed - CentralizedL0 buffer

Compressed - Centralized L0 Buffer

- 2 additional registers for VLDecoding

Partitioned (no control) - 2 partitions

Clustered (width only) - 3 partitions

Clustered (width and depth) - 2 partitions

Page 34: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

34

Fully Distributed Instruction Memory Fully Distributed Instruction Memory HierarchyHierarchy

L0 Buffers

FU FU FU

L0 Buffers

FU FU

L0 Buffers

FU FU FU

L0 Buffers

FU FU FU FU

Main Memory(off-chip)

L1 cache(on-chip)

L1 cache(on-chip)

L0 Cluster L1 Cluster

Page 35: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

35

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conclusions

Page 36: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

36

Exploration MethodologyExploration MethodologyWhat we haveWhat we have

Application

Software Transformations

Compiler(Scheduling)

Clustering ToolEnergyModels

InstructionClusters

Pareto Curve Generation

- For Choosing the operating point at Run-time

- Enable the designer to asses the trade-off between energy and performance

Delay

Ene

rgy

optimized for performance

- maximum cluster activity

optimized for Energy

- minimal cluster activity

Page 37: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

37

Exploration MethodologyExploration MethodologyWhat we want to achieve…What we want to achieve…

Application

Software Transformations

Compiler(Scheduling & Clustering)

EnergyModels

InstructionClusters

Schedule

Pareto Curve Generation

- For Choosing the operating point at Run-time

- Enable the designer to asses the trade-off between energy and performance

Delay

Ene

rgy

optimized for performance

- maximum cluster activity

optimized for Energy

- minimal cluster activity

Page 38: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

38

Compiler SchedulingCompiler Scheduling

Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance

OP11 OP12 - OP13 - OP14

All 3 clusters need to be activeOP11 OP12 OP13 OP14 - -

Only 2 clusters need to be active

OP11 OP12 - OP13 - OP14

OP21 - OP22 - OP23 -

2 activations of all 3 clusters OP11 OP12 - - - -

OP11 - - - - -

- - OP22 OP13 OP23 OP14

2 activations for 1st, 1 activation for 2nd and 3rd cluster

Energy reduction without performance loss

Energy reduction at the expense of performance loss

Page 39: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

39

Software TransformationsSoftware Transformations

loop 1

loop 2

Loop

High level code transformations can also impact/change the clustering result and hence energy and performance

Loop Transformations

- Loop splitting

- Loop merging

- Loop peeling (for nested loops)

- Loop collapsing (nested loops)

- Code movement across loops

-....etc

Loop Splitting

Page 40: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

40

OverviewOverview

• Context: Introduction to the problem

• Motivation for L0 Buffer organization and status

• Distributed L0 Buffer organization

• Instruction Memory Exploration Software and Compiler Transformation

• Conclusions

Page 41: Distributed  L0 Buffer Architecture and Exploration for Low Energy Embedded Systems

ESAT/ACCA

41

ConclusionsConclusions

• L0 Buffer Organization Multimedia applications have high locality in small program segments An additional small L0 buffer should be used Current options for L0 buffer still not efficient (energy) A distributed L0 buffer organization should be sought But, the clustering/partitioning should be application specific

• L1 Cache Organization Distributed (?)

• Instruction Memory Exploration Software transformations and compiler scheduling can change the

clusterting results An exploration methodology should be sought to analyze the trade-offs

in energy and performance (pareto curves)