Distributed L0 Buffer Distributed L0 Buffer Architecture and Exploration Architecture and Exploration for Low Energy Embedded for Low Energy Embedded Systems Systems Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium
41
Embed
Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems
Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Systems. Francky Catthoor Henk Corporaal IMEC, Leuven, Belgium. Murali Jayapala Francisco Barat Pieter Op de Beeck Tom Vander Aa Geert Deconinck ESAT/ACCA, K.U.Leuven, Belgium. Overview. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Distributed L0 Buffer Architecture and Distributed L0 Buffer Architecture and Exploration for Low Energy Embedded Exploration for Low Energy Embedded
SystemsSystems
Murali Jayapala
Francisco Barat
Pieter Op de Beeck
Tom Vander Aa
Geert Deconinck
ESAT/ACCA, K.U.Leuven, Belgium
Francky Catthoor
Henk Corporaal
IMEC, Leuven,
Belgium
ESAT/ACCA
2
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
3
ContextContext
Low Power Embedded Systems Battery operated (low energy)
10-50 MOPS/mW
Small Low cost Flexible Multimedia Applications
Video, audio, wireless High performance
10-100 GOPS real-time constraints
Low Energy Embedded systems
ESAT/ACCA
4
ContextContext
Embedded processors• Power Breakdown
43 % of power in on-chip Memory StrongARM SA110: A 160MHz 32b 0.5W
CMOS ARM processor
40 % of power in internal memory C6x, Texas Instruments Inc.
25-30% of power in Instruction Memory
To address the data memory issues:• Data Transfer and Storage Methodology (DTSE)
F.Catthoor et. al.
Embedded systems:Programmable
Processor Based
ESAT/ACCA
5
Related WorkRelated Work
Significant Power consumption in Instruction Memory Hierarchy
Core
Main Memory(off-chip)
L1 cache(on-chip)
Compression (code size reduction)
- L. Benini et.al., “Selective Instruction Compression for Memory Energy Reduction...”, ISLPED 1999
- P. Centoducatte et.al, “Compressed Code Execution on DSP Architectures” ISSS 1999
- T. Ishihara et.al., “A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors”, DATE 2000.
Software Transformations
- N. D. Zervas et.al.,”A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications”, ICECS 2001
- S. Parameswaran et.al., “I-CoPES: Fast Instruction Code Placement for Embedded Sytems to Improve Performance and Energy Efficiency”, ICCAD 2001
ESAT/ACCA
6
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Loops with less than 128 instructions were hand-mapped onto the loop buffer
0
10
20
30
40
50
60
70
80
90
100
cav_det
c jpeg
djpeg
epic
g721gsm
mpeg2d
pegwit
unepic
Normalized Energy Consumption
Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers
ESAT/ACCA
14
Related Work (Architecture):Related Work (Architecture):Software controlled L0 buffersSoftware controlled L0 buffers
• Advantages 50% (avg) energy reduction, with no performance degradation Software control: enables to map only a selected program segments
• Limitations Supports only innermost loops (regular basic blocks)
Other basic blocks frequently executed are still fetched from L1 cache
No support for control constructs within loops
F. Vahid et.al [2001-2002]: Hardware support for conditional constructs within loops Identifying the loop address bounds (preloading the program segment/loop) Sub-routines conditional constructs 1 level nested loop
ESAT/ACCA
15
Related Work (Architecture):Related Work (Architecture):Compiler controlled L0 buffersCompiler controlled L0 buffers
N. Bellas et.al, “Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors”, ISLPED 1998
• Aim: Reduce instruction cache energy by letting the compiler to assume the role of allocating basic blocks to L0 buffer.
• L0 Buffer: Regular cache (< 1KB; 128 instr)
• Technique:
– profile– function inlining
– identify basic blocks
– code layout
Core
Main Memory(off-chip)
L1 cache(on-chip)
L0 Buffer
code layout
basic blocks allocated to
L0 buffer
L0 Buffer address space
Advantages
- Automated: a ‘tool’ can do this job- Use of basic block as atomic unit of allocation- 60% (avg) energy reduction in i-mem hierarchy [SPEC95]
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
36
Exploration MethodologyExploration MethodologyWhat we haveWhat we have
Application
Software Transformations
Compiler(Scheduling)
Clustering ToolEnergyModels
InstructionClusters
Pareto Curve Generation
- For Choosing the operating point at Run-time
- Enable the designer to asses the trade-off between energy and performance
Delay
Ene
rgy
optimized for performance
- maximum cluster activity
optimized for Energy
- minimal cluster activity
ESAT/ACCA
37
Exploration MethodologyExploration MethodologyWhat we want to achieve…What we want to achieve…
Application
Software Transformations
Compiler(Scheduling & Clustering)
EnergyModels
InstructionClusters
Schedule
Pareto Curve Generation
- For Choosing the operating point at Run-time
- Enable the designer to asses the trade-off between energy and performance
Delay
Ene
rgy
optimized for performance
- maximum cluster activity
optimized for Energy
- minimal cluster activity
ESAT/ACCA
38
Compiler SchedulingCompiler Scheduling
Compiler scheduling can change the functional unit activity and hence the clustering result and hence energy and performance
OP11 OP12 - OP13 - OP14
All 3 clusters need to be activeOP11 OP12 OP13 OP14 - -
Only 2 clusters need to be active
OP11 OP12 - OP13 - OP14
OP21 - OP22 - OP23 -
2 activations of all 3 clusters OP11 OP12 - - - -
OP11 - - - - -
- - OP22 OP13 OP23 OP14
2 activations for 1st, 1 activation for 2nd and 3rd cluster
Energy reduction without performance loss
Energy reduction at the expense of performance loss
ESAT/ACCA
39
Software TransformationsSoftware Transformations
loop 1
loop 2
Loop
High level code transformations can also impact/change the clustering result and hence energy and performance
Loop Transformations
- Loop splitting
- Loop merging
- Loop peeling (for nested loops)
- Loop collapsing (nested loops)
- Code movement across loops
-....etc
Loop Splitting
ESAT/ACCA
40
OverviewOverview
• Context: Introduction to the problem
• Motivation for L0 Buffer organization and status
• Distributed L0 Buffer organization
• Instruction Memory Exploration Software and Compiler Transformation
• Conclusions
ESAT/ACCA
41
ConclusionsConclusions
• L0 Buffer Organization Multimedia applications have high locality in small program segments An additional small L0 buffer should be used Current options for L0 buffer still not efficient (energy) A distributed L0 buffer organization should be sought But, the clustering/partitioning should be application specific
• L1 Cache Organization Distributed (?)
• Instruction Memory Exploration Software transformations and compiler scheduling can change the
clusterting results An exploration methodology should be sought to analyze the trade-offs