- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue? Many reports about low efficiency of standard compilers Special features of embedded processors have to be exploited. High levels of optimization more important than compilation speed. Compilers can help to reduce the energy consumption (energy optimization). Compilers could help to meet real-time constraints. Less legacy problems than for PCs. There is a large variety of instruction sets. Design space exploration for optimized processors makes sense
20
Embed
- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue? Many reports about low efficiency of standard compilers Special.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
- 1 -EE898- Compiler
Compilers for embedded systems:Why are compilers an issue?
Many reports about low efficiency of standard compilers Special features of embedded processors have to be exploited. High levels of optimization more important than compilation
speed. Compilers can help to reduce the energy consumption (energy
optimization). Compilers could help to meet real-time constraints.
Less legacy problems than for PCs. There is a large variety of instruction sets. Design space exploration for optimized processors makes sense
Many reports about low efficiency of standard compilers Special features of embedded processors have to be exploited. High levels of optimization more important than compilation
speed. Compilers can help to reduce the energy consumption (energy
optimization). Compilers could help to meet real-time constraints.
Less legacy problems than for PCs. There is a large variety of instruction sets. Design space exploration for optimized processors makes sense
- 2 -EE898- Compiler
Use of assembly languages in embedded systems
Assembler (DSP)
Assembler (µController)
C-Code (µController)
C-Code (DSP)
[Paulin, 1995]
Similar situation more recently
- 3 -EE898- Compiler
Optimizations considered
• Energy-aware compilation• Compilation for digital signal processors• Compilation for multimedia processors• Compilation for VLIW (very long instruction word) processors• Compilation for network processors• Compiler generation, retargetable compilers and design space
exploration
• Energy-aware compilation• Compilation for digital signal processors• Compilation for multimedia processors• Compilation for VLIW (very long instruction word) processors• Compilation for network processors• Compiler generation, retargetable compilers and design space
exploration
- 4 -EE898- Compiler
Efforts for Reducing Energy
• Device Level– Development of Low Power Devices– Reducing Power Supply Voltage– Reducing Threshold Voltage
No !• High-performance if available memory bandwidth fully used; low-energy consumption if memories are at stand-by mode• Reduced energy if more values are kept in registers
- 6 -EE898- Compiler
Energy models
• Commercial tools frequently very imprecise• Model of Tiwari (Dissertation, Princeton 1996):
Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access
• Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.
• Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.
• Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls
Dedicated energy models.
• Commercial tools frequently very imprecise• Model of Tiwari (Dissertation, Princeton 1996):
Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access
• Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.
• Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.
• Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls
• It is not important, which address bit is set to ‘1’• The number of ‚1‘s in the address bus is irrelevant• The cost of flipping a bit on the address bus is independent of
the bit position.• It is not important, which data bit is set to ‘1’• The number of ‚1‘s on the data bus has a minor effect (3%)• The cost of flipping a bit on the data bus is independent of the
bit position.
• It is not important, which address bit is set to ‘1’• The number of ‚1‘s in the address bus is irrelevant• The cost of flipping a bit on the address bus is independent of
the bit position.• It is not important, which data bit is set to ‘1’• The number of ‚1‘s on the data bus has a minor effect (3%)• The cost of flipping a bit on the data bus is independent of the
bit position.
- 11 -EE898- Compiler
Compiler optimizations for improving energy efficiency
Energy-aware scheduling Energy-aware instruction selection Operator strength reduction: e.g. replace * by + and << Minimize the bitwidth of loads and stores Standard compiler optimizations with energy as a cost function
Energy-aware scheduling Energy-aware instruction selection Operator strength reduction: e.g. replace * by + and << Minimize the bitwidth of loads and stores Standard compiler optimizations with energy as a cost function
E.g.: Register pipelining:
for i:= 0 to 10 do C:= 2 * a[i] + a[i-1];
R2:=a[0];for i:= 1 to 10 do begin R1:= a[i]; C:= 2 * R1 + R2; R2 := R1; end;
Exploitation of the memory hierarchy Exploitation of the memory hierarchy
- 12 -EE898- Compiler
3 key problems for future memory systems
Energy
Access times
1. (Average) Speed
2. Energy/Power
3. Predictability/Worst Case Execution Time
1. (Average) Speed
2. Energy/Power
3. Predictability/Worst Case Execution Time
smaller, faster, less energy
- 13 -EE898- Compiler
1. (Average) Speed
Speed gap between processor and main DRAM increasesSpeed gap between processor and main DRAM increases
2
4
8
2 4 5
Speed
years
CPU
(1.5
-2 p
.a.)
DRAM (1.07 p.a.)
31
• early 60ties (Atlas):page fault ~ 2500 instructions
• 2002 (2 GHz µP):access to DRAM ~ 500 instructions
penalty for cache miss soon same as for page fault in Atlas
• early 60ties (Atlas):page fault ~ 2500 instructions
• 2002 (2 GHz µP):access to DRAM ~ 500 instructions
penalty for cache miss soon same as for page fault in Atlas
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
2x every 2 years
10
- 14 -EE898- Compiler
2. Power/Energy
0
0.5
1
1.5
2
2.5
64 128 256 512 1024 2048 4096 8192
Memory size [bytes]
En
erg
y p
er a
cce
ss
[nJ
]Example (CACTI Model):
[Steinke et al., Inf 12, UniDo, 2002]
- 15 -EE898- Compiler
3. Predictability/WCET
• Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern;pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas] Time-triggered, statically scheduled operating systems
• What about memory accesses? – Currently available caches don‘t solve the problem:
• Improve the average case behavior• Use “non-deterministic“ cache replacement algorithms
Scratch-pad/tightly coupled memory based predictability
• Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern;pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas] Time-triggered, statically scheduled operating systems
• What about memory accesses? – Currently available caches don‘t solve the problem:
• Improve the average case behavior• Use “non-deterministic“ cache replacement algorithms
Scratch-pad/tightly coupled memory based predictability
- 16 -EE898- Compiler
Hierarchical memoriesusing scratch pad memories (SPM)
Address spaceAddress space ARM7TDMI cores, well-known for low power consumption
scratch pad memory
0
FFF..
main
SPM
processor
HierarchyHierarchy
ExampleExample
no tag memory
- 17 -EE898- Compiler
Exploitation of SPM
Which segment (array, loop, etc.) to be stored in SPM?
Gain gi and size si for each segment i.
Maximise gain G = gi, respecting constraint K si.
Static memory allocation:
Solution: knapsack algorithm.
Dynamic reloading:
Finding optimal reloading points.Processor
SPMcapacity K
board
Main memory (On-board)
?
For i .{ }
for j ..{ }
while ...
Repeat
call ...
Array ...
Int ...
Array
Example:
- 18 -EE898- Compiler
Reduction in energy and average run-time
Multi_sort (mix of sort algorithms)
Cyc
les
- 19 -EE898- Compiler
Energy consumption per functional unit,as a function of the SPM size
Parameters different from previous slide
Parameters different from previous slide
- 20 -EE898- Compiler
Hardware-support for block-copying
• The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.
• The unit can be put to sleep when it is unused.
• The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.
• The unit can be put to sleep when it is unused.
• Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying.
• Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying.