1 Heterogeneous Multi- Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen Presenter: Borys Bradel
1
Single-ISA Heterogeneous Multi-Core Architectures:
The Potential for Processor Power Reduction
Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi,
Parthasarathy Ranganathan, Dean M. TullsenPresenter: Borys Bradel
2
Introduction
Different programs have different requirements (e.g. ILP) Extends to phases of a single program Heterogeneous cores Use core that matches the requirements
Reuse existing cores Use multiple generations of the same
family of processors
3
Outline
Methodology Hardware Assumptions Power
Experiments Optimal – energy/energy delay product Heuristic based – static/dynamic
Related Work Conclusion
4
Single ISA Multi-Core Benefits
Small area overhead because of the growth in core sizes between generations
Clock frequencies of older cores would scale with technology P3 1 GHz = P4 1.4 GHz Increased pipeline depth precisely
because could not scale
5
Hardware – Alpha Family
2 in order cores EV4=21064 EV5=21164
2 out of order cores EV6=21264 EV8-=21464 (multi thread support
removed)
6
Hardware Size
15% more area than just using 21464
7
Assumptions
Can switch cores dynamically Private L1 cache and common L2 cache All cores use 0.10 micron technology Single process executing on a single core at any one
time 2.1 GHz clock (=21264 0.35 micron 600 MHz) Input voltage 1.2V Cores shut down when idle 1000 cycle restart cost (staged, phase lock loop left
alone) 150 ms memory access Stall cycles through CACTI
8
Core Configurations
9
Power Model
Use Wattch to account for activity based dissipation
Use scaling and offset factors to account for other factors
This hybrid model is closer to manufacturer’s data points
Peak power: data sheets less L2 cache and output pins
Typical power: scaled based on Intel chips
10
Power and Area Statistics
11
Performance Modeling
Use SMTSIM, a cycle accurate simulator
simpoint is used to identify representative instructions of programs and how many instructions need to be fast forwarded
12
Varying Performance Ratio
13
Varying Energy Efficiency Ratio
14
Oracle Switching for Energy
Performance always within 10% of EV8-
15
Oracle Switching for Energy
16
Oracle Switching for Energy Delay Product
Performance always within 50% of EV8-
17
Oracle Switching for Energy Delay Product
18
Others
Voltage/frequency scaling – not as good
Static core selection only EV6 and EV8- are used
Dynamic heuristic Running average performance within 10% Every 100 time intervals (100 million
instructions) cores are sampled for 5 intervals
Select best core based on sampling
19
Results for Heuristics
20
Results for Heuristics/Static Core
21
Related Work
Gating based power optimization Cannot gate at a fine enough
granularity May still have leakage This could be thought of as gating to
reduce capabilities of different units Voltage and frequency scaling
Chip wide – one size does not fit all Fine grained – granularity problems
22
Conclusions
Heterogeneous multi core architectures reduce the energy-delay product More fine grained than other approaches
Using several cores from the same family is good Reduces development/testing costs Is it scalable?
Just use EV6??