Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~ vahid Co-PI: Walid Najjar, Professor, CS&E, UCR
18
Embed
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Frank Vahid, UC Riverside
1
Self-Improving Configurable IC Platforms
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahidCo-PI: Walid Najjar, Professor, CS&E, UCR
Frank Vahid, UC Riverside 2
Goal: Platform Self-Tunes to Executing Application
Download standard binary Platform adjusts to executing application Result is better speed and energy Why and How?
Profile Decompile frequent loops Optimize Synthesize Place and route onto FPGA Update Sw to call FPGA
Transparent No impact on tool flow Dynamic software
optimization, software binary updating, and dynamic binary translation are proven technologies
But how can you profile, decompile, optimize, synthesize, and p&r, on-chip?
DAG & LC
MemProcessor
L1Cache
Profiler
Explorer
Dynamic Partitioning
ModuleDecompiler, Optimizer
Synthesis, Place and
Route
FPGA
Frank Vahid, UC Riverside 6
Dynamic Partitioning Requires Lean Tools
How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations?
Key – our tools only need be good enough to speedup critical loops
Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) Created ultra-lean versions of the tools
Quality not necessarily as good, but good enough Runs on a 60 MHz ARM 7
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 10
% execution time
% size of program
Loop
Frank Vahid, UC Riverside 7
Dynamic Hw/Sw Partitioning Tool Chain
DAG & LC
FPGA
MemProcessor
L1Cach
eProfiler
Explorer
Partitioner
Binary
Loop Profiler
Small, Frequent Loops
Loop Decompilatio
n
Place & Route
Hw
Synthesis
Binary Modification
Updated Binary
DMA Configuration
Bitfile Creation
Tech. Mapping
Architecture targeted for loop speedup, simple P&R
We’ve developed efficient profiler Hw
We’re continuing to extend these tools to handle more benchmarks
Decompiler, Optimizer
Synthesis, Place and
Route
Frank Vahid, UC Riverside 8
Dynamic Hw/Sw Partitioning Results
DAG & LC
FPGA
MemProcessor
L1Cach
eProfiler
Explorer
Partitioner
UCR Tools
Code Size
(lines)Memory (bytes)
Avg. Time
(s)
Binary Size
(bytes)
Decompilation
FPGA Config.
RT Synthesis
Logic Min.
Tech. Mapping
Place&Route
4,695 360K 1.60 47K
7,203 452K 0.20 67K
Decompiler, Optimizer
Synthesis, Place and
Route
Frank Vahid, UC Riverside 9
Dynamic Hw/Sw Partitioning Results
Example Sw TimeSw Loop
TimeHw Loop
TimeSw/Hw Time S
brev 0.07 0.05 0.001 0.02 3.1
g3fax1 33.84 10.58 1.19 24.45 1.4
g3fax2 33.84 10.64 2.15 25.35 1.3
url 547.06 437.39 19.13 128.80 4.2
logm in 23.50 15.00 0.31 8.81 2.7
pktflow 1.19 0.42 0.09 0.86 1.4
canrdr 1.18 0.41 0.07 0.84 1.4
bitm np 6.98 3.75 0.04 3.27 2.1
Avg: 59.78 2.87 24.05 2.2
Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only Average speedup very close to ideal speedup of 2.4
Not much left on the table in these examples Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective
Frank Vahid, UC Riverside 10
Configurable Cache: Why?
ARM920T: Caches consume half of total processor system power (Segars 01)
M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99)
DAG & LC
FPGA
MemProcessor
L1Cache
Profiler
Explorer
Dynamic Partitioning
ModuleDecompiler
Synthesis
Place and Route
Frank Vahid, UC Riverside 11
Best Cache for Embedded Systems?
Diversity of associativity, line size, total size
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 12
Cache Design Dilemmas Associativity
Low: low power, good performance for many programs
High: better performance on more programs Total size
Small: lower power if working set small, (less area) Big: better performance/power if working set large
Line size Small: better when poor spatial locality Big: better when good spatial locality
Most caches are a compromise for many programs
Work best on average But embedded systems run one/few programs
Want best cache for that one program
vs.
vs.
vs.
Frank Vahid, UC Riverside 13
Solution to the Cache Design Dilemna
Configurable cache Design physical cache that can be
reconfigured 1-way, 2-ways, or 4-ways
Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar)
Four 2K ways, plus concatenation logic 8K, 4K or 2K byte total size
Way shutdown, ISCA’03 Gates Vdd, saves both dynamic and static
power, some performance overhead (5%) 16, 32 or 64 byte line size
Variable line fetch size, ISVLSI’03 Physical 16 byte line, one, two or four
physical line fetches Note: this is a single physical cache, not a
synthesizable core
Frank Vahid, UC Riverside 14
Configurable Cache Design: Way Concatenation (4, 2 or 1 way)
index
c1 c3c0 c2
a11
a12
reg1
reg0
sense ampscolumn mux
tag part
tag address
mux driver
c1
line offset
data output
critical path
c0
c2
c0 c1
6x64
6x64
c3c2
6x64
6x64
c3
6x64
6x64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
Configuration circuit
data array
bitline
Trivial area overhead, no performance overhead
Frank Vahid, UC Riverside 15
Configurable Cache Design Metrics
We computed power, performance, energy and size using CACTI models Our own layout (0.13 TSMC CMOS), Cadence tools Energy: considered cache, memory, bus, and CPU stall
Powerstone, MediaBench, and SPEC benchmarks Used SimpleScalar for simulations
Frank Vahid, UC Riverside 16
Configurable Cache Energy Benefits
40%-50% energy savings on average Compared to conventional 4-way and 1-way assoc., 32-byte line size AND, best for every example (remember, conventional is compromise)
126.1%619.6%126.8%
0%
20%
40%
60%
80%
100%
120%
padp
cm crc
auto
2
bcnt bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
g721
pegw
it
mpe
g
jpeg ar
t
mcf
pars
er vpr
Ave
cnv4w32 cnv1w32 con4
Frank Vahid, UC Riverside 17
Future Work Dynamic cache tuning More advanced dynamic partitioning
Automatic frequent loop detection On-chip exploration tool Better decompilation, synthesis Better FPGA fabric, place and route Approach: continue to extend to support more
benchmarks Extend to platforms with multiple processors
Scales well – processors can share on-chip partitioning tools
Frank Vahid, UC Riverside 18
Conclusions
Self-improving configurable ICs Provide excellent speed and energy improvements Require no modification to existing software flows
Can thus be widely adopted
We’ve shown the idea is practical Lean on-chip tools are possible Now need to make them even better Extensive research into algorithms, designs and