Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend
37
Embed
Frank Vahid, UC Riverside 1 System-on-a-Chip Platform Tuning for Embedded Systems Frank Vahid Associate Professor Dept. of Computer Science and Engineering.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Frank Vahid, UC Riverside
1
System-on-a-Chip Platform Tuning for Embedded Systems
Frank VahidAssociate Professor
Dept. of Computer Science and EngineeringUniversity of California, Riverside
Also with the Center for Embedded Computer Systems at UC Irvine
http://www.cs.ucr.edu/~vahid
This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend
Frank Vahid, UC Riverside 2
How Much is Enough?
Frank Vahid, UC Riverside 3
How Much is Enough?
Perhaps a bit small
Frank Vahid, UC Riverside 4
How Much is Enough?
Reasonably sized
Frank Vahid, UC Riverside 5
How Much is Enough?
Probably plenty big
Frank Vahid, UC Riverside 6
How Much is Enough?
More than typically necessary
Frank Vahid, UC Riverside 7
How Much is Enough?
Very few people could use this
Frank Vahid, UC Riverside 8
How Much is Enough for an IC?
1993: ~ 1 million logic transistors
IC package IC
Perhaps a bit small
Frank Vahid, UC Riverside 9
How Much is Enough for an IC?
1996: ~ 5-8 million logic transistors
Reasonably sized
Frank Vahid, UC Riverside 10
How Much is Enough for an IC?
1999: ~ 10-50 million logic transistors
Probably plenty big
Frank Vahid, UC Riverside 11
How Much is Enough for an IC?
2002: ~ 100-200 million logic transistors
More than typically necessary
Frank Vahid, UC Riverside 12
How Much is Enough for an IC?
2008: >1 BILLION logic transistors
1993: 1 M
Perhaps very few people could design this
Point of diminishing returns
8-bit uC: ~15K 32-bit ARM: ~30K MPEG dcd: ~1M 100M good enough
Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg)
Frank Vahid, UC Riverside 27
Speedup Gained with Relatively Few Gates
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
0 5,000 10,000 15,000 20,000 25,000
Gates
Sp
ee
du
p
G721(MB)
ADPCM(MB)
PEGWIT(MB)
DH(NB)
MD5(NB)
TL(NB)
URL(NB)
27. 27.
2.05 at 90,000
Created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates; diminishing returns after that Surprisingly few gates
Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of
Embedded Systems, 2002 (to appear).
Frank Vahid, UC Riverside 28
Other Types of Configurability
Microprocessor (other researchers)
VLIW configurations Voltage scaling
Memory hierarchy Our focus: build a highly-configurable cache that
can be tuned to a particular program Work by Chaunjun Zhang, along with Walid Najjar, at UCR
Frank Vahid, UC Riverside 29
Cache Contributes Much to Performance and Power
Well-known for performance Energy
ARM920T: caches consume nearly half of total power (Segars 01) M*CORE: unified cache consumes half of total power
(Lee/Moyer/Arends 99)
ARM920T. Source: Segars ISSCC’01
Mem
L1 Cache
Processor
Frank Vahid, UC Riverside 30
Associativity Plays a Big Role
Reduces miss rate – thus improving performance Impact on power and energy?
(Energy = Power * Time)
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s r
ate
epic
mpeg2
Frank Vahid, UC Riverside 31
Associativity is Costly
Associativity improves hit rate, but at the cost of more power per access
Are the power savings from reduced misses outweighed by the increased power per hit?
sa_data
wordline_databitline_data
decode_data
data output driver
mux driver
comparator
bitline_tag sa_tag
wordline_tag
decode_tag
Energy access breakdown for 8 Kbyte, 4-way set associative cache (considering dynamic power only)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1w ay 2w ay 4w ay
Associativity
En
erg
y p
er a
ccess(n
J)
Energy per access for 8 Kbyte cache
Frank Vahid, UC Riverside 32
Associativity and Energy
Best performing cache is not always lowest energy
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4Associativity
Mis
s ra
te
epic
mpeg2
0.0
0.2
0.4
0.6
0.8
1.0
1 2 4
AssociativityN
orm
aliz
ed e
nerg
y
epic
mpeg2
Significantly poorer energy
Frank Vahid, UC Riverside 33
So What’s the Best Cache?
Looking at popular embedded processors, there’s obviously no standard cache
Dilemma Direct mapped –good performance and energy for most programs Four-way – good performance for all programs, but at cost of higher power
per access for all programs Do we design for the average case or the worst case?
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
Instruct. Cache Data Cache Instruct. Cache Data Cache
Frank Vahid, UC Riverside 34
Solution to the Dilemma
Configurable cache Can be configured as four way, two way, or one way
Ways can be concatenated Furthermore, ways can even be shut down to decrease total size
Memory
Dir
ect
map
ped
cach
e
Four-way Now two-way
Now one-way
Frank Vahid, UC Riverside 35
Configurable Cache Design: Way Concatenation
index
c1 c3c0 c2
a11
a12
reg1
reg0
sense ampscolumn mux
tag part
tag address
mux driver
c1
line offset
data output
critical path
c0
c2
c0 c1
6x64
6x64
c3c2
6x64
6x64
c3
6x64
6x64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
Configuration circuit
data array
bitline
Small area and performance overhead
Frank Vahid, UC Riverside 36
Configurable Cache Experiments
Configurable cache with both way concatenation and way shutdown is superior on every benchmark
Considered Powerstone, MediaBench, and Spec2000 Tuning the cache to the program is important Work submitted to High-Performance Computer Architectures 2003, Zhang, Vahid and Najjar
114%268%116%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
padp
cm crc
auto
2
bcnt
bilv
bina
ry blit
brev
g3fa
x fir
pjep
g
ucbq
sort
v42
adpc
m
epic
jpeg
mpe
g2
pegw
it
g721 ar
t
mcf
pars
er vpr
Ave
rage
Benchmarks
En
erg
y (n
orm
aliz
ed)
CnvI1D1cnctshutboth
100% = 4-way conventional cache
Frank Vahid, UC Riverside 37
Conclusions Trend is away from semi-custom IC fabrication
Big enough; other pressures encourage buying pre-fabricated platforms Platforms must be highly configurable
To be useful for a variety of applications, and hence mass produced We have discussed
Software speedup/energy benefits of on-chip configurable logic: 3x speedups with only ~10,000 gates
Creating a highly-configurable cache architecture: 40% energy savings compared to conventional cache
Current/future work (collaborators: Walid Najjar UCR, Nik Dutt UCI) Automatically partitioning software loops to configurable logic
Several approaches: platform-assisted, and dynamically on-chip Work being done by Roman Lysecky, Susan Cotterell, Greg Stitt, and Shawn
Nematbaktsh at UCR Automatically tuning a configurable cache