REGIMap: Register-Aware Application Mapping on Coarse-Grained Reconfigurable Architectures Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013 This work was supported in part by CSR-EHS 0509540, CCF-0916652, CCF 1055094, NSF IUCRC for Embedded Systems (IIP-0856090), Center for Embedded Systems grant DWS-0086; Science Foundation Arizona grant SRG 0211-07, Raytheon and by the Stardust Foundation.
12
Embed
Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
REGIMap: Register-Aware Application Mapping on Coarse-
Grained Reconfigurable Architectures
Mahdi Hamzeh, Aviral Shrivastava, and Sarma VrudhulaSchool of Computing, Informatics, and Decision Systems Engineering
Arizona State UniversityJune 2013
This work was supported in part by CSR-EHS 0509540, CCF-0916652, CCF 1055094, NSF IUCRC for Embedded Systems (IIP-0856090), Center for Embedded Systems grant DWS-0086; Science Foundation Arizona grant SRG 0211-07, Raytheon and by the Stardust Foundation.
2
Accelerators for Energy Efficiency
50 100 150 200 2501
10
100 ADRES[1] CGRA
Intel Core i7
NVIDIA Tesla™ c2050
Power (W)
Giga Opsper Sec
60 GOpS/W
1.4 GOpS/W 4.3 GOpS/W
• Demand for performance• Power consumption• Technology scaling
CoreAccelerator
Shared Cache
Private cache Private cache
[1] BOUWENS, F., BEREKOVIC, M., SUTTER, B. D., AND GAYDADJIEV, G. Architecture enhancements for the adres coarse-grained reconfigurable array. In Proc. HiPEAC (2008), pp. 66–81.
3
Coarse-grained Reconfigurable Architectures
• 2D array of Processing Elements (PEs)• ALU + Local register file → PE• Mesh interconnection• Shared data bus– Data memory
• PE inputs:– 4 Neighboring PEs– Local register file
P 12
Q 12
P 12
Q 12
P 12
Q 12
P 12
Q 12
P 12
Q 12
P 12
Q 12
P 12
Q 12
P 12
Q 12
a
b
c
d
Time
1
2
3
4
Map Loops on CGRA and Minimize Initiation Interval
a
b
c
d
a
b
c
d
aa
a aa
a
ab 4
2II is the performance metric
aRegister utilization decreases IIP 1
2 Q 12
Register Files for Inter-Iteration DependenciesP 1
2Q 1
2P 12
Q 12P 1
2Q 1
2P 12
Q 12
a
c
e
f
1
3
6
8
3b
2P 1
2Q 1
2P 12
Q 12P 1
2Q 1
2P 12
Q 12
a
c b
b
e
b
f
f
a
c bb
e
b
f f
f
f
a
c b
b
f
2
4
5
7
Register Utilization is essential for Inter-iteration Data
Dependencies
P 12 Q 1
2
6
• Size of resource graph ≈ O(n)• Partition the resources n+1 partitions• Huge number of possible partitions (exponential)
• Assign operations to sets such that • All operations are mapped• Data dependency between operations are obeyed
• General Problem formulation• Reduce search space
– Partition the problem to Scheduling and integrated placement and register allocation
– No register in resource graph
• Constructive search• Integrated placement and register allocation• REGIMap
– Schedule DFG– Construct Resource graph– Construct a compatibility graph between DFG and resource graph– Model register requirement of operation in the weight of arcs in
compatibility graph– Find a restricted maximal clique
Contributions
a
b
c
d
P Q
P Q
(, a)
(, a)
(, b)
(, b)
(, c)
(, c)(, d)
(, d)
P 12 Q 1
22
9
• Loops from SPEC2006 and multimedia benchmarks
• 4 × 4 CGRA with enough instruction and data memory
• Shared data bus for each row• Latency is 1 cycle• Compared with register-aware DRESC [2]
Experimental Setup
[2] DE SUTTER, B., COENE, P., VANDER AA, T., AND MEI, B. Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In Proc. LCTES (2008), pp. 151–160.
10
Mapping Results
Swim
_Calc
YUV2RGB
Sobel
Lowpass SO
R
Laplac
eGSR
Wav
elet
Forw
ard
Compress
Mpeg2
Averag
e Res
h264ref
gobmk
hmmerdea
lIIbzip
2ast
ar
omnetpp
perl
povray
sphinx gcc
soplex
libquan
tum
Averag
e Rec
00.10.20.30.40.50.60.70.80.9
1
REGIDRECS
Perf
orm
ance
Rati
o (M
II/II)
Size of Register File = 2
Res Bounded Rec Bounded
0
0.2
0.4
0.6
0.8
1
REGIDRECS
Perf
orm
ance
Rati
o (M
II/II)
Size of Register File = 4
Res Bounded Rec Bounded
REGIMap improves performance on average by
1.8X more than DRESC*
11
Reasonable Running Time
0.0001
0.01
1
100
10000
1000000
REGI
DRECS
Com
pila
tion
Tim
e (S
) Size of Register File = 2
Res Bounded Rec Bounded
0.001
0.1
10
1000
100000
10000000
REGIDRECS
Com
pila
tion
Tim
e (S
) Size of Register File = 4
Res Bounded Rec Bounded
REGIMap maps loops on average 56X faster than
DRESC*
12
• Accelerators for energy efficiency• Coarse-grained reconfigurable architecture, a
programmable accelerator• Contributions– Problem formulation– Search space reduction– Constructive search– Integrated register allocation– REGIMap
• Better mappings 1.8X performance improvement• On average 56 times better compilation time
• Please join my poster presentation for more details