A Study of the Scalability of On-Chip Routing for Just-in- Time FPGA Compilation Roman Lysecky a , Frank Vahid a* , Sheldon X.- D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx
22
Embed
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. Roman Lysecky a , Frank Vahid a* , Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation
Roman Lyseckya, Frank Vahida*, Sheldon X.-D. Tanb
aDepartment of Computer Science and EngineeringbDepartment of Electrical Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx
2/22
IntroductionStandard binary - Separating Function and Architecture
SW__________________
SW__________________
ProfilingStandard Compiler
Binaryx86 Binary
Software binaries of the past Binary reflected specific language of underlying
architecture – limited portability Current “standard binary”
Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization
3/22
IntroductionBut Today’s Binaries are More than just Software
SW__________________
SW__________________
ProfilingStandard Compiler
BinarySW Binary
ProfilingCompiler/ Synthesis
BinaryBinary
Processor1Processor1
FPGAProc.
SW__________________
SW__________________
SW__________________
HW__________________
ProcessorProcessor2
Processor3Processor3 FPGA
Proc.
Proc.
FPGA
Proc.
Proc.
4/22
IntroductionJust-in-Time FPGA Compilation?
JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for
microprocessor Portability, transparency, standard tools
Embedded JIT compilation tools optimized for each FPGA
BinaryVHDL/Verilog
ProfilingStandard CAD Tools
BinaryStd. HW Binary
JIT FPGA Comp.
FPGA
+ + JIT FPGA Comp.
FPGA
+** +
MEM
5/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
BinarySW Binary
6/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Netlist3
BinarySW Binary
BinaryHW Netlist2
BinarySW Binary
BinaryHW Netlist1
BinarySW Binary
BinaryHW Netlist4
HW1____________HW2____________
HW3____________
HW4____________
7/22
IntroductionOne Use of JIT FPGA Compilation
CableTV Company
FeatureUpgradeFeatureUpgrade
SW____________
Processor ARM7
Processor ARM9
Processor ARM10
Processor ARM11
HW____________
Processor FPGA 1
Processor FPGA 2
Processor FPGA 3
Processor FPGA 4
BinarySW Binary
BinaryHW Binary
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
JIT FPGA Comp.
8/22
µPI$
D$
FPGA
Profiler
Dynamic Part.
Module (DPM)
Time Energy
SW Only
HW/ SW
Partitioned application executes faster with lower energy consumption
55
IntroductionAnother Use - Warp Processors (Dynamic HW/SW Partitioning)
Profile application to determine critical regions
22
Profiler
Initially execute application in software only
11
µPI$
D$
Partition critical regions to hardware
33
Dynamic Part.
Module (DPM)
Program configurable logic & update software binary
IntroductionExisting FPGAs Not Suitable for JIT FPGA Compilation
Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution
50 MB 60 MB10 MB
1 min
Log.
Syn
.
1 min
Tech
. Map
1-2 minsPl
ace
2-30 mins
Rou
te
10 MB
11/22
JIT FPGA Comp.
FPGA
+ +
JIT FPGA CompilationCAD-Oriented FPGA
Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD
Enables development of fast, lean JIT FPGA compilation tools
1s <1s
.5 MB
1 MB
<1s
1 MB
10s
3.6 MB
Tech. Mapping/Packing
Placement
Logic Synthesis
Routing
Lysecky/Vahid, DATE’04
12/22
Simple Configurable Logic FabricCAD-Oriented FPGA
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics
Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD
Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices
(SMs) CLB is directly connected to a SM
Along with SM design, allows for design of lean JIT routing
123 MCNC benchmark circuits Circuit complexity ranges from few
LUTs to tens of thousands of LUTs Performed technology mapping,
packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement
Routed each HW benchmark circuit using:
VPR’s timing-driven router VPR’s fast timing-driven router (-fast
option) Riverside On-Chip Router (ROCR)
18/22
Scalability of On-chip Routing
Memory Usage
126602
8352
113235
0
20000
40000
60000
80000
100000
120000
140000
VPR VPR (Fast) ROCR
Me
mo
ry U
sa
ge
(K
By
tes
)
Minimum
Average
Maximum
VPR requires over 100MB of on average
ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average
19/22
Scalability of On-chip Routing
Algorithm Performance
0
25
50
75
100
125
150
175
200
050
010
0015
0020
0025
0030
0035
0040
00
Circuit Size (CLBs)
Ex
ec
uti
on
Tim
e (
s)
VPR VPR (Fast) ROCR
ROCR is over 40X times faster than VPR for small HW circuits
ROCR is 2X-3X times faster than VPR for large HW circuits
20/22
Scalability of On-chip Routing
Critical Path
0
25
50
75
100
125
150
175
200
Circuit Size (CLBs)
Cri
tic
al P
ath
(n
s)
VPR VPR (Fast) ROCR
19% longer critical path than VPR2.6% shorter than VPR (Fast)
30%/27% longer critical path than VPR/VPR (Fast)
21/22
Scalability of On-chip Routing
Wire Segments
0
15000
30000
45000
60000
75000
90000
Circuit Size (Nets)
Wir
e S
eg
me
nts
VPR VPR (Fast) ROCR
ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits
22/22
Conclusions and Future Work Conclusions
Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router
Requiring 18X less memory than VPR Produces good circuit quality
Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit
Requires on average 5% fewer wire segments
Future Work Currently project: Major microprocessor vendor is fabricating our
custom FPGA Improvements to Riverside On-Chip Router (ROCR)
Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity
JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation