Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ph.D. expected June 2006 Ann Gordon-Ross Ph.D. expected June 2006 David Sheldon Ph.D. expected 2009 Ryan Mannion Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Intel Jeff Welser, IBM
Warp Processors. Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Warp ProcessorsFrank Vahid (Task Leader)
Department of Computer Science and EngineeringUniversity of California, Riverside
Associate Director, Center for Embedded Computer Systems, UC Irvine
Task ID: 1331.001 July 2005 – June 2008Ph.D. students:
Greg Stitt Ph.D. expected June 2006Ann Gordon-Ross Ph.D. expected June 2006
On-chip profiler Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid,
ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; Extended version of above in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp., Oct 2005.
Warp-tuned FPGA A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid,
Design Automation and Test in Europe Conf. (DATE), Feb 2004. On-chip CAD, including Just-in-Time FPGA compilation
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.
Dynamic FPGA Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid, and S. Tan. Design Automation Conf. (DAC), June 2004.
A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ISSS/CODES conf., Oct 2003. Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design
Automation Conf. (DAC), 2003. On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid,
IEEE Design and Test of Computers, Nov./Dec. 2002. Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference
on Computer Aided Design (ICCAD), Nov. 2002.
Related A Self-Tuning Cache Architecture for Embedded Systems. C. Zhang, F. Vahid and R. Lysecky. ACM Transactions on
Embedded Computing Systems (TECS), Vol. 3., Issue 2, May 2004. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on
Low-Power Electronics and Design (ISLPED), 2005.
16
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
17
Automatic High-Level Construct Recovery from Binaries
Challenge: Binary lacks high-level constructs (loops, arrays, ...) Decompilation can help recover
Extensive previous work (e.g., [Cifuentes 93, 94, 99])
multiplies by shifts and adds) prevents synthesis from using hard-core multipliers, sometimes hurting circuit performance
*
B[i] 10
*
B[i+1] 18
*
B[i+2] 34
*
B[i+3] 66
+++
A[i]
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i] 3 B[i] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
FIR Filter
Strength-Reduced FIR Filter
Strength-reduced multiplication
22
Strength Promotion
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i] 3 B[i] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
Identify strength-reduced subgraphs
+
++
<< <<
B[i+1] 4B[i+1] 1
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
Replace with multiplication
++
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
+++
<< <<
B[i+3]6 B[i+3]1
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*++
+
A[i]
B[i] 10
*
B[i] 18
*
B[i] 34
*
B[i] 66
*
Strength promotion lets synthesis decide on strength reduction based on available resources
1
++
B[i+1] 18B[i] 10
+<< <<
B[i+2] 5B[i+2]1
+<< <<
B[i+3]6 B[i+3]
+
A[i]
* *
Synthesis can of course apply strength reduction itself
Solution: Promote strength-reduced code to muls
23
New Decompilation Methods’ Benefits
Rerolling Speedups from better use of smart
buffers Other potential benefits: faster
synthesis, less area Strength promotion
Speedups from fewer cycles Speedups from faster clock
New methods to be developed e.g., pointer DS to arrays
0.00.51.01.52.02.53.0
Speedups from Loop Rerolling
0.0
0.5
1.0
1.5
2.0
2.5
Y axis = speedup, X axis = x_y_z => x adder constraint, y multiplier constraint, z = adders needed for reduction
0
50
100
150
200
250
No Strength PromotionStrength Promotion
Y axis = clock frequency, X axis = adders needed for reduction
24
Decompilation is Effective Even with High Compiler-Optimization Levels
Average Speedup of 10 Examples
0
5
10
15
20
25
30
Speedups similar on MIPS for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar on ARM for –O1 and –O3 optimizations
0
5
10
15
20
25
30
Speedups similar between ARM and MIPS
Complex instructions of ARM didn’t hurt synthesis
MicroBlaze speedups much larger
MicroBlaze is a slower microprocessor
-O3 optimizations were very beneficial to hardware
0
5
10
15
20
25
30
MIPS -O
1
MIPS -O
3
ARM -O1
ARM -O3
MicroB
laze -
O1
MicroB
laze -
O3
Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
25
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
26
Research Problem: Make Synthesis from Binaries Competitive with Synthesis from High-Level Languages
Performed in-depth study with Freescale
H.264 video decoder Highly-optimized proprietary
code, not reference code Huge difference A benefit of SRC
collaboration Research question: Is
synthesis from binaries competitive on highly-optimized code?
Several-month study
MPEG 2 H.264: Better quality, or smaller files, using more
computation
27
Optimized H.264 Larger than most
benchmarks H.264: 16,000 lines Previous work: 100 to
several thousand lines Highly-optimized
H.264: Many man-hours of manual optimization
10x faster than reference code used in previous works
Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning
34
Studied More Benchmarks, Developed More Guidelines
Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary issue)
Publications Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G.
McGregor, B. Einloth. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005 (joint publication with Freescale)
Submitted: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar, 2006.
More guidelines to be developed
573 1616 842
0123456789
10
g3fax mpeg2 jpeg brev fir crc
SwHw/sw with original codeHw/sw with guidelines
-88% -47%-30%
-20%
-10%
0%
10%
20%
30%
g3fa
x
mpe
g2
jpeg brev fir crc
Performance Overhead
Size Overhead
35
Task Description Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Year 1 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
36
Warp-Tailored FPGA Prototype Developed FPGA fabric tailored to
fast/small-memory on-chip CAD Building chip prototype with Intel
Created synthesizable VHDL models, running through Intel shuttle tool flow
Plan to incorporate with ARM processor and other IP on shuttle seat
Bi-weekly phone meetings with Intel engineers since summer 2005, ongoing, scheduled tapeout 2006 Q3
DADGLCH
Configurable Logic Fabric
32-bit MAC
SM
CLB
SM
SM
SM
SM
SM
CLB
SM
CLB
SM
SM
SM
SM
SM
CLB
LUTLUT
a b c d e f
o1 o2 o3o4
Adj.CLB
Adj.CLB
0
0L
1
1L2L
2
3L
3
0123
0L1L2L
3L
0123
0L1L2L3L
0 1 2 3 0L1L2L3L
37
Industrial Interactions Freescale
Numerous phone conferences, emails, and reports, on technical subjects Co-authored paper (CODES/ISSS’05), another pending Summer internship – Scott Sirowy (new UCR graduate student), summer
2005, Austin Intel
Three visits by PI, one by graduate student Roman Lysecky, to Intel Research in Santa Clara
PI presented at Intel System Design Symposium, Nov. 2005 PI served on Intel Research Silicon Prototyping Workshop panel, May 2005 Participating in Intel’s Research Shuttle (chip prototype), bi-weekly phone
conferences since summer 2005 involving PI, Intel engineers, and Roman Lysecky (now Prof. at UA)
IBM Embarking on studies of warp processing results on server applications UCR group to receive Cell-based prototyping platform (w/ Prof. Walid
Najjar) Several interactions with Xilinx also
38
Task Description – Coming Up Warp processing background
Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from
microprocessor to FPGA 10x perf./energy gains or more
Task– Mature warp technology Years 1/2 (in progress)
Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)
Years 2/3 – All three sub-tasks just now underway Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)
39
Recent Publications New Decompilation Techniques for Binary-level Co-processor Generation. G.
Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-
Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264
Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale)
Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005.
A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.
A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.