Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part by the National Science Foundation and the Semiconductor Research Corporation Frank Vahid Dept. of CS&E University of California, Riverside Also with the Center for Embedded Computer Systems, UC Irvine
30
Embed
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators
Greg Stitt Dept. of ECE
University of Florida
This research was supported in part by the National Science Foundation and the Semiconductor
Research Corporation
Frank VahidDept. of CS&E
University of California, Riverside
Also with the Center for Embedded Computer Systems, UC Irvine
2/30
Binary Translation
VLIWµP
Background Motivated by commercial dynamic binary translation of early
2000s
x86Binary x86 VLIW
x86 VLIW FPGA
VLIWBinary
FPGAµP
Binary
Warp processing (Lysecky/Stitt/Vahid 2003-2007): dynamically translate binary to circuits on FPGAs
Performance
e.g., Transmeta Crusoe “code morphing”
Binary “Translation”
3/30
µP
FPGAOn-chip CAD
Warp Processing Background
Profiler
Initially, software binary loaded into instruction memory
Decompilation surprisingly effective at recovering high-level program structures Stitt et al ICCAD’02, DAC’03, CODES/ISSS’05, ICCAD’05, FPGA’05, TODAES’06, TODAES’07
Recover loops, arrays, subroutines, etc. – needed to synthesize good circuits
Each thread accesses different addresses – but addresses may overlap
enable
17/30
Framework
Accelerator Instantiation
Thread Queue
Thread Functions
Thread Counts
Accelerator Synthesis
Accelerator Library
FPGA
Not In Library? Done
Accelerators Synthesized?
Queue Analysis
falsefalse
true true
Updated Binary
Schedulable Resource List
Place&RouteThread Group Table
NetlistBitfile
Also developed initial algorithms for: Queue analysis Accelerator
instantiation OS scheduling
of threads to accelerators and cores
18/30
Thread Warping Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
Thread Queue
Queue Analysis
Thread functions: filter()
filter() threads execute on available cores
Remaining threads added to queue
OS invokes CAD (due to queue size or periodically)
CAD tools identify filter() for synthesis
19/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, i ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() binary
Decompilation
CDFG
Memory Access Synchronization
MAS detects overlapping windows
MAS detects thread group
CAD reads filter() binary
20/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[51], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() binary
Decompilation
CDFG
Memory Access Synchronization
High-level Synthesis
+ +
+
>>2
filter() filter(). . . . .
Smart Buffer
RAM
RAM Accelerator Library
filter filter
Synthesis creates pipelined accelerator for filter() group: 8 accelerators
Stored for future use
Accelerators loaded into FPGA
21/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i][=avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-52]
a[2-5] a[9-12]
Smart buffer streams a[] data
After buffer fills, delivers a window to all eight accelerators
OS schedules threads to accelerators
enable (from OS)
22/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
a[10-13] a[17-20]
Each cycle, smart buffer delivers eight more windows – pipeline remains full
23/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
b[2-9]Accelerators create 8 outputs after pipeline latency passes
24/30
Example
int main( ){ . . . . . . for (i=0; i < 50; i++) { thread_create( filter, a, b, I ); } . . . . . .}void filter( int a[53], int b[50], int i, ) { b[i] =avg( a[i], a[i+1], a[i+2], a[i+3] );}
FPGAµPµP
µP
OS
µP
main() filter()
µP
On-chip CAD filter()
filter() filter(). . . . .
Smart Buffer
RAM
RAM
filter filter
a[0-53]
b[10-17]
Thread warping: 8 pixel outputs per cycle Software: 1 pixel output every ~9 cycles72x cycle count improvement
Additional 8 outputs each cycle
25/30
Experiments to Determine Thread Warping Performance: Simulator Setup