Top Banner
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ph.D. expected June 2006 Ann Gordon-Ross Ph.D. expected June 2006 David Sheldon Ph.D. expected 2009 Ryan Mannion Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Intel Jeff Welser, IBM
39

Warp Processors

Feb 09, 2016

Download

Documents

beryl

Warp Processors. Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Warp Processors

Warp ProcessorsFrank Vahid (Task Leader)

Department of Computer Science and EngineeringUniversity of California, Riverside

Associate Director, Center for Embedded Computer Systems, UC Irvine

Task ID: 1331.001 July 2005 – June 2008Ph.D. students:

Greg Stitt Ph.D. expected June 2006Ann Gordon-Ross Ph.D. expected June 2006

David Sheldon Ph.D. expected 2009Ryan Mannion Ph.D. expected 2009Scott Sirowy Ph.D. expected 2010

Industrial Liaisons: Brian W. Einloth, Motorola

Serge Rutman, Dave Clark, IntelJeff Welser, IBM

Page 2: Warp Processors

2

Task Description Warp processing background

Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from

microprocessor to FPGA 10x perf./energy gains or more

Task– Mature warp technology Years 1/2 (in progress)

Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)

Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)

Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Page 3: Warp Processors

3

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

Initially, software binary loaded into instruction memory

11

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary

Page 4: Warp Processors

4

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

ProfilerI Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryMicroprocessor executes

instructions in software binary

22

Time EnergyµP

Page 5: Warp Processors

5

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryProfiler monitors instructions

and detects critical regions in binary

33

Time Energy

Profiler

add

add

add

add

add

add

add

add

add

add

beq

beq

beq

beq

beq

beq

beq

beq

beq

beq

Critical Loop Detected

Page 6: Warp Processors

6

µP

FPGAOn-chip CAD

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD reads in critical

region44

Time Energy

Profiler

On-chip CAD

Page 7: Warp Processors

7

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD decompiles critical

region into control data flow graph (CDFG)

55

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Page 8: Warp Processors

8

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD synthesizes

decompiled CDFG to a custom (parallel) circuit

66

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

Page 9: Warp Processors

9

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software BinaryOn-chip CAD maps circuit onto

FPGA77

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

Page 10: Warp Processors

10

µP

FPGADynamic Part. Module (DPM)

Warp Processing Background: Basic Idea

Profiler

µP

I Mem

D$

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

Software Binary88

Time Energy

Profiler

On-chip CAD

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0+ + ++ ++

+ ++

+

+

+

. . .

. . .

. . .

CLB

CLB

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

SM

++

FPGA

On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more

Mov reg3, 0Mov reg4, 0loop:// instructions that interact with FPGA

Ret reg4

FPGA

Time Energy

Software-only“Warped”

Page 11: Warp Processors

11

Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms

FPGAs with hard core processors

FPGAs with soft core processors

Computer boards with FPGAs

Cray XD1. Source: FPGA journal, Apr’05

Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera

Xilinx Spartan. Source: Xilinx

Page 12: Warp Processors

12

Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms

Programming a key challenge Soln 1: Compile high-level

language to custom binaries Soln 2: Use standard binaries,

dynamically re-map (warp) Cons:

Less high-level information, less optimization

Pros: Available to all software

developers, not just specialists Data dependent optimization Most importantly, standard binaries

enable “ecosystem” among tools, architecture, and applications

Xilinx Virtex II Pro. Source: Xilinx

Architectures

Applications Tools

Standard binaries

Most significant concept presently absent in FPGAs and other new programmable platforms

Page 13: Warp Processors

13

uP I$D$

FPGA

Profiler

On-chip CAD

Warp Processing Background: Basic Technology

Warp processing On-chip profiler Warp-tuned FPGA On-chip CAD, including Just-

in-Time FPGA compilation

BinaryBinary

Partitioning

BinaryHW

Behav./RT Synthesis

Technology Mapping

Placement & Routing

Logic Synthesis

DecompilationBinary Updater

BinaryUpdated Binary

JIT F

PGA

com

pila

tion

Page 14: Warp Processors

14

Warp Processing Background: Initial Results

60 MB

9.1 s Xilinx ISE

Manually performed

3.6MB0.2 s

ROCCAD

On a 75Mhz ARM7: only 1.4 s

46x improvement30% perf. penalty

Log.

Syn.

Tech

. Map

Place

Rout

e

RT Sy

n.

Deco

mp.

Parti

tionin

g

Page 15: Warp Processors

15

Warp Processing Background: Publications 2002-2005

On-chip profiler Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid,

ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; Extended version of above in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp., Oct 2005.

Warp-tuned FPGA A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid,

Design Automation and Test in Europe Conf. (DATE), Feb 2004. On-chip CAD, including Just-in-Time FPGA compilation

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.

Dynamic FPGA Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid, and S. Tan. Design Automation Conf. (DAC), June 2004.

A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ISSS/CODES conf., Oct 2003. Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design

Automation Conf. (DAC), 2003. On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid,

IEEE Design and Test of Computers, Nov./Dec. 2002. Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference

on Computer Aided Design (ICCAD), Nov. 2002.

Related A Self-Tuning Cache Architecture for Embedded Systems. C. Zhang, F. Vahid and R. Lysecky. ACM Transactions on

Embedded Computing Systems (TECS), Vol. 3., Issue 2, May 2004. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on

Low-Power Electronics and Design (ISLPED), 2005.

Page 16: Warp Processors

16

Task Description Warp processing background

Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from

microprocessor to FPGA 10x perf./energy gains or more

Task– Mature warp technology Year 1 (in progress)

Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)

Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)

Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Page 17: Warp Processors

17

Automatic High-Level Construct Recovery from Binaries

Challenge: Binary lacks high-level constructs (loops, arrays, ...) Decompilation can help recover

Extensive previous work (e.g., [Cifuentes 93, 94, 99])

Mov reg3, 0Mov reg4, 0loop:Shl reg1, reg3, 1Add reg5, reg2, reg1Ld reg6, 0(reg5)Add reg4, reg4, reg6Add reg3, reg3, 1Beq reg3, 10, -5Ret reg4

long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum;}

loop:reg1 := reg3 << 1reg5 := reg2 + reg1reg6 := mem[reg5 + 0]reg4 := reg4 + reg6reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Control/Data Flow Graph CreationOriginal C Code

Corresponding Assembly

loop:reg4 := reg4 + mem[ reg2 + (reg3 << 1)]reg3 := reg3 + 1if (reg3 < 10) goto loop

ret reg4

reg3 := 0reg4 := 0

Data Flow Analysis

long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4;}

Function Recovery

long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4;}

Control Structure Recovery

long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }

Array Recovery

Almost Identical Representations

Page 18: Warp Processors

18

New Method: Loop Rerolling Problem: Compiler unrolling of loops (to expose

parallelism) causes synthesis problems: Huge input (slow), can’t unroll to desired amount, can’t use

advanced loop methods (loop pipelining, fusion, splitting, ...) Solution: New decompilation method: Loop

Rerolling Identify unrolled iterations, compact into one iteration

for (int i=0; i < 3; i++) accum += a[i];

Ld reg2, 100(0)Add reg1, reg1, reg2 Ld reg2, 100(1)Add reg1, reg1, reg2Ld reg2, 100(2)Add reg1, reg1, reg2

Loop Unrolling for (int i=0; i<3;i++)

reg1 += array[i];

Loop Rerolling

Page 19: Warp Processors

19

Loop Rerolling: Identify Unrolled Iterations

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Original C Code

Find Consecutive Repeating Substrings: Adjacent Nodes with Same SubstringUnrolled Loop

2 unrolled iterationsEach iteration = abc (Ld, Add, St)

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Binary

x= x + 1;a[0] = b[0]+1;a[1] = b[1]+1;y = x;

Unrolled Loop

Add r3, r3, 1 => BLd r0, b(0) => AAdd r1, r0, 1 => BSt a(0), r1 => CLd r0, b(1) => AAdd r1, r0, 1 => BSt a(1), r1 => CMov r4, r3 => D

Map to String

BABCABCD

String Representatio

n

Find consecutively repeating instruction sequences

abc c db

abcabcd c abcd d abcd d

dabcd

Suffix Tree

Derived from bioinformatics

techniques

Page 20: Warp Processors

20

Loop Rerolling: Compacting Iterations

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Original C Code

Unrolled Loop Identificiation

Add r3, r3, 1Ld r0, b(0)Add r1, r0, 1St a(0), r1Ld r0, b(1)Add r1, r0, 1St a(1), r1Mov r4, r3

Determine relationship of constants

1)

Add r3, r3, 1i=0loop:Ld r0, b(i)Add r1, r0, 1St a(i), r1Bne i, 2, loopMov r4, r3

Replace constants with induction variable expression

2)

reg3 = reg3 + 1;for (i=0; i < 2; i++) array1[i]=array2[i]+1;reg4=reg3;

Rerolled, decompiled code

3)

x = x + 1;for (i=0; i < 2; i++) a[i]=b[i]+1;y=x;

Page 21: Warp Processors

21

Method: Strength Promotion Problem: Compiler’s strength reduction (replacing

multiplies by shifts and adds) prevents synthesis from using hard-core multipliers, sometimes hurting circuit performance

*

B[i] 10

*

B[i+1] 18

*

B[i+2] 34

*

B[i+3] 66

+++

A[i]

+

++

<< <<

B[i+1] 4B[i+1] 1

+<< <<

B[i] 3 B[i] 1

+<< <<

B[i+2] 5B[i+2]1

+<< <<

B[i+3]6 B[i+3]1

+

A[i]

FIR Filter

Strength-Reduced FIR Filter

Strength-reduced multiplication

Page 22: Warp Processors

22

Strength Promotion

+

++

<< <<

B[i+1] 4B[i+1] 1

+<< <<

B[i] 3 B[i] 1

+<< <<

B[i+2] 5B[i+2]1

+<< <<

B[i+3]6 B[i+3]1

+

A[i]

Identify strength-reduced subgraphs

+

++

<< <<

B[i+1] 4B[i+1] 1

+<< <<

B[i+2] 5B[i+2]1

+<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

Replace with multiplication

++

+<< <<

B[i+2] 5B[i+2]1

+<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

+++

<< <<

B[i+3]6 B[i+3]1

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*++

+

A[i]

B[i] 10

*

B[i] 18

*

B[i] 34

*

B[i] 66

*

Strength promotion lets synthesis decide on strength reduction based on available resources

1

++

B[i+1] 18B[i] 10

+<< <<

B[i+2] 5B[i+2]1

+<< <<

B[i+3]6 B[i+3]

+

A[i]

* *

Synthesis can of course apply strength reduction itself

Solution: Promote strength-reduced code to muls

Page 23: Warp Processors

23

New Decompilation Methods’ Benefits

Rerolling Speedups from better use of smart

buffers Other potential benefits: faster

synthesis, less area Strength promotion

Speedups from fewer cycles Speedups from faster clock

New methods to be developed e.g., pointer DS to arrays

0.00.51.01.52.02.53.0

Speedups from Loop Rerolling

0.0

0.5

1.0

1.5

2.0

2.5

Y axis = speedup, X axis = x_y_z => x adder constraint, y multiplier constraint, z = adders needed for reduction

0

50

100

150

200

250

No Strength PromotionStrength Promotion

Y axis = clock frequency, X axis = adders needed for reduction

Page 24: Warp Processors

24

Decompilation is Effective Even with High Compiler-Optimization Levels

Average Speedup of 10 Examples

0

5

10

15

20

25

30

Speedups similar on MIPS for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar on ARM for –O1 and –O3 optimizations

0

5

10

15

20

25

30

Speedups similar between ARM and MIPS

Complex instructions of ARM didn’t hurt synthesis

MicroBlaze speedups much larger

MicroBlaze is a slower microprocessor

-O3 optimizations were very beneficial to hardware

0

5

10

15

20

25

30

MIPS -O

1

MIPS -O

3

ARM -O1

ARM -O3

MicroB

laze -

O1

MicroB

laze -

O3

Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.

Page 25: Warp Processors

25

Task Description Warp processing background

Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from

microprocessor to FPGA 10x perf./energy gains or more

Task– Mature warp technology Year 1 (in progress)

Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)

Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)

Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Page 26: Warp Processors

26

Research Problem: Make Synthesis from Binaries Competitive with Synthesis from High-Level Languages

Performed in-depth study with Freescale

H.264 video decoder Highly-optimized proprietary

code, not reference code Huge difference A benefit of SRC

collaboration Research question: Is

synthesis from binaries competitive on highly-optimized code?

Several-month study

MPEG 2 H.264: Better quality, or smaller files, using more

computation

Page 27: Warp Processors

27

Optimized H.264 Larger than most

benchmarks H.264: 16,000 lines Previous work: 100 to

several thousand lines Highly-optimized

H.264: Many man-hours of manual optimization

10x faster than reference code used in previous works

Different profiling results Previous examples

~90% time in several loops H.264

~90% time in ~45 functions Harder to speedup

Function Name Instr %TimeCumulative SpeedupMotionComp_00 33 6.8% 1.1InvTransform4x4 63 12.5% 1.1FindHorizontalBS 47 16.7% 1.2GetBits 51 20.8% 1.3FindVerticalBS 44 24.7% 1.3MotionCompChromaFullXFullY24 28.6% 1.4FilterHorizontalLuma 557 32.5% 1.5FilterVerticalLuma 481 35.8% 1.6FilterHorizontalChroma133 39.0% 1.6CombineCoefsZerosInvQuantScan69 42.0% 1.7memset 20 44.9% 1.8MotionCompensate 167 47.7% 1.9FilterVerticalChroma 121 50.3% 2.0MotionCompChromaFracXFracY48 53.0% 2.1ReadLeadingZerosAndOne56 55.6% 2.3DecodeCoeffTokenNormal93 57.5% 2.4DeblockingFilterLumaRow272 59.4% 2.5DecodeZeros 79 61.3% 2.6MotionComp_23 279 63.0% 2.7DecodeBlockCoefLevels56 64.6% 2.8MotionComp_21 281 66.2% 3.0FindBoundaryStrengthPMB44 67.7% 3.1

Page 28: Warp Processors

28

C vs. Binary Synthesis on Opt. H.264

Binary partitioning competitive with source partitioning Speedups compared to ARM9 software

Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for

partitioning and synthesis Discovered another research problem: Why aren’t

speedups (from binary or C) closer to “ideal” (0-time per fct)

Speedup from C Partititioning

0

1

2

3

4

5

6

7

8

9

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Spee

dup

Speedup from C Partititioning

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Spee

dup

Speedup from C Partititioning

Speedup from Binary Partitioning

0

1

2

3

4

5

6

7

8

9

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Spee

dup

Ideal Speedup (Zero-time Hw Execution)

Speedup from C Partititioning

Speedup from Binary Partitioning

Page 29: Warp Processors

29

Coding Guidelines Are there C-coding

guidelines to improve partitioning speedups? Orthogonal to C vs. binary

question Guidelines may help both

Examined H.264 code further Several phone conferences

with Freescale liasons, also several email exchanges and reports

0

5

10

Ideal C

0

5

10

Ideal C

Competitive, but both could be better

Coding guidelines get closer to ideal

Page 30: Warp Processors

30

Synthesis-Oriented Coding Guidelines

Pass by value-return Declare a local array and copy in all data needed by a

function (makes lack of aliases explicit) Function specialization

Create function version having frequent parameter-values as constants

void f(int width, int height ) { . . . . for (i=0; i < width, i++) for (j=0; j < height; j++) . . . . . . }

void f_4_4() { . . . . for (i=0; i < 4, i++) for (j=0; j < 4; j++) . . . . . . }

Bounds are explicit so loops are now unrollable

Original Rewritten

Page 31: Warp Processors

31

Synthesis-Oriented Coding Guidelines

Algorithmic specialization Use parallelizable hardware algorithms when possible

Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling

Lookup table avoidance Use expressions rather than lookup tables

int clip[512] = { . . . }void f() { . . . for (i=0; i < 10; i++) val[i] = clip[val[i]]; . . . }

void f() { . . . for (i=0; i < 10; i++) if (val[i] > 255) val[i] = 255; else if (val[i] < 0) val[i] = 0; . . . }

val[1]

<

0 255

3x1

>

val[1]

val[0]

<

0 255

3x1

>

val[0]

Original Rewritten

. . .

Comparisons can now be parallelized

Page 32: Warp Processors

32

Synthesis-Oriented Coding Guidelines

Use explicit control flow Replace function pointers with if statements

and static function calls

void (*funcArray[]) (char *data) = { func1, func2, . . . };void f(char *data) { . . . funcPointer = funcArray[i]; (*funcPointer) (data); . . . }

void f(char *data) { . . . if (i == 0) func1(data); else if (i==1) func2(data); . . . }

Original Rewritten

Page 33: Warp Processors

33

Coding Guideline Results on H.264

Simple coding guidelines made large improvement Rewritten software only ~3% slower than original

And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56

Small difference caused by switch statements that used indirect jumps

0123456789

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Spee

dup

Ideal Speedup (Zero-time Hw Execution)

Speedup from C Partititioning

Speedup from Binary Partitioning

0123456789

101 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Number of Functions in Hardware

Spee

dup

Ideal Speedup (Zero-time Hw Execution)Speedup After Rewrite (C Partitioning)Speedup After Rewrite (Binary Partitioning)Speedup from C PartititioningSpeedup from Binary Partitioning

Page 34: Warp Processors

34

Studied More Benchmarks, Developed More Guidelines

Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary issue)

Publications Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G.

McGregor, B. Einloth. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005 (joint publication with Freescale)

Submitted: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar, 2006.

More guidelines to be developed

573 1616 842

0123456789

10

g3fax mpeg2 jpeg brev fir crc

SwHw/sw with original codeHw/sw with guidelines

-88% -47%-30%

-20%

-10%

0%

10%

20%

30%

g3fa

x

mpe

g2

jpeg brev fir crc

Performance Overhead

Size Overhead

Page 35: Warp Processors

35

Task Description Warp processing background

Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from

microprocessor to FPGA 10x perf./energy gains or more

Task– Mature warp technology Year 1 (in progress)

Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)

Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)

Years 2/3 Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Page 36: Warp Processors

36

Warp-Tailored FPGA Prototype Developed FPGA fabric tailored to

fast/small-memory on-chip CAD Building chip prototype with Intel

Created synthesizable VHDL models, running through Intel shuttle tool flow

Plan to incorporate with ARM processor and other IP on shuttle seat

Bi-weekly phone meetings with Intel engineers since summer 2005, ongoing, scheduled tapeout 2006 Q3

DADGLCH

Configurable Logic Fabric

32-bit MAC

SM

CLB

SM

SM

SM

SM

SM

CLB

SM

CLB

SM

SM

SM

SM

SM

CLB

LUTLUT

a b c d e f

o1 o2 o3o4

Adj.CLB

Adj.CLB

0

0L

1

1L2L

2

3L

3

0123

0L1L2L

3L

0123

0L1L2L3L

0 1 2 3 0L1L2L3L

Page 37: Warp Processors

37

Industrial Interactions Freescale

Numerous phone conferences, emails, and reports, on technical subjects Co-authored paper (CODES/ISSS’05), another pending Summer internship – Scott Sirowy (new UCR graduate student), summer

2005, Austin Intel

Three visits by PI, one by graduate student Roman Lysecky, to Intel Research in Santa Clara

PI presented at Intel System Design Symposium, Nov. 2005 PI served on Intel Research Silicon Prototyping Workshop panel, May 2005 Participating in Intel’s Research Shuttle (chip prototype), bi-weekly phone

conferences since summer 2005 involving PI, Intel engineers, and Roman Lysecky (now Prof. at UA)

IBM Embarking on studies of warp processing results on server applications UCR group to receive Cell-based prototyping platform (w/ Prof. Walid

Najjar) Several interactions with Xilinx also

Page 38: Warp Processors

38

Task Description – Coming Up Warp processing background

Two seed SRC CSR grants (2002-2005) showed feasibility Idea: Transparently move critical binary regions from

microprocessor to FPGA 10x perf./energy gains or more

Task– Mature warp technology Years 1/2 (in progress)

Automatic high-level construct recovery from binaries In-depth case studies (with Freescale)

Also discovered unanticipated problem, developed solution Warp-tailored FPGA prototype (with Intel)

Years 2/3 – All three sub-tasks just now underway Reduce memory bottleneck by using smart buffer Investigate domain-specific-FPGA concepts (with Freescale) Consider desktop/server domains (with IBM)

Page 39: Warp Processors

39

Recent Publications New Decompilation Techniques for Binary-level Co-processor Generation. G.

Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-

Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. Hardware/Software Partitioning of Software Binaries: A Case Study of H.264

Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale)

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware. A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005.

A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.

A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.