Top Banner
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University of California, Santa Barbara of America
24

Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

Optimizing high speed arithmetic circuits using three-term

extraction Anup Hosangadi

Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories

University of California, Santa Barbara of America

Page 2: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

2

Outline• Carry Save Arithmetic

• Related Work

• Problem formulation

• Algebraic methods

• Delay aware optimization

• Experimental results

Page 3: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

3

Carry Save Arithmetic• Multi-Operand addition

• F = A + B + C + D + E + F• Carry propagation major bottleneck• Fast adders: Carry Lookahead Adder (CLA),

Carry Select Adders, not fast enough

• Solution: Eliminate Carry propagation to the final step

• Generate Sums and Carries separately• Treat them as separate numbers• Keep adding till only two numbers remain• Add the numbers using fast adder (CLA)

Page 4: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

4

Carry Save Arithmetic

CSA CSA

CSA

CSA

+

A B C D E F

Delay = 3 + log2(M + 3)

3 = height of CSA tree

M = bitwidth of operands

S

S

S

SCC

C

C

F

CLA

Tree height = log1.5(N/2)

Page 5: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

5

Carry Save arithmetic

RCA

RCA

RCA

RCA

RCA

(M +1)

Delay = (M+5) + 4

Delay comparison

0

20

40

60

80

100

120

2 6 10 14 18 22 26 30 34 38 42 46 50

# of operands

Del

ay (

full

ad

der

del

ays)

RCA

CSA

Area comparison

0

500

1000

1500

2000

# Operands

Are

a (f

ull

ad

der

un

its)

RCA

CSA

Using Ripple carry adders (RCAs)

(M +2)

(M +3)

(M +4)

(M +5)

Delay thru CSA network =

3 + log1.5(M + 3)

Page 6: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

6

Related Work• Kim et. al “Arithmetic optimization using Carry

Save Adders”, DAC’98

+

+

+

+

A B C DE

F

D

E

CSA

A B C

CSA

CSA

+

+

F

Page 7: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

7

Related Work• Kim. et. al “Optimal allocation of CSAs”, ICCAD’99

• Delay aware CSA allocation

• Kim et. al “High performance, low power synthesis”, DAC’2000

• SynopsysTM Behavioral optimization for arithmetic (BOA)

• A.Verma and P.Ienne “Improved use of the carry save representation for the synthesis of complex arithmetic circuits”, ICCAD’2004

ArithmeticOptimizer?

Page 8: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

8

Problem formulation• No methodology for detecting redundancy

in CSA computations• Can reduce the number of CSAs

• Can reduce the number of wires

• Common subexpression elimination• Standard compiler technique

• Applied to 2-term arithmetic operations– Polynomial expressions (ICCAD’04, VLSI’05)

– Constant multiplications (ASAP’04, ASPDAC’05)

• CSA expressions (Common 3-term subexpressions)

Page 9: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

9

Problem formulation

Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2

Y2 = X1<<2 + X2<<2 + X2<<3

D1 = X1 + X2 + X2<<1

Y1 = (D1S + D1

C) + X1<<2 + X2<<2

Y2 = (D1S + D1

C)

Page 10: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

10

Algebraic methods • Polynomial transformation

• X<<i = XLi

• Detects shifted common subexpressions and also extends to multiple variables

C × X = (±X×Li)

(14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1

= (100-10)CSD × X = XL4 – XL1

Page 11: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

11

Algebraic methods• 3-term divisors = All potential common

subexpressions

• Divisor generation• One for every combination of 3 terms

• eg. F1 = X1 + X1L2 + X2 + X2L + X2L2

• d1 = X1L2 + X2L + X2L2

• MinL = L

• Divisor D1 = d1/L = X1L + X2 + X2L

• # of divisors =

• Theorem: • There exists a 3-term common subexpression iff

there exists a non-overlapping intersection among the set of 3-term divisors

N

3

Page 12: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

12

Algebraic methods• Greedy Iterative algorithm

• Extracts the “best” 3-term divisor

• Rewrites the expressions containing it

• Terminates when there are no more common subexpressions

F1 = a + b + c + d + e

F2 = a + b + c + d + f

>> D1 = a + b + c

F1 = D1S + D1

C + d + e

F2 = D1S + D1

C + d + f

>> D2 = D1S + D1

C + e

F1 = D2S + D2

C + e

F2 = D2S + D2

C + f

Page 13: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

13

Algebraic methods• Algorithm details

Optimize ({Pi}){ {Pi} = Set of expressions in polynomial form; {D} = Set of divisors = φ; // Step 1. Creating divisors and their frequency statistics for each expression Pi in {Pi} { {Dnew} = Divisors(Pi); Update frequency statistics of divisors in {D}; {D} = {D} { Dnew}; }

//Step 2. Iterative selection and elimination of best divisor while (1) { Find d = divisor in {D} with most number of non-overlapping intersections; if (d == NULL) break; Rewrite affected expressions in {Pi} using d;

Remove divisors in {D} that have become invalid;

Update frequency statistics of affected divisors; {Dnew} = Set of new divisors from new terms added by division; {D} = {D} {Dnew}; }}

Page 14: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

14

Algebraic methods• Algorithm complexity

• M expressions, each with N terms

• Divisor generation = M* = O(MN3)

• Iterative algorithm, worst case

– N terms reduced to 2 terms = (N -2) steps

– M expressions = O(MN) steps

N

3

Page 15: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

15

Delay aware optimization• Sharing subexpressions can increase the

total delay• Traditional high level synthesis approach:

Reduce delay by Tree Height Reduction (THR)

• Our solution: Control delay during optimization itself

• Optimal delay CSA allocation (T.Kim, J.Um, “Timing driven synthesis”, ASPDAC’2000)

– Use this to get minimum possible delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

Page 16: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

16

Delay aware optimization• Optimal allocation Delay ignorant extraction

CSA0 0 0

b c d

CSA1

e

CSA

a

+

F1

33

1 0

2 2 2

CSA0 0 0

b c d

CSA1

f

CSA

a

+

F2

33

1 0

2 2 2

Delay(F1) = Delay(F2) =

3 + D(Add)

Page 17: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

17

Delay aware extraction• Control delay during optimization

• Evaluate each candidate divisor for delay

• Only consider those divisors that do not increase the delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

>> D1(3) = a(2) + b(0) + c(0)

F1 = D1S(3) + D1C

(3) + d(0) + e(0)

F2 = D1S(3) + D1C

(3) + d(0) + f(0)

Delay = 5 + D(Add)

Delay = 5 + D(Add)

Page 18: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

18

Delay aware extraction• Control delay during optimization

• Evaluate each candidate divisor for delay

• Only consider those divisors that do not increase the delay

F1 = a(2) + b(0) + c(0) + d(0) + e(0)

F2 = a(2) + b(0) + c(0) + d(0) + f(0)

>> D2(1) = b(0) + c(0) + d(0)

F1 = D2S(1) + D2C

(1) + e(0) + a(2)

F2 = D2S(1) + D2C

(1) + f(0) + a(2)

Delay = 3 + D(Add)

Delay = 3 + D(Add)

Page 19: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

19

Experimental results• Comparing # of CSAs

Comparing # of CSAs

0

50

100

150

200

250

Example

# C

SA

s

Original

Optimized

Average 38.4% reduction

Page 20: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

20

Experimental results• Synthesis for Standard Cell Designs

• SynopsysTM Design compiler

• 0.25 micron library

• Synthesized for minimum delay

Area results

0200400600800

10001200140016001800

Example

Are

a Series1

Series2

Avg 32.7% Area reduction

Avg 3.7% increase in delay

Page 21: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

21

Experimental results• FPGA synthesis

• Virtex II FPGAs

• Synthesized designs and performed place & route

Reduction in LUTs and slices

05

10152025303540

H.264 DCT8 IDCT8 6 tapFIR

20 tapFIR

41 tapFIR

Average

Examples

% R

educt

ion

LUTs

Slices

Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs

Avg 5.7% increase in the delay

Page 22: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

22

Experimental results• Evaluate Delay aware extraction algorithm

• Consider different arrival times of the signals

• Assume delay dominated by gate delay (FA delay)

• Only consider best case delay

Example # of CSAs Delay (FA units)

Delay

ignorant

Delay

aware

Delay

Ignorant

Delay

aware

H.264 78 79 9 8

DCT8 222 232 14 13

IDCT8 195 201 14 13

FIR6tap 11 15 5 4

FIR20tap 34 45 6 5

FIR41tap 79 91 6 5

Average 103.2 110.5 9 8

Best delay with 15.5% increase in #CSAs

Page 23: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

23

Conclusions• First methodology for common

subexpression elimination for Carry Save Arithmetic

• Significant area/power reduction

• Delay aware optimization algorithm also developed

• Can be combined with CSA tree extraction methods for actual application improvement

Page 24: Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.

24

Thank you!!• Questions?