IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Anup Hosangadi

Ryan Kastner

ECE Department, UCSB

Farzan Fallah

Advanced CAD Research

Fujitsu Labs of America

Outline

Introduction Related Work Polynomial transformation Common Subexpression

elimination Results Conclusions

Introduction

Multiplications by constants encountered in many application areas DSP transforms in Audio, Video, Image

processing (DFT, DCT, IDCT etc..) Filtering operations in Communication

(FIR, IIR filters) Multiple Input Multiple Output (MIMO)

systems Polynomials in Computer graphics

Introduction Multiplication is expensive in hardware Decompose constant multiplications into shifts and

additions 13*X = (1101)2*X = X + X<<2 + X<<3

Signed digits can reduce the number of additions/subtractions

Canonical Signed Digits (CSD) (Knuth’74) (57)10 = (0110111)2 = (100-1001)CSD

Further reduction possible by common subexpression elimination

Upto 50% reduction (R.Hartley TCS’96)

Introduction Common subexpressions = common digit patterns

F1 = 7*X = (0111)*X = X + X<<1 + X<<2 F2 = 13*X = (1101)*X = X + X<<2 + X<<3

D1 = X + X<<2 F1 = D1 + X<<1 F2 = D1 + X<<3

Good for single variable: FIR filters (transposed form)

Multiple variable? (DFT, DCT etc..??)

“0101”

=> X + X<<23+, 3<<

4+, 4<<

Related Work Simple Bipartite matching (Potkonjak et. al

TCAD’95) (10101) and (01101) => common pattern = “101” (10010) and (010010) => cannot detect pattern “1001”

Recursive Shift and Add (RESANDS) (H.Nguyen et. Al, TVLSI 2000)

(10010) and (010010) => common pattern “1001”

Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’99)

(1011) => “0011”, “1001”, “1010”, “0101”, “1011”

Related Work Extending techniques for multiple variables

Y1 a11 a12 a13 X1

Y2 = a21 a22 a23 x X2

Y3 a31 a32 a33 X3

k

kikjj

iji DCXSY k

kikjj

iji DCXSY

1 0 1 1 0 0

0 1 1 1 0 1

1 0 0 1 0 1

All Distinct SAll Distinct SijijXXjj and C and CikikDDkk

Y1

Y2

Y3

Potkonjak et. al. TCAD’95

Related Work Multiple Variable Common

Subexpression elimination (A.Hosangadi et. al ASAP’04) Polynomial transformation of linear systems. Use rectangular covering methods

Cannot find subexpressions with reversed signs eg. (X1 – X2<<1) ≠ (X2<<1 – X1) Common occurrence when signed digits are

used Rectangle covering has exponential complexity

Method to overcome these limitations ?

Related Work Algebraic methods in

multi-level logic synthesis (MLLS)

Reducing literal count in a set of Boolean expressions

Factoring, decomposition: Established algebraic techniques

Typically used for thousands of variables and literals

Apply these methods to optimize linear systems?

D1 = X1+ X2<<2

Y1 = D1 + D1<<3 + X1<<3

Y2 = D1 + X2<<2

Linear systems and polynomial transformation

View linear systems as set of arithmetic expressions Expressions consisting of +,-,<< operators Develop methodology for extracting common

subexpressions

Polynomial formulationC × X = (±X×Li)(14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1

= (100-10)CSD × X = XL4 – XL1

Linear Systems and polynomial transformation

Y0 1 1 1 1 X0

Y1 = 2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

Decomposing constant multiplications

Y0 = X0 + X1 + X2 + X3

Y1 = X0<<1 + X1 - X2 - X3<<1

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1<<1 + X2<<1 - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0<<1 + X1 - X2 - X3<<1

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1<<1 + X2<<1 - X3 12+, 4<<12+, 4<<

H.264 Integer Transform

Linear Systems and polynomial transformation

Y0 1 1 1 1 X0

Y1 = 2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

Polynomial transformation

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3 12+, 4<<12+, 4<<

H.264 Integer Transform

Fx algorithm

Concurrent Decomposition and Factorization of Boolean Expressions (J.Rajski et. al TCAD’92) Popular as Fast-Extract (Fx) algorithm Expression f = gh + r

g = (ab + c) => Double cube divisor g = ab => Single cube divisor

Fx algorithm for Linear systems?

Two-term divisors Obtained from every pair of terms in each

expression Divide by the minimum exponent of L

eg. F = X1 + X2L + X3L3

{ +X2L, +X3L3}: Divide by L => (X2 + X3L2)

Divisors = (X1 + X2L), (X1 + X3L3), (X2 + X3L2)

Two divisors intersect if The terms involved are distinct (X1 – X2L) ∩ (X1 - X2L) = φ

(X1 – X2L) ∩ (-X1 + X2L) = φ (reversed signs allowed !!)

Two-term divisors Theorem: Multiple term common

subexpression in set of expression iff non-overlapping intersection among two-term divisors

Many divisors with intersections, which one to choose? Use greedy selection of divisor with most #

of intersections Selecting divisors changes expressions

Perform concurrent decomposition of expressions

Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi;

{Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of

{Divisors} ; }

Algorithm (Step 2)Common Subexpression Elimination

{Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in

{T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }

Algorithm complexity

MxM constant matrix; N digits of precision

Y0 1111 1111 1011 1001 Y0 = X0 + X0L + ... XM-1L3

+

XM-1

Y1

.. … … … …

.. YM-1 1111 1110 0011 1010

M

MN

O(MN) terms

=> O(M2N2) divisors

Algorithm (Step 1) Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; {Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of

{Divisors} ; }

O(M2N2) distinct divisors

O(M2N2)

O(M3N2)

Algorithm (Step 2)Common Subexpression Elimination

{Divisors} = Set of all 2-term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in

{T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }

O(M2N2)

O(M2N2)

Algorithm

H.264 example

>> Select D0 = (X0 + X3)

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Algorithm

H.264 example

>> Select D1 = (X1 – X2)

Y0 = D0 + X1 + X2

Y1 = X0L + X1 - X2 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - X1L + X2L - X3

Y0 = D0 + X1 + X2

Y1 = X0L + X1 - X2 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - X1L + X2L - X3

Algorithm

H.264 example

>> Select D2 = (X1 + X2)

Y0 = D0 + X1 + X2

Y1 = X0L + D1 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - D1L - X3

Y0 = D0 + X1 + X2

Y1 = X0L + D1 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - D1L - X3

Algorithm

H.264 example

>> Select D3 = (X0 – X3)

Y0 = D0 + D2

Y1 = X0L + D1 - X3L

Y2 = D0 - D2

Y3 = X0 - D1L - X3

Y0 = D0 + D2

Y1 = X0L + D1 - X3L

Y2 = D0 - D2

Y3 = X0 - D1L - X3

Final Implementation

Extracting 4 divisors

D0 = X0 + X3 Y0 = D0 + D2

D1 = X1 – X2 Y1 = D1 + D3L

D2 = X1 + X2 Y2 = D0 - D2

D3 = X0 - X3 Y3 = D3 – D1L

D0 = X0 + X3 Y0 = D0 + D2

D1 = X1 – X2 Y1 = D1 + D3L

D2 = X1 + X2 Y2 = D0 - D2

D3 = X0 - X3 Y3 = D3 – D1L

8+, 2<<8+, 2<<

Original: 12+, 4<<

Rectangle Covering:

10+, 3<<

Experimental Setup Goal

Reduction in #additions/subtractions Effect on area/latency on synthesis Simulate designs to estimate power consumption

Transforms DCT, IDCT,DFT, DST, DHT. 8x8 constant matrices 16 digits precision (CSD representation) Compare with

Potkonjak (TCAD’95) RESANDS (Nguyen et. al TVLSI’2000) Rectangle Covering (A.Hosangadi et.al ASAP’04)

Experimental Results

Example

# of additions/subtractions

Original(I)

Potkonjak

(II)

RESANDS

(III)

Rectangle

Covering(IV)

Two-term CSE(V)

DCT 274 202 227 174 153

IDCT 242 183 222 162 143

RealDFT 253 193 208 165 144

ImagDFT 207 178 198 134 124

DST 320 238 252 200 187

DHT 284 209 211 175 158

Average 263.3 200.5 219.7 168.3 151.5Run Time 0.81s 0.08s

Experimental results Synthesis results (minimum latency

constraints)Exampl

e Area (Library

Units) Latency (Clock

cycles)

(III) (IV) (V) (III) (IV) (V)

DCT 90667 73311 66759 10 11 10

IDCT 81868 66864 62883 10 11 10

R-DFT 90496 69827 64026 10 11 10

I-DFT 75140 55940 54606 10 10 10

DST 108101

84715 81214 11 11 11

DHT 93939 71272 67775 11 11 10

Average 90110

70322

66211 10.3 10.8 10.2

(III) RESANDS

(IV) Rect. Covering

(V) 2-term CSE

Experimental results Power

consumptionExample

Power consumption (µWatts)

(III) (IV) (V)

DCT 729 504 531

IDCT 662 547 569

R-DFT 707 544 554

I-DFT 644 575 490

DST 607 718 595

DHT 598 545 527

Average

657.8 572.2 544.3

(III) RESANDS

(IV) Rect. Covering

(V) 2-term CSE

Conclusions

A new technique for eliminating common subexpressions in linear systems

Fewer operations than known methods

Much faster than rectangle covering Combine with scheduling on given

resources

Thank you Questions??

IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005

Documents