FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

FPGA Multipliers

Bogdan PASCA

projet Arenaire, ENS-Lyon/INRIA/CNRS/Universite de Lyon, France

RAIM’11February 7-10, 2011

Outline

Background & Context

Algorithmic techniques for reducing DSP count of large multipliersKaratsuba-Ofman algorithmNon-Standard tilingsSquarersTruncated multipliers

Conclusions

Bogdan PASCA FPGA Multipliers 1

What’s an FPGA?

Field Programmable Gate Array

integrated circuit

has a regular architecture (hence array)

logic elements can be programmed to perform various functions


Modern FPGA Architecture

a set of configurable logic elements

on chip memory blocks

digital signal processing (DSP) blocks (including multipliers)

connected by a configurable wire network

all connected to outside world by I/O pins



RA

MR

AM

RA

MR

AM








RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP








RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP








RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP








RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

LUT








RA

MR

AM

RA

MR

AM

DS

PD

SP

DS

PD

SP

LUT

shift 17

18

18







What can we compute?

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2



LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2



LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2



LUT

LUT

LUT

LUT

LUT

LUTx0y0

y0

x1

y0

x2

y1

x0

x1y1

y1

x2 u2

u1

u0

l2

l1

l0 p0

p1

p2

p3

p4

FA

FA

FA

x2x1x0×y1y0

l2 l1 l0+u2u1u0

p4p3p2p1p0

l0 = y0 ∧ x0

l1 = y0 ∧ x1

l2 = y0 ∧ x2

u0 = y1 ∧ x0

u1 = y1 ∧ x1

u2 = y1 ∧ x2


Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2︸︷︷︸partial products

+ n(n − 1)︸︷︷︸adder tree

LUTs

18× 18 bit ≈ 324LUT + 306LUT = 630LUTs

1 DSP block = 8 LEs (size on FPGA layout)

DSP blocks are a need in modern FPGAs

17 bit shift

17 bit shift

48

48

B

P

18

18

A

C

P


Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2︸︷︷︸partial products

+ n(n − 1)︸︷︷︸adder tree

LUTs

18× 18 bit ≈ 324LUT + 306LUT = 630LUTs

1 DSP block = 8 LEs (size on FPGA layout)

DSP blocks are a need in modern FPGAs

17 bit shift

17 bit shift

48

48

B

P

18

18

A

C

P


DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1

→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2

→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3

→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%

A temporal coding hardware implementation for spiking neuralnetworks4

→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)








→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)










→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)










→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)










→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)




Perceiving Multiplications Visually

XY

∑

classical binary multiplication

all sub-products can be properly located inside the diamond

rotate the diamond so to obtain a rectangle



∑Y2:0

X2:0






∑X5:3

Y5:3






∑X3:1

Y4:3






∑X3:1

Y4:3






005

5

3

X0X1

Y0

Y1

X0Y0






005

5

3

X0X1

Y0

Y1

X0Y0

+23+3X1Y1

+23X1Y0

+23X0Y1

XY =





Karatsuba-Ofman algorithm

trading multiplications for additions


The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks:

X = 2kX1 + X0 and Y = 2kY1 + Y0

computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0

precompute DX = X1 − X0 and DY = Y1 − Y0

make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY

XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )

overhead: two k-bit and one 2k-bit subtraction

overhead � DSP-block emulation





X = 2kX1 + X0 and Y = 2kY1 + Y0











X = 2kX1 + X0 and Y = 2kY1 + Y0











X = 2kX1 + X0 and Y = 2kY1 + Y0











X = 2kX1 + X0 and Y = 2kY1 + Y0











X = 2kX1 + X0 and Y = 2kY1 + Y0











X = 2kX1 + X0 and Y = 2kY1 + Y0








Visual Interpretation

X0X1

Y1

Y0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Y0

Y1

Y2

X0

Y3

X1X2X3X0X1X2X3



X0X1

Y1

Y0

X1X2

Y0

Y1

Y2

X0

Y0

Y1

Y2

X0

Y3

X1X2X3X0X1X2X3


Implementation

fairly trivial starting from the equation:

XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0

z

z

DSP48

17

17

17

17

18

18

Y0

X0

Y0

Y1

X0

X1

Y1

X1

36

34

34

X0Y0

X0Y0 − DXDY

X1Y1 + X0Y0 − DXDY

P6851X1Y1

34(16 : 0)

(33 : 17)

34x34bit multiplier using Virtex-4 DSP48

X1Y1 + X0Y0 − DXDY is implemented inside the DSPs

need to recover X1Y1 with a subtraction


Results

latency frequency (MHz). slices5 DSPs

LogiCore 6 447 26 4

LogiCore 3 176 34 4

K-O-2 3 317 95 3

Table: 34x34-bit multipliers on Virtex-4

trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements

latency frequency(MHz) slices DSPs

LogiCore 11 353 185 9

LogiCore 6 264 122 9

K-O-3 6 317 331 6

Table: 51x51 multipliers on Virtex-4

trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements

5On Virtex4 devices 1 slice = 2 Logic ElementsBogdan PASCA FPGA Multipliers 12

Non-Standard tilings

new multiplication algorithms


Non-standard tilings

optimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24chunk size for Y is 17

translate the operand decomposition into a tiling problem







∑







∑Y2:0

X2:0







∑X5:3

Y5:3







∑X3:1

Y4:3







∑X3:1

Y4:3







X0

05

5

1

3Y

23+1X3:1Y4:3







X0

05

5

1

3Y

23+1X3:1Y4:3

+21+5X3:1Y5

+23X0Y5:3

+X3:0Y2:0

+24X5:4Y5:0

XY =


Tilings

Performing a 53× 53-bit multiplication on Virtex5

51

48

(a) standard tiling

00

16

33

163358

58

(b) Logicore tiling

34

0

0

24

41

58 34 17

41 24

17

M1

M2

M3M4M5

M6

M7M8

(c) proposed tiling

standard tiling ≡ classical decomposition (12 DSPs)

Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)

our proposed tiling does it in 8 DSPs and a few LUTs


Tiling Architecture - 53x53bit

34

0

0

24

41

58 34 17

41 24

17

M1

M2

M3M4M5

M6

M7M8

XY = X0:23Y0:16 (M1)+ 217(X0:23Y17:33 (M2)+ 217(X0:16Y34:57 (M3)+ 217X17:33Y34:57)) (M4)+ 224(X24:40Y0:23 (M8)+ 217(X41:57Y0:23 (M7)+ 217(X34:57Y24:40 (M6)+ 217X34:57Y41:57))) (M5)+ 248X24:33Y24:33

X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs.

parenthesis makes best use of DSP48E internal adders (17-bitshifts)


Tiling Results

58x58 multipliers on Virtex-5 (5vlx50ff676-3)6

latency Freq. REGs LUTs DSPs

LogiCore 14 440 300 249 10

LogiCore 8 338 208 133 10

LogiCore 4 95 208 17 10

Tiling 4 366 247 388 8

Remarks

save 2 DSP48E for a few LUTs/REGs

huge latency save at a comparable frequency

good use of internal adders due to the 17-bit shifts

6Results for 53-bits are almost identicalBogdan PASCA FPGA Multipliers 17

Squarers

simple methods to save resources


Squarers

appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0


Squarers



X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0


Squarers



X02

X12

X0X1

X0X1

(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2

0

X02

X12

X0X1

X0X1

X0X2

X0X2

X1X2

X1X2X22

(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2

1 + X 20

+ 2 · 23kX2X1

+ 2 · 22kX2X0

+ 2 · 2kX1X0


However ...

(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2

0

shifts of 0, 18, 34 the previous equation

the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Workaround for ≤ 33-bit multiplications

rewrite equation:

(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2

0

compute 2X1 by shifting X1 by one bit before inputing into DSP48block


However ...

(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2

0

shifts of 0, 18, 34 the previous equation

the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Workaround for ≤ 33-bit multiplications

rewrite equation:

(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2

0

compute 2X1 by shifting X1 by one bit before inputing into DSP48block


Results – 32-bit and 53-bit squarers on Virtex-4

latency frequency slices DSPs bits

LogiCore 6 489 59 432LogiCore 3 176 34 4

Squarer 3 317 18 3

LogiCore 18 380 279 1653LogiCore 7 176 207 16

Squarer 7 317 332 6

DSPs saved without any overhead

impressive 10 DSPs saved for double precision squarer


Squarers on Virtex5 using tilings

the tiling technique can be extended to squaring

36

53

17

0

M1

M2

M3 M6M5

M4

041 24 0

19

36

53

M1

M2

M3

M4M5

Issues

darker squares are computed twice thus need be removed.

thanks to symmetry diagonal multiplication of size n shouldconsume only n(n + 1)/2 LUTs instead of n2 .


Truncated multipliers



Classical technique

reduce resources, delay, or power consumption

controlled accuracy degradation

×

∑BA

u

kd

n − k

v

remove some of the least-significant d columns

keep the error smaller than 2k



Classical technique



×

∑BA

u

kd

n − k

v





Classical technique



×

∑BA

u

kd

n − k

v




Error budget

×

∑BA

u

kd

n − k

v

Etotal = Eapprox + Eround ≤ 2k

Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1

Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤

∑di=1 i2i−1

Eapprox < 2k−1→ d = f (k)

Precision k Discarded (d)

Single 23 18Double 52 46

Quadruple 112 105


Error budget

×

∑BA

u

kd

n − k

v

Etotal = Eapprox + Eround ≤ 2k

Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1

Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤

∑di=1 i2i−1

Eapprox < 2k−1→ d = f (k)

Precision k Discarded (d)

Single 23 18Double 52 46

Quadruple 112 105


Tiling the truncated board

M2

M3 M1

k

d

M4

M2

M3 M1

k

d

M4 M2

M3 M1

k

d

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic)

Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free



M2

M3 M1

k

d

M4 M2

M3 M1

k

d

M4

M2

M3 M1

k

d


waste DSPs






M2

M3 M1

k

d

M4 M2

M3 M1

k

d

M4 M2

M3 M1

k

d


waste DSPs





Reality Check – faithfully rounding

Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)

FPGA Prec. Latency, Freq. Resources

Virtex5DP 6 cycles @ 414MHz 320LUT 302REG 5DSP

QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP

QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP

Virtex4DP 11 cycles @ 368MHz 358sl. 7DSP

QP 21 cycles @ 368MHz 1735sl. 26DSP

Virtex4DP reduce DSPs from 10 to 7 while also reducing slice countQP reduce DSPs from 49 to 26 at without any slice penalty

Virtex5DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGsQP reduce DSP from 34 to 19 at a small increase in logic resources.


Another point of view

(wE ,wF )=accuracy

(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗

truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit)

default 27 DSPsoptimized Horner 23 DSPs

optimized Horner + truncated multipliers 11* DSPs


Another point of view

(wE ,wF )=accuracy

(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗

truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit)

default 27 DSPsoptimized Horner 23 DSPs

optimized Horner + truncated multipliers 11* DSPs


Conclusion

save DSPs by exploiting the flexibility of the FPGA

Karatsuba-Ofman reduces DSP cost at small price in logic elements

tiling techinques adapt better to asymmetric DSPs

dedicated squarers significantly reduce DSP count

control accuracy and save DSPs using truncated multipliers


Thank you for your attention !

http://flopoco.gforge.inria.fr/

Questions ?


http://flopoco.gforge.inria.fr/

FPGA Multipliers - univ-perp.fr · !a)LE 30%,DSP 86%, b) LE 52%,DSP 88%, c) LE 63%,DSP 100% A temporal coding hardware implementation for spiking neural networks4!16PE: LE 22%, RAM

Documents