FPGA Multipliers Bogdan PASCA projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France RAIM’11 February 7-10, 2011
FPGA Multipliers
Bogdan PASCA
projet Arenaire, ENS-Lyon/INRIA/CNRS/Universite de Lyon, France
RAIM’11February 7-10, 2011
Outline
Background & Context
Algorithmic techniques for reducing DSP count of large multipliersKaratsuba-Ofman algorithmNon-Standard tilingsSquarersTruncated multipliers
Conclusions
Bogdan PASCA FPGA Multipliers 1
What’s an FPGA?
Field Programmable Gate Array
integrated circuit
has a regular architecture (hence array)
logic elements can be programmed to perform various functions
Bogdan PASCA FPGA Multipliers 2
Modern FPGA Architecture
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
LUT
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
Modern FPGA Architecture
RA
MR
AM
RA
MR
AM
DS
PD
SP
DS
PD
SP
LUT
shift 17
18
18
a set of configurable logic elements
on chip memory blocks
digital signal processing (DSP) blocks (including multipliers)
connected by a configurable wire network
all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
What can we compute?
LUT
LUT
LUT
LUT
LUT
LUTx0y0
y0
x1
y0
x2
y1
x0
x1y1
y1
x2 u2
u1
u0
l2
l1
l0 p0
p1
p2
p3
p4
FA
FA
FA
x2x1x0×y1y0
l2 l1 l0+u2u1u0
p4p3p2p1p0
l0 = y0 ∧ x0
l1 = y0 ∧ x1
l2 = y0 ∧ x2
u0 = y1 ∧ x0
u1 = y1 ∧ x1
u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
Need of DSP blocks
Multiplication in logic is expensive
n × n bit ≈ n2︸︷︷︸partial products
+ n(n − 1)︸ ︷︷ ︸adder tree
LUTs
18× 18 bit ≈ 324LUT + 306LUT = 630LUTs
1 DSP block = 8 LEs (size on FPGA layout)
DSP blocks are a need in modern FPGAs
17 bit shift
17 bit shift
48
48
B
P
18
18
A
C
P
Bogdan PASCA FPGA Multipliers 5
Need of DSP blocks
Multiplication in logic is expensive
n × n bit ≈ n2︸︷︷︸partial products
+ n(n − 1)︸ ︷︷ ︸adder tree
LUTs
18× 18 bit ≈ 324LUT + 306LUT = 630LUTs
1 DSP block = 8 LEs (size on FPGA layout)
DSP blocks are a need in modern FPGAs
17 bit shift
17 bit shift
48
48
B
P
18
18
A
C
P
Bogdan PASCA FPGA Multipliers 5
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
DSP-Hungry Applications
FPGA floating point performance – a pencil and paper evaluation 1
→ DSP-blocks are a scarce resource for accelerating DP apps.Efficient reconfigurable design for pricing asian options 2
→ LUTs 46%, RAM 4%, DSP 100% (192)Implementation and evaluation of an arithmetic pipeline onFLOPS-2D: multi-FPGA system3
→ a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100%
A temporal coding hardware implementation for spiking neuralnetworks4
→ 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
Four recipes for saving DSPs
1D. Strenski (HPCWire, 2007.)2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10)
Bogdan PASCA FPGA Multipliers 6
Perceiving Multiplications Visually
XY
∑
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
∑Y2:0
X2:0
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
∑X5:3
Y5:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
∑X3:1
Y4:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
∑X3:1
Y4:3
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
005
5
3
X0X1
Y0
Y1
X0Y0
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Perceiving Multiplications Visually
005
5
3
X0X1
Y0
Y1
X0Y0
+23+3X1Y1
+23X1Y0
+23X0Y1
XY =
classical binary multiplication
all sub-products can be properly located inside the diamond
rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Karatsuba-Ofman algorithm
trading multiplications for additions
Bogdan PASCA FPGA Multipliers 8
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
The Karatsuba-Ofman algorithm
Basic principle for two way splitting
split X and Y into two chunks:
X = 2kX1 + X0 and Y = 2kY1 + Y0
computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0
precompute DX = X1 − X0 and DY = Y1 − Y0
make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY
XY requires only 3 DSP blocks (X1Y1,X0Y0,DXDY )
overhead: two k-bit and one 2k-bit subtraction
overhead � DSP-block emulation
Bogdan PASCA FPGA Multipliers 9
Visual Interpretation
X0X1
Y1
Y0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Y0
Y1
Y2
X0
Y3
X1X2X3X0X1X2X3
Bogdan PASCA FPGA Multipliers 10
Visual Interpretation
X0X1
Y1
Y0
X1X2
Y0
Y1
Y2
X0
Y0
Y1
Y2
X0
Y3
X1X2X3X0X1X2X3
Bogdan PASCA FPGA Multipliers 10
Implementation
fairly trivial starting from the equation:
XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0
z
z
DSP48
17
17
17
17
18
18
Y0
X0
Y0
Y1
X0
X1
Y1
X1
36
34
34
X0Y0
X0Y0 − DXDY
X1Y1 + X0Y0 − DXDY
P6851X1Y1
34(16 : 0)
(33 : 17)
34x34bit multiplier using Virtex-4 DSP48
X1Y1 + X0Y0 − DXDY is implemented inside the DSPs
need to recover X1Y1 with a subtraction
Bogdan PASCA FPGA Multipliers 11
Results
latency frequency (MHz). slices5 DSPs
LogiCore 6 447 26 4
LogiCore 3 176 34 4
K-O-2 3 317 95 3
Table: 34x34-bit multipliers on Virtex-4
trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements
latency frequency(MHz) slices DSPs
LogiCore 11 353 185 9
LogiCore 6 264 122 9
K-O-3 6 317 331 6
Table: 51x51 multipliers on Virtex-4
trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements
5On Virtex4 devices 1 slice = 2 Logic ElementsBogdan PASCA FPGA Multipliers 12
Non-Standard tilings
new multiplication algorithms
Bogdan PASCA FPGA Multipliers 13
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑Y2:0
X2:0
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X5:3
Y5:3
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X3:1
Y4:3
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
∑X3:1
Y4:3
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
X0
05
5
1
3Y
23+1X3:1Y4:3
Bogdan PASCA FPGA Multipliers 14
Non-standard tilings
optimize use of rectangular multipliers on Virtex5 (25x18 signed)
classical decomposition may produce suboptimal results
chunk size for X is 24chunk size for Y is 17
translate the operand decomposition into a tiling problem
X0
05
5
1
3Y
23+1X3:1Y4:3
+21+5X3:1Y5
+23X0Y5:3
+X3:0Y2:0
+24X5:4Y5:0
XY =
Bogdan PASCA FPGA Multipliers 14
Tilings
Performing a 53× 53-bit multiplication on Virtex5
51
48
(a) standard tiling
00
16
33
163358
58
(b) Logicore tiling
34
0
0
24
41
58 34 17
41 24
17
M1
M2
M3M4M5
M6
M7M8
(c) proposed tiling
standard tiling ≡ classical decomposition (12 DSPs)
Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)
our proposed tiling does it in 8 DSPs and a few LUTs
Bogdan PASCA FPGA Multipliers 15
Tiling Architecture - 53x53bit
34
0
0
24
41
58 34 17
41 24
17
M1
M2
M3M4M5
M6
M7M8
XY = X0:23Y0:16 (M1)+ 217(X0:23Y17:33 (M2)+ 217(X0:16Y34:57 (M3)+ 217X17:33Y34:57)) (M4)+ 224(X24:40Y0:23 (M8)+ 217(X41:57Y0:23 (M7)+ 217(X34:57Y24:40 (M6)+ 217X34:57Y41:57))) (M5)+ 248X24:33Y24:33
X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs.
parenthesis makes best use of DSP48E internal adders (17-bitshifts)
Bogdan PASCA FPGA Multipliers 16
Tiling Results
58x58 multipliers on Virtex-5 (5vlx50ff676-3)6
latency Freq. REGs LUTs DSPs
LogiCore 14 440 300 249 10
LogiCore 8 338 208 133 10
LogiCore 4 95 208 17 10
Tiling 4 366 247 388 8
Remarks
save 2 DSP48E for a few LUTs/REGs
huge latency save at a comparable frequency
good use of internal adders due to the 17-bit shifts
6Results for 53-bits are almost identicalBogdan PASCA FPGA Multipliers 17
Squarers
simple methods to save resources
Bogdan PASCA FPGA Multipliers 18
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
Squarers
appear in norms, statistical computations, polynomial evaluation...dedicated squarer saves as many DSP blocks as theKaratsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X02
X12
X0X1
X0X1
(2kX1 + X0)2 = 22kX 21 + 2 · 2kX1X0 + X 2
0
X02
X12
X0X1
X0X1
X0X2
X0X2
X1X2
X1X2X22
(22kX2 + 2kX1 + X0)2 = 24kX 22 + 22kX 2
1 + X 20
+ 2 · 23kX2X1
+ 2 · 22kX2X0
+ 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
However ...
(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2
0
shifts of 0, 18, 34 the previous equation
the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Workaround for ≤ 33-bit multiplications
rewrite equation:
(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2
0
compute 2X1 by shifting X1 by one bit before inputing into DSP48block
Bogdan PASCA FPGA Multipliers 20
However ...
(2kX1 + X0)2 = 234X 21 + 218X1X0 + X 2
0
shifts of 0, 18, 34 the previous equation
the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Workaround for ≤ 33-bit multiplications
rewrite equation:
(217X1 + X0)2 = 234X 21 + 217(2X1)X0 + X 2
0
compute 2X1 by shifting X1 by one bit before inputing into DSP48block
Bogdan PASCA FPGA Multipliers 20
Results – 32-bit and 53-bit squarers on Virtex-4
latency frequency slices DSPs bits
LogiCore 6 489 59 432LogiCore 3 176 34 4
Squarer 3 317 18 3
LogiCore 18 380 279 1653LogiCore 7 176 207 16
Squarer 7 317 332 6
DSPs saved without any overhead
impressive 10 DSPs saved for double precision squarer
Bogdan PASCA FPGA Multipliers 21
Squarers on Virtex5 using tilings
the tiling technique can be extended to squaring
36
53
17
0
M1
M2
M3 M6M5
M4
041 24 0
19
36
53
M1
M2
M3
M4M5
Issues
darker squares are computed twice thus need be removed.
thanks to symmetry diagonal multiplication of size n shouldconsume only n(n + 1)/2 LUTs instead of n2 .
Bogdan PASCA FPGA Multipliers 22
Truncated multipliers
Bogdan PASCA FPGA Multipliers 23
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
Truncated multipliers
Classical technique
reduce resources, delay, or power consumption
controlled accuracy degradation
×
∑BA
u
kd
n − k
v
remove some of the least-significant d columns
keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
Error budget
×
∑BA
u
kd
n − k
v
Etotal = Eapprox + Eround ≤ 2k
Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1
Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤
∑di=1 i2i−1
Eapprox < 2k−1→ d = f (k)
Precision k Discarded (d)
Single 23 18Double 52 46
Quadruple 112 105
Bogdan PASCA FPGA Multipliers 25
Error budget
×
∑BA
u
kd
n − k
v
Etotal = Eapprox + Eround ≤ 2k
Eround – caused by rounding the n − d-bit result to n − k bitsuse compensation bit to center the errorround to nearest bounds Eround ≤ 2k−1
Eapprox – caused by the truncation of the d columns{0 ≤ Eapprox ≤
∑di=1 i2i−1
Eapprox < 2k−1→ d = f (k)
Precision k Discarded (d)
Single 23 18Double 52 46
Quadruple 112 105
Bogdan PASCA FPGA Multipliers 25
Tiling the truncated board
M2
M3 M1
k
d
M4
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
Tiling the truncated board
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
M4
M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
Tiling the truncated board
M2
M3 M1
k
d
M4 M2
M3 M1
k
d
M4 M2
M3 M1
k
d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic)
Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
Reality Check – faithfully rounding
Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)
FPGA Prec. Latency, Freq. Resources
Virtex5DP 6 cycles @ 414MHz 320LUT 302REG 5DSP
QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP
QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP
Virtex4DP 11 cycles @ 368MHz 358sl. 7DSP
QP 21 cycles @ 368MHz 1735sl. 26DSP
Virtex4DP reduce DSPs from 10 to 7 while also reducing slice countQP reduce DSPs from 49 to 26 at without any slice penalty
Virtex5DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGsQP reduce DSP from 34 to 19 at a small increase in logic resources.
Bogdan PASCA FPGA Multipliers 27
Another point of view
(wE ,wF )=accuracy
(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗
truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit)
default 27 DSPsoptimized Horner 23 DSPs
optimized Horner + truncated multipliers 11* DSPs
Bogdan PASCA FPGA Multipliers 28
Another point of view
(wE ,wF )=accuracy
(wE ,wF + 1)correctly rounded faithfully rounded→ in FPGAs the extra bit comes for free∗
truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit)
default 27 DSPsoptimized Horner 23 DSPs
optimized Horner + truncated multipliers 11* DSPs
Bogdan PASCA FPGA Multipliers 28
Conclusion
save DSPs by exploiting the flexibility of the FPGA
Karatsuba-Ofman reduces DSP cost at small price in logic elements
tiling techinques adapt better to asymmetric DSPs
dedicated squarers significantly reduce DSP count
control accuracy and save DSPs using truncated multipliers
Bogdan PASCA FPGA Multipliers 29
Thank you for your attention !
http://flopoco.gforge.inria.fr/
Questions ?
Bogdan PASCA FPGA Multipliers 30