A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.1

19.1 A 5GHz Floating Point Multiply-Accumulator in 90nm Dual VT CMOS

Sriram Vangal, Yatin Hoskote, Dinesh Somasekhar, Vasantha Erraguntla,Jason Howard, Gregory Ruhl, Venkat Veeramachaneni, David Finan,Sanu Mathew, Nitin Borkar

Microprocessor Research Labs, Intel, Hillsboro, OR

A single precision, floating point multiply accumulator (FPMAC)uses base 32, internal carry save format and delayed additiontechniques to enable single cycle accumulate at 5GHz. In addi-tion, an improved Leading Zero Anticipator (LZA) and overflowdetection logic applicable to carry save format are implemented.

The FPMAC architecture is shown in Fig. 19.1.1. Operands Aand B are 32b inputs in IEEE 754 single precision format [1].The multiplier is designed using a Wallace tree of 4-2 adders.The well-matched delays of each Wallace tree stage allow forvery efficient pipelining (pipe stages S1-S5). In an effort toachieve fast single cycle accumulate operation, the followingoptimizations are employed: (1) The accumulator retains themultiplier output in carry save format and uses an array of 4-2adders to “accumulate” the result. This removes the need for anexpensive carry propagate adder in the critical path. (2)Accumulation is performed in base 32 system. The incomingnumber is converted to base 32 by shifting the mantissa to theleft by an amount given by the last five exponent bits (Exp[4:0]),thus extending the mantissa width to 55b. This reduces theexponent from 8 to 3 bits (Exp[7:5]), allowing faster exponentcomparison and generation of critical control signals, accelerat-ing the accumulate operation. (3) Expensive variable shifters inthe accumulate loop are replaced with constant shifters. (4) Thecostly post-normalization step is moved outside the loop wherethe accumulation result is added, the sum normalized and con-verted back to base 2.

The accumulator mantissa loop uses 4-2 adders at its core [2] toaccumulate the result using the extended un-normalized man-tissas and 3b exponents. The algorithm has the following fourcases, implemented as concurrent operations and shown as fourdistinct paths in Fig. 19.1.2. (A) If the feedback mantissa is zero,the result of accumulation is the mantissa from the multiplier.(B) If the feedback mantissa has more than 31 leading zeroes orones, and the feedback exponent is greater than the multiplierexponent by 1 or 2, then the mantissas are partially aligned byshifting the feedback mantissa left and the multiplier outputmantissa right by 32b before addition. (C) If the two exponentsdiffer by more than 1, the mantissa of the larger number is cho-sen, bypassing the adder. (D) If the two exponents differ by 1 orare equal, the mantissas are aligned by shifting the mantissa ofthe smaller number right by 32 and then added. The exponentloop (Fig. 19.1.3) mirrors the four cases described above. The con-trol signals are generated from 3b exponent comparisons. Toreduce critical path delay, the signals are re-timed to the previ-ous pipe stage. This approach enables accumulation to be imple-mented in nine logic stages. In the absence of normalization inthe accumulate loop, alignment of mantissas is necessary toavoid precision loss. The use of two 4-2 adders in the mantissaloop (parallel paths B, D) arises from the need to correctly han-dle non-commutative input data streams. Our implementationalso handles overflow during mantissa addition and subsequentrecovery on the fly. Each 4-2 adder array has its own overflowdetection logic which triggers a right shift of the addition result.

The tasks of overflow detection and leading zero anticipation [3]for numbers in carry-save format are challenging and requirenew logic. The “toggle detector” circuit (Fig. 19.1.4) looks at apair of 3b vectors and predicts a toggle (0 � 1 or 1 � 0) in theiraddition result (for example, 100), without actual summation.This circuit is used to predict mantissa overflow in the accumu-late loop. Sign-extending the data path by 1 bit and detecting atoggle in the 2 most significant bits (MSB) of the extended resultimplies overflow. Because the “toggle detector” circuit flags tog-gles in a 3b window, this overflow detection is conservative. Toavoid precision loss from such a conservative right shift, weextend the internal data path by 2 bits and shift the additionresult right only if overflow is flagged for this extended width.The same “toggle detector” circuit is used to determine the num-ber of leading zeroes or ones in a mantissa in carry-save format.An N-2 array of these circuits, for an N bit number, is used.Overlapping groups of three bits at a time of the carry and sumvectors, from MSB down to the least significant bit, are fed to“toggle detector” array to identify toggles. Counting the bits fromMSB down to the first observed toggle gives the number of lead-ing zeroes or leading ones. This count can be off by one bit posi-tion and requires a subsequent compensatory shift in the post-normalization block (Fig. 19.1.5). The post-normalization blockuses a 57b sparse-tree dual adder core [4] that computes (A+B)and –(A+B) in parallel by preconditioning the inputs of two iden-tical adders. The LZA operates in parallel with the dual adder tocompute the normalization distance. It also checks the feedbackmantissa for more than 31 leading zeroes or ones and generatesthe control signal required in the accumulate loop.

Layout of the FPMAC core and summary of chip characteristicsare shown in Fig. 19.1.6. The 2mm2 custom design contains 230ktransistors. To enable 5GHz operation, the FPMAC core usesimplicit-pulsed semi-dynamic flip-flops [5]. More than 80% of thedevice widths in the high-speed core are low-VT. The designincludes three 32 word x 32b FIFO buffers, each operating atcore speed. Two of the FIFOs provide the input operands to theFPMAC and the third FIFO captures the results at-speed. A scanchain is used to load the input FIFOs and read the output FIFO.

A frequency versus Vcc plot characterizing execution of FPMACcore is shown in Fig. 19.1.7. Simulations show that at 25°C and1.2V, the FPMAC functions at 5GHz. At 1.5V, the operating fre-quency increases to 6.2GHz. Average estimated power consump-tion for 1.2V, 5GHz operation is 1.2W.

AcknowledgementsThe authors thank A. Pangal, V. Govindarajulu, H. Wilson, Y. Niu and J.Tschanz for design efforts, K. Ikeda, K. Truong, C. Parsons and H. Nguyenfor chip layout, and S. Borkar and J. Rattner for encouragement and sup-port.

References[1] IEEE Standards Board, “IEEE Standard for Binary Floating-PointArithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, NewYork, 1985.[2] Z. Luo, et al., “Accelerating Pipelined Integer and Floating-PointAccumulations in Configurable Hardware with Delayed AdditionTechniques,” IEEE Trans. on Computers, Mar. 2000 pp 208-218. [3] H. Suzuki et al., “Leading-Zero Anticipatory Logic for High-speedFloating Point Addition,” IEEE J. Solid State Circuits, Aug. 1996, pp.1157-1164.[4] S. Mathew, et al., “A 4GHz 130nm Address Generation Unit with a 32-bit Sparse-tree Adder Core,” VLSI Circuits Symp. 2002, pp 126-127.[5] F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with EmbeddedLogic,” VLSI Circuits Symp. 1998, Digest of Technical Papers, pp 108-109.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

https://www.researchgate.net/publication/3044191_Accelerating_pipelined_integer_and_floating-point_accumulations_in_configurable_hardware_with_delayed_addition_techniques?el=1_x_8&enrichId=rgreq-d6d81a0402047fa4a582434f0e50b965-XXX&enrichSource=Y292ZXJQYWdlOzQwMzYxODk7QVM6MTAyNjMyMDkzOTEzMTAwQDE0MDE0ODA4MDY4NDc=




https://www.researchgate.net/publication/3758253_Semi-dynamic_and_dynamic_flip-flops_with_embedded_logic?el=1_x_8&enrichId=rgreq-d6d81a0402047fa4a582434f0e50b965-XXX&enrichSource=Y292ZXJQYWdlOzQwMzYxODk7QVM6MTAyNjMyMDkzOTEzMTAwQDE0MDE0ODA4MDY4NDc=



19

Figure 19.1.1: FPMAC pipe stages and organization. Figure 19.1.2: Floating point accumulator mantissa datapath.

Figure 19.1.3: Floating point accumulator exponent logic.

Figure 19.1.5: Post-normalization pipeline diagram. Figure 19.1.6: Chip micrograph and process characteristics.

Figure 19.1.4: Toggle detection circuit.

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

�

�

��

�

�!�

� �

"�� "��

#��$

#��

#��

#��

� ��%��

� ��&��

' ��

(� �� )�

��

�*

�$

��

��

�+

�,

�-

�.

�/

�$*

�$$

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

�

�

��

�

�!�

� �

"�� "��

#��$

#��

#��

#��

� ��%��

� ��&��

' ��

(� �� )�

��

�*

�$

��

��

�+

�,

�-

�.

�/

�$*

�$$

��

��

0��

�� - 1��.�. 1��-

)��

�� - 1��.

��

�� -��2��.

��

��

3��

45�

1�$6

�-

��

+-

�-

�.

"��7

��

"��7

��

$��*

��3��

$��* $��*

$��*

�$ �$ �$

$��*

� ��

�

�- �.

3��

�-

��3��

�- 1��.�- 1��.

��

� ��

��8-9+:

;

�

<

(

"

5

� �

'��= �

($ (* "$ "* ;� ��

<$"* 5$(*

;

�

<

(

"

5

� �

'��= �

($ (* "$ "* ;� ��

<$"* 5$(*

�� . ��&��.

��

��

�

*��$

��$��

��

45�

�/�/

3��

� �

1�$6

45��$

�.

�/

��

� �� &��

�$*

�$$

� �� )�

�$*

�$$

%�� !&��

�� '�� ()�

*�� $+�&��

�� ',�

��-��$� !./��

��0cc� �!0�

��1�� 2�3��0�

�� &!��

45 *�5*67

�5*.867

��4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

%�� !&��

�� '�� ()�

*�� $+�&��

�� ',�

��-��$� !./��

��0cc� �!0�

��1�� 2�3��0�

�� &!��

45 *�5*67

�5*.867

��4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,


19


Figure 19.1.7: Accumulator frequency vs. supply voltage.

�

�

�

!

#

&

' � �� !

0��=0>

�� -��$�=./�>

+<>)!�$?�@

,?�<>)!�$?+@


Figure 19.1.1: FPMAC pipe stages and organization.

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

�

�

��

�

�!�

� �

"�� "��

#��$

#��

#��

#��

� ��%��

� ��&��

' ��

(� �� )�

��

�*

�$

��

��

�+

�,

�-

�.

�/

�$*

�$$

��

��

��

��

��

��

��

��

��

��

��

��

��

� ��

��

��

��

�

�

��

�

�!�

� �

"�� "��

#��$

#��

#��

#��

� ��%��

� ��&��

' ��

(� �� )�

��

�*

�$

��

��

�+

�,

�-

�.

�/

�$*

�$$

��

��


Figure 19.1.2: Floating point accumulator mantissa datapath.

0��

�� - 1��.�. 1��-

)��

�� - 1��.

��

�� -��2��.

��

��

3��

45�

1�$6

�-

��

+-

�-

�.

"��7

��

"��7

��

$��*

��3��


$��* $��*

$��*

�$ �$ �$

$��*

� ��

�

�- �.

3��

�-

��3��

�- 1��.�- 1��.

��

� ��

��8-9+:

Figure 19.1.3: Floating point accumulator exponent logic.


;

�

<

(

"

5

� �

'��= �

($ (* "$ "* ;� ��

<$"* 5$(*

;

�

<

(

"

5

� �

'��= �

($ (* "$ "* ;� ��

<$"* 5$(*

Figure 19.1.4: Toggle detection circuit.


�� . ��&��.

��

��

�

*��$

��$��

��

45�

�/�/

3��

� �

1�$6

45��$

�.

�/

��

� �� &��

�$*

�$$

� �� )�

�$*

�$$

Figure 19.1.5: Post-normalization pipeline diagram.


%�� !&��

�� '�� ()�

*�� $+�&��

�� ',�

��-��$� !./��

��0cc� �!0�

��1�� 2�3��0�

�� &!��

45 *�5*67

�5*.867

��4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

%�� !&��

�� '�� ()�

*�� $+�&��

�� ',�

��-��$� !./��

��0cc� �!0�

��1�� 2�3��0�

�� &!��

45 *�5*67

�5*.867

��4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

Figure 19.1.6: Chip micrograph and process characteristics.


Figure 19.1.7: Accumulator frequency vs. supply voltage.

�

�

�

!

#

&

' � �� !

0��=0>

�� -��$�=./�>

+<>)!�$?�@

,?�<>)!�$?+@

A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

Documents