Top Banner
ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.1 19.1 A 5GHz Floating Point Multiply-Accumulator in 90nm Dual V T CMOS Sriram Vangal, Yatin Hoskote, Dinesh Somasekhar, Vasantha Erraguntla, Jason Howard, Gregory Ruhl, Venkat Veeramachaneni, David Finan, Sanu Mathew, Nitin Borkar Microprocessor Research Labs, Intel, Hillsboro, OR A single precision, floating point multiply accumulator (FPMAC) uses base 32, internal carry save format and delayed addition techniques to enable single cycle accumulate at 5GHz. In addi- tion, an improved Leading Zero Anticipator (LZA) and overflow detection logic applicable to carry save format are implemented. The FPMAC architecture is shown in Fig. 19.1.1. Operands A and B are 32b inputs in IEEE 754 single precision format [1]. The multiplier is designed using a Wallace tree of 4-2 adders. The well-matched delays of each Wallace tree stage allow for very efficient pipelining (pipe stages S 1 -S 5 ). In an effort to achieve fast single cycle accumulate operation, the following optimizations are employed: (1) The accumulator retains the multiplier output in carry save format and uses an array of 4-2 adders to “accumulate” the result. This removes the need for an expensive carry propagate adder in the critical path. (2) Accumulation is performed in base 32 system. The incoming number is converted to base 32 by shifting the mantissa to the left by an amount given by the last five exponent bits (Exp[4:0]), thus extending the mantissa width to 55b. This reduces the exponent from 8 to 3 bits (Exp[7:5]), allowing faster exponent comparison and generation of critical control signals, accelerat- ing the accumulate operation. (3) Expensive variable shifters in the accumulate loop are replaced with constant shifters. (4) The costly post-normalization step is moved outside the loop where the accumulation result is added, the sum normalized and con- verted back to base 2. The accumulator mantissa loop uses 4-2 adders at its core [2] to accumulate the result using the extended un-normalized man- tissas and 3b exponents. The algorithm has the following four cases, implemented as concurrent operations and shown as four distinct paths in Fig. 19.1.2. (A) If the feedback mantissa is zero, the result of accumulation is the mantissa from the multiplier. (B) If the feedback mantissa has more than 31 leading zeroes or ones, and the feedback exponent is greater than the multiplier exponent by 1 or 2, then the mantissas are partially aligned by shifting the feedback mantissa left and the multiplier output mantissa right by 32b before addition. (C) If the two exponents differ by more than 1, the mantissa of the larger number is cho- sen, bypassing the adder. (D) If the two exponents differ by 1 or are equal, the mantissas are aligned by shifting the mantissa of the smaller number right by 32 and then added. The exponent loop (Fig. 19.1.3) mirrors the four cases described above. The con- trol signals are generated from 3b exponent comparisons. To reduce critical path delay, the signals are re-timed to the previ- ous pipe stage. This approach enables accumulation to be imple- mented in nine logic stages. In the absence of normalization in the accumulate loop, alignment of mantissas is necessary to avoid precision loss. The use of two 4-2 adders in the mantissa loop (parallel paths B, D) arises from the need to correctly han- dle non-commutative input data streams. Our implementation also handles overflow during mantissa addition and subsequent recovery on the fly. Each 4-2 adder array has its own overflow detection logic which triggers a right shift of the addition result. The tasks of overflow detection and leading zero anticipation [3] for numbers in carry-save format are challenging and require new logic. The “toggle detector” circuit (Fig. 19.1.4) looks at a pair of 3b vectors and predicts a toggle (0 1 or 1 0) in their addition result (for example, 100), without actual summation. This circuit is used to predict mantissa overflow in the accumu- late loop. Sign-extending the data path by 1 bit and detecting a toggle in the 2 most significant bits (MSB) of the extended result implies overflow. Because the “toggle detector” circuit flags tog- gles in a 3b window, this overflow detection is conservative. To avoid precision loss from such a conservative right shift, we extend the internal data path by 2 bits and shift the addition result right only if overflow is flagged for this extended width. The same “toggle detector” circuit is used to determine the num- ber of leading zeroes or ones in a mantissa in carry-save format. An N-2 array of these circuits, for an N bit number, is used. Overlapping groups of three bits at a time of the carry and sum vectors, from MSB down to the least significant bit, are fed to “toggle detector” array to identify toggles. Counting the bits from MSB down to the first observed toggle gives the number of lead- ing zeroes or leading ones. This count can be off by one bit posi- tion and requires a subsequent compensatory shift in the post- normalization block (Fig. 19.1.5). The post-normalization block uses a 57b sparse-tree dual adder core [4] that computes (A+B) and –(A+B) in parallel by preconditioning the inputs of two iden- tical adders. The LZA operates in parallel with the dual adder to compute the normalization distance. It also checks the feedback mantissa for more than 31 leading zeroes or ones and generates the control signal required in the accumulate loop. Layout of the FPMAC core and summary of chip characteristics are shown in Fig. 19.1.6. The 2mm 2 custom design contains 230k transistors. To enable 5GHz operation, the FPMAC core uses implicit-pulsed semi-dynamic flip-flops [5]. More than 80% of the device widths in the high-speed core are low-V T . The design includes three 32 word x 32b FIFO buffers, each operating at core speed. Two of the FIFOs provide the input operands to the FPMAC and the third FIFO captures the results at-speed. A scan chain is used to load the input FIFOs and read the output FIFO. A frequency versus V cc plot characterizing execution of FPMAC core is shown in Fig. 19.1.7. Simulations show that at 25°C and 1.2V, the FPMAC functions at 5GHz. At 1.5V, the operating fre- quency increases to 6.2GHz. Average estimated power consump- tion for 1.2V, 5GHz operation is 1.2W. Acknowledgements The authors thank A. Pangal, V. Govindarajulu, H. Wilson, Y. Niu and J. Tschanz for design efforts, K. Ikeda, K. Truong, C. Parsons and H. Nguyen for chip layout, and S. Borkar and J. Rattner for encouragement and sup- port. References [1] IEEE Standards Board, “IEEE Standard for Binary Floating-Point Arithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, New York, 1985. [2] Z. Luo, et al., “Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques,” IEEE Trans. on Computers, Mar. 2000 pp 208-218. [3] H. Suzuki et al., “Leading-Zero Anticipatory Logic for High-speed Floating Point Addition,” IEEE J. Solid State Circuits, Aug. 1996, pp. 1157-1164. [4] S. Mathew, et al., “A 4GHz 130nm Address Generation Unit with a 32- bit Sparse-tree Adder Core,” VLSI Circuits Symp. 2002, pp 126-127. [5] F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with Embedded Logic,” VLSI Circuits Symp. 1998, Digest of Technical Papers, pp 108-109. 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE
10

A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

Jan 21, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.1

19.1 A 5GHz Floating Point Multiply-Accumulator in 90nm Dual VT CMOS

Sriram Vangal, Yatin Hoskote, Dinesh Somasekhar, Vasantha Erraguntla,Jason Howard, Gregory Ruhl, Venkat Veeramachaneni, David Finan,Sanu Mathew, Nitin Borkar

Microprocessor Research Labs, Intel, Hillsboro, OR

A single precision, floating point multiply accumulator (FPMAC)uses base 32, internal carry save format and delayed additiontechniques to enable single cycle accumulate at 5GHz. In addi-tion, an improved Leading Zero Anticipator (LZA) and overflowdetection logic applicable to carry save format are implemented.

The FPMAC architecture is shown in Fig. 19.1.1. Operands Aand B are 32b inputs in IEEE 754 single precision format [1].The multiplier is designed using a Wallace tree of 4-2 adders.The well-matched delays of each Wallace tree stage allow forvery efficient pipelining (pipe stages S1-S5). In an effort toachieve fast single cycle accumulate operation, the followingoptimizations are employed: (1) The accumulator retains themultiplier output in carry save format and uses an array of 4-2adders to “accumulate” the result. This removes the need for anexpensive carry propagate adder in the critical path. (2)Accumulation is performed in base 32 system. The incomingnumber is converted to base 32 by shifting the mantissa to theleft by an amount given by the last five exponent bits (Exp[4:0]),thus extending the mantissa width to 55b. This reduces theexponent from 8 to 3 bits (Exp[7:5]), allowing faster exponentcomparison and generation of critical control signals, accelerat-ing the accumulate operation. (3) Expensive variable shifters inthe accumulate loop are replaced with constant shifters. (4) Thecostly post-normalization step is moved outside the loop wherethe accumulation result is added, the sum normalized and con-verted back to base 2.

The accumulator mantissa loop uses 4-2 adders at its core [2] toaccumulate the result using the extended un-normalized man-tissas and 3b exponents. The algorithm has the following fourcases, implemented as concurrent operations and shown as fourdistinct paths in Fig. 19.1.2. (A) If the feedback mantissa is zero,the result of accumulation is the mantissa from the multiplier.(B) If the feedback mantissa has more than 31 leading zeroes orones, and the feedback exponent is greater than the multiplierexponent by 1 or 2, then the mantissas are partially aligned byshifting the feedback mantissa left and the multiplier outputmantissa right by 32b before addition. (C) If the two exponentsdiffer by more than 1, the mantissa of the larger number is cho-sen, bypassing the adder. (D) If the two exponents differ by 1 orare equal, the mantissas are aligned by shifting the mantissa ofthe smaller number right by 32 and then added. The exponentloop (Fig. 19.1.3) mirrors the four cases described above. The con-trol signals are generated from 3b exponent comparisons. Toreduce critical path delay, the signals are re-timed to the previ-ous pipe stage. This approach enables accumulation to be imple-mented in nine logic stages. In the absence of normalization inthe accumulate loop, alignment of mantissas is necessary toavoid precision loss. The use of two 4-2 adders in the mantissaloop (parallel paths B, D) arises from the need to correctly han-dle non-commutative input data streams. Our implementationalso handles overflow during mantissa addition and subsequentrecovery on the fly. Each 4-2 adder array has its own overflowdetection logic which triggers a right shift of the addition result.

The tasks of overflow detection and leading zero anticipation [3]for numbers in carry-save format are challenging and requirenew logic. The “toggle detector” circuit (Fig. 19.1.4) looks at apair of 3b vectors and predicts a toggle (0 � 1 or 1 � 0) in theiraddition result (for example, 100), without actual summation.This circuit is used to predict mantissa overflow in the accumu-late loop. Sign-extending the data path by 1 bit and detecting atoggle in the 2 most significant bits (MSB) of the extended resultimplies overflow. Because the “toggle detector” circuit flags tog-gles in a 3b window, this overflow detection is conservative. Toavoid precision loss from such a conservative right shift, weextend the internal data path by 2 bits and shift the additionresult right only if overflow is flagged for this extended width.The same “toggle detector” circuit is used to determine the num-ber of leading zeroes or ones in a mantissa in carry-save format.An N-2 array of these circuits, for an N bit number, is used.Overlapping groups of three bits at a time of the carry and sumvectors, from MSB down to the least significant bit, are fed to“toggle detector” array to identify toggles. Counting the bits fromMSB down to the first observed toggle gives the number of lead-ing zeroes or leading ones. This count can be off by one bit posi-tion and requires a subsequent compensatory shift in the post-normalization block (Fig. 19.1.5). The post-normalization blockuses a 57b sparse-tree dual adder core [4] that computes (A+B)and –(A+B) in parallel by preconditioning the inputs of two iden-tical adders. The LZA operates in parallel with the dual adder tocompute the normalization distance. It also checks the feedbackmantissa for more than 31 leading zeroes or ones and generatesthe control signal required in the accumulate loop.

Layout of the FPMAC core and summary of chip characteristicsare shown in Fig. 19.1.6. The 2mm2 custom design contains 230ktransistors. To enable 5GHz operation, the FPMAC core usesimplicit-pulsed semi-dynamic flip-flops [5]. More than 80% of thedevice widths in the high-speed core are low-VT. The designincludes three 32 word x 32b FIFO buffers, each operating atcore speed. Two of the FIFOs provide the input operands to theFPMAC and the third FIFO captures the results at-speed. A scanchain is used to load the input FIFOs and read the output FIFO.

A frequency versus Vcc plot characterizing execution of FPMACcore is shown in Fig. 19.1.7. Simulations show that at 25°C and1.2V, the FPMAC functions at 5GHz. At 1.5V, the operating fre-quency increases to 6.2GHz. Average estimated power consump-tion for 1.2V, 5GHz operation is 1.2W.

AcknowledgementsThe authors thank A. Pangal, V. Govindarajulu, H. Wilson, Y. Niu and J.Tschanz for design efforts, K. Ikeda, K. Truong, C. Parsons and H. Nguyenfor chip layout, and S. Borkar and J. Rattner for encouragement and sup-port.

References[1] IEEE Standards Board, “IEEE Standard for Binary Floating-PointArithmetic,” Technical Report ANSI/IEEE Std. 754-1985, IEEE, NewYork, 1985.[2] Z. Luo, et al., “Accelerating Pipelined Integer and Floating-PointAccumulations in Configurable Hardware with Delayed AdditionTechniques,” IEEE Trans. on Computers, Mar. 2000 pp 208-218. [3] H. Suzuki et al., “Leading-Zero Anticipatory Logic for High-speedFloating Point Addition,” IEEE J. Solid State Circuits, Aug. 1996, pp.1157-1164.[4] S. Mathew, et al., “A 4GHz 130nm Address Generation Unit with a 32-bit Sparse-tree Adder Core,” VLSI Circuits Symp. 2002, pp 126-127.[5] F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with EmbeddedLogic,” VLSI Circuits Symp. 1998, Digest of Technical Papers, pp 108-109.

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Page 2: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

19

Figure 19.1.1: FPMAC pipe stages and organization. Figure 19.1.2: Floating point accumulator mantissa datapath.

Figure 19.1.3: Floating point accumulator exponent logic.

Figure 19.1.5: Post-normalization pipeline diagram. Figure 19.1.6: Chip micrograph and process characteristics.

Figure 19.1.4: Toggle detection circuit.

�������������

�� �� ���� �� ��

�� ��

���

���

�� ��

���

���

��� ���

������ ���

��� ��� ���

� ������ ������� �� ���� ������������� �

���

��������

����� ����

���� ��

�!�

� �

"������� "�������

#�������������$

#��������������

#��������������

#��������������

� ��%����

� �����&������� �����

' ��������

(� �� )�

������������

�*

�$

��

����

�+

�,

�-

�.

�/

�$*

�$$

�� � ��� ���� �

�� ��

�������������

�� �� ���� �� ��

�� ��

���

������

�� ��

���

���

��� ���

������ ���

��� ��� ���

� ������ ������� �� ���� ������������� �

���

��������

����� ����

���� ��

�!�

� �

"������� "�������

#�������������$

#��������������

#��������������

#��������������

� ��%����

� �����&������� �����

' ��������

(� �� )�

������������

�*

�$

��

����

�+

�,

�-

�.

�/

�$*

�$$

�� � ��� ���� �

�� ��

0�� ��������

�� �� �� ���- 1��.�. 1��-

)��������

��� �- 1��.

�� ��

�� �� �� ���-���2��.

���

�� ��

3�������������������

45�

1�$6

�-

���� �� ��������

+-

�-

�.

"�����7

���� ��

"�����7

���� ��

$�������*

��3���

$�������* $��������*

$�������*

�$ �$ �$

$�������*

� ����

�- �.

3��������������������������

�-

��3���

�- 1��.�- 1��.

����

� ����

���8-9+:

;

<

(

"

5

� �

'����= �

($ (* "$ "* ;� ��

<$"* 5$(*

;

<

(

"

5

� �

'����= �

($ (* "$ "* ;� ��

<$"* 5$(*

�� ��. ����&��.

�� ��

�� ��

*��������$

����$���

��� ���

45�

�/�/

3���������

� �

1�$6

45����$

�.

�/

�����

� ����������� ������&��� ��

�$*

�$$

� �� )�

�$*

�$$

%�������� �������!&�����

�������� '���� ()�

*������������ �����$+�&�������

����������� ��',�

���-����$� !./��

�������0cc� �!0�

�������1��� ��2�3���0�

���������� &!��

45 *�5*67

�5*.867

���4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

%�������� �������!&�����

�������� '���� ()�

*������������ �����$+�&�������

����������� ��',�

���-����$� !./��

�������0cc� �!0�

�������1��� ��2�3���0�

���������� &!��

45 *�5*67

�5*.867

���4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Page 3: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

19

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Figure 19.1.7: Accumulator frequency vs. supply voltage.

!

#

&

' � �� �� �� �� �!

0���=0>

�� ������-����$�=./�>

+<>)!�$?�@

,?�<>)!�$?+@

Page 4: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Figure 19.1.1: FPMAC pipe stages and organization.

�������������

�� �� ���� �� ��

�� ��

���

���

�� ��

���

���

��� ���

������ ���

��� ��� ���

� ������ ������� �� ���� ������������� �

���

��������

����� ����

���� ��

�!�

� �

"������� "�������

#�������������$

#��������������

#��������������

#��������������

� ��%����

� �����&������� �����

' ��������

(� �� )�

������������

�*

�$

��

����

�+

�,

�-

�.

�/

�$*

�$$

�� � ��� ���� �

�� ��

�������������

�� �� ���� �� ��

�� ��

���

������

�� ��

���

���

��� ���

������ ���

��� ��� ���

� ������ ������� �� ���� ������������� �

���

��������

����� ����

���� ��

�!�

� �

"������� "�������

#�������������$

#��������������

#��������������

#��������������

� ��%����

� �����&������� �����

' ��������

(� �� )�

������������

�*

�$

��

����

�+

�,

�-

�.

�/

�$*

�$$

�� � ��� ���� �

�� ��

Page 5: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Figure 19.1.2: Floating point accumulator mantissa datapath.

0�� ��������

�� �� �� ���- 1��.�. 1��-

)��������

��� �- 1��.

�� ��

�� �� �� ���-���2��.

���

�� ��

3�������������������

45�

1�$6

�-

���� �� ��������

+-

�-

�.

"�����7

���� ��

"�����7

���� ��

$�������*

��3���

Page 6: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

$�������* $��������*

$�������*

�$ �$ �$

$�������*

� ����

�- �.

3��������������������������

�-

��3���

�- 1��.�- 1��.

����

� ����

���8-9+:

Figure 19.1.3: Floating point accumulator exponent logic.

Page 7: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

;

<

(

"

5

� �

'����= �

($ (* "$ "* ;� ��

<$"* 5$(*

;

<

(

"

5

� �

'����= �

($ (* "$ "* ;� ��

<$"* 5$(*

Figure 19.1.4: Toggle detection circuit.

Page 8: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

�� ��. ����&��.

�� ��

�� ��

*��������$

����$���

��� ���

45�

�/�/

3���������

� �

1�$6

45����$

�.

�/

�����

� ����������� ������&��� ��

�$*

�$$

� �� )�

�$*

�$$

Figure 19.1.5: Post-normalization pipeline diagram.

Page 9: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

%�������� �������!&�����

�������� '���� ()�

*������������ �����$+�&�������

����������� ��',�

���-����$� !./��

�������0cc� �!0�

�������1��� ��2�3���0�

���������� &!��

45 *�5*67

�5*.867

���4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

%�������� �������!&�����

�������� '���� ()�

*������������ �����$+�&�������

����������� ��',�

���-����$� !./��

�������0cc� �!0�

�������1��� ��2�3���0�

���������� &!��

45 *�5*67

�5*.867

���4 45� 6

8(7 �5*96

�*�(�

:�

)��8

�5,

Figure 19.1.6: Chip micrograph and process characteristics.

Page 10: A 5 GHz floating point multiply-accumulator in 90 nm dual VT CMOS

• 2003 IEEE International Solid-State Circuits Conference 0-7803-7707-9/03/$17.00 ©2003 IEEE

Figure 19.1.7: Accumulator frequency vs. supply voltage.

!

#

&

' � �� �� �� �� �!

0���=0>

�� ������-����$�=./�>

+<>)!�$?�@

,?�<>)!�$?+@