Design of High Performance Multiply-Accumulate Computation Unit S.Ahish 1 , Y.B.N.Kumar 2 , Dheeraj Sharma 3 , M.H.Vasantha 4 Department of Electronics and Communication National Institute Of Technology Goa Email:[email protected]1 , [email protected]2 , [email protected]3 , [email protected]4 Abstract—In Digital Signal Processing (DSP), Multiply- Accumulate Computation (MAC) unit plays a very important role and lies in the critical path. Multiplier is one of the most important block in MAC unit. The overall performance of the MAC unit depends on the resources used by the multiplier. Therefore, this paper describes the design of a Partial Product Reduction Block (PPRB) that is used in the implementation of multiplier having better area, delay and power performances. PPRB reduces the partial products row wise by using different multi-bit adder blocks instead of conventional coloumn wise reduction. MAC unit consisting of the multiplier realized using the proposed partial product reduction technique has a delay reduction of 46%, power consumption is reduced by 39% and area requirement is reduced by 17% when compared to MAC unit realised using conventional multiplier architecture. Index Terms— Carry-lookahead adder, brent-kung adder, wallace tree, booth multiplier, multiply-accumulate unit. I. I NTRODUCTION The multiplication-accumulation is main computational ker- nel and is considered as one of the fundamental operations in DSP [1]. MAC unit is an integral part of DSP architecture, hence decides the pace of the general framework that is it generally lies in the critical path. Creating a high performance MAC unit is pivotal for continuous DSP applications. Addi- tionally with perpetually expanding interest for the compact electronic items, an electronic segment with low power and less area requirement is very much necessary for market stand point of view. Hence designing of a MAC unit with high speed performance, less area requirement and low power consumption becomes an important aspect in real time video coding and DSP systems [2]. The critical path delays and hardware complexities of Multiplier-Accumulation units are investigated to derive a high performance MAC [3]. In order to improve the power, delay, speed performances of the MAC unit, the performance parameters of the multiplier which constitutes the major part of the MAC unit has to be improved. Much work has been done on advanced multiplication algorithms and designs [1]. Improving the performance of the multiplier means to re- duce the resources used by partial product reduction block. Carry propagation is time consuming, hence performing two different carry propagations in the same MAC circuit is inefficient. This can be overcome by bolstering the multiplier yield back to the input of the PP unit decrease tree hinders the requirement for a traditional accumulate adder [4], [5], [6]. Accumulationis consequently taken care of by the last adder of the multiplier, and just one convey spreading stage is needed. The issue is that this advancement just applies to one-cycle Macs, where the long discriminating deferral is a constraining component in many applications. The multiplier architecture proposed in this paper is based on the basic algorithm for multiplication, extraordinary paper and pencil approach [7] and passes through three fundamental stages: 1) Partial Product (PP) generation, 2) Partial product reduction, and 3) Final (carry-propagated) addition. The partial product reduction block is the resource extensive. In this work, parallel prefix circuits that take n inputs and produce the outputs has been used to realize powerful adders [8],[9], these adders are used to realize the partial product reduction block to get a multiplier with improved performance [10]. The paper is organised as follows: Section II addresses the existing methods of multiplier implementation. The Section III consists of the description of the proposed partial product reduction block along with brief description about the Brent- Kung adder and the CLA which will be used in the proposed architecture. Section IV describes the simulation setup used for the experiment and the results of the experiment. Finally, conclusions will be drawn in Section V. A. General MAC unit operation [11]: The multiply-accumulation operation is one of the most used operation in DSP architecture. In order to realize the expressions such as y[n]= k x[k]h[n − k] (1) First the multiplication operation for different values of x and h should be performed and then add the products to get the output y[n]. Instead of waiting for multiplication results to become available, addition operation can be computed in parallel with the multiplication using MAC operation. The hardware used for this is called as MAC unit. The general expression representing the MAC operation is as given below y[n + 1] = y[n]+ x[n + 1] ∗ h[n + 1] (2) where x[i] is the multiplier and h[i] is the multiplicand each of n-bit size. The basic block diagram of the MAC unit is as shown in the Fig. 1. 915 978-1-4799-8047-5/15/$31.00 c 2015 IEEE
4
Embed
Design of High Performance Multiply-Accumulate Computation ...syslog.co.in/vlsi-projects/design-of-high-performance-multiply... · Design of High Performance Multiply-Accumulate Computation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Design of High Performance Multiply-AccumulateComputation Unit
Abstract—In Digital Signal Processing (DSP), Multiply-Accumulate Computation (MAC) unit plays a very importantrole and lies in the critical path. Multiplier is one of the mostimportant block in MAC unit. The overall performance of theMAC unit depends on the resources used by the multiplier.Therefore, this paper describes the design of a Partial ProductReduction Block (PPRB) that is used in the implementation ofmultiplier having better area, delay and power performances.PPRB reduces the partial products row wise by using differentmulti-bit adder blocks instead of conventional coloumn wisereduction. MAC unit consisting of the multiplier realized usingthe proposed partial product reduction technique has a delayreduction of 46%, power consumption is reduced by 39% andarea requirement is reduced by 17% when compared to MACunit realised using conventional multiplier architecture.
Index Terms— Carry-lookahead adder, brent-kung adder,wallace tree, booth multiplier, multiply-accumulate unit.
I. INTRODUCTION
The multiplication-accumulation is main computational ker-
nel and is considered as one of the fundamental operations in
DSP [1]. MAC unit is an integral part of DSP architecture,
hence decides the pace of the general framework that is it
generally lies in the critical path. Creating a high performance
MAC unit is pivotal for continuous DSP applications. Addi-
tionally with perpetually expanding interest for the compact
electronic items, an electronic segment with low power and
less area requirement is very much necessary for market
stand point of view. Hence designing of a MAC unit with
high speed performance, less area requirement and low power
consumption becomes an important aspect in real time video
coding and DSP systems [2].
The critical path delays and hardware complexities of
Multiplier-Accumulation units are investigated to derive a
high performance MAC [3]. In order to improve the power,
delay, speed performances of the MAC unit, the performance
parameters of the multiplier which constitutes the major part
of the MAC unit has to be improved. Much work has been
done on advanced multiplication algorithms and designs [1].
Improving the performance of the multiplier means to re-
duce the resources used by partial product reduction block.
Carry propagation is time consuming, hence performing two
different carry propagations in the same MAC circuit is
inefficient. This can be overcome by bolstering the multiplier
yield back to the input of the PP unit decrease tree hinders
the requirement for a traditional accumulate adder [4], [5],
[6]. Accumulationis consequently taken care of by the last
adder of the multiplier, and just one convey spreading stage
is needed. The issue is that this advancement just applies to
one-cycle Macs, where the long discriminating deferral is a
constraining component in many applications. The multiplier
architecture proposed in this paper is based on the basic
algorithm for multiplication, extraordinary paper and pencil
approach [7] and passes through three fundamental stages: 1)
Total Power(mW) 0.38 0.227 0.234425.Arrival(ns) 13.594 6.972 7.365
Fig. 5. (a) Area performance of different MAC implementation, (b) Delayperformance of different MAC implementation.
0.4Fig. 6. Power consumed by different MAC units
to MAC unit implemented using [ii] . The power consumption
of the MAC unit using [i] is 40.26% less when compared to
MAC with booth multiplier implementation, 3.16% less when
compared to MAC unit implemented using [ii]. The delay
performance of the MAC unit using [i] is 48.71% less when
compared to MAC unit using booth multiplier implementation,
5.34% less when compared MAC unit implemented using [ii].
The comparison of the area, delay and power performances of
different MAC unit implementations is illustrated graphically
in Fig. 5 and Fig. 6, respectively.
V. CONCLUSION
In this paper, the design and implementation of MAC unit
by using Verilog coding has been presented. The MAC unit
discussed can perform 16-bit unsigned operations. In this
work, the multiplier is realized by using different bit-width
Carry-Lookahead Adder (CLA) and Brent-Kung adder. The
MAC unit implemented with the multiplier using the proposed
partial product reduction block achieves better delay, power
and area performace when compared to MAC unit consisting
of the conventional Booth Multiplier.
REFERENCES
[1] A. Farooqui and V. Oklobdzija, “General Data-Path Organization of aMAC Unit for VLSI Implementation of DSP Processors,” in Proc. IEEEIntl. Symposium on Circuits and Systems, May 1998, pp. 260263.
[2] O. Chen, N. Y. Shen, and C. C. Shen, “A Low-Power MultiplicationAccumulation Calculation Unit for Multimedia Applications,” in Proc.IEEE Intl. Conference on Acoustics, Speech, and Signal Processing,Apr. 2003, pp. II6458.
[3] L. H. Chen, L. H. Chen, T. Y. Wang, and Y. C. Ma, “A MultiplicationAccumulation Computation Unit with Optimized Compressors and Min-imized Switching Activities,” in Proc. IEEE Intl. Symposium on Circuitsand Systems, May 2005, pp. 61186121.
[4] A. Abdelgawad, “Low Power Multiply Accumulate Unit (MAC) forFuture Wireless Sensor Networks,” in Proc. IEEE Sensors App. Sym-posium, Feb. 2013, pp. 129132.
[5] A. Abdelgawad and M. Bayoumi, “High Speed and Area-Efficient Multi-ply Accumulate (MAC) Unit for Digital Signal Processing Applications,”in Proc. IEEE Intl. Symposium on Circuits and Systems, May 2007, pp.3199 3202.
[6] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors, “A High-Speed,Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architectureand its Application to a Double-Throughput MAC Unit,” IEEE Trans.on Circuits and Systems, vol. 57, no. 12, pp. 30733081, Dec 2010.
[7] C. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Trans. onElectronics and Computers, vol. EC-13, no. 1, pp. 1417, Feb. 1964.
[8] M. S. Schmookler and A. Weinberger, “High Speed Decimal Addition,”IEEE Trans. on Computers, vol. C-20, no. 8, pp. 862866, Aug. 1971.
[9] R. P. Brent and H. Kung, “A Regular Layout for Parallel Adders,” IEEETrans. on Computers, vol. C-31, no. 3, pp. 260264, Mar. 1982.
[10] W. Chu, A. Unwala, P. Wu, and E. Swartzlander, “Implementation of aHigh Speed Multiplier using Carry Lookahead Adders,” in Proc. IEEEAsilomar Conf. on Signals, Systems and Computers, Nov. 2013, pp.400404.
[11] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, “Discrete-time SignalProcessing (2Nd Ed.),” Upper Saddle River, NJ, USA: Prentice-Hall,Inc., 1999.
[12] A. Booth, “A Signed Binary Multiplication Technique,”Quarterly Journalof Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236 240, Jun.1951.
918 2015 IEEE International Advance Computing Conference (IACC)