CHAPTER1
INTRODUCTION1.1 INTRODUCTION
Multipliers are key components of many high performance systems
such as FIR filters, microprocessors, digital signal processors,
etc. A systems performance is generally determined by the
performance of the multiplier because the multiplier is generally
the slowest clement in the system. Furthermore, it is generally the
most area consuming. Hence, optimizing the speed and area of the
multiplier is a major design issue. However, area and speed are
usually conflicting constraints so that improving speed results
mostly in larger areas. As a result, a whole spectrum of
multipliers with different area-speed constraints have been
designed with fully parallel. Multipliers at one end of the
spectrum and fully serial multipliers at the other end. In between
are digit serial multipliers where single digits consisting of
several bits are operated on. These multipliers have moderate
performance in both speed and area. However, existing digit serial
multipliers have been Plagued by complicated switching systems
and/or irregularities in design. Radix 2^n multipliers which
operate on digits in a parallel fashion instead of bits bring the
pipelining to the digit level and avoid most ofthe above problems.
They were introduced by M. K. Ibrahim in 1993. These structures are
iterative and modular. The pipelining done at the digit level
brings the benefit of constant operation speed irrespective of the
size of the multiplier. The clock speed is only determined by the
digit size which is already fixed before the design is
implemented
1.2 MOTIVATIONAs the scale of integration keeps growing, more
and more sophisticated signal processing systems are being
implemented on a VLSI chip . These signal processing applications
not only demand great computation capacity but also consume
considerable amounts of energy. While performance and area remain
to be two major design goals, power consumption has become a
critical concern in todays VLSI system design . The need for
low-power VLSI systems arises from two main forces. First, with the
steady growth of operating frequency and processing capacity per
chip, large current has to be delivered and the heat due to large
power consumption must be removed by proper cooling techniques.
Second, battery life in portable electronic devices is limited. Low
power design directly leads to prolonged operation time in these
portable devices.
Multiplication is a fundamental operation in most signal
processing algorithms . Multipliers have large area, long latency
and consume considerable power. Therefore, low-power multiplier
design has been an important part in low-power VLSI system design.
The primary objective is power reduction with small area and delay
overhead. By using Radix_4 booth algorithm we design a multiplier
with low power and lesser area.
1.3 POWER OPTIMIZATION
Power refers to the number of Joules dissipated over a certain
amount of time whereas energy is a measure of the total number of
Joules dissipated by a circuit. Strictly speaking, low-power design
is a different goal from low-energy design although they are
related . Power is a problem primarily when cooling is a concern.
The maximum power at any time, peak power, is often used for power
and ground wiring design, signal noise margin and reliability
analysis. Energy per operation or task is a better metric of the
energy efficiency of a system, especially in the domain of
maximizing battery lifetime.
In digital CMOS design, the well-known power-delay product is
commonly used to assess the merits of designs. In a sense, this is
a misnomer as power delay = (energy/delay) delay = energy, which
implies delay is irrelevant . Instead, the term energy-delay
product should be used since it involves two independent measures
of circuit behaviors. Therefore, when power-delay products are used
as a comparison metric, different schemes should be measured at the
same frequency to ensure that it is equivalent to energy-delay
product comparison.
There are two major sources of power dissipation in digital CMOS
circuits: dynamic power and static power . Dynamic power is related
to circuit switching activities or the changing events of logic
states, including power dissipation due to capacitance charging and
discharging, and dissipation due to short-circuit current (SCC). In
CMOS logic, unintended leakage current, either reverse biased
PN-junction current or sub threshold channel conduction current, is
the only source of static current. However, occasional deviations
from the strict CMOS style logic, such as pseudo NMOS logic, can
cause intended static current. The total power consumption is
summarized in the following equations :
Ptotal = Pdynamic + Pstatic = Pcap + Pscc + Pstatic (1.1)
Pcap = 01fclk. = 01fclk CL (1.2)
Pscc = 01fclk Ipeak (tr + tf)/2 .VDD (1.3)
Pstatic = Istatic.VDD (1.4)
Pcap in Equation 1.2 represents the dynamic power due to
capacitance charging and discharging of a circuit node, where CL is
the loading capacitance, fclk is the clock frequency, and 01 is the
0 1 transition probability in one clock period. In most cases, the
voltage swing Vswing is the same as the supply voltage VDD;
otherwise, Vswing should replace VDD in this equation. Pscc is a
first-order average power consumption due to short-circuit current.
The peak current, Ipeak, is determined by the saturation current of
the devices and is hence directly proportional to the sizes of the
transistors. tr and tf are rising time and falling time of
short-circuit current, respectively. The static power Pstatic is
primarily determined by fabrication technology considerations,
which is usually several orders of magnitude smaller than the
dynamic power. The leakage power problem mainly appears in very low
frequency circuits or ones with sleep modes where dynamic
activities are suppressed . The dominant term in a well-designed
circuit during its active state is the dynamic term due to
switching activity on loading capacitance, and thus low-power
design often becomes the task of minimizing 01, CL, VDD and fclk,
while retaining the required functionality . In the future, static
power will become increasingly important as the supply voltage
keeps scaling. To avoid performance degrading, the threshold
voltage Vt is lowered accordingly and sub threshold leakage current
increases exponentially . Leakage power reduction heavily depends
on circuit and technology techniques such as dual Vt partitioning
and multi-threshold CMOS . In this work, we will not consider
leakage power reduction.
Power optimization of digital systems has been studied at
different abstract levels, from the lowest technology level, to the
highest system level . At the technology level, power consumption
is reduced by the improvement in fabrication process such as small
feature size, very low voltages, copper interconnects, and
insulators with low dielectric constants . With the fabrication
support of multiple supply voltages, lower voltages can be applied
on non-critical system blocks. At the layout level, placement and
routing are adjusted to reduce wire capacitance and signal delay
imbalances . At the circuit level, power reduction is achieved by
transistor sizing, transistor network restructuring and
reorganization, and different circuit logic styles. 1.4 LOW POWER
MULTIPLIER DESIGN
Multiplication consists of three steps: generation of partial
products or PPs (PPG), reduction of partial products (PPR), and
final carry-propagate addition (CPA) . In general, there are
sequential and combinational multiplier implementations. We only
consider combinational multipliers in this work because the scale
of integration now is large enough to accept parallel multiplier
implementation in digital VLSI systems. Different multiplication
algorithms vary in the approaches of PPG, PPR, and CPA. For PPG,
radix-2 digit-vector multiplication is the simplest form because
the digit-vector multiplication is produced by a set of AND gates.
To reduce the number of PPs and consequently reduce the area/delay
of PP reduction, one operand is usually recoded into high-radix
digit sets. The most popular one is the radix-4 digit set {2,1, 0,
1, 2}. For PPR, two alternatives exist : reduction by rows ,
performed by an array of adders, and reduction by columns ,
performed by an array of counters. In reduction by rows, there are
two extreme classes: linear array and tree array. Linear array has
the delay of O(n) while both tree array and column reduction have
the delay of O(log n), where n is the number of PPs. The final CPA
requires a fast adder scheme because it is on the critical path. In
some cases, final CPA is postponed if it is advantageous to keep
redundant results from PPG for further arithmetic operations.
The difficulty of low-power multiplier design lies in three
aspects. First, the multiplier area is quadratically related to the
operand precision. Second, parallel multipliers have many logic
levels that introduce spurious transitions or glitches. Third, the
structure of parallel multipliers could be very complex in order to
achieve high speed, which deteriorates the efficiency of layout and
circuit level optimization. As a fundamental arithmetic operation,
multiplication has many algorithm-leveland bit-level computation
features in which it differs from random logic. These features have
not been considered well in low-level power optimization. It is
also difficult to consider input data characteristics at low
levels. Therefore, it is desirable to develop algorithm and
architecture level power optimization techniques that consider
multiplications arithmetic features and operands
characteristics.
There has been some work on low-power multipliers at the
algorithm and architecture level. As smaller area usually leads to
less switching capacitance, the results in could provide a rough
estimation of relative power consumptions in different
multiplication schemes. In , Callaway studied the power/delay/area
characteristics of four classical multipliers. In , Angel proposed
low-power sign extension schemes and self-timed design with
bypassing logic for zero PPs in radix-4 multipliers. Cherkauer and
Friedman proposed a hybrid radix-4/radix-8 low power signed
multiplier architecture. For multiplication data with large dynamic
range, several approaches have been proposed. Architecture-level
signal gating techniques have been studied . In , a mixed number
representation for radix-4 twos-complement multiplication is
proposed. In , radix-4 recoding is applied to the constant input
instead of the dynamic input in low-power multiplication for FIR
filters. In , multiplication is separated into higher and lower
parts and the results of the higher part are stored in a cache in
order to reduce redundant computation. In , two techniques are
proposed for data with large dynamic range:
most-significant-digit-first carry-save array for PP reduction and
dynamically-generated reduced twos-complement representation In,
the precisions of two input data are compared at runtime and two
operands are then exchanged if necessary so that radix-4 recoding
is applied on the operand with smaller precisions in order to
generate more zero PPs. CHAPTER 2MULTIPLIERS
2.1 MULTIPLIERS: OVERVIEW
Multipliers can be classified as hardware multipliers and
software multipliers. In older digital systems, there was no
hardware multiplier and multiplication was implemented with a micro
program. The micro program needed many micro instruction cycles to
complete the multiplication process, which make the micro
programmed multipliers slow. For high speed digital systems,
hardware multipliers are usually used. In modem microprocessors and
ASIC processors, most arithmetic logic units (ALU) contain a
hardware multiplier. High speed hardware multipliers have been of
interest for some time. More sophisticated approaches for
multiplier designs can be implemented today due to the increase
density of integrated circuits.
Hardware multipliers can be divided in two main categories:
sequential and parallel array multipliers. For a sequential
multiplier, multiplication of the multiplierand multiplicand is the
operation of repeatedly adding the multiplicand and shifting. The
advantage of a sequential multiplier is that the circuit is simple
and the chip occupies less area, the disadvantage is that it is
slower. For parallel array multipliers, the summation of partial
products is carried out by using a linear adder array. Because the
operation is in parallel, the speed is much faster than of a
sequential multiplier.
There are a number of algorithms used for multiplication . The
3-bit recoding algorithm is one of the most well known . It is used
in the design of many kinds of hardware and software multipliers.
This algorithm is used to reduce the number of partial product rows
by about half, so, the speed of multiplication increases
significantly and the chip area is reduced. The 3-bit recoding
algorithm is also called the Modified Booth's Algorithm and was
developed from Booth's algorithm . A number of other multiple-bit
recoding algorithms for multiplication have been developed .
Recent, a parallel hardware multiplier based on a 5-bit recoding
algorithm has been proposed . From the view of optimization, the
5-bit recoding algorithm is preferred to a 4-bit recoding
algorithm. While more partial product rows can be reduced with the
5-bit recoding algorithm than with a 3-bit recoding algorithm, more
complicated circuits are required to determine the odd multiples of
the multiplicand. With the potential of improving both performance
and the hardware requirements, the 5-bit recoding algorithm maybe
good for a high bit multiplier, but, not for a low bit multiplier,
such as, an 8 x 8 bit multiplier.
Using the 5-bit recoding algorithm reduces the number of partial
product rows to two. The partial products are selected from 17
different multiples of the multiplicand Y (0, Y, 2Y, 3Y, 4Y, 5Y,
6Y, 7Y, 8Y). Using the 3-bit recoding algorithm reduces the number
of partial product rows to four. The partial products are selected
from 5 different multiples of the multiplicand Y (0, Y, 2Y). The
addition of four partial products can be changed to the addition of
two binary numbers by using two rows of carry save adder arrays
(CSA) with only a two gate delay introduced. The even multiples of
the multiplicand Y can be implemented by using a hardwire shift.
For the 3-bit recoding algorithm, only the two's complement of Y
needs to
be determined. For the 5-bit recoding algorithm, additional high
speed adders are required to determine odd multiples of Y. These
high speed adders require more circuitry to implement and suffer
more time delay. For higher bit multipliers, the advantage of the
5-bit recoding algorithm can be seen. For example, for a 32 x 32
bit multiplier, the 5-bit recoding algorithm reduces the number of
partial product rows to 8 and the 3-bit recoding algorithm reduces
the number of partial product rows to 16. The reduction of the
number of partial product rows is apparent.2.2 BINARY MULTIPLIERA
Binary multiplier is an electronic hardware device used in digital
electronics or a computer or other electronic device to perform
rapid multiplication of two numbers in binary representation. It is
built using binary adders.
The rules for binary multiplication can be stated as follows
1. If the multiplier digit is a 1, the multiplicand is simply
copied down and represents the product.
2. If the multiplier digit is a 0 the product is also 0.
For designing a multiplier circuit we should have circuitry to
provide or do the following three things:
1. it should be capable identifying whether a bit is 0 or 1.
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the
products as sum of partial products.
4. It should examine the sign bits. If they are alike, the sign
of the product will be a positive, if the sign bits are opposite
product will be negative. The sign bit of the product stored with
above criteria should be displayed along with the product.
From the above discussion we observe that it is not necessary to
wait until all the partial products have been formed before summing
them. In fact the addition of partial product can be carried out as
soon as the partial product is formed.
2.3 DIRECT MULTIPLICATION OF TWO UNSIGNED BINARY NUMBERS
The process of digital multiplication is based on addition, and
many of the techniques useful in addition carry over to
multiplication. The general scheme for unsigned multiplication is
shown in Figure 2.1.
Figure 2.1 Digital multiplication of unsigned four bit binary
numbers
For the multiplication of a n bit multiplier and a m bit
multiplicand, the product is represented with a n + m bit binary
number. To complete the multiplication,
(1) The partial products can be added sequentially or
(2) The partial products can be added by using parallel adder
array
2.4 MULTIPLY ACCUMULATE CIRCUITS
Multiplication followed by accumulation is a operation in many
digital systems ,particularly those highly interconnected like
digital filters, neural networks, data quantisers, etc. One typical
MAC(multiply-accumulate) architecture is illustrated in figure. It
consists of multiplying 2 values, then adding the result to the
previously accumulated value, which must then be restored in the
registers for future accumulations. Another feature of MAC circuit
is that it must check for overflow, which might happen when the
number of MAC operation is large .
This design can be done using component because we have already
design each of the units shown in figure. However since it is
relatively simple circuit, it can also be designed directly. In any
case the MAC circuit, as a whole, can be used as a component in
application like digital filters and neural networks.
2.5 SEQUENTIAL MULTIPLIER/ARRAY MULTIPLIER
A sequential multiplier implements multiplication by repeatedly
shifting the multiplicand and adding to the partial product. The
advantage of a sequential multiplier is that the circuit is simple
and the chip occupies less area, the disadvantage is that it is
slower. A sequential multiplier usually consists of a register, MD,
which holds the multiplicand, a shift register, MR, which holds the
multiplier initially, a shift accumulator which holds the partial
product, a shift counter.
Figure 2.2 Flow chart for multiplication process of a sequential
multiplier
Figure 2.2 shows the multiplication process of a sequential
multiplier.
The steps for multiplication are given by:
1. The multiplier and multiplicand are loaded into the register
MR and MD, respectively, and the accumulator and counter are reset
to zero.
2. The least significant bit j ^ ^ of the shift register MR is
tested, if ^^=1, the multiplicand Y is added to partial
product.
3. The partial product and multiplier are shifted one place
right and the least significant bit of the multiplier is
discarded.
4. The counter number is increased by one.
5. If the count is equal to the number n of bits in the
multiplier, the multiplication process is complete and the product
is equal to the number held in the accumulator, otherwise, the
operation return to step 2.
A sequential multiplier is a simple circuit and occupies less
chip area, but it is slow. To increase the speed of multiplication,
parallel adder arrays are used to add partial products.
2.6 PARALLEL MULTIPLIER
In parallel array multipliers, the summation of partial products
is carried out by using a linear adder array. Because the operation
is in parallel, the speed is much faster than of a sequential
multiplier.
There are a number of algorithms used for multiplication, two of
them are Radix_2 and Radix_4 Booth algorithms. Radix_4 algorithm is
also called as modified booth algorithm
2.7 ARCHITECTURE OF RADIX 2n MULTIPLIER
The architecture of a radix 2^n multiplier is given in the
Figure. This block diagram shows the multiplication of two numbers
with four digits each. These numbers are denoted as V and U while
the digit size was chosen as four bits. The reason for this will
become apparent in the following sections. Each circle in the
figure corresponds to a radix cell which is the heart of the
design. Every radix cell has four digit inputs and two digit
outputs. The input digits are also fed through the corresponding
cells. The dots in the figure represent latches for pipelining.
Every dot consists of four latches. The ellipses represent adders
which are included to calculate the higher order bits. They do not
fit the regularity of the design as they are used to terminate the
design at the boundary. The outputs are again in terms of four bit
digits and are shown by Ws. The1s denote the clock period at which
the data appear.
Figure 2.3 Radix_2n multiplier Architecture
2.8 BOOTH MULTIPLICATION ALGORITHM
Booth's multiplication algorithm will multiply two signed binary
numbers in two's complement notation.
Procedure:
If x is the count of bits of the multiplicand, and y is the
count of bits of the multiplier :
Draw a grid of three lines, each with squares for x + y + 1
bits. Label the lines respectively A (add), S(subtract), and P
(product).
In two's complement notation, fill the first x bits of each line
with :
A: the multiplicand
S: the negative of the multiplicand
P: zeroes
Fill the next y bits of each line with :
A: zeroes
S: zeroes
P: the multiplier
Fill the last bit of each line with a zero.
Do both of these steps y times :
1. If the last two bits in the product are...
a) 00 or 11: do nothing.
b) 01: P = P + A. Ignore any overflow.
c) 10: P = P + S. Ignore any overflow.
2. Arithmetically shift the product right one position.
Drop the last bit from the product for the final result.
2.9 BOOTH MULTIPLICATION ALGORITHM FOR RADIX 4
One of the solutions of realizing high speed multipliers is to
enhance parallelism which helps to decrease the number of
subsequent calculation stages. The original version of the Booth
algorithm (Radix-2) had two drawbacks. They are:
(i) The number of add subtract operations and the number of
shift operations becomes variable and becomes inconvenient in
designing parallel multipliers.
(ii) The algorithm becomes inefficient when there are isolated
1s. These problems are overcome by using Radix4 Booth algorithm.
This algorithm is used to reduce the number of partial product rows
by about half, so, the speed of multiplication increases
significantly and it also consumes less power.
Booth algorithm which scan strings of three bits with the
algorithm given below:
1) Extend the sign bit 1 position if necessary to ensure that n
is even.
2) Append a 0 to the right of the LSB of the multiplier.
3) According to the value of each vector , each Partial Product
will he 0, +y , -y, +2y or -2y.
The negative values of y are made by taking the 2s complement
and in this paper Carry-look-ahead (CLA) fast adders are used. The
multiplication of y is done by shifting y by one bit to the left.
Thus, in any case, in designing a n-bit parallel multipliers, only
n/2 partial products are generated.
Table 2.1 Radix_4 Booth recording TableLet us see an example
demonstrating the whole procedure of Booth multiplier (Radix -4)
using Wallace Tree and Sign Extension Correctors. Let us take
Example of calculation of (34-42). M Multiplicand A = 34 =
00100010
Multiplier B = -42 = 11010110 (2s Complement form)
AB = 34 -42 = -1428
First of all, the multiplier had to be converted into radix
number as in Figure below. The first partial product determined by
three digits LSB of multiplier that are B1, B0 and one appended
zero. This 3 digit number is 100 which mean the multiplicand A has
to multiply by -2.To multiply by -2, the process takes twos
complement of the multiplicand value and then shift left one bit of
that product. Hence, the first partial product is 110111100. All of
the partial products will have nine bits length.
Next, the second partial product is determined by bits B3, B2,
B1 which indicated have to multiply by 2. Multiply by 2 means the
multiplicand value has to shift left one bit. So, the second
partial product is 001000100. The third partial product is
determined by bits B5, B4, B3 in which indicated have to multiply
by 1. So, the third partial product is the multiplicand value
namely 000100010. The forth partial product is determined by bits
B7, B6, B5 which indicated have to multiply by -1. Multiply by -1
means the multiplicand has to convert to twos complement value. So,
the forth partial product is 111011110.
Figure below shows the arrangement for all four partial products
to be added using Wallace tree adder method. 1E, 1BE 2E, 3E and 4E
is obtained based on the Table 4.2. The way on how this sign E is
arranged has been shown in Wallace Tree Multiplication Method
above. The Wallace tree for the Example is given below.
Figure 2.4 Method showing How Partial Products Should Be AddedTo
prove the output result is correct:
11111101001101100 = 20(0) + 21(0) + 22(1) + 23(1) + 24(0) +
25(1) + 26(1) + 27(0) + 29(1) + 210(0) + 211(-1)
= 4 + 8 + 32 + 64 + 512 2048 = -1428 2.10 ARCHITECTURE OF AN 8 X
8 BIT MULTIPLIER
Figure2.5 shows the architecture of an 8 x 8-bit parallel
multiplier using the 3-bit recoding algorithm. The number of
partial product rows is reduced to four. The two's complement block
is used to determine the two's complement of the multiplicand Y
which represents the negative of Y. The 3-bit Encoder gives four
4-bit codes: S1, S2, S3 AND S4, which are used to determine Pk.
Figure2.5 Architecture of an 8x8 bit multiplier
The Partial Products Selectors are four identical multiplexers
which are used to determine
(k=l ,2,3,4). The product P of X
and Y is:
p=p1+p2.+p3.+p4..The partial products .(k=1' 2,3 and 4) are
shifted 2(k-l) bits to the right of . and need be sign extended to
16 bits. Carry save adder arrays (CSA) are used to add multiple
inputs and change the summation of multiple numbers to the
summation of two numbers. There is no carry propagation delay in
the carry save adders array. So, the speed is high. Two 16 bits
numbers A[15.. .0] and B[15...0] are obtained from the CSA. The 4
least significant bits of B are zero, so, it is only needed to add
the two numbers A[15...4] and B[15...4]. For the least significant
4 bits, P[3...0]=A[3...0]. A Ripple Carry Adder Array (RCA) is used
to add A[9.. .4] and B[9.. .4] to obtain P[9.. .4]. A Carry Select
Adder Array is used to add A[15...10] and B[15...10] to obtain
P[15...10].
The logic implementations of different multiplier blocks will be
detailed in the following section. These blocks are based on logic
gates and half and full adder.
2.11 TWO'S COMPLEMENT BLOCK
The two's complement block is used to determine the two's
complement of the multiplicand Y which represents the negative of
Y. Figure shows the circuit of the Two's complement block.
According to the definition of two's complement, negative.
Figure 2.6 Two's Complement Circuit
To obtain the highest speed, carry skip adders combined with
carry select adders are adopted.
2.12 3-BIT ENCODER BLOCKThe 3-bit Encoder gives four words, ,
and , and , simultaneously based
on Table . (k= 1,2,3,4) are used to determine PK. The following
table shows how
to determine ^5*^ and p^ by looking up three consecutive bits of
the multiplier X.
Table 2.2 Encodes [30]
Figure 2.7 shows the circuit for determining [30]
Figure 2.7 Part of 3-bit encoder circuit
2.13 THE PARTIAL PRODUCTS SELECTOR
The circuit shown in Figure is a basic circuit cell to build the
partial product select circuits. This basic cell is used to
determine the ith bit of the kth partial product (before shifting).
In Figure b1i, b2i, b3i I and b4i are the ith bits of Y, 2Y, -Y and
-2Y, and is the ith bit of the kth partial product (before shifting
2k bits to the right).
Figure 2.8 Circuit to determine ith bit of the kth partial
product
2.14 CARRY SAVE ADDERS ARRAY BLOCKFigure is a diagram of two
carry save adders (CSA) arrays used to add four binary numbers .
The carry save adder array, CSAl, is used to add three binary
numbers. There is no carry propagation on carry in. The carry of
the ith bit will be saved as the value of C[i] and the sum of the
ith will be saved as the value of S[i].
Figure 2.9 Carrv Save Adders Array Block
The summation of three numbers: A1[MSB:0], A2[MSB:0] and
A3[MSB:0] is equal to the summation of two numbers: C[MSB:0] and
S[MSB:0]. The carry save adders array, CSA2, is used to add
A4[MSB:0], C[MSB:0] and S[MSB:0]. So, the summation of four numbers
is transferred to the summation of two numbers quickly by using
carry save adders arrays with only a two-gate time delay.
2.15 RIPPLE CARRY ADDERSFigure shows 6-bit ripple carry adders
consisted of 6 1-bit full adders.
Figure 2.10 Ripple Carrv Adders
2.16 CARRY SELECT ADDERS
Figure shows the carry select adders array. There are two
identical 6-bit ripple carry adder arrays. For one ripple carry
adder array, the carry in is '0'; for the other, the carry in is '
1'. Carry C[9] is used to determine which set of bits are the most
significant 6 bits:P[15:10].
Figure 2.11 Carry select adders array
CHAPTER 3ADDERSIn electronics, an adder is a digital circuit
that performs addition of numbers. In modern computers adders
reside in the arithmetic logic unit (ALU) where other operations
are performed. Although adders can be constructed for many
numerical representations, such as Binary-coded decimal or
excess-3, the most common adders operate on binary numbers. In
cases where two's complement is being used to represent negative
numbers it is trivial to modify an adder into an
adder-subtracter
Addition is the most common and often used arithmetic operation
on microprocessor, digital signal processor, especially digital
computers. Also, it serves as a building block for synthesis all
other arithmetic operations. Therefore, regarding the efficient
implementation of an arithmetic unit, the binary adder structures
become a very critical hardware unit.
In any book on computer arithmetic, someone looks that there
exists a large number of different circuit architectures with
different performance characteristics and widely used in the
practice. Although many researches dealing with the binary adder
structures have been done, the studies based on their comparative
performance analysis are only a few.
In this project, qualitative evaluations of the classified
binary adder architectures are given. Among the huge member of the
adders we wrote VHDL (Hardware Description Language) code for
Ripple-carry, Carry-select and Carry-look ahead to emphasize the
common performance properties belong to their classes. In the
following section, we give a brief description of the studied adder
architectures.
The first class consists of the very slow ripple-carry adder
with the smallest area. In the second class, the carry-skip,
carry-select adders with multiple levels have small area
requirements and shortened computation times. From the third class,
the carry-look ahead adder and from the fourth class, the parallel
prefix adder represents the fastest addition schemes with the
largest area complexities.
Types of adders
For single bit adders, there are two general types. A half adder
has two inputs, generally labeled A and B, and two outputs, the sum
S and carry C. S is the two-bit XOR of A and B, and C is the AND of
A and B. Essentially the output of a half adder is the sum of two
one-bit numbers, with C being the most significant of these two
outputs. The second type of single bit adder is the full adder. The
full adder takes into account a carry input such that multiple
adders can be used to add larger numbers. To remove ambiguity
between the input and output carry lines, the carry in is labeled
Ci or Cin while the carry out is labeled Co or Cout.
3.1 HALF ADDER Figure 3.1 Half adder Logic diagramA half adder
is a logical circuit that performs an addition operation on two
binary digits. The half adder produces a sum and a carry value
which are both binary digits.
Following is the logic table for a half adder:
Inputs Outputs
A B S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
Table 3.1 Half Adder Truth Table3.2 FULL ADDER
Figure 3.2 Full adder Logic diagramInputs: {A, B, Carry In}
Outputs: {Sum, Carry Out}
Figure3.3 Schematic symbol for a 1-bit full adder
A full adder is a logical circuit that performs an addition
operation on three binary digits.
The full adder produces a sum and carries value, which are both
binary digits. It can be combined with other full adders (see
below) or work on its own.
InputOutput
A B CiCo S
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
0 0
0 1
0 1
1 0
0 1
1 0
1 0
1 1
Table 3.2 Full Adder Truth TableNote that the final OR gate
before the carry-out output may be replaced by an XOR gate without
altering the resulting logic. This is because the only discrepancy
between OR andXOR gates occurs when both inputs are 1; for the
adder shown here, one can check this is never possible. Using only
two types of gates is convenient if one desires to implement the
adder directly using common IC chips. A full adder can be
constructed from two half adders by connecting A and B to the input
of one half adder, connecting the sum from that to an input to the
second adder, connecting Ci to the other input and or the two carry
outputs. Equivalently, S could be made the three-bit xor of A, B,
and Ci and Co could be made the three-bit majority function of A,
B, and Ci. The output of the full adder is the two-bit arithmetic
sum of three one-bit numbers.
CHAPTER 4FILTERS
4.1 FIR FILTER
Digital filters can be divided into two categories: finite
impulse response (FIR) filters; and infinite impulse response (IIR)
filters. Although FIR filters, in general, require higher taps than
IIR filters to obtain similar frequency characteristics, FIR
filters are widely used because they have linear phase
characteristics, guarantee stability and are easy to implement with
multipliers, adders and delay elements . The number of taps in
digital filters varies according to applications. In commercial
filter chips with the fixed number of taps , zero coefficients are
loaded to registers for unused taps and unnecessary calculations
have to be performed. To alleviate this problem, the FIR filter
chips providing variable-length taps have been widely used in many
application fields . However, these FIR filter chips use memory, an
address generation unit, and a modulo unit to access memory in a
circular manner. The paper proposes two special features called a
data reuse structure and a recurrent-coefficient scheme to provide
variable-length taps efficiently. Since the proposed architecture
only requires several MUXs, registers, and a feedback-loop, the
number of gates can be reduced over 20 % than existing chips.
Fig 4.1 FIR filter block diagram
In, general, FIR filtering is described by a simple convolution
operation as expressed in
the equation (5.1)
(4.1)
where x[n], y[n], and h[n] represent data input, filtering
output, and a coefficient, respectively and N is the filter order.
The equation using the bit-serial algorithm for a FIR filter can be
represented as (4.2)
where the hj, N and M are the jth bit of the coefficient
4.2 TRANSVERSAL FILTER
An N-Tap transversal was assumed as the basis for this adaptive
filter. The value of N is determined by practical considerations, .
An FIR filter was chosen because of its stability. The use of the
transversal structure allows relatively straight forward
construction of the filter, Fig..
Figure 4.2 Transversal filter
As the input, coefficients and output of the filter are all
assumed to be complex valued, and then the natural choice for the
property measurement is the modulus, or instantaneous amplitude. If
y(k) is the complex valued filter output, then |y(k)| denotes the
amplitude. The convergence error p(k) can be defined as
follows:
Aykpk=)( (4)
where the A is the amplitude in the absence of signal
degredations. The error p(k) should be zero when the envelope has
the proper value, and non-zero otherwise. The error carries sign
information to indicate which direction the envelope is in error.
The adaptive algorithm is defined by specifying a
performance/cost/fitness function based on the error p (k) and then
developing a procedure that adjusts the filter impulse response so
as to minimize or maximize that performance function.
Yk = 10iNi==_ wk(i) xk-I (5)
The gradient search algorithm was selected to simplify the
filter design. The filter coefficient update equation is given
by:
WK+1 = wK eK xK (6), Where XK is the filter input at sample k,
ek is the error term at sample k = pk . yk and is the step size for
updating the weights value.
CHAPTER 5
VHDLMany DSP applications demand high throughput and real-time
response, performance constraints that often dictate unique
architectures with high levels of concurrency. DSP designers need
the capability to manipulate and evaluate complex algorithms to
extract the necessary level of concurrency. Performance constraints
can also be addressed by applying alternative technologies. A
change at the implementation level of design by the insertion of a
new technology can often make viable an existing marginal algorithm
or architecture.
The VHDL language supports these modeling needs at the algorithm
or behavioral level,
and at the implementation or structural level. It provides a
versatile set of description facilities to model DSP circuits from
the system level to the gate level. Recently, we have also noticed
efforts to include circuit-level modeling in VHDL. At the system
level we can build behavioral models to describe algorithms and
architectures. We would use concurrent processes with constructs
common to many high-level languages, such as if, case, loop, wait,
and assert statements. VHDL also includes user-defined types,
functions, procedures, and packages." In many respects VHDL is a
very powerful, high-level, concurrent programming language. At the
implementation level we can build structural models using component
instantiation statements that connect and invoke subcomponents. The
VHDL generate statement provides ease of block replication and
control. A dataflow level of description offers a combination of
the behavioral and structural levels of description. VHDL lets us
use all three levels to describe a single component. Most
importantly, the standardization of VHDL has spurred the
development of model libraries and design and development tools at
every level of abstraction. VHDL, as a consensus description
language and design environment, offers design tool portability,
easy technical exchange, and technology insertion
VHDL: The language
An entity declaration, or entity, combined with architecture or
body constitutes a VHDL model. VHDL calls the entity-architecture
pair a design entity. By describing alternative architectures for
an entity, we can configure a VHDL model for a specific level of
investigation. The entity contains the interface description common
to the alternative architectures. It communicates with other
entities and the environment through ports and generics. Generic
information particularizes an entity by specifying environment
constants such as register size or delay value. For example,
entity A is
port (x, y: in real; z: out real);
generic (delay: time);
end A;
The architecture contains declarative and statement sections.
Declarations form the region before the reserved word begin and can
declare local elements such as signals and components. Statements
appear after begin and can contain concurrent statements. For
instance,
architecture B of A is
component M
port ( j : in real ; k : out real);
end component;
signal a, b ,c real := 0.0;
begin
"concurrent statements"
end B;
The variety of concurrent statement types gives VHDL the
descriptive power to create and combine models at the structural,
dataflow, and behavioral levels into one simulation model. The
structural type of description makes use of component instantiation
statements to invoke models described elsewhere. After declaring
components, we use them in the component instantiation statement,
assigning ports to local signals or other ports and giving values
to generics. invert: M port map ( j => a ; k => c); We can
then bind the components to other design entities through
configuration specifications in VHDL's architecture declarative
section or through separate configuration declarations. The
dataflow style makes wide use of a number of types of concurrent
signal assignment statements, which associate a target signal with
an expression and a delay. The list of signals appearing in the
expression is the sensitivity list; the expression must be
evaluated for any change on any of these signals. The target
signals obtain new values after the delay specified in the signal
assignment statement. If no delay is specified, the signal
assignment occurs during the next simulation cycle:
c