enhanced seperable data hiding

CHAPTER1

INTRODUCTION1.1 INTRODUCTION

Multipliers are key components of many high performance systems such as FIR filters, microprocessors, digital signal processors, etc. A systems performance is generally determined by the performance of the multiplier because the multiplier is generally the slowest clement in the system. Furthermore, it is generally the most area consuming. Hence, optimizing the speed and area of the multiplier is a major design issue. However, area and speed are usually conflicting constraints so that improving speed results mostly in larger areas. As a result, a whole spectrum of multipliers with different area-speed constraints have been designed with fully parallel. Multipliers at one end of the spectrum and fully serial multipliers at the other end. In between are digit serial multipliers where single digits consisting of several bits are operated on. These multipliers have moderate performance in both speed and area. However, existing digit serial multipliers have been Plagued by complicated switching systems and/or irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion instead of bits bring the pipelining to the digit level and avoid most ofthe above problems. They were introduced by M. K. Ibrahim in 1993. These structures are iterative and modular. The pipelining done at the digit level brings the benefit of constant operation speed irrespective of the size of the multiplier. The clock speed is only determined by the digit size which is already fixed before the design is implemented

1.2 MOTIVATIONAs the scale of integration keeps growing, more and more sophisticated signal processing systems are being implemented on a VLSI chip . These signal processing applications not only demand great computation capacity but also consume considerable amounts of energy. While performance and area remain to be two major design goals, power consumption has become a critical concern in todays VLSI system design . The need for low-power VLSI systems arises from two main forces. First, with the steady growth of operating frequency and processing capacity per chip, large current has to be delivered and the heat due to large power consumption must be removed by proper cooling techniques. Second, battery life in portable electronic devices is limited. Low power design directly leads to prolonged operation time in these portable devices.

Multiplication is a fundamental operation in most signal processing algorithms . Multipliers have large area, long latency and consume considerable power. Therefore, low-power multiplier design has been an important part in low-power VLSI system design. The primary objective is power reduction with small area and delay overhead. By using Radix_4 booth algorithm we design a multiplier with low power and lesser area.

1.3 POWER OPTIMIZATION

Power refers to the number of Joules dissipated over a certain amount of time whereas energy is a measure of the total number of Joules dissipated by a circuit. Strictly speaking, low-power design is a different goal from low-energy design although they are related . Power is a problem primarily when cooling is a concern. The maximum power at any time, peak power, is often used for power and ground wiring design, signal noise margin and reliability analysis. Energy per operation or task is a better metric of the energy efficiency of a system, especially in the domain of maximizing battery lifetime.

In digital CMOS design, the well-known power-delay product is commonly used to assess the merits of designs. In a sense, this is a misnomer as power delay = (energy/delay) delay = energy, which implies delay is irrelevant . Instead, the term energy-delay product should be used since it involves two independent measures of circuit behaviors. Therefore, when power-delay products are used as a comparison metric, different schemes should be measured at the same frequency to ensure that it is equivalent to energy-delay product comparison.

There are two major sources of power dissipation in digital CMOS circuits: dynamic power and static power . Dynamic power is related to circuit switching activities or the changing events of logic states, including power dissipation due to capacitance charging and discharging, and dissipation due to short-circuit current (SCC). In CMOS logic, unintended leakage current, either reverse biased PN-junction current or sub threshold channel conduction current, is the only source of static current. However, occasional deviations from the strict CMOS style logic, such as pseudo NMOS logic, can cause intended static current. The total power consumption is summarized in the following equations :

Ptotal = Pdynamic + Pstatic = Pcap + Pscc + Pstatic (1.1)

Pcap = 01fclk. = 01fclk CL (1.2)

Pscc = 01fclk Ipeak (tr + tf)/2 .VDD (1.3)

Pstatic = Istatic.VDD (1.4)

Pcap in Equation 1.2 represents the dynamic power due to capacitance charging and discharging of a circuit node, where CL is the loading capacitance, fclk is the clock frequency, and 01 is the 0 1 transition probability in one clock period. In most cases, the voltage swing Vswing is the same as the supply voltage VDD; otherwise, Vswing should replace VDD in this equation. Pscc is a first-order average power consumption due to short-circuit current. The peak current, Ipeak, is determined by the saturation current of the devices and is hence directly proportional to the sizes of the transistors. tr and tf are rising time and falling time of short-circuit current, respectively. The static power Pstatic is primarily determined by fabrication technology considerations, which is usually several orders of magnitude smaller than the dynamic power. The leakage power problem mainly appears in very low frequency circuits or ones with sleep modes where dynamic activities are suppressed . The dominant term in a well-designed circuit during its active state is the dynamic term due to switching activity on loading capacitance, and thus low-power design often becomes the task of minimizing 01, CL, VDD and fclk, while retaining the required functionality . In the future, static power will become increasingly important as the supply voltage keeps scaling. To avoid performance degrading, the threshold voltage Vt is lowered accordingly and sub threshold leakage current increases exponentially . Leakage power reduction heavily depends on circuit and technology techniques such as dual Vt partitioning and multi-threshold CMOS . In this work, we will not consider leakage power reduction.

Power optimization of digital systems has been studied at different abstract levels, from the lowest technology level, to the highest system level . At the technology level, power consumption is reduced by the improvement in fabrication process such as small feature size, very low voltages, copper interconnects, and insulators with low dielectric constants . With the fabrication support of multiple supply voltages, lower voltages can be applied on non-critical system blocks. At the layout level, placement and routing are adjusted to reduce wire capacitance and signal delay imbalances . At the circuit level, power reduction is achieved by transistor sizing, transistor network restructuring and reorganization, and different circuit logic styles. 1.4 LOW POWER MULTIPLIER DESIGN

Multiplication consists of three steps: generation of partial products or PPs (PPG), reduction of partial products (PPR), and final carry-propagate addition (CPA) . In general, there are sequential and combinational multiplier implementations. We only consider combinational multipliers in this work because the scale of integration now is large enough to accept parallel multiplier implementation in digital VLSI systems. Different multiplication algorithms vary in the approaches of PPG, PPR, and CPA. For PPG, radix-2 digit-vector multiplication is the simplest form because the digit-vector multiplication is produced by a set of AND gates. To reduce the number of PPs and consequently reduce the area/delay of PP reduction, one operand is usually recoded into high-radix digit sets. The most popular one is the radix-4 digit set {2,1, 0, 1, 2}. For PPR, two alternatives exist : reduction by rows , performed by an array of adders, and reduction by columns , performed by an array of counters. In reduction by rows, there are two extreme classes: linear array and tree array. Linear array has the delay of O(n) while both tree array and column reduction have the delay of O(log n), where n is the number of PPs. The final CPA requires a fast adder scheme because it is on the critical path. In some cases, final CPA is postponed if it is advantageous to keep redundant results from PPG for further arithmetic operations.

The difficulty of low-power multiplier design lies in three aspects. First, the multiplier area is quadratically related to the operand precision. Second, parallel multipliers have many logic levels that introduce spurious transitions or glitches. Third, the structure of parallel multipliers could be very complex in order to achieve high speed, which deteriorates the efficiency of layout and circuit level optimization. As a fundamental arithmetic operation, multiplication has many algorithm-leveland bit-level computation features in which it differs from random logic. These features have not been considered well in low-level power optimization. It is also difficult to consider input data characteristics at low levels. Therefore, it is desirable to develop algorithm and architecture level power optimization techniques that consider multiplications arithmetic features and operands characteristics.

There has been some work on low-power multipliers at the algorithm and architecture level. As smaller area usually leads to less switching capacitance, the results in could provide a rough estimation of relative power consumptions in different multiplication schemes. In , Callaway studied the power/delay/area characteristics of four classical multipliers. In , Angel proposed low-power sign extension schemes and self-timed design with bypassing logic for zero PPs in radix-4 multipliers. Cherkauer and Friedman proposed a hybrid radix-4/radix-8 low power signed multiplier architecture. For multiplication data with large dynamic range, several approaches have been proposed. Architecture-level signal gating techniques have been studied . In , a mixed number representation for radix-4 twos-complement multiplication is proposed. In , radix-4 recoding is applied to the constant input instead of the dynamic input in low-power multiplication for FIR filters. In , multiplication is separated into higher and lower parts and the results of the higher part are stored in a cache in order to reduce redundant computation. In , two techniques are proposed for data with large dynamic range: most-significant-digit-first carry-save array for PP reduction and dynamically-generated reduced twos-complement representation In, the precisions of two input data are compared at runtime and two operands are then exchanged if necessary so that radix-4 recoding is applied on the operand with smaller precisions in order to generate more zero PPs. CHAPTER 2MULTIPLIERS

2.1 MULTIPLIERS: OVERVIEW

Multipliers can be classified as hardware multipliers and software multipliers. In older digital systems, there was no hardware multiplier and multiplication was implemented with a micro program. The micro program needed many micro instruction cycles to complete the multiplication process, which make the micro programmed multipliers slow. For high speed digital systems, hardware multipliers are usually used. In modem microprocessors and ASIC processors, most arithmetic logic units (ALU) contain a hardware multiplier. High speed hardware multipliers have been of interest for some time. More sophisticated approaches for multiplier designs can be implemented today due to the increase density of integrated circuits.

Hardware multipliers can be divided in two main categories: sequential and parallel array multipliers. For a sequential multiplier, multiplication of the multiplierand multiplicand is the operation of repeatedly adding the multiplicand and shifting. The advantage of a sequential multiplier is that the circuit is simple and the chip occupies less area, the disadvantage is that it is slower. For parallel array multipliers, the summation of partial products is carried out by using a linear adder array. Because the operation is in parallel, the speed is much faster than of a sequential multiplier.

There are a number of algorithms used for multiplication . The 3-bit recoding algorithm is one of the most well known . It is used in the design of many kinds of hardware and software multipliers. This algorithm is used to reduce the number of partial product rows by about half, so, the speed of multiplication increases significantly and the chip area is reduced. The 3-bit recoding algorithm is also called the Modified Booth's Algorithm and was developed from Booth's algorithm . A number of other multiple-bit recoding algorithms for multiplication have been developed . Recent, a parallel hardware multiplier based on a 5-bit recoding algorithm has been proposed . From the view of optimization, the 5-bit recoding algorithm is preferred to a 4-bit recoding algorithm. While more partial product rows can be reduced with the 5-bit recoding algorithm than with a 3-bit recoding algorithm, more complicated circuits are required to determine the odd multiples of the multiplicand. With the potential of improving both performance and the hardware requirements, the 5-bit recoding algorithm maybe good for a high bit multiplier, but, not for a low bit multiplier, such as, an 8 x 8 bit multiplier.

Using the 5-bit recoding algorithm reduces the number of partial product rows to two. The partial products are selected from 17 different multiples of the multiplicand Y (0, Y, 2Y, 3Y, 4Y, 5Y, 6Y, 7Y, 8Y). Using the 3-bit recoding algorithm reduces the number of partial product rows to four. The partial products are selected from 5 different multiples of the multiplicand Y (0, Y, 2Y). The addition of four partial products can be changed to the addition of two binary numbers by using two rows of carry save adder arrays (CSA) with only a two gate delay introduced. The even multiples of the multiplicand Y can be implemented by using a hardwire shift. For the 3-bit recoding algorithm, only the two's complement of Y needs to

be determined. For the 5-bit recoding algorithm, additional high speed adders are required to determine odd multiples of Y. These high speed adders require more circuitry to implement and suffer more time delay. For higher bit multipliers, the advantage of the 5-bit recoding algorithm can be seen. For example, for a 32 x 32 bit multiplier, the 5-bit recoding algorithm reduces the number of partial product rows to 8 and the 3-bit recoding algorithm reduces the number of partial product rows to 16. The reduction of the number of partial product rows is apparent.2.2 BINARY MULTIPLIERA Binary multiplier is an electronic hardware device used in digital electronics or a computer or other electronic device to perform rapid multiplication of two numbers in binary representation. It is built using binary adders.

The rules for binary multiplication can be stated as follows

1. If the multiplier digit is a 1, the multiplicand is simply copied down and represents the product.

2. If the multiplier digit is a 0 the product is also 0.

For designing a multiplier circuit we should have circuitry to provide or do the following three things:

1. it should be capable identifying whether a bit is 0 or 1.

2. It should be capable of shifting left partial products.

3. It should be able to add all the partial products to give the products as sum of partial products.

4. It should examine the sign bits. If they are alike, the sign of the product will be a positive, if the sign bits are opposite product will be negative. The sign bit of the product stored with above criteria should be displayed along with the product.

From the above discussion we observe that it is not necessary to wait until all the partial products have been formed before summing them. In fact the addition of partial product can be carried out as soon as the partial product is formed.

2.3 DIRECT MULTIPLICATION OF TWO UNSIGNED BINARY NUMBERS

The process of digital multiplication is based on addition, and many of the techniques useful in addition carry over to multiplication. The general scheme for unsigned multiplication is shown in Figure 2.1.

Figure 2.1 Digital multiplication of unsigned four bit binary numbers

For the multiplication of a n bit multiplier and a m bit multiplicand, the product is represented with a n + m bit binary number. To complete the multiplication,

(1) The partial products can be added sequentially or

(2) The partial products can be added by using parallel adder array

2.4 MULTIPLY ACCUMULATE CIRCUITS

Multiplication followed by accumulation is a operation in many digital systems ,particularly those highly interconnected like digital filters, neural networks, data quantisers, etc. One typical MAC(multiply-accumulate) architecture is illustrated in figure. It consists of multiplying 2 values, then adding the result to the previously accumulated value, which must then be restored in the registers for future accumulations. Another feature of MAC circuit is that it must check for overflow, which might happen when the number of MAC operation is large .

This design can be done using component because we have already design each of the units shown in figure. However since it is relatively simple circuit, it can also be designed directly. In any case the MAC circuit, as a whole, can be used as a component in application like digital filters and neural networks.

2.5 SEQUENTIAL MULTIPLIER/ARRAY MULTIPLIER

A sequential multiplier implements multiplication by repeatedly shifting the multiplicand and adding to the partial product. The advantage of a sequential multiplier is that the circuit is simple and the chip occupies less area, the disadvantage is that it is slower. A sequential multiplier usually consists of a register, MD, which holds the multiplicand, a shift register, MR, which holds the multiplier initially, a shift accumulator which holds the partial product, a shift counter.

Figure 2.2 Flow chart for multiplication process of a sequential multiplier

Figure 2.2 shows the multiplication process of a sequential multiplier.

The steps for multiplication are given by:

1. The multiplier and multiplicand are loaded into the register MR and MD, respectively, and the accumulator and counter are reset to zero.

2. The least significant bit j ^ ^ of the shift register MR is tested, if ^^=1, the multiplicand Y is added to partial product.

3. The partial product and multiplier are shifted one place right and the least significant bit of the multiplier is discarded.

4. The counter number is increased by one.

5. If the count is equal to the number n of bits in the multiplier, the multiplication process is complete and the product is equal to the number held in the accumulator, otherwise, the operation return to step 2.

A sequential multiplier is a simple circuit and occupies less chip area, but it is slow. To increase the speed of multiplication, parallel adder arrays are used to add partial products.

2.6 PARALLEL MULTIPLIER

In parallel array multipliers, the summation of partial products is carried out by using a linear adder array. Because the operation is in parallel, the speed is much faster than of a sequential multiplier.

There are a number of algorithms used for multiplication, two of them are Radix_2 and Radix_4 Booth algorithms. Radix_4 algorithm is also called as modified booth algorithm

2.7 ARCHITECTURE OF RADIX 2n MULTIPLIER

The architecture of a radix 2^n multiplier is given in the Figure. This block diagram shows the multiplication of two numbers with four digits each. These numbers are denoted as V and U while the digit size was chosen as four bits. The reason for this will become apparent in the following sections. Each circle in the figure corresponds to a radix cell which is the heart of the design. Every radix cell has four digit inputs and two digit outputs. The input digits are also fed through the corresponding cells. The dots in the figure represent latches for pipelining. Every dot consists of four latches. The ellipses represent adders which are included to calculate the higher order bits. They do not fit the regularity of the design as they are used to terminate the design at the boundary. The outputs are again in terms of four bit digits and are shown by Ws. The1s denote the clock period at which the data appear.

Figure 2.3 Radix_2n multiplier Architecture

2.8 BOOTH MULTIPLICATION ALGORITHM

Booth's multiplication algorithm will multiply two signed binary numbers in two's complement notation.

Procedure:

If x is the count of bits of the multiplicand, and y is the count of bits of the multiplier :

Draw a grid of three lines, each with squares for x + y + 1 bits. Label the lines respectively A (add), S(subtract), and P (product).

In two's complement notation, fill the first x bits of each line with :

A: the multiplicand

S: the negative of the multiplicand

P: zeroes

Fill the next y bits of each line with :

A: zeroes

S: zeroes

P: the multiplier

Fill the last bit of each line with a zero.

Do both of these steps y times :

1. If the last two bits in the product are...

a) 00 or 11: do nothing.

b) 01: P = P + A. Ignore any overflow.

c) 10: P = P + S. Ignore any overflow.

2. Arithmetically shift the product right one position.

Drop the last bit from the product for the final result.

2.9 BOOTH MULTIPLICATION ALGORITHM FOR RADIX 4

One of the solutions of realizing high speed multipliers is to enhance parallelism which helps to decrease the number of subsequent calculation stages. The original version of the Booth algorithm (Radix-2) had two drawbacks. They are:

(i) The number of add subtract operations and the number of shift operations becomes variable and becomes inconvenient in designing parallel multipliers.

(ii) The algorithm becomes inefficient when there are isolated 1s. These problems are overcome by using Radix4 Booth algorithm. This algorithm is used to reduce the number of partial product rows by about half, so, the speed of multiplication increases significantly and it also consumes less power.

Booth algorithm which scan strings of three bits with the algorithm given below:

1) Extend the sign bit 1 position if necessary to ensure that n is even.

2) Append a 0 to the right of the LSB of the multiplier.

3) According to the value of each vector , each Partial Product will he 0, +y , -y, +2y or -2y.

The negative values of y are made by taking the 2s complement and in this paper Carry-look-ahead (CLA) fast adders are used. The multiplication of y is done by shifting y by one bit to the left. Thus, in any case, in designing a n-bit parallel multipliers, only n/2 partial products are generated.

Table 2.1 Radix_4 Booth recording TableLet us see an example demonstrating the whole procedure of Booth multiplier (Radix -4) using Wallace Tree and Sign Extension Correctors. Let us take Example of calculation of (34-42). M Multiplicand A = 34 = 00100010

Multiplier B = -42 = 11010110 (2s Complement form)

AB = 34 -42 = -1428

First of all, the multiplier had to be converted into radix number as in Figure below. The first partial product determined by three digits LSB of multiplier that are B1, B0 and one appended zero. This 3 digit number is 100 which mean the multiplicand A has to multiply by -2.To multiply by -2, the process takes twos complement of the multiplicand value and then shift left one bit of that product. Hence, the first partial product is 110111100. All of the partial products will have nine bits length.

Next, the second partial product is determined by bits B3, B2, B1 which indicated have to multiply by 2. Multiply by 2 means the multiplicand value has to shift left one bit. So, the second partial product is 001000100. The third partial product is determined by bits B5, B4, B3 in which indicated have to multiply by 1. So, the third partial product is the multiplicand value namely 000100010. The forth partial product is determined by bits B7, B6, B5 which indicated have to multiply by -1. Multiply by -1 means the multiplicand has to convert to twos complement value. So, the forth partial product is 111011110.

Figure below shows the arrangement for all four partial products to be added using Wallace tree adder method. 1E, 1BE 2E, 3E and 4E is obtained based on the Table 4.2. The way on how this sign E is arranged has been shown in Wallace Tree Multiplication Method above. The Wallace tree for the Example is given below.

Figure 2.4 Method showing How Partial Products Should Be AddedTo prove the output result is correct:

11111101001101100 = 20(0) + 21(0) + 22(1) + 23(1) + 24(0) + 25(1) + 26(1) + 27(0) + 29(1) + 210(0) + 211(-1)

= 4 + 8 + 32 + 64 + 512 2048 = -1428 2.10 ARCHITECTURE OF AN 8 X 8 BIT MULTIPLIER

Figure2.5 shows the architecture of an 8 x 8-bit parallel multiplier using the 3-bit recoding algorithm. The number of partial product rows is reduced to four. The two's complement block is used to determine the two's complement of the multiplicand Y which represents the negative of Y. The 3-bit Encoder gives four 4-bit codes: S1, S2, S3 AND S4, which are used to determine Pk.

Figure2.5 Architecture of an 8x8 bit multiplier

The Partial Products Selectors are four identical multiplexers which are used to determine

(k=l ,2,3,4). The product P of X

and Y is:

p=p1+p2.+p3.+p4..The partial products .(k=1' 2,3 and 4) are shifted 2(k-l) bits to the right of . and need be sign extended to 16 bits. Carry save adder arrays (CSA) are used to add multiple inputs and change the summation of multiple numbers to the summation of two numbers. There is no carry propagation delay in the carry save adders array. So, the speed is high. Two 16 bits numbers A[15.. .0] and B[15...0] are obtained from the CSA. The 4 least significant bits of B are zero, so, it is only needed to add the two numbers A[15...4] and B[15...4]. For the least significant 4 bits, P[3...0]=A[3...0]. A Ripple Carry Adder Array (RCA) is used to add A[9.. .4] and B[9.. .4] to obtain P[9.. .4]. A Carry Select Adder Array is used to add A[15...10] and B[15...10] to obtain P[15...10].

The logic implementations of different multiplier blocks will be detailed in the following section. These blocks are based on logic gates and half and full adder.

2.11 TWO'S COMPLEMENT BLOCK

The two's complement block is used to determine the two's complement of the multiplicand Y which represents the negative of Y. Figure shows the circuit of the Two's complement block. According to the definition of two's complement, negative.

Figure 2.6 Two's Complement Circuit

To obtain the highest speed, carry skip adders combined with carry select adders are adopted.

2.12 3-BIT ENCODER BLOCKThe 3-bit Encoder gives four words, , and , and , simultaneously based

on Table . (k= 1,2,3,4) are used to determine PK. The following table shows how

to determine ^5*^ and p^ by looking up three consecutive bits of the multiplier X.

Table 2.2 Encodes [30]

Figure 2.7 shows the circuit for determining [30]

Figure 2.7 Part of 3-bit encoder circuit

2.13 THE PARTIAL PRODUCTS SELECTOR

The circuit shown in Figure is a basic circuit cell to build the partial product select circuits. This basic cell is used to determine the ith bit of the kth partial product (before shifting). In Figure b1i, b2i, b3i I and b4i are the ith bits of Y, 2Y, -Y and -2Y, and is the ith bit of the kth partial product (before shifting 2k bits to the right).

Figure 2.8 Circuit to determine ith bit of the kth partial product

2.14 CARRY SAVE ADDERS ARRAY BLOCKFigure is a diagram of two carry save adders (CSA) arrays used to add four binary numbers . The carry save adder array, CSAl, is used to add three binary numbers. There is no carry propagation on carry in. The carry of the ith bit will be saved as the value of C[i] and the sum of the ith will be saved as the value of S[i].

Figure 2.9 Carrv Save Adders Array Block

The summation of three numbers: A1[MSB:0], A2[MSB:0] and A3[MSB:0] is equal to the summation of two numbers: C[MSB:0] and S[MSB:0]. The carry save adders array, CSA2, is used to add A4[MSB:0], C[MSB:0] and S[MSB:0]. So, the summation of four numbers is transferred to the summation of two numbers quickly by using carry save adders arrays with only a two-gate time delay.

2.15 RIPPLE CARRY ADDERSFigure shows 6-bit ripple carry adders consisted of 6 1-bit full adders.

Figure 2.10 Ripple Carrv Adders

2.16 CARRY SELECT ADDERS

Figure shows the carry select adders array. There are two identical 6-bit ripple carry adder arrays. For one ripple carry adder array, the carry in is '0'; for the other, the carry in is ' 1'. Carry C[9] is used to determine which set of bits are the most significant 6 bits:P[15:10].

Figure 2.11 Carry select adders array

CHAPTER 3ADDERSIn electronics, an adder is a digital circuit that performs addition of numbers. In modern computers adders reside in the arithmetic logic unit (ALU) where other operations are performed. Although adders can be constructed for many numerical representations, such as Binary-coded decimal or excess-3, the most common adders operate on binary numbers. In cases where two's complement is being used to represent negative numbers it is trivial to modify an adder into an adder-subtracter

Addition is the most common and often used arithmetic operation on microprocessor, digital signal processor, especially digital computers. Also, it serves as a building block for synthesis all other arithmetic operations. Therefore, regarding the efficient implementation of an arithmetic unit, the binary adder structures become a very critical hardware unit.

In any book on computer arithmetic, someone looks that there exists a large number of different circuit architectures with different performance characteristics and widely used in the practice. Although many researches dealing with the binary adder structures have been done, the studies based on their comparative performance analysis are only a few.

In this project, qualitative evaluations of the classified binary adder architectures are given. Among the huge member of the adders we wrote VHDL (Hardware Description Language) code for Ripple-carry, Carry-select and Carry-look ahead to emphasize the common performance properties belong to their classes. In the following section, we give a brief description of the studied adder architectures.

The first class consists of the very slow ripple-carry adder with the smallest area. In the second class, the carry-skip, carry-select adders with multiple levels have small area requirements and shortened computation times. From the third class, the carry-look ahead adder and from the fourth class, the parallel prefix adder represents the fastest addition schemes with the largest area complexities.

Types of adders

For single bit adders, there are two general types. A half adder has two inputs, generally labeled A and B, and two outputs, the sum S and carry C. S is the two-bit XOR of A and B, and C is the AND of A and B. Essentially the output of a half adder is the sum of two one-bit numbers, with C being the most significant of these two outputs. The second type of single bit adder is the full adder. The full adder takes into account a carry input such that multiple adders can be used to add larger numbers. To remove ambiguity between the input and output carry lines, the carry in is labeled Ci or Cin while the carry out is labeled Co or Cout.

3.1 HALF ADDER Figure 3.1 Half adder Logic diagramA half adder is a logical circuit that performs an addition operation on two binary digits. The half adder produces a sum and a carry value which are both binary digits.

Following is the logic table for a half adder:

Inputs Outputs

A B S C

0 0 0 0

0 1 1 0

1 0 1 0

1 1 0 1

Table 3.1 Half Adder Truth Table3.2 FULL ADDER

Figure 3.2 Full adder Logic diagramInputs: {A, B, Carry In}

Outputs: {Sum, Carry Out}

Figure3.3 Schematic symbol for a 1-bit full adder

A full adder is a logical circuit that performs an addition operation on three binary digits.

The full adder produces a sum and carries value, which are both binary digits. It can be combined with other full adders (see below) or work on its own.

InputOutput

A B CiCo S

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0 0

0 1

0 1

1 0

0 1

1 0

1 0

1 1

Table 3.2 Full Adder Truth TableNote that the final OR gate before the carry-out output may be replaced by an XOR gate without altering the resulting logic. This is because the only discrepancy between OR andXOR gates occurs when both inputs are 1; for the adder shown here, one can check this is never possible. Using only two types of gates is convenient if one desires to implement the adder directly using common IC chips. A full adder can be constructed from two half adders by connecting A and B to the input of one half adder, connecting the sum from that to an input to the second adder, connecting Ci to the other input and or the two carry outputs. Equivalently, S could be made the three-bit xor of A, B, and Ci and Co could be made the three-bit majority function of A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of three one-bit numbers.

CHAPTER 4FILTERS

4.1 FIR FILTER

Digital filters can be divided into two categories: finite impulse response (FIR) filters; and infinite impulse response (IIR) filters. Although FIR filters, in general, require higher taps than IIR filters to obtain similar frequency characteristics, FIR filters are widely used because they have linear phase characteristics, guarantee stability and are easy to implement with multipliers, adders and delay elements . The number of taps in digital filters varies according to applications. In commercial filter chips with the fixed number of taps , zero coefficients are loaded to registers for unused taps and unnecessary calculations have to be performed. To alleviate this problem, the FIR filter chips providing variable-length taps have been widely used in many application fields . However, these FIR filter chips use memory, an address generation unit, and a modulo unit to access memory in a circular manner. The paper proposes two special features called a data reuse structure and a recurrent-coefficient scheme to provide variable-length taps efficiently. Since the proposed architecture only requires several MUXs, registers, and a feedback-loop, the number of gates can be reduced over 20 % than existing chips.

Fig 4.1 FIR filter block diagram

In, general, FIR filtering is described by a simple convolution operation as expressed in

the equation (5.1)

(4.1)

where x[n], y[n], and h[n] represent data input, filtering output, and a coefficient, respectively and N is the filter order. The equation using the bit-serial algorithm for a FIR filter can be represented as (4.2)

where the hj, N and M are the jth bit of the coefficient

4.2 TRANSVERSAL FILTER

An N-Tap transversal was assumed as the basis for this adaptive filter. The value of N is determined by practical considerations, . An FIR filter was chosen because of its stability. The use of the transversal structure allows relatively straight forward construction of the filter, Fig..

Figure 4.2 Transversal filter

As the input, coefficients and output of the filter are all assumed to be complex valued, and then the natural choice for the property measurement is the modulus, or instantaneous amplitude. If y(k) is the complex valued filter output, then |y(k)| denotes the amplitude. The convergence error p(k) can be defined as follows:

Aykpk=)( (4)

where the A is the amplitude in the absence of signal degredations. The error p(k) should be zero when the envelope has the proper value, and non-zero otherwise. The error carries sign information to indicate which direction the envelope is in error. The adaptive algorithm is defined by specifying a performance/cost/fitness function based on the error p (k) and then developing a procedure that adjusts the filter impulse response so as to minimize or maximize that performance function.

Yk = 10iNi==_ wk(i) xk-I (5)

The gradient search algorithm was selected to simplify the filter design. The filter coefficient update equation is given by:

WK+1 = wK eK xK (6), Where XK is the filter input at sample k, ek is the error term at sample k = pk . yk and is the step size for updating the weights value.

CHAPTER 5

VHDLMany DSP applications demand high throughput and real-time response, performance constraints that often dictate unique architectures with high levels of concurrency. DSP designers need the capability to manipulate and evaluate complex algorithms to extract the necessary level of concurrency. Performance constraints can also be addressed by applying alternative technologies. A change at the implementation level of design by the insertion of a new technology can often make viable an existing marginal algorithm or architecture.

The VHDL language supports these modeling needs at the algorithm or behavioral level,

and at the implementation or structural level. It provides a versatile set of description facilities to model DSP circuits from the system level to the gate level. Recently, we have also noticed efforts to include circuit-level modeling in VHDL. At the system level we can build behavioral models to describe algorithms and architectures. We would use concurrent processes with constructs common to many high-level languages, such as if, case, loop, wait, and assert statements. VHDL also includes user-defined types, functions, procedures, and packages." In many respects VHDL is a very powerful, high-level, concurrent programming language. At the implementation level we can build structural models using component instantiation statements that connect and invoke subcomponents. The VHDL generate statement provides ease of block replication and control. A dataflow level of description offers a combination of the behavioral and structural levels of description. VHDL lets us use all three levels to describe a single component. Most importantly, the standardization of VHDL has spurred the development of model libraries and design and development tools at every level of abstraction. VHDL, as a consensus description language and design environment, offers design tool portability, easy technical exchange, and technology insertion

VHDL: The language

An entity declaration, or entity, combined with architecture or body constitutes a VHDL model. VHDL calls the entity-architecture pair a design entity. By describing alternative architectures for an entity, we can configure a VHDL model for a specific level of investigation. The entity contains the interface description common to the alternative architectures. It communicates with other entities and the environment through ports and generics. Generic information particularizes an entity by specifying environment constants such as register size or delay value. For example,

entity A is

port (x, y: in real; z: out real);

generic (delay: time);

end A;

The architecture contains declarative and statement sections. Declarations form the region before the reserved word begin and can declare local elements such as signals and components. Statements appear after begin and can contain concurrent statements. For instance,

architecture B of A is

component M

port ( j : in real ; k : out real);

end component;

signal a, b ,c real := 0.0;

begin

"concurrent statements"

end B;

The variety of concurrent statement types gives VHDL the descriptive power to create and combine models at the structural, dataflow, and behavioral levels into one simulation model. The structural type of description makes use of component instantiation statements to invoke models described elsewhere. After declaring components, we use them in the component instantiation statement, assigning ports to local signals or other ports and giving values to generics. invert: M port map ( j => a ; k => c); We can then bind the components to other design entities through configuration specifications in VHDL's architecture declarative section or through separate configuration declarations. The dataflow style makes wide use of a number of types of concurrent signal assignment statements, which associate a target signal with an expression and a delay. The list of signals appearing in the expression is the sensitivity list; the expression must be evaluated for any change on any of these signals. The target signals obtain new values after the delay specified in the signal assignment statement. If no delay is specified, the signal assignment occurs during the next simulation cycle:

c

enhanced seperable data hiding

Documents

low power design

lowpower design

lowpower multiplier

power optimization power

lowpower vlsi system

considerable power

lowpower vlsi systems

power reduction