SPST

Abstract

This paper presents the design exploration and applications of a spurious-

power suppression technique (SPST) which can dramatically reduce the power

dissipation of combinational VLSI designs for multimedia/DSP purposes. The

proposed SPST separates the target designs into two parts, i.e., the most significant

part and least significant part (MSP and LSP), and turns off the MSP when it does not

affect the computation results to save power. Furthermore, this paper proposes an

original glitch-diminishing technique to filter out useless switching power by

asserting the data signals after the data transient period.

There are different entities that one would like to optimize when designing a

VLSI circuit. These entities can often not be optimized simultaneously, only improve

one entity at the expense of one or more others The design of an efficient integrated

circuit in terms of power, area, and speed simultaneously, has become a very

challenging problem. Power dissipation is recognized as a critical parameter in

modern the objective of a good multiplier is to provide a physically compact, good

speed and low power consuming chip. To save significant power consumption of a

VLSI design, it is a good direction to reduce its dynamic power that is the major part

of total power dissipation. In this paper, we propose a high speed low-power

multiplier adopting the new SPST implementing approach. This multiplier is designed

by equipping the Spurious Power Suppression Technique (SPST) on a modified

Booth encoder which is controlled by a detection unit using an AND gate. The

modified booth encoder will reduce the number of partial products generated by a

factor of 2. The SPST adder will avoid the unwanted addition and thus minimize the

switching power dissipation.

In this project we used Modelsim for logical verification, and further

synthesizing it on Xilinx-ISE tool using target technology and performing placing &

routing operation for system verification on targeted FPGA.

CHAPTER 1 CHAPTER 1

INTRODUCTION AND LITERATURE SURVEY

1. Introduction:

Power dissipation is recognized as a critical parameter in modern VLSI design

field. To satisfy MOORE’S law and to produce consumer electronics goods with

more backup and less weight, low power VLSI design is necessary.

Fast multipliers are essential parts of digital signal processing systems. The

speed of multiply operation is of great importance in digital signal processing as well

as in the general purpose processors today, especially since the media processing took

off. In the past multiplication was generally implemented via a sequence of addition,

Subtraction, and shift operations. Multiplication can be considered as a series of

repeated additions. The number to be added is the multiplicand, the number of times

that it is added is the multiplier, and the result is the product. Each step of addition

generates a partial product. In most computers, the operand usually contains the same

number of bits. When the operands are interpreted as integers, the product is generally

twice the length of operands in order to preserve the information content. This

repeated addition method that is suggested by the arithmetic definition is slow that it

is almost always replaced by an algorithm that makes use of positional representation.

It is possible to decompose multipliers into two parts. The first part is dedicated to the

generation of partial products, and the second one collects and adds them.

The basic multiplication principle is two fold, i.e. evaluation of partial

products and accumulation of the shifted partial products. It is performed by the

successive Addition’s of the columns of the shifted partial product matrix. The

‘multiplier’ is successfully shifted and gates the appropriate bit of the ‘multiplicand’.

The delayed, gated instance of the multiplicand must all be in the same column of the

shifted partial product matrix. They are then added to form the product bit for the

particular form. Multiplication is therefore a multi operand operation. To extend the

multiplication to both signed and unsigned numbers, a convenient number system

would be the representation of numbers in two’s complement format.

Multipliers are key components of many high performance systems such as

FIR filters, microprocessors, digital signal processors, etc. A system’s performance is

generally determined by the performance of the multiplier because the multiplier is

generally the slowest clement in the system. Furthermore, it is generally the most area

consuming. Hence, optimizing the speed and area of the multiplier is a major design

issue. However, area and speed are usually conflicting constraints so that improving

speed results mostly in larger areas. As a result, whole spectrums of multipliers with

different area-speed constraints are designed with fully parallel processing. In

between are digit serial multipliers where single digits consisting of several bits are

operated on. These multipliers have moderate performance in both speed and area.

However, existing digit serial multipliers have been plagued by complicated

switching systems and/or irregularities in design. Radix 2^n multipliers which operate

on digits in a parallel fashion instead of bits bring the pipelining to the digit level and

avoid most of the above problems. They were introduced by M. K. Ibrahim in 1993.

These structures are iterative and modular. The pipelining done at the digit level

brings the benefit of constant operation speed irrespective of the size of’ the

multiplier. The clock speed is only determined by the digit size which is already fixed

before the design is implemented.

The growing market for fast floating-point co-processors, digital signal

processing chips, and graphics processors has created a demand for high speed, area-

efficient multipliers. Current architectures range from small, low-performance shift

and add multipliers, to large, high-performance array and tree multipliers.

Conventional linear array multipliers achieve high performance in a regular structure,

but require large amounts of silicon. Tree structures achieve even higher performance

than linear arrays but the tree interconnection is more complex and less regular,

making them even larger than linear arrays. Ideally, one would want the speed

benefits of a tree structure, the regularity of an array multiplier, and the small size of a

shift and add multiplier.

To reduce the size of the multiplier a partial tree is used together with a 4-2

carry-save accumulator placed at its outputs to iteratively accumulate the partial

products. This allows a full multiplier to be built in a fraction of the area required by a

full array. Higher performance is achieved by increasing the hardware utilization of

the partial 4-2 tree through pipelining. To ensure optimal performance of the

pipelined 4-2 tree, the clock frequency must be tightly controlled to match the delay

of the 4-2 adder pipe stages.

The SPST (Spurious Power Suppression Technique) is used for digital signal

processing (DSP), Transformations of Digital Image Processing and versatile

multimedia functional unit (VMFU) etc. The Booth's radix-4 algorithm, Modified

Booth Multiplier, 34-bit CSA are improves speed of Multipliers and SPST adder will

reduce the power consumption in addition process.

1.2 Background

Webster’s dictionary defines multiplication as “a mathematical operation that

at its simplest is an abbreviated process of adding an integer to itself a specified

number of times”. A number (multiplicand) is added to itself a number of times as

specified by another number (multiplier) to form a result (product). In elementary

school, students learn to multiply by placing the multiplicand on top of the multiplier.

The multiplicand is then multiplied by each digit of the multiplier beginning with the

rightmost, Least Significant Digit (LSD). Intermediate results (partial-products) are

placed one atop the other, offset by one digit to align digits of the same weight. The

final product is determined by summation of all the partial-products. Although most

people think of multiplication only in base 10, this technique applies equally to any

base, including binary. Figure 1.1 shows the data flow for the basic multiplication

technique just described. Each black dot represents a single digit.

Fig 1.1 Basic Multiplication Data flow

1.3 Different type of Multipliers:

1.3.1 Binary Multiplication

In the binary number system the digits, called bits, are limited to the set. The

result of multiplying any binary number by a single binary bit is either 0, or the

original number. This makes forming the intermediate partial-products simple and

efficient. Summing these partial-products is the time consuming task for binary

multipliers. One logical approach is to form the partial-products one at a time and sum

them as they are generated. Often implemented by software on processors that do not

have a hardware multiplier, this technique works fine, but is slow because at least one

machine cycle is required to sum each additional partial-product. For applications

where this approach does not provide enough performance, multipliers can be

implemented directly in hardware.

1.3.2 Hardware Multipliers

Direct hardware implementations of shift and add multipliers can increase

performance over software synthesis, but are still quite slow. The reason is that as

each additional partial-product is summed a carry must be propagated from the least

significant bit (LSB) to the most significant bit (MSB). This carry propagation is time

consuming, and must be repeated for each partial product to be summed.

One method to increase multiplier performance is by using encoding

techniques to reduce the the number of partial products to be summed. Just such a

technique was first proposed by Booth [BOO 511. The original Booth’s algorithm

ships over contiguous strings of l’s by using the property that: 2” + 2(n-1) + 2(n-2) + .

. . + 2hm) = 2(n+l) - 2(n-m). Although Booth’s algorithm produces at most N/2

encoded partial products from an N bit operand, the number of partial products

produced varies. This has caused designers to use modified versions of Booth’s

algorithm for hardware multipliers. Modified 2-bit Booth encoding halves the number

of partial products to be summed.

Since the resulting encoded partial-products can then be summed using any

suitable method, modified 2 bit Booth encoding is used on most modern floating-

point chips LU 881, MCA 861. A few designers have even turned to modified 3 bit

Booth encoding, which reduces the number of partial products to be summed by a

factor of three IBEN 891. The problem with 3 bit encoding is that the carry-propagate

addition required to form the 3X multiples often overshadows the potential gains of 3

bit Booth encoding.

To achieve even higher performance advanced hardware multiplier

architectures search for faster and more efficient methods for summing the partial-

products. Most increase performance by eliminating the time consuming carry

propagate additions. To accomplish this, they sum the partial-products in a redundant

number representation. The advantage of a redundant representation is that two

numbers, or partial-products, can be added together without propagating a carry

across the entire width of the number. Many redundant number representations are

possible. One commonly used representation is known as carry-save form. In this

redundant representation two bits, known as the carry and sum, are used to represent

each bit position. When two numbers in carry-save form are added together any

carries that result are never propagated more than one bit position. This makes adding

two numbers in carry-save form much faster than adding two normal binary numbers

where a carry may propagate. One common method that has been developed for

summing rows of partial products using a carry-save representation is the array

multiplier.

1.3.3 Array Multipliers

Conventional linear array multipliers consist of rows of carry-save adders

(CSA). A portion of an array multiplier with the associated routing can be seen in

Figure 1.2. In a linear array multiplier, as the data propagates down through the array,

each row of CSA’s adds one additional partial-product to the partial sum. Since the

intermediate partial sum is kept in a redundant, carry-save form there is no carry

propagation. This means that the delay of an array multiplier is only dependent upon

the depth of the array, and is independent of the partial-product width. Linear array

multipliers are also regular, consisting of replicated rows of CSA’s. Their high

performance and regular structure have perpetuated the use of array multipliers for

VLSI math co-processors and special purpose DSP chips.

The biggest problem with full linear array multipliers is that they are very

large. As operand sizes increase, linear arrays grow in size at a rate equal to the square

of the operand size. This is because the number of rows in the array is equal to the

length of the multiplier, with the width of each row equal to the width of multiplicand.

The large size of full arrays typically prohibits their use, except for small operand

sizes, or on special purpose math chips where a major portion of the silicon area can

be assigned to the multiplier array.

Another problem with array multipliers is that the hardware is underutilized.

As the sum is propagated down through the array, each row of CSA’s computes a

result only once, when the active computation front passes that row. Thus, the

hardware is doing useful work only a very small percentage of the time. This low

hardware utilization in conventional linear array multipliers makes performance gains

possible through increased efficiency. For example, by overlapping calculations

pipelining can achieve a large gain in throughput [NOL 861. Figure 1.3 shows a full

array pipelined after each row of CSA’s. Once the partial sum has passed the first row

of CSA’s, represented by the shaded row of GSA’s in cycle 1, a subsequent multiply

can be started on the next cycle. In cycle 2, the first partial sum has passed to the

second row of CM’s, and the second multiply, represented by the cross hatched row

of CSA’s, has begun. Although pipelining a full array can greatly increase throughput,

both the size and latency are increased due to the additional latches While high

throughput is desirable, for general purpose computers size and latency tend to be

more important; thus, fully pipelined linear array multipliers are seldom found.

1.4 Iterative Techniques

To reduce area, some designers use partial arrays and iterate using a clock. At

the limit, a minimal iterative structure would have one row of CSA’s and a latch.

Clearly, this structure requires the least amount of hardware, and has the highest

utilization since each CSA is used every cycle. An important observation is that

iterative structures are fast if the latch delays are small, and the clock is matched to

the combinational delay of the CSA’s. If both of these conditions are met, iterative

structures approach the same throughput and latency as full arrays. The only

difference in latency is due to the latch and clock overhead. Although they require

very fast clocks, a few companies use iterative structures in their new high-

performance floating point processors.

Figure 1.4 Minimal Iterative Structures

In an attempt to increase performance of the minimal iterative structure

additional rows of CSA’s could be added to make a bigger array. For example, the

addition of one row of CM’s to the minimal structure would yield a partial array with

two rows of CM’s. This structure provides two advantages over the single row of

CSA cells:

1) It reduces the required clock frequency, and

2) It requires only half as many latch delays.

It is important to note that although the number of CSA’s has been doubled,

the latency was reduced only by halving the number of latch delays. The number of

CSA delays remains the same. Thus, assuming the latch delays are small relative to

the CSA delays, increasing the depth of the partial array by adding additional rows of

CSA’s in a linear structure yields only a slight increase in performance.

1.1.3..Proposed SPST Architecture:

In this section, the expression for the new arithmetic will be derived from

equations of the standard design.

1.1.4 Basic Concept:

If an operation to multiply two N –bit numbers and accumulates into a 2N -bit

number, addition, subtraction, Sum of Absolute Difference (SAD), and Interpolation

is considered. The critical path is determined by the 2-bit accumulation operation. If a

pipeline scheme is applied for each step in the standard design of Fig. 1, the delay of

the last accumulator must be reduced in order to improve the performance of the

MAC. The overall performance of the proposed VMFU is improved by eliminating

the accumulator itself by combining it with the CSA function. If the accumulator has

been eliminated, the critical path is then determined by the final adder in the

multiplier. The basic method to improve the performance of the final adder is to

decrease the number of input bits. In order to reduce this number of input bits, the

multiple partial products are compressed into a sum and a carry by CSA. The number

of bits of sums and carries to be transferred to the final adder is reduced by adding the

lower bits of sums and carries in advance within the range in which the overall

performance will not be degraded. A 2-bit CLA is used to add the lower bits in the

CSA. In addition, to increase the output rate when pipelining is applied, the sums and

carrys from the CSA are accumulated instead of the outputs from the final adder in

the manner that the sum and carry from the CSA in the previous cycle are inputted to

CSA. Due to this feedback of both sum and carry, the number of inputs to CSA

increases, compared to the standard design. In order to efficiently solve the increase in

the amount of data, a CSA architecture is modified to treat the sign bit.

1.2. Versatile Multimedia Functional Unit

VMFU is composed of an adder, multiplier and an accumulator.

Usually adders implemented are Carry- Select or Carry-Save adders,

as speed is of utmost importance in DSP (Chandrakasan, Sheng, &

Brodersen, 1992 and Weste & Harris, 3rd Ed). One implementation of

the multiplier could be as a parallel array multiplier. The inputs for

the VMFU are to be fetched from memory location and fed to the

multiplier block, which will perform multiplication and give the result

to adder which will accumulate the result and then will store the

result into a memory location. This entire process is to be achieved

in a single clock cycle (Weste & Harris, 3rd Ed).

Figure 2 is the architecture of the MAC unit which had been

designed in this work. The design consists of one 16 bit register, one

16-bit Modified Booth Multiplier multiplier, 33-bit accumulator using

ripple carry and two16-bit accumulator registers. To multiply the

values of A and B, Modified Booth multiplier is used instead of

conventional multiplier because Modified Booth multiplier can

increase the MAC unit design speed and reduce multiplication

complexity. Carry save Adder (CSA) is used as an accumulator in

this design. Apparently, together with the utilization of Wallace tree

multiplier approach, carry save adder in the final stage of the

Modified Booth multiplier and Carry save Adder as the accumulator,

this VMFU unit design is not only reducing the standby power

consumption but also can enhance the VMFU unit speed so as to

gain better system performance. The operation of the designed

VMFU unit is as in Equation 2.1. The product of Ai X Bi is always fed

back into the 34-bit Carry save accumulator and then added again

with the next product Ai x Bi. This MAC unit is capable of multiplying

and adding with previous product consecutively up to as many as

eight times. Operation: Output = Σ Ai Bi (2.1) In this paper, the

design of 16x16 multiplier unit is carried out that can perform

accumulation on 34 bit number. This MAC unit has 34 bit output and

its operation is to add repeatedly the multiplication results. The total

design area is also being inspected by observing the total count of

gates [Hardwires]. Power delay product is calculated by multiplying

the power consumption result with the time delay.

1.3. A radix-2 modified Booth's algorithm:

Booth's Algorithm is simple but powerful. Speed of VMFU is dependent on

the number of partial products and speed of accumulate partial product. Booth's

Algorithm provide us to reduced partial products. We choose radix-4 algorithm

because of below reasons.

Original Booth's algorithm has an inefficient case.

The 17 partial products are generated in 16bit x 16bit signed or unsigned

multiplication.

Modified Booth's radix-4 algorithm has fatal encoding time in 16bit x 16bit

multiplication.

Radix-4 Algorithm has a 3x term which means that a partial product cannot be

generated by shifting. Therefore, 2x + 1x are needed in encoding processing. One of

the solution is handling an additional 1x term in wallace tree. However, large wallace

tree has some problems too.

A radix-4 modified Booth's algorithm: Booth's radix-4 algorithm is widely

used to reduce the area of multiplier and to increase the speed. Grouping 3 bits of

multiplier with overlapping has half partial products which improves the system

speed. Radix-4 modified Booth's algorithm is shown below:

X-1 = 0; Insert 0 on the right side of LSB of multiplier.

Start grouping each 3bits with overlapping from x-1

If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB

generate partial product from truth table

when new partial product is generated, each partial product is added 2 bit left

shifting in regular sequence.

x: multiplicand y: multiplier

1.3.1. Sign or zero extension

Our MAC supports signed or unsigned multiplication and the produced result

is 64bit which are stored in 2 special 32bit register. First MAC receives a multiplicand

and multiplier but just 16bit operands are signed number in Booth's radix-4 algorithm.

Hence, extension bit is required to express 16bit signed number. The core idea of this

is that 16bit unsigned number can be expressed by 33bit signed number. The 17

partial products are generated in 33bit x 33bit case (16 partial products in 32bit x

32bit case). Here is an example of signed and unsigned multiplication. When

x(multiplicand) is 3bit 111 and y(multiplier) is 3bit 111, the signed and unsigned

multiplication is different. In signed case x × y = 1 (-1 x -1 = 1) and in unsigned case

x × y = 49 (7 x 7 = 49).

1.3.2. Carry-Save Adder

When three or more operands are to be added simultaneously

using two operand adders, the time consuming carry propagation

must be repeated several times. If the number of operands is ‘k’,

then carries have to propagate (k-1) times (Weste & Harris, 3rd Ed).

In the carry save addition, we let the carry propagate only in the last

step, while in all the other steps we generate the partial sum and

sequence of carries separately. A CSA is capable of reducing the

number of operands to be added from 3 to 2 without any carry

propagation. A CSA can be implemented in different ways. In the

simplest implementation, the basic element of carry save adder is

the combination of two half adders or 1 bit full adder (Weste &

Harris, 3rd Ed)

1.4. Circuit Design Features

One of the most advanced types of MAC for general-purpose digital signal

processing has been proposed by Elguibaly . It is an architecture in which

accumulation has been combined with the carry save adder (CSA) tree that

compresses partial products. In the architecture proposed in, the critical path was

reduced by eliminating the adder for accumulation and decreasing the number of input

bits in the final adder. While it has a better performance because of the reduced

critical path compared to the previous VMFU architectures, there is a need to improve

the output rate due to the use of the final adder results for accumulation. The

architecture to merge the adder block to the accumulator register in the VMFU

operator was proposed to provide the possibility of using two separate N/2-bit adders

instead of one-bit adder to accumulate the MAC results. Recently, Zicari proposed an

architecture that took a merging technique to fully utilize the 4–2 compressor .It also

took this compressor as the basic building blocks for the multiplication circuit.

1.5. MAC

1.5.1 Block Diagram of MAC:

In this paper, a new architecture for a high-speed MAC is proposed. In this MAC, the

computations of multiplication and accumulation are combined and a hybrid-type

CSA structure is proposed to reduce the critical path and improve the output rate. It

uses MBA algorithm based on 1’s complement number system. A modified array

structure for the sign bits is used to increase the density of the operands. A carry look-

ahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in

the final adder. In addition, in order to increase the output rate by optimizing the

pipeline efficiency, intermediate calculation results are accumulated in the form of

sum and carry instead of the final adder outputs.

A multiplier can be divided into three operational steps. The first is radix-2

Booth encoding in which a partial product is generated from the multiplicand X and

the multiplier Y . The second is adder array or partial product compression to add all

partial products and convert them into the form of sum and carry. The last is the final

addition in which the final multiplication result is produced by adding the sum and the

carry. If the process to accumulate the multiplied results is included, a MAC consists

of four steps, as shown in Fig. 1, which shows the operational

steps explicitly.

CHAPTER 2 CHAPTER 2

Design of VMFU

In the majority of digital signal processing (DSP) applications

the critical operations usually involve many multiplications and/or

accumulations. For real-time signal processing, a high speed and

high throughput Multiplier-Accumulator (MAC) is always a key to

achieve a high performance digital signal processing system and

versatile Multimedia functional units. In the last few years, the main

consideration of MAC design is to enhance its speed. This is

because; speed and throughput rate is always the concern of VMFU.

But for the epoch of personal communication, low power design also

becomes another main design consideration. This is because;

battery energy available for these portable products limits the

power consumption of the system. Therefore, the main motivation

of this work is to investigate various Pipelined

multiplier/accumulator architectures and circuit design techniques

which are suitable for implementing high throughput signal

processing algorithms and at the same time achieve low power

consumption. A conventional VMFU unit consists of (fast multiplier)

multiplier and an accumulator that contains the sum of the previous

consecutive products. The function of the VMFU unit is given by the

following equation:

F = Σ A i Bi………………………………………………………… (2.1)

The main goal of a VMFU design is to enhance the speed of

the MAC unit, and at the same time limit the power consumption. In

a pipelined MAC circuit, the delay of pipeline stage is the delay of a

1-bit full adder. Estimating this delay will assist in identifying the

overall delay of the pipelined MAC. In this work, 1-bit full adder is

designed. Area, power and delay are calculated for the full adder,

based on which the pipelined MAC unit is designed for low power.

2.1. High-Speed Booth Encoded Parallel Multiplier Design:

Fast multipliers are essential parts of digital signal processing

systems. The speed of multiply operation is of great importance in

digital signal processing as well as in the general purpose

processors today, especially since the media processing took off. In

the past multiplication was generally implemented via a sequence

of addition, subtraction, and shift operations. Multiplication can be

considered as a series of repeated additions. The number to be

added is the multiplicand, the number of times that it is added is the

multiplier, and the result is the product. Each step of addition

generates a partial product. In most computers, the operand usually

contains the same number of bits. When the operands are

interpreted as integers, the product is generally twice the length of

operands in order to preserve the information content. This

repeated addition method that is suggested by the arithmetic

definition is slow that it is almost always replaced by an algorithm

that makes use of positional representation. It is possible to

decompose multipliers into two parts. The first part is dedicated to

the generation of partial products, and the second one collects and

adds them.

The basic multiplication principle is two fold i.e. evaluation of

partial products and accumulation of the shifted partial products. It

is performed by the successive additions of the columns of the

shifted partial product matrix. The ‘multiplier’ is successfully shifted

and gates the appropriate bit of the ‘multiplicand’. The delayed,

gated instance of the multiplicand must all be in the same column of

the shifted partial product matrix. They are then added to form the

product bit for the particular form. Multiplication is therefore a multi

operand operation. To extend the multiplication to both signed and

unsigned.

2.2. Modified Booth Encoder:

In order to achieve high-speed multiplication, multiplication

algorithms using parallel counters, such as the modified Booth

algorithm has been proposed, and some multipliers based on the

algorithms have been implemented for practical use. This type of

multiplier operates much faster than an array multiplier for longer

operands because its computation time is proportional to the

logarithm of the word length of operands.

Booth multiplication is a technique that allows for smaller,

faster multiplication circuits, by recoding the numbers that are

multiplied. It is possible to reduce the number of partial products by

half, by using the technique of radix-4 Booth recoding. The basic

idea is that, instead of shifting and adding for every column of the

multiplier term and multiplying by 1 or 0, we only take every second

column, and multiply by ±1, ±2, or 0, to obtain the same results.

The advantage of this method is the halving of the number of partial

products. To Booth recode the multiplier term, we consider the bits

in blocks of three, such that each block overlaps the previous block

by one bit. Grouping starts from the LSB, and the first block only

uses two bits of the multiplier. Figure 3 shows the grouping of bits

from the multiplier term for use in modified booth encoding.

Fig.2.2 Grouping of bits from the multiplier term

Each block is decoded to generate the correct partial product. The

encoding of the multiplier Y, using the modified booth algorithm,

generates the following five signed digits, -2, -1, 0, +1, +2. Each

encoded digit in the multiplier performs a certain operation on the

multiplicand, X, as illustrated in Table 1

For the partial product generation, we adopt Radix-4 Modified

Booth algorithm to reduce the number of partial products for

roughly one half. For multiplication of 2’s complement numbers, the

two-bit encoding using this algorithm scans a triplet of bits. When

the multiplier B is divided into groups of two bits, the algorithm is

applied to this group of divided bits.

Figure 4, shows a computing example of Booth multiplying

two numbers ”2AC9” and “006A”. The shadow denotes that the

numbers in this part of Booth multiplication are all zero so that this

part of the computations can be neglected. Saving those

computations can significantly reduce the power consumption

caused by the transient signals. According to the analysis of the

multiplication shown in figure 4, we propose the SPST-equipped

modified-Booth encoder, which is controlled by a detection unit. The

detection unit has one of the two operands as its input to decide

whether the Booth encoder calculates redundant computations. As

shown in figure 9. The latches can, respectively, freeze the inputs of

MUX-4 to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to

PP7 or the PP6 to PP7 are zero; to reduce the transition power

dissipation. Figure 10, shows the booth partial product generation

circuit. It includes AND/OR/EX-OR logic.

Fig.2.3 Illustration of multiplication using modified Booth encoding

The PP generator generates five candidates of the partial

products, i.e., {-2A,-A, 0, A, 2A}. These are then selected according

to the Booth encoding results of the operand B. When the operand

besides the Booth encoded one has a small absolute value, there

are opportunities to reduce the spurious power dissipated in the

compression tree.

Fig.2.4 SPST equipped modified Booth encoder

2.3. Partial product generator:

Fig.2.5 Booth partial product selector logic2.4. Partial Product Generator:

The multiplication first step generates from A and X a set of bits

whose weights sum is the product P. For unsigned multiplication, P

most significant bit weight is positive, while in 2's complement it is

negative.

The partial product is generated by doing AND between ‘a’

and ‘b’ which are a 4 bit vectors as shown in fig. If we take, four bit

multiplier and 4-bit multiplicand we get sixteen partial products in

which the first partial product is stored in ‘q’. Similarly, the second,

third and fourth partial products are stored in 4-bit vector n, x, y.

Fig.2.6 Booth partial products Generation

The multiplication second step reduces the partial products

from the preceding step into two numbers while preserving the

weighted sum. The sough after product P is the sum of those two

numbers. The two numbers will be added during the third step The

"Wallace trees" synthesis follows the Dadda's algorithm, which

assures of the minimum counter number. If on top of that we

impose to reduce as late as (or as soon as) possible then the

solution is unique. The two binary number to be added during the

third step may also be seen a one number in CSA notation (2 bits

per digit).

Fig.2.7 Booth single partial product selector logic

2.5. Modified Booth Encoder:

Multiplication consists of three steps: 1) the first step to

generate the partial products; 2) the second step to add the

generated partial products until the last two rows are remained; 3)

the third step to compute the final multiplication results by adding

the last two rows. The modified Booth algorithm reduces the

number of partial products by half in the first step. We used the

modified Booth encoding (MBE) scheme proposed in. It is known as

the most efficient Booth encoding and decoding scheme. To multiply

X by Y using the modified Booth algorithm starts from grouping Y by

three bits and encoding into one of {-2, -1, 0, 1, 2}. Table I shows

the rules to generate the encoded signals by MBE scheme and Fig. 1

(a) shows the corresponding logic diagram. The Booth decoder

generates the partial products using the encoded signals as shown

in Fig. 1

Fig.2.8 Booth Encoder

Fig.2.9 Booth Decoder

Fig. shows the generated partial products and sign extension

scheme of the 8-bit modified Booth multiplier. The partial products

generated by the modified Booth algorithm are added in parallel

using the Wallace tree until the last two rows are remained. The

final multiplication results are generated by adding the last two

rows. The carry propagation adder is usually used in this step.

Fig 2.10 Truth table for MBE Scheme

2.6. Spurious power suppression technique:

Figure shows the five cases of a 16-bit addition in which the spurious

switching activities occur. The 1st case illustrates a transient state in which the

spurious transitions of carry signals occur in the MSP though the final result of the

MSP are unchanged. The 2nd and the 3rd cases describe the situations of one negative

operand adding another positive operand without and with carry from LSP,

respectively. Moreover, the 4th and the 5th cases respectively demonstrate the addition

of two negative operands without and with carry-in from LSP. In those cases, the

results of the MSP are predictable Therefore the computations in the MSP are useless

and can be neglected. The data are separated into the Most Significant Part (MSP) and

the Least Significant Part (LSP). To know whether the MSP affects the computation

results or not. We need a detection logic unit to detect the effective ranges of the

inputs. The Boolean logical equations shown below express the behavioral principles

of the detection logic unit in the MSP circuits of the SPST-based adder/subtractor:

Figure 2. Spurious transition cases in multimedia/ DSP processing

AMSP = A[15:8]; BMSP = B[15:8] ;

Aand = A[15] A[14] A[8];

Band = B[15] B[14] B[8];]

where A[m] and B[n] respectively denote the mth bit of the operands A and

the nth bit of the operand B, and AMSP and BMSP respectively denote the MSP

parts, i.e. the 9th bit to the 16th bit, of the operands A and B. When the bits in AMSP

and/or those in BMSP are all ones, the value of Aand and/or that of Band respectively

become one, while the bits in AMSP and/or those in BMSP are all zeros, the value of

Anor, and/or that of Bnor respectively turn into one. Being one of the three outputs of

the detection logic unit, close denotes whether the MSP circuits can be neglected or

not. When the two input operand can be classified into one of the five classes as

shown in figure 1,

the value of close becomes zero which indicates that the MSP circuits can be closed.

figure 1. also shows that it is necessary to compensate the sign bit of computing

results Accordingly, we derive the Karnaugh maps which lead to the Boolean

equations (7) and (8) for the Carr_ctrl and the sign signals, respectively. In equation

(7) and (8), CLSP denotes the carry propagated from the LSP circuits.

Figure shows a 16-bit adder/subtractor design example based on the proposed

SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP at the

place between the 8th bit and the 9th bit. Latches implemented by simple AND gates

are used to control the input data of the MSP. When the MSP is necessary, the input

data of MSP remain the same as usual, while the MSP is negligible, the input data of

the MSP become zeros to avoid switching power consumption. From the derived

Boolean equations (1) to (8), the detection logic unit of the SPST is designed as

shown in figure 4. The use of MSP can be determined by whether the input data of

MSP should be latched or not. Moreover, we add three 1-bit to control the assertion of

the close, sign, and Carr-ctrl signals in order to further decrease the glitch signals

occurred in the cascaded circuits which are usually adopted in VLSI architectures

designed for video coding.

Fig. shows a 16-bit adder/subtractor design example adopting the proposed

SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP

between the eighth and the ninth bits. Latches implemented by simple AND gates are

used to control the input data of the MSP. When theMSP is necessary, the input data

of MSP remain unchanged. However, when the MSP is negligible, the input data of

the MSP become zeros to avoid glitching power consumption. The two operands of

the MSP enter the detection-logic unit, except the adder/subtractor, so that the

detection-logic unit can decide whether to turn off the MSP or not. Based on the

derived Boolean equations (1) to (8), the detection-logic unit of SPST is shown in Fig.

6(a), which can determine whether the input data of MSP should be latched or not.

Moreover, we propose the novel glitch-diminishing technique by adding three 1-bit

registers to control the assertion of the close, sign, and carr-ctrl signals to further

decrease the transient signals occurred in the cascaded circuits which are usually

adopted in VLSI architectures designed for multimedia/DSP applications. The timing

diagram is shown in Fig. 6(b). A certain amount of delay is used to assert the close,

sign, and carr-ctrl signals after the period of data transition which is achieved by

controlling the three 1-bit registers at the outputs of the detection-logic unit.

Hence, the transients of the detection-logic unit can be filtered out; thus, the

data latches shown in Fig can prevent the glitch signals from flowing into the MSP

with tiny cost. The data transient time and the earliest required time of all the inputs

are also illustrated. The delay should be set in the range of, which is shown as the

shadow area in Fig, to filter out the glitch signals as well as to keep the computation

results correct. Based on Figs. 5 and 6, the timing issue of the SPST is analyzed as

follows.

2.6.1. When the detection-logic unit turns off the MSP:

At this moment, the outputs of the MSP are directly compensated by the SE unit;

therefore, the time saved from skipping the computations in the MSP circuits shall

cancel out the delay caused by the detection-logic unit.

2.6.2. When the detection-logic unit turns on the MSP:

The MSP circuits must wait for the notification of the detection-logic unit to turn on

the data latches to let the data in. Hence, the delay caused by the detection-logic unit

will contribute to the delay of the whole combinational circuitry, i.e., the16-bit

adder/subtractor in this design example.

2.6.3.When the detection-logic unit remains its decision:

No matter whether the last decision is turning on or turning off the MSP, the

delay of the detection logic is negligible because the path of the combinational

circuitry (i.e., the 16-bit adder/subtractor in this design example) remains the same.

From the analysis earlier, we can know that the total delay is affected only when the

detection-logic unit turns on the MSP. However, the detection-logic unit should be a

speed-oriented design. When the SPST is applied on combinational circuitries, we

should first determine the longest transitions of the interested cross sections of each

combinational circuitry, which is timing characteristic and is also related to the

adopted technology. The longest transitions can be obtained from analyzing the

timing differences between the earliest arrival and the latest arrival signals of the

cross sections of a combinational circuitry.

CHAPTER 3 CHAPTER 3

Results and Discussion

3.1. Simulation Results of VMFU:

3.1.1 Partial products Generators:

31.2 Booth Encoder:

31.2 Carry-Save Adder:

3.1.3 Versatile Multimedia Functional Unit:

3.2. Introduction to FPGA

FPGA stands for Field Programmable Gate Array which has the array of logic

module, I /O module and routing tracks (programmable interconnect). FPGA can be

configured by end user to implement specific circuitry. Speed is up to 100 MHz but at

present speed is in GHz.

Main applications are DSP, FPGA based computers, logic emulation, ASIC

and ASSP. FPGA can be programmed mainly on SRAM (Static Random Access

Memory). It is Volatile and main advantage of using SRAM programming technology

is re-configurability. Issues in FPGA technology are complexity of logic element,

clock support, IO support and interconnections (Routing).

3.2.1. FPGA Design Flow:

FPGA contains a two dimensional arrays of logic blocks and interconnections

between logic blocks. Both the logic blocks and interconnects are programmable.

Logic blocks are programmed to implement a desired function and the interconnects

are programmed using the switch boxes to connect the logic blocks.

To be more clear, if we want to implement a complex design (CPU for

instance), then the design is divided into small sub functions and each sub function is

implemented using one logic block. Now, to get our desired design (CPU), all the sub

functions implemented in logic blocks must be connected and this is done by

programming the Internal structure of an FPGA is depicted in the following figure.

FPGAs, alternative to the custom ICs, can be used to implement an entire

System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.

User can reprogram an FPGA to implement a design and this is done after the FPGA

is manufactured. This brings the name “FieldProgrammable.”

Custom ICs are expensive and takes long time to design so they are useful

when produced in bulk amounts. But FPGAs are easy to implement with in a short

time with the help of Computer Aided Designing (CAD) tools (because there is no

physical layout process, no mask making, and no IC manufacturing).

Some disadvantages of FPGAs are, they are slow compared to custom ICs as

they can’t handle vary complex designs and also they draw more power.

Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An

LUT is used to implement number of different functionality. The input lines to the

logic block go into the LUT and enable it. The output of the LUT gives the result of

the logic function that it implements and the output of logic block is registered or

unregistered out put from the LUT.

SRAM is used to implement a LUT.A k-input logic function is implemented

using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is

2^2^k. Advantage of such an architecture is that it supports implementation of so

many logic functions, however the disadvantage is unusually large number of memory

cells required to implement such a logic block in case number of inputs is large.

Figure below shows a 4-input LUT based implementation of logic block

LUT based design provides for better logic block utilization. A k-input LUT

based logic block can be implemented in number of different ways with trade off

between performance and logic density.

An n-LUT can be shown as a direct implementation of a function truth-table. Each of

the latch holds the value of the function corresponding to one input combination. For

Example: 2-LUT can be used to implement 16 types of functions like AND , OR,

A+not B .... etc.

Interconnects

A wire segment can be described as two end points of an interconnect with no

programmable switch between them. A sequence of one or more wire segments in an

FPGA can be termed as a track.

Typically an FPGA has logic blocks, interconnects and switch blocks

(Input/Output blocks). Switch blocks lie in the periphery of logic blocks and

interconnect. Wire segments are connected to logic blocks through switch blocks.

Depending on the required design, one logic block is connected to another and so on.

FPGA DESIGN FLOW

In this part of tutorial we are going to have a short intro on FPGA design flow. A

simplified version of design flow is given in the flowing diagram.

FPGA Design Flow

Design Entry

There are different techniques for design entry. Schematic based, Hardware

Description Language and combination of both etc. . Selection of a method depends

on the design and designer. If the designer wants to deal more with Hardware, then

Schematic entry is the better choice. When the design is complex or the designer

thinks the design in an algorithmic way then HDL is the better choice. Language

based entry is faster but lag in performance and density.

HDLs represent a level of abstraction that can isolate the designers from the

details of the hardware implementation. Schematic based entry gives designers much

more visibility into the hardware. It is the better choice for those who are hardware

oriented. Another method but rarely used is state-machines. It is the better choice for

the designers who think the design as a series of states. But the tools for state machine

entry are limited. In this documentation we are going to deal with the HDL based

design entry.

Synthesis

The process which translates VHDL or Verilog code into a device netlist

formate. i.e a complete circuit with logical elements( gates, flip flops, etc…) for the

design.If the design contains more than one sub designs, ex. to implement a

processor, we need a CPU as one design element and RAM as another and so on, then

the synthesis process generates netlist for each design element Synthesis process will

check code syntax and analyze the hierarchy of the design which ensures that the

design is optimized for the design architecture, the designer has selected. The

resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx®

Synthesis Technology (XST)).

FPGA Synthesis

Implementation

This process consists a sequence of three steps

1. Translate

2. Map

3. Place and Route

Translate:

Process combines all the input netlists and constraints to a logic design file.

This information is saved as a NGD (Native Generic Database) file. This can be done

using NGD Build program. Here, defining constraints is nothing but, assigning the

ports in the design to the physical elements (ex. pins, switches, buttons etc) of the

targeted device and specifying time requirements of the design. This information is

stored in a file named UCF (User Constraints File). Tools used to create or modify the

UCF are PACE, Constraint Editor etc.

FPGA Translate

Map

Process divides the whole circuit with logical elements into sub blocks such

that they can be fit into the FPGA logic blocks. That means map process fits the logic

defined by the NGD file into the targeted FPGA elements (Combinational Logic

Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit

Description) file which physically represents the design mapped to the components of

FPGA. MAP program is used for this purpose.

FPGA map

Place and Route:

PAR program is used for this process. The place and route process places the

sub blocks from the map process into logic blocks according to the constraints and

connects the logic blocks. Ex. if a sub block is placed in a logic block which is very

near to IO pin, then it may save the time but it may effect some other constraint. So

trade off between all the constraints is taken account by the place and route process

The PAR tool takes the mapped NCD file as input and produces a completely

routed NCD file as output. Output NCD file consists the routing information.

FPGA Place and route

Device Programming:

Now the design must be loaded on the FPGA. But the design must be

converted to a format so that the FPGA can accept it. BITGEN program deals with the

conversion. The routed NCD file is then given to the BITGEN program to generate a

bit stream (a .BIT file) which can be used to configure the target FPGA device. This

can be done using a cable. Selection of cable depends on the design.

Behavioral Simulation (RTL Simulation):

This is first of all simulation steps; those are encountered throughout the

hierarchy of the design flow. This simulation is performed before synthesis process to

verify RTL (behavioral) code and to confirm that the design is functioning as

intended. Behavioral simulation can be performed on either VHDL or Verilog

designs. In this process, signals and variables are observed, procedures and functions

are traced and breakpoints are set. This is a very fast simulation and so allows the

designer to change the HDL code if the required functionality is not met with in a

short time period. Since the design is not yet synthesized to gate level, timing and

resource usage properties are still unknown.

3.4 Synthesis Result

The developed MAC design is simulated and verified their functionality. Once

the functional verification is done, the RTL model is taken to the synthesis process

using the Xilinx ISE tool. In synthesis process, the RTL model will be converted to

the gate level netlist mapped to a specific technology library. This MAC design can

be synthesized on the family of Spartan 3E.

Here in this Spartan 3E family, many different devices were available in the

Xilinx ISE tool. In order to synthesis this design the device named as “XC3S500E”

has been chosen and the package as “FG320” with the device speed such as “-4”. The

design of MAC is synthesized and its results were analyzed as follows.

Device utilization summary:

This device utilization includes the following.

Logic Utilization

Logic Distribution

Total Gate count for the Design

The device utilization summery is shown above in which its gives the details of

number of devices used from the available devices and also represented in %. Hence

as the result of the synthesis process, the device utilization in the used device and

package is shown above.

Timing Summary:

Speed Grade: -4

Minimum period: 35.100ns (Maximum Frequency: 28.490MHz)

Minimum input arrival time before clock: 23.605ns

Maximum output required time after clock: 4.283ns

Maximum combinational path delay: No path found

In timing summery, details regarding time period and frequency is shown are

approximate while synthesize. After place and routing is over, we get the exact timing

summery. Hence the maximum operating frequency of this synthesized design is

given as 28.490MHz and the minimum period as 35.100ns. Here, OFFSET IN is the

minimum input arrival time before clock and OFFSET OUT is maximum output

required time after clock.

RTL Schematic

The RTL (Register Transfer Logic) can be viewed as black box after

synthesize of design is made. It shows the inputs and outputs of the system. By

double-clicking on the diagram we can see gates, flip-flops and MUX.

Figure 3.6 Schematic with Basic Inputs and Output

INPUTS

OUTPUTS

Figure 3.7 Schematic of Booth Encoder with SPST Adder

3.4 Summary

The developed VMFU design is modelled and is simulated using the

Modelsim tool.

The simulation results are discussed by considering different cases.

The RTL model is synthesized using the Xilinx tool in Spartan 3E and their

synthesis results were discussed with the help of generated reports.

CHAPTER 4 CHAPTER 4

CONCLUSION

This work presents a versatile multimedia functional unit is

designed with low-power technique called SPST, 16x16 multiplier-

accumulator (MAC), with addition, subtraction, sum of absolute

difference, interpolation. A Radix-2 Modified Booth multiplier circuit

is used for MAC architecture. Compared to other circuits, the Booth

multiplier has the highest operational speed and less hardware

count. The basic building blocks for the VMFU unit are identified and

each of the blocks is analyzed for its performance. Power and delay

is calculated for the blocks. MAC unit is designed with enable to

reduce the total power consumption based on block enable

technique. Using this block, the N-bit MAC unit is constructed and

the total power consumption is calculated for the MAC unit.

The presented low-power technique called SPST and explores

its applications in multimedia/DSP computations, where the

theoretical analysis and the realization issues of the SPST are fully

discussed. The proposed SPST can obviously decrease the switching

(or dynamic) power dissipation, which comprises a significant

portion of the whole power dissipation in integrated circuits.

Besides, the proposed SPST can achieve a 24% saving in power

consumption at the expense of only 10% area overheads for the

proposed VMFU.

SPST

Documents

SPST