Top Banner
Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] © http://ijcer.org ISSN: 2278-5795 Page 94 LOW POWER ADD AND SHIFT MULTIPLIER DESIGN BZFAD ARCHITECTURE Prof Prasann D.Kulkarni 1 , Prof.S.P.Deshpande 2 , Dr.G.R.Udupi 3 1 Lecturer,Dept of E&CE, KLS’s VDRIT, Haliyal, India 2 Asst.Prof, Dept of E&CE, KLS’s G.I.T, Belgaum, India 3 Principal, KLS’s VDRIT, Haliyal, India 1 [email protected], 2 [email protected], 3 [email protected] Abstract - A multiplier is one of the key hardware blocks in most digital and high performance systems such as FIR filters, digital signal processors and microprocessors etc. With advances in technology, many researchers have tried and are trying to design multipliers which offer either of the following- high speed, low power consumption, regularity of layout and hence less area or even combination of them in multiplier. Thus making them suitable for various high speed, low power and compact VLSI implementations. However area and speed are two conflicting constraints. So improving speed results always in larger areas. So here we try to find out the best trade off solution among them. Generally as we know multiplication goes in two basic steps. Partial product and then addition. Hence here, we first try to design Considering the design of Wallace tree multiplier then followed by Booth’s Wallace multiplier and comparing the speed and Power consumption in them. Motivation - As the scale of integration keeps growing, more and more sophisticated signal processing systems are being implemented on a VLSI chip. These signal processing applications not only demand great computation capacity but also consume considerable amount of energy. While performance and Area remain to be the two major design tools, power consumption has become a critical concern in today’s VLSI system design. The need for low-power VLSI system arises from two main forces. First, with the steady growth of operating frequency and processin capacity per chip, large currents have to be delivered and the heat due to large power consumption must be removed by proper cooling techniques. Second, battery life in portable electronic devices is limited. Low power design directly leads to prolonged operation time in these portable devices. Multiplication is a fundamental operation in most signal processing algorithms. Multipliers have large area, long latency and consume considerable power. Therefore low-power multiplier design has been an important part in low- power VLSI system design. A system’s performance is generally determined by the performance of the multiplier because the multiplier is generally the slowest element in the system. Furthermore, it is generally the most area consuming. Hence,optimizing the speed and area of the multiplier is a major design issue. However, area and speed are usually conflicting constraints so that improving speed results mostly in larger areas. We study different adders and compare them, so that we can judge to know which adder was best suited for situation. Ripple Carry Adder has a smaller area while having lesser speed. Carry Select Adders are high-speed but posses a larger area. Carry Look Ahead Adder is in between the spectrum having a proper trade off between time and area complexities. Coming to Multipliers, we consider different Multipliers starting from Array Multiplier to Wallace Tree, Booth Multipliers, both Radix-2 and Radix-4. Array Multiplier is the worst case multiplier consuming highest amount of power. Then comes the Radix-2 Booth multiplier which consumes lesser power than array multiplier. The Wallace Tree multiplier and Booth Multiplier Radix-4 have nearly same amount of delay while Radix-4 Booth consuming lesser power than the other. Hence we reach to a conclusion that Booth Radix- 4 Multiplier is best for situations requiring Low power Applications. However, the benefit achieved comes at the expense of increased
14

161-514-2-PB

Nov 17, 2015

Download

Documents

LOW POWER ADD AND SHIFT MULTIPLIER DESIGN
BZFAD ARCHITECTURE
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 94

    LOW POWER ADD AND SHIFT MULTIPLIER DESIGN

    BZFAD ARCHITECTURE

    Prof Prasann D.Kulkarni1, Prof.S.P.Deshpande

    2, Dr.G.R.Udupi

    3

    1Lecturer,Dept of E&CE, KLSs VDRIT, Haliyal, India

    2Asst.Prof, Dept of E&CE, KLSs G.I.T, Belgaum, India

    3Principal, KLSs VDRIT, Haliyal, India

    [email protected],

    [email protected],

    [email protected]

    Abstract - A multiplier is one of the key hardware blocks in most digital and high performance systems such as

    FIR filters, digital signal processors and microprocessors

    etc. With advances in technology, many researchers have

    tried and are trying to design multipliers which offer

    either of the following- high speed, low power

    consumption, regularity of layout and hence less area or

    even combination of them in multiplier. Thus making

    them suitable for various high speed, low power and

    compact VLSI implementations. However area and

    speed are two conflicting constraints. So improving

    speed results always in larger areas. So here we try to

    find out the best trade off solution among them.

    Generally as we know multiplication goes in two basic

    steps. Partial product and then addition. Hence here, we

    first try to design Considering the design of Wallace tree

    multiplier then followed by Booths Wallace multiplier

    and comparing the speed and Power consumption in

    them.

    Motivation - As the scale of integration keeps

    growing, more and more sophisticated signal

    processing systems are being implemented on a

    VLSI chip. These signal processing applications

    not only demand great computation capacity but

    also consume considerable amount of energy.

    While performance and Area remain to be the two

    major design tools, power consumption has

    become a critical concern in todays

    VLSI system design. The need for low-power

    VLSI system arises from two main forces. First,

    with the steady growth of operating frequency and

    processin capacity per chip, large currents have to

    be delivered and the heat due to large power

    consumption must be

    removed by proper cooling techniques. Second,

    battery life in portable electronic devices is

    limited.

    Low power design directly leads to prolonged

    operation time in these portable devices.

    Multiplication is a fundamental operation in

    most signal processing algorithms. Multipliers

    have large area, long latency and consume

    considerable power. Therefore low-power

    multiplier design has been an important part in

    low- power VLSI system design. A systems

    performance is generally determined by the

    performance of the multiplier because the

    multiplier is generally the slowest element in the

    system. Furthermore, it is generally the most area

    consuming. Hence,optimizing the speed and area

    of the multiplier is a major design issue. However,

    area and speed are usually conflicting constraints

    so that improving speed results mostly in larger

    areas.

    We study different adders and compare them, so

    that we can judge to know which adder was best

    suited for situation.

    Ripple Carry Adder has a smaller area while having lesser speed.

    Carry Select Adders are high-speed but posses a larger area.

    Carry Look Ahead Adder is in between the spectrum having a proper trade off between

    time and area complexities.

    Coming to Multipliers, we consider different

    Multipliers starting from Array Multiplier to

    Wallace Tree, Booth Multipliers, both Radix-2

    and Radix-4.

    Array Multiplier is the worst case multiplier

    consuming highest amount of power. Then comes

    the Radix-2 Booth multiplier which consumes

    lesser power than array multiplier. The Wallace

    Tree multiplier and Booth Multiplier Radix-4

    have nearly same amount of delay while Radix-4

    Booth consuming lesser power than the other.

    Hence we reach to a conclusion that Booth Radix-

    4 Multiplier is best for situations requiring Low

    power Applications. However, the benefit

    achieved comes at the expense of increased

    http://ijcer.org/mailto:[email protected],[email protected]

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 95

    hardware complexity. Indeed, this implementation

    requires hardware for the encoding and for the

    selection of the partial products. Among other

    multipliers, shift-and-add multipliers have been

    used in many applications for their simplicity and

    relatively small area requirement. The architecture

    in BZFAD, gives an optimization in both power

    and area.

    Table 1: Comparison of address

    Adder Delay

    for n bit

    Rea

    for n

    bit

    Area

    delay

    product

    Ripple

    carry

    adder

    2n 7n 14n2

    Carry

    select

    adder

    2.8(n)1/2

    14n 39.6(n)3/2

    Carry

    look

    ahead

    adder

    4log2n 4n 16nlog2n

    Table 2:.Comparision Multipliers

    Multiple Power

    Consumption

    Speed

    Array

    Multiplier

    High Limited

    Radix-2

    Booth

    Multiplier

    Less than

    array

    Moderate

    Radix-4

    Booth

    Less than

    other

    Highest

    Wallace

    tree

    multiplier

    Less than

    radix-2

    High

    1. INTRODUCTION

    Power dissipation of VLSI chips is traditionally

    a neglected subject. In the past, the device density

    and frequency were low enough that it was not a

    constraining factor in chips. As the scale of

    integration improves, more transistors, faster and

    smaller than their predecessors, are being packed

    into a chip. This leads to the steady growth of the

    operating frequency and processing capacity per

    chip, resulting in increased power dissipation.

    The power consumption in digital CMOS circuit

    can be described by

    Pavg=Pdynamic+Pshortcircuit+Pleakage+Pstatic

    (1)

    The dynamic power dissipation is caused by

    charging and discharging of capacitances in the

    circuit. The short circuit power consumption is

    caused by the current flow through the direct path

    existing between the power supply and the ground

    during the transition phase. The n-MOS and p-

    MOS transistors used in a CMOS logic circuit

    commonly have non zero reverse leakage and sub

    threshold current. The computation of a multiplier

    manipulates two input data to generate many

    partial products for subsequent addition

    operations, which in the CMOS circuit design

    require many switching activities. The switching

    activities within the functional unit of a multiplier

    accounts for the majority of the power dissipation

    of a multiplier, as given in the following equation

    Pswitching = C Vdd2 fclk (2)

    Where is the switching activity parameter, C

    is the loading capacitance, Vdd is the operating

    voltage and fclk is the operating frequency.

    Shift-and-add multiplication is similar to the

    multiplication performed by paper and pencil.

    This method adds the multiplicand X to itself

    Y times, where Y denotes the multiplier. To

    multiply two numbers by paper and pencil, the

    algorithm is to take the digits of the multiplier one

    at a time from right to left, multiplying the

    multiplicand by a single digit of the multiplier and

    placing the intermediate product in the appropriate

    positions to the left of the earlier results. To

    perform the entire operations for getting the final

    product, the conventional architecture for shift and

    add multipliers require many switching activities.

    So the dynamic power dissipation is more in

    conventional architecture. By eliminating or

    reducing the sources switching activity in the

    conventional multiplier, low power architecture of

    multiplier can be derived. Being one among the

    functional components of many digital systems

    the reduction of power dissipation in multipliers

    should be as much as possible.

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 96

    BZFAD

    A low-power structure called BZ-FAD (Bypass

    Zero, Feed A Directly) for shift-and-add

    multipliers is proposed. The architecture

    considerably lowers the switching activity of

    conventional multipliers. The modifications to the

    multiplier which multiplies A by B include the

    removal of the shifting the B register, direct

    feeding of A to the adder, bypassing the adder

    whenever possible, using a ring counter instead of

    a binary counter and removal of the partial

    product shift. The architecture makes use of a

    low-power ring counter proposed in this work.

    Simulation results for 32-bit radix-2 multipliers

    show that the BZ-FAD architecture lowers the

    total switching activity up to 76% and power

    consumption up to 30% when compared to the

    conventional architecture. The proposed multiplier

    can be used for low-power applications where the

    speed is not a primary design parameter.

    The rest of the paper is organized as follows.

    Section II briefly reviews the background

    information about conventional shift and add

    multiplier. Section III describes the architecture

    description of the low power multiplier. Section

    IV describes the low power ring counter

    architecture. Results are discussed in section V

    and conclusion is in the last section.

    2. TYPES OF ADDERS

    Addition is the most common and often used

    arithmetic operation on microprocessor, digital

    signal processor, especially digital computers.

    Also, it serves as a building block for synthesis all

    other arithmetic operations. Therefore, regarding

    the efficient implementation of an arithmetic unit,

    the binary adder structures become a very critical

    hardware unit. Although many researches dealing

    with the binary adder structures have been done,

    the studies based on their comparative

    performance analysis are only a few.

    With respect to asymptotic delay time and area

    complexity, the binary adder architectures can be

    categorized into four primary classes as given

    below.

    2.1 Ripple Carry Adder(RCA)

    The well known adder architecture, ripple carry

    adder is composed of cascaded full adders for n-

    bit adder, as shown in figure 2.1.It is constructed

    by cascading full adder blocks in series. The carry

    out of one stage is fed directly to the carry-in of

    the next stage. For an n-bit parallel adder it

    requires n full adders.

    Figure 1: A 4-bit Ripple Carry Adder

    Logic equations

    gi = ai bi p = ai xor bi.

    Ci+1 = gi + pi.ci Si = pi xor ci

    Complexity and Delay for n-bit RCA structure

    ARCA = O (n) = 7n

    TRCA = O (n) = 2n

    Not very efficient when large number bits numbers are used.

    Delay increases linearly wit bit length. 2.2 Carry Select Adder(CSLA)

    In Carry select adder scheme, blocks of bits are

    added in two ways: one assuming a carry-in of 0

    and the other with a carry-in of 1.This results in

    two precomputed sum and carry-out signal pairs

    (s0i-1:k , c0i ; s1i-1:k , c1i) , later as the blocks

    true carry-in (ck) becomes known , the correct

    signal pairs are selected. Generally multiplexers

    are used to propagate carries.

    Figure 2: A Carry Select Adder with 1 level using

    n/2- bit RCA

    Logic equations

    Si-1: k = ck' s0i-1: k + ck s1i-1: k

    ci = ck' c0i + ck c1i

    Complexity and Delay for n-bit CSLA structure

    ACSLA = O (n) = 14n

    TCSLA = O (n1/*l+1) = 2.8n1/2.

    Because of multiplexers larger area is required.

    Have a lesser delay than Ripple Carry Adders (half delay of RCA).

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 97

    Hence we always go for Carry Select Adder while working with smaller no of

    bits.

    2.3 Carry Look Ahead Adder(CLA)

    Carry Look Ahead Adder can produce carries

    faster due to carry bits generated in parallel by an

    additional circuitry whenever inputs change. This

    technique uses carry bypass logic to speed up the

    carry propagation.

    Figure 3: 4-BIT CLA Logic equations

    Let ai and bi be the augends and addend inputs,

    ci the carry input, si and ci+1, the sum and carry-

    out to the ith bit position. If the auxiliary

    functions, pi and gi called the propagate and

    generate signals, the sum output respectively are

    defined as follows.

    pi = ai + bi gi = ai bi

    si = ai xor bi xor ci ci+1 = gi + pici

    As we increase the no of bits in the Carry Look

    Ahead adders, the complexity increases because

    the no. of gates in the expression Ci+1 increases.

    So practically its not desirable to use the

    traditional CLA shown above because it increases

    the Space required and the power too.

    Instead we will use here Carry Look Ahead

    adder (less bits) in levels to create a larger CLA.

    Commonly smaller CLA may be taken as a 4-bit

    CLA. So we can define carry look ahead over a

    group of 4 bits. Hence now we redefine terms

    carry generate as [Group Generated Carry] g[

    i,i+3 ] and carry propagate as [Group Propagated

    Carry] p[ i,i+3 ] which are defined below.

    Redefined Equations

    g[ i,i+3 ] = gi+3 + gi+2 pi+3 + gi+1 pi+2 pi+3 +

    g[i pi+1 pi+2 pi+3

    p[ i,i+3 ] = pi pi+1 pi+2 pi+3

    Now the modified block diagram for the Carry

    Look ahead Adder (8-bit) using levels (of 4-bit

    CLA) will be as block diagram below

    Figure 4: 8-BIT Carry Look Ahead Generator

    (using 2-bit CLA)

    Complexity and Delay for n-bit CLA structure

    ACLA = O (n) = 14n

    TCLA = O (log n) = 4 log2n.

    3. TYPES OF MULTIPLIERS

    3.1. Wallace Tree Multiplier

    The Wallace tree multiplier is considerably

    faster than a simple array multiplier because its

    height is logarithmic in word size, not linear.

    However, in addition to the large number of

    adders required, the Wallace trees wiring is much

    less regular and more complicated. As a result,

    Wallace trees are often avoided by designers,

    while design complexity is a concern to them.

    Wallace tree styles use a log-depth tree network

    for reduction. Faster, but irregular, they trade ease

    of layout for speed. Wallace tree styles are

    generally avoided for low power applications,

    since excess of wiring is likely to consume extra

    power.

    While subsequently faster than Carry-save

    structure for large bit multipliers, the Wallace tree

    multiplier has the disadvantage of being very

    irregular, which complicates the task of coming

    with an efficient layout.

    Figure 5: Wallace Tree Block Diagram

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 98

    Three step processes are used to multiply two

    numbers

    Formation of bit products. Reduction of the bit product matrix into a

    two row matrix by means of a carry save

    adder.

    Summation of remaining two rows using a faster Carry Look Ahead Adder (CLA).

    3.2 Booths Multiplier

    Though Wallace Tree multipliers were faster

    than the traditional Carry Save Method, it also

    was very irregular and hence was complicated

    while drawing the Layouts. Slowly when

    multiplier bits gets beyond 32-bits large numbers

    of logic gates are required and hence also more

    interconnecting wires which makes chip design

    large and slows down operating speed

    Booth multiplier can be used in different modes

    such as radix-2, radix-4, radix-8 etc. But we

    decided to use Radix-4 Booths Algorithm

    because of number of Partial products is reduced

    to n/2.

    3.2.1. Booth Multiplication Algorithm(Radux 4)

    One of the solutions realizing high speed

    multipliers is to enhance parallelism which helps

    in decreasing the number of subsequent

    calculation stages. The Original version of

    Booths multiplier (Radix 2) had two

    drawbacks.

    The number of add / subtract operations became variable and hence became

    inconvenient while designing Parallel

    multipliers.

    The Algorithm becomes inefficient when there are isolated 1s

    These problems are overcome by using Radix 4

    Booths Algorithm which can scan strings of three

    bits with the algorithm given below. The design of

    Booths multiplier in this project consists of four

    Modified Booth Encoded (MBE), four sign

    extension corrector, four partial product

    generators (comprises of 5:1 multiplexer) and

    finally a Wallace Tree Adder. This Booth

    multiplier technique is to increase speed by

    reducing the number of partial products by half.

    Since an 8-bit booth multiplier is used in this

    project, so there are only four partial products that

    need to be added instead of eight partial products

    generated using conventional multiplier. The

    architecture design for the modified Booths

    Algorithm used in this project is shown below.

    Figure 6: Architecture of designed Booth

    Multiplier.

    4. CONVENTIONAL SHIFT & ADD

    MULTIPLIER

    Figure 5. shows the architecture of a

    conventional shift and add multiplier. The dashed

    ovals show the major sources of switching

    activities. The multiplier is shifted in each cycle

    and the bit which getting out of register B is

    connected to the select pin of multiplexer, mux_A.

    As the select signal changes, the output of mux_A

    also changes. This causes the adder operation. The

    partial product is required to be shifted in every

    cycle. The counter is for checking whether the

    required number of operations has been

    performed. The major sources of switching

    activities are summarized as below

    Shifting of the B register

    Activity in the counter

    Activity in the adder

    Switching between 0 and A in the

    multiplexer

    Activity in the multiplexer select

    Shifting of the partial product register

    By eliminating or reducing the switching activity

    described above, low power architecture can be

    derived architecture can be derived.

    Figure 7: Architecture of conventional shift

    and add multiplier with major

    source of switching activity.

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 99

    4.1 State Diagram

    Figure 8: Conventional add shift multiplier

    state diagram

    5. THE PROPOSED LOW POWER

    MULTIPLIER: BZ-FAD

    5.1 Architecture

    To derive a low-power architecture, we

    concentrate our effort on eliminating or reducing

    the sources of the switching activity discussed in

    the previous section. The proposed architecture

    which is shown in Figure 6.3 is called BZ-FAD.

    5.1.1 Shift of the B Register An example of shifting of register is shown here

    Figure 9: Shift and add multiplication example

    In the traditional architecture (see Figure 9), to

    generate the partial product, B(0) is used to decide

    between A and 0. If the bit is 1, A should be

    added to the previous partial product, whereas if it

    is 0, no addition operation is needed to generate

    the partial product. Hence, in each cycle, register

    B should be shifted to the right so that its right bit

    appears at B(0); this operation gives rise to some

    switching activity.

    Figure 10: Multiplier with ring counter

    For a 3 bit multiplier 3 bit ring counter is used.

    Table 2 gives the required bit and counter output

    Combination

    TABEL 3: Counter output with required bit.

    To avoid this, in the proposed architecture (Fig

    11) a multiplexer (M1) with one-hot encoded bus

    selector chooses the hot bit of B in each cycle. A

    ring counter is used to select B(n) in the nth cycle.

    As will be seen later, the same counter can be

    used for block M2 as well. The ring counter used

    in the proposed multiplier is noticeably wider (32

    bits vs. 5 bits for a 32-bit multiplier) than the

    binary counter used in the conventional

    architecture; therefore an ordinary ring counter, if

    used in BZ-FAD, would raise more transitions

    than its binary counterpart in the conventional

    architecture. To minimize the switching activity of

    the counter, we utilize the low-power ring counter,

    which is described in the next section.

    5.1.2 Reducing Switching Activity of te Adder

    In the conventional multiplier architecture

    (Figure 7), in each cycle, the current partial

    product is added to A (when B(0) is one) or to 0

    (when B(0) is zero). This leads to unnecessary

    transitions in the adder when B(0) is zero. In these

    cases, the adder can be bypassed and the partial

    product should be shifted to the right by one bit.

    This is what is performed in the proposed

    architecture which eliminates unnecessary

    switching activities in the adder. As shown in

    Figure 11, the Feeder and Bypass registers are

    used to bypass the adder in the cycles where B(n)

    is zero. In each cycle, the hot bit of the next cycle

    (i.e., B(n + 1)) is checked. If it is 0, i.e., the adder

    is not needed in the next cycle, the Bypass register

    is clocked to store the current partial product. If

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 100

    B(n + 1) is 1, i.e., the adder is really needed in the

    next cycle, the Feeder register is clocked to store

    the current partial product which must be fed to

    the adder in the next cycle. Note that to select

    between the Feeder and Bypass registers we have

    used NAND and NOR gates which are inverting

    logic, therefore, the inverted clock (~Clock in

    Figure6.3) is fed to them. Finally, in each cycle,

    B(n determines if the partial product should come

    from the Bypass register or from the Adder output.

    In each cycle, when the hot bit B(n) is zero, there

    is no transition in the adder since its inputs do not

    change. The reason is that in the previous cycle,

    the partial product has been stored in the Bypass

    register and the value of the Feeder register,

    which is the input of the adder, remains

    unchanged. The other input of the adder is A,

    which is constant during the multiplication. This

    enables us to remove the multiplexer and feed

    input A directly to the adder, resulting in a

    noticeable power saving. Finally, note that the

    BZ-FAD architecture does not put any constraint

    on the adder type. In this work, we have used the

    ripple carry adder which has the least average

    transition per addition among the look ahead,

    carry skip, carry-select, and conditional sum

    adders.

    5.1.3 Shift of the PP Register

    In the conventional architecture, the partial

    product is shifted in each cycle giving rise to

    ransitions. Inspecting the multiplication algorithm

    reveals that the multiplication may be completed

    by processing the most significant bits of the

    partial product, and hence, it is not necessary for

    the least significant bits of the partial product to

    be shifted. We take advantage of this observation

    in the BZ-FAD architecture. Notice that in Figure

    11 for PLow, the lower half of the partial product,

    we use k latches (for a k-bit multiplier). These

    latches are indicated by the dotted rectangle M2 in

    Figure 11 .

    Figure 11: The proposed low power multiplier

    architecture (BZ-FAD)

    In the first cycle, the least significant bit, PP(0),

    of the product becomes finalized and is stored in

    the rightmost latch of PLow. The ring counter

    output is used to open (unlatch) the proper latch.

    This is achieved by connecting the S/~H line of

    the nth latch to the nth bit of the ring counter

    which is '1' in the nth cycle. In this way, the nth

    latch samples the value of the nth bit of the final

    product (Figure 11). In the subsequent cycles, the

    next least significant bits are finalized and stored

    in the proper latches. When the last bit is stored in

    the leftmost latch, the higher and lower halves of

    the partial product form the final product result.

    Using this method, no shifting of the lower half of

    the partial product is required. The higher part of

    the partial product, however, is still shifted.

    Comparing the two architectures, BZ-FAD saves

    power for two reasons: first, the lower half of the

    partial product is not shifted, and second, this half

    is implemented with latches instead of flip-flops.

    Note that in the conventional architecture (Fig 1)

    the data transparency problem of latches prohibits

    us from using latches instead of flip-flops for

    forming the lower half of the partial product. This

    problem does not exist in BZ-FAD since the lower

    half is not formed by shifting the bits in a shift

    register.

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 101

    Figure12.Manual approach for BZFAD

    5.2 State Diagram

    Figure 13: BZFAD state diagram

    6. CONVENTIONAL MULTIPLIER CODE

    DESCRIPTION

    Following the architecture of conventional add

    and shift multiplier, simulation results are

    obtained. The total operation is obtained in four

    states. First state loads the registers and second

    state calculates the first partial product. As we

    move on to the third state, the counter value is

    incremented and is tested for the kth

    bit value.

    With every increment of the counter until the

    required value is reached, the other shifting and

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 102

    addition operations are calculated. The output is

    visible at the transition from third state to fourth

    state, as done signal goes high. Later counter is

    reset for further operations.

    6.1 BZFAD Multiplier Code Description

    We made a number of adjustments to the

    conventional multiplier architecture to reduce

    power. Following this BZFAD architecture,

    simulation results are obtained. In the first state

    the multiplier and multiplicand values are loaded

    with their respective values and all the signals are

    initialized to zero. In the next state, in each cycle,

    the hot bit of the next cycle, that is, B(n+1) is

    checked. If it is 0, that is, adder is not needed in

    the next cycle, the bypass register is clocked to

    store the current partial product. If B(n+1) is 1,

    that is, the adder is really needed in the next cycle.

    The Feeder register is clocked to store the current

    partial product which must be fed to the adder in

    the next cycle. In each cycle ring counter is

    incremented and the MSB is checked for 1, when

    it becomes 1 state is incremented. In the next

    state, the lower half of partial product is stored in

    the Plow latch and the upper half is stored in the

    feeder, and these two registers are concatenated to

    form the final product.

    7. HARDWARE IMPLEMENTATION

    7.1 Basics About Spartan-II Trainer Kit

    The Spartan-II trainer MXSFK-LC-208 is

    useful to realize and verify various digital designs.

    User can construct VHDL/Verilog code and verify

    the results by implementing physically in to the

    target device (FPGA -Field Programmable Gate

    Arrays). With the help of this trainer user can

    simulate/observe various input and output

    conditions to verify the implemented design. Also

    you can select various i/o std. Interface to the

    device.

    7.2. Programmable Logic Devices [PLDS]

    A Programmable Logic Device is a device

    whose logic characteristics can be changed and

    manipulated or stored through programming.

    7.2.1 Different Types of PLDs.

    7.2.1.1 Programmable Array Logic[PALS]

    The most common and simple device that falls

    in this category is the PAL, which simply consists

    of an array of AND gates and an array of OR

    gates. The AND array is programmable while the

    OR array is relatively fixed.

    7.2.1.2. Field Programmable Gate Arrays

    [FPGAS]

    FPGA's are arrays of logic blocks, which can

    be linked together to form complex logic

    implementations. They are separated into two

    categories - Fine Grained and Coarse Grained.

    Fine Grained being made up of sea of gates or

    transistors or small macro cells, while Coarse

    Grained being made up of bigger macro cells

    which are often made up of flip-flops and Look up

    Tables which make up the Combinational logic

    functions. These are RAM based devices i.e.

    these devices lose their configuration when power

    is switched off. Hence they have to be configured

    every time when power is applied.

    7.2.1.3 Complex Programmable Logic Devices

    [CPLDS]

    CPLD's are made up of smaller common Macro

    cells, which are programmable. CPLD's consists

    of multiple PAL like function block that can be

    interconnected through a switch matrix. These are

    [Flash] EPROM based devices i.e. these devices

    store their configuration even when power is

    switched off. Hence they need not to be

    configured every time when power is applied.

    7.2.1.4 Application Specific Integrated Circuits

    [ASICS]

    ASIC's are nothing but prefabricated pre-doped

    silicon chips. These are application specific

    designs. They cannot be reconfigured once

    manufactured. Once the design is completely

    finalized, it can be made as ASIC. Design changes

    are not possible but the size and speed is more.

    7.3 SPARTAN-II [FPGA] Spartan-II family is second-generation high

    volume production FPGA solution. Devices in

    this family are available up to 200,000 gates, with

    up to 200MHz system performance at 2.5V

    supply.

    Features of the Spartan-II families are:

    1. On-chip RAM (block and distributed).

    2. Fully PCI compliant.

    3. Dedicated carry logic for high-speed

    arithmetic.

    4. Dedicated multiplier support.

    5. Low power segmented routine

    architecture.

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 103

    6. 16 high performance interface standards.

    7. 4 dedicated delay locked loop (DLLs) for

    advanced clock control.

    8. Power down mode (ICCO =100 mA).

    9. Unlimited re-programmability.

    8.3.1 Tainer Description

    Technical Data

    On board FPGA Spartan-II XC2S50 PQ 208 and compatible with XC2S100,

    XC2S150, XC2S200.in PQ 208 Package.

    2 Keys for Keyboard Interface.

    8 Digital I/Ps and O/Ps with LED indication.

    Two seven segment Displays.

    On board 4 MHZ clock and Power On reset circuit.

    User selectable Interface hardware.

    Support required for VCCO is on board, no external supply required].

    Probing facility: All I/Os available to the user.

    Power Supply

    9-Volt Adapter supplied with Spartan-II Trainer.

    Required VCCO (3.3V) and Vccint (2.5V) voltages are generated on board.

    Seven Segment Led Display

    Two 7-Segment LED displays are provided. User can use them as an aid to

    verify his design. [They come handy in

    counter related application to monitor the

    results].

    LEDs

    There are total 18 LEDs on the Trainer, which are grouped as follows.

    1. POWER-ON LED is used for

    power supply indication

    2. .DONE LED, indicates successful

    configuration of SPARTAN-II

    device.

    3. Eight LEDs [IL0 to IL7] indicate

    the inputs applied by user.

    4. Eight LEDs [LD0 to LD7] indicate

    output conditions.

    Test Points [TPs]

    User can use these points to verify ground, supply voltage, and clock.

    DIP Switch

    Single 8-way DIP switch [SW 1] is provided to be

    used as input to the FPGA. Logic Level applied to

    FPGA through SW1 is seen on LEDs LD0 to

    LD7.

    JUMPERS

    Various jumpers are provided for

    Selection of clock.

    Selection of configuration mode.

    KEYS

    Two Keys are provided for Keyboard

    Interface.

    Downloading Cable

    For downloading the design from PC, a 9 pin

    D-Type male (J7) connector is provided on board.

    The trainer can be connected to PC's parallel port

    with a cable having 25 pins D-Type (male) to 9

    pins D- type (female) connector. This cable is

    provided with the trainer.

    Figure 14: SPARTAN 2 Trainer

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 104

    When a software is to be implemented on

    hardware, interfacing is done. Any software

    code/program can be dumped on a hardware kit

    (in this case Spartan-II FPGA) with the help of a

    software interfacing tool (Xilinx).

    When we burned our programs for conventional

    architecture and BZFAD architecture on the

    Spartan-II kit, the results were obtained

    successfully. The images of Spartan-II executing

    the program are shown

    8. RESULTS AND ANALYSIS

    After understanding the architecture of both

    conventional and BZFAD multipliers, next step

    was to implement it. In order to accomplish this

    we write a code in Very High Speed Integrated

    Circuit- Hardware Descriptive Language [VHDL].

    This code was synthesized using Xilinx and

    simulated using ISE simulator [isim], and was

    implemented by burning on Spartan2 FPGA kit.

    Simulation results, timing summary, area

    utilization and power analysis report is shown

    below.

    8.1 Simulation Results

    The simulation results for both the conventional

    and BZFAD architectures follow in the order

    given below,

    4 Bit conventional multiplier 8 Bit Conventional Multiplier 4 Bit BZFAD Multiplier 8 Bit BZFAD Multiplier

    8.2 Timing Summary

    Conventional 8

    bit

    BZFAD 8

    bit

    Minimum

    period

    8.258 ns 6.975 ns

    Maximum

    frequency

    121.094 Mhz 143.362 Mhz

    Minimum

    input arrival

    time

    8.426 ns 7.167 ns

    Conventional 16

    bit

    BZFAD 16

    bit

    Minimum

    period

    9.946 ns 6.564 ns

    Maximum

    frequency

    100.540 Mhz 152.352

    Mhz

    Minimum

    input arrival

    time

    10.281 ns 7.502 ns

    8.3 Area Utilization

    Conventional

    4 bit

    BZFAD

    4 bit

    Minimum

    period

    5.943 ns 4.918 ns

    Maximum

    frequency

    168.264 Mhz 203.33

    Mhz

    Minimum

    input

    arrival time

    6.682 ns 5.160 ns

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 105

    8.4 Power Analysis

    8.5 Result Summary

    Figure 15: Area, Power and Delay comparison for

    conventional and proposed BZFAD multiplier for

    various bits.

    Figure 16: Relationship between power reduction

    and bit size of multiplier.

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 106

    Figure 17: Simulation for 8 bit BZFAD

    Figure 18: Simulation for 8 bit conventional

    CONCLUSION

    In this paper, a low-power architecture for

    shift-and-add multipliers was proposed. The

    modifications to the conventional architecture

    included the removal of the shift of the B register

    (in A B), direct feeding of A to the adder,

    bypassing the adder whenever possible, use of a

    ring counter instead of the binary counter, and

    removal of the partial product shift. The results

    showed an average power reduction of 30% by the

    proposed architecture. We also compared our

    multiplier with SPST [6], a low-power tree-based

    array multiplier. The comparison showed that the

    power saving of BZ-FAD was only 6% lower than

    that of SPST whereas the SPST area was five

    times higher than that of the BZ-FAD. Thus, for

    applications where small area and high speed are

    important concerns, BZ-FAD is an excellent

    choice. Additionally we proposed a low-power

    architecture for ring counters based on

    partitioning the counter into blocks of flip flops

    clock gated with a special clock gating structure

    the complexity of which was independent of the

    block sizes. The simulation results showed that in

    comparison with the conventional architecture, the

    proposed architecture reduced the power

    consumption more than 75% for the 64-bit counte

    REFERENCES

    [1] M.Mottaghi Dastjerdi ,A.afzali

    Kusha,m.Pedram BZFAD A Low Power

    Low Area Multiplier Based on Shift and Add

    Architecture IEEE Trans. Very Large Scale

    Integr .(VLSI)Syst., Vol.17, no-2,pp302-306,

    Feb. 2009.

    [2] O. Chen, S.Wang, and Y.W. Wu,

    Minimization of switching activities of

    partial products for designing low-power

    multipliers, IEEE Trans. Very Large Scale

    Integr. (VLSI) Syst., vol. 11, no. 3, pp. 418

    433, Jun. 2003.

    [3] B.Parhami Computer arithmetic algorithms

    and Hardware designs 1 st ed.Oxford U.K.

    Oxford Univ, Press 2000.

    [4] Ercegovac M.D. and Huang Z. (March 2006)

    http://ijcer.org/

  • Prasann D. Kulkarni, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]

    http://ijcer.org ISSN: 2278-5795 Page 107

    High performance low power left to right

    array multiplier design IEEE Trans.

    Comput., Vol-54, no-2, pp 272-283.

    [5] Anantha P. Chandrakasan, Samuel Sheng, and

    Robert W. Brodersen, Low-Power CMOS

    Digital Design, Journal of Solid state

    circuits. Volume 27, NO 4. April 1992.

    [6] Nazieh M. Botros, HDL programming

    (VHDL and Verilog), Dreamtech

    Press(Available through John Wiley- India

    and Thomson Learning) 2006 Edition.

    [7] Charles H. Roth. Jr:, Digital systems Design

    using VHDL, Thomson Learning, Inc, 9th

    reprint, 2006.

    AUTHORS PROFILE

    Mr. Prasann D.Kulkarni has

    completed B.E in Electronics

    and Communication Engg.

    From KLSs Vishwanathrao

    Deshpande Rural Institute of

    Technology, Haliyal,Uttar

    Kannada, Karnataka, India.

    Presently he is pursuing M. Tech in Digital

    Electronics from KLSs G.I.T, Belgaum,

    Karnataka, India and since 2008 he is working as a

    lecturer in KLSs Vishwanathrao Deshpande Rural

    Institute of Technology, Haliyal, Uttar Kannada,

    Karnataka, India. His Research interests are in Low

    Power Embedded system design, Fuzzy logic in

    neural applications.

    http://ijcer.org/