ELEC516/10 Lecture ELEC 516 VLSI System Design and Design Automation Spring 2010 Lecture 4 - Shifter and Multiplier Design Reading Assignment: Weste: Chapter 8 Rabaey: Chapter 11 Note: some of the figures in this slide set are adapted from the slide set of “ Digital Integrated Circuits” by Rabaey. Et. al. 2002
55
Embed
ELEC516/10 Lecture 4 1 ELEC 516 VLSI System Design and Design Automation Spring 2010 Lecture 4 - Shifter and Multiplier Design Reading Assignment: Weste:
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ELEC516/10 Lecture 41
ELEC 516 VLSI System Design and Design Automation Spring 2010
Lecture 4 - Shifter and Multiplier Design
Reading Assignment:
Weste: Chapter 8
Rabaey: Chapter 11
Note: some of the figures in this slide set are adapted from the slide setof “ Digital Integrated Circuits” by Rabaey. Et. al. 2002
ELEC516/10 Lecture 42
Shifter Design• Shifting operations are important and are used extensively for
– arithmetic shifting, logical shifting, rotation, – floating point operations, scaling and multiplications by
constant number– Data alignment– Field extraction/combination– Address generation
• Shifting a data-word left or right over a constant amount is trivial hardware operation. A programmable shifter, however, is more complex.
• E.g. shift left or right for a variable number of bit• Design style
– Two dimension arrays– Variable size– Rotate– Padding with zeros/ones
ELEC516/10 Lecture 43
A simple shifter
•The above design will rapidly become complex and slow for larger shift values•More structural approach is advisable: Two commonly used shift structures, the barrel shifter and the logarithmic shifter.
Ai
Ai-1
Bi
Bi-1
Right Leftnop
Bit-Slice i
...
ELEC516/10 Lecture 44
Barrel Shifter
• It consists of array of transmission gates, where the number of row equals the word length of the data and the number of columns equals the maximum shift length.
• A major advantage for this shifter is that the signal has to pass through at most one transmission gate and hence the delay is theoretically constant and independent of the shift value or shifter size. This is not true in reality since the capacitance at the input of the buffers rise linearly with the maximum shift-width.
ELEC516/10 Lecture 45
Barrel Shifter (2)
Sh3Sh2Sh1Sh0
Sh3
Sh2
Sh1
A3
A2
A1
A0
B3
B2
B1
B0
: Control Wire
: Data Wire
Area Dominated by Wiring
ELEC516/10 Lecture 46
Logarithmic Shifter
• While the barrel shifter implements the whole shifter as a single array of pass-transistors, the log. shifter uses a staged approach. It uses stages of multiplexers which decompose the shift into power-of-two stages.
• A shifter with a maximum shift width of M consists of log2M stages, where the ith stage either shifts over 2i or passes the data unchanged.
• Log. shifter is usually smaller than the barrel shifter. For larger values, of M, it is definitely the structure of choice.
• The speed depends upon the shift-width in a log. way since a n-bit shifter requires log2n stages.
• Other shift options are frequently required, for instance, shuffles, bit reversals, and interchanges.
ELEC516/10 Lecture 47
Logarithmic Shifter (2)
• In general, it can be concluded that a barrel-shifter is appropriate for smaller shifters. For large shift values, the log. shifter becomes more effective, in terms of area and speed. Also log. shifter is more regular and hence can be easily generated automatically.
Sh1 Sh1 Sh2 Sh2 Sh4 Sh4
A3
A2
A1
A0
B1
B0
B2
B3
ELEC516/10 Lecture 48
Multiplexer-based shifter
ELEC516/10 Lecture 49
Shifter design - Summary
• The design of a shifter is a trade-off between area, delay.
• Barrel shifter: fastest but requires more transistors Speed: O(1), area: n2 transistors
• Logarithmic shifter: Slower but less transistors: Speed: O(log n), area: n log n transistors
• Barrel shifter is wire-dominated circuit
ELEC516/10 Lecture 410
The Multiplier
• Very important operation. Often the speed of multiplication limits the performance of the digital processor.
• Multiplications are used in many digital signal processing applications: – correlations, convolution, filtering, and frequency analysis.– Vector product, matrix multiplication.– Weighted sums required in many DSP such as Neural
network, Filtering etc…
• Multipliers are in fact complex adder arrays. • The analysis of the multiplier gives us some further insight
on how to optimize the performance (or the area) of complex circuit topologies.
ELEC516/10 Lecture 411
Example
•The multiplication process may be viewed to consist of two steps:
•Evaluation of partial products•Accumulation of the shifted partial products.
• Partial products can be generated using an array of AND gates.
• Example: 10x5
Multiplicand: 1 0 1 0 10Multiplier: 0 1 0 1 5
1 0 1 00 0 0 0
1 0 1 00 0 0 0
0 1 1 0 0 1 0 50
4 partial products
ELEC516/10 Lecture 412
The Multiplier(II)
• Binary multiplication is equivalent AND operation. Evaluation of the partial products consists of the logical ANDing of the multiplicand and the relevant multiplier bit.
• Different techniques exist. The choice of technique is based on factors such as speed, throughput, numerical accuracy and area.
• N*N multiplier has 2n bits output– Integer multiplier – takes the n LSB bits
– Floating point multiplier (or fixed point with decimal point in the MSB) e.g. FP, 1.XXX * 1.XXX, takes the n MSB bits
ELEC516/10 Lecture 413
Simple multiplier
• Generates and add one partial product at each cycles.
• Takes n cycles.multiplicand
multiplierPartial Product
generation
Adder
Shift
Shift right every cycle
ELEC516/10 Lecture 414
Issues for design fast multiplier
• Reduce the number of partial products• Fast adder cells• Reducing the number of addition required to sum
the partial products – e.g. use tree adders
ELEC516/10 Lecture 415
The Array Multiplier
• Consider two unsigned binary number X and Y that are M and N bits wide, respectively
iM
iiXX 2
1
0
jN
jjYY 2
1
0
1
0
1
0
1
0
1
0
222M
i
N
j
jiji
N
j
jj
M
i
ii YXYYXX
•Pk the partial product terms called summands. There are M*N summands which are generated in parallel by a set of M*N AND gates
kNM
kkbYXb 2
1
0
ELEC516/10 Lecture 416
The Array Multiplier (II)
• A n*n multiplier requires n(n-2) full adders, n half adders, and n2 AND gates. The worst case delay is (2n+1)g, where g is the worst case adder delay.
ELEC516/10 Lecture 417
The Array Multiplier (III)
• The following is a basic cell used in array multiplier
B CYX
Y
CO
X PO
+
ELEC516/10 Lecture 418
A 4*4 array multiplier
HA FA FA HA
FA FA FA HA
FA FA FA HA
X0X1X2X3 Y1
X0X1X2X3 Y2
X0X1X2X3 Y3
Z1
Z2
Z3Z4Z5Z6
Z0
Z7
Y0x3 x2 x1 x0
ELEC516/10 Lecture 419
The MxN Array Multiplier - Critical Path
HA FA FA HA
HAFAFAFA
FAFA FA HA
Critical Path 1
Critical Path 2
andsumcarrymult ttNtNMt )1()]2()1[(
ELEC516/10 Lecture 420
Carry-Save Adder (old style)• We don’t need to optimize the carry chain of each of
the rows. Postpone the carry to a later stage
CSA
Delay=N.tcarry+ tand + tmerge
HA HA HA HA
HA FA FA FA
HA FA FA FA
HA FA FA HA
[Rab96] p.411
Vector merging stageHA FA FA HA
N
M
ELEC516/10 Lecture 421
Booth Encoding
• The multiplier we studied before use radix-2 multiplication, i.e. by observing one bit of the multiplicand at a time.
• Higher radix multipliers may be designed to reduce the number of adders and hence the delay required to compute the partial sums.
• Booth encoding - perform two’s complement multiplication and perform several steps of the multiplication at once.
• It takes the advantage of the fact that an add-subtracter is nearly as fast and small as a simple adder.
• The most common form of Booth’s algorithm looks at three bits of the multiplier at a time to perform two stages of multiplication.
ELEC516/10 Lecture 422
Booth Multiplier: Example• 2a = 2a+1- 2a and hence we can recode each 1 in
multiplier as “+2-1”– Converts sequences of 1 to 10…0(-1)– Might reduce the number of 1’s
Booth Multiplier• The following shows a structure of a Booth multiplier
Left shift 2
codeAdder/subtractor
mux sel
Left shift 2
codeAdder/subtractor
mux sel
Pj+1
Pj+1
0 x 2x
yi+4
yi+3
yi+2
yi+2
yi+1
yi
0 x 2x
Pj
Stage j+1 Stage j
ELEC516/10 Lecture 430
Modified Booth Multiplier -Summary
• Uses high-radix to reduce number of intermediate addition operands– Can go higher: radix-8, radix-16
– Radix-8 should implement *3, *-3, *4, *-4
– Recoding and partial product generation becomes more complex
• Can automatically take care of signed multiplication
ELEC516/10 Lecture 431
Wallace-Tree Based Multiplier
• Principle– Sum N shifted partial products
– Do N-input addition efficiently
– Reduced N-input addition in steps
– Use counters, e.g. carry-save adder (CSA) (3/2 reduction)
• CSA is simple, it is just a full adder– At the end of the array you need to add two parts
together.
– This take a fast adder, but you only need one at the end, not one for each partial product.
ELEC516/10 Lecture 432
Reduction by Carry-save adders
• Example: X(2,1,0)*Y(2,1,0), Let A0=X(0)*Y(0), A1 = X(1)*Y(0), X(2)*Y(0), etc. A2 A1 A0
B2 B1 B0
C2 C1 C0
CSA
CSA
CPA
C0
A0B0A1
B1A2
C1B2
C2
ELEC516/10 Lecture 433
Carry-Save Multiplier
HA HA HA HA
FAFAFAHA
FAHA FA FA
FAHA FA HA
Vector Merging Adder
ELEC516/10 Lecture 434
Wallace Tree Multiplier
• The Wallace tree multiplier uses logic tricks to speed up the required addition. It is an adder tree built from carry save adders using 3-to-2 reduction
A 1-bit adder provides a 3:2 compression in the number of bits. The addition of partial products in a column of an array multiplier may be thought of as totaling up the number of 1’s in that column, with an carry being passed to the next column to the left.
• Adding these few bits is equivalent to complete sign extension
ELEC516/10 Lecture 442
Other Multiplier structures• Serial Multiplier: Very compact but very slow: M+N bit product
requires Td= MN clock cycles
• Serial/Parallel Multiplier: Very modular, good trade-off: Td=M+N cycles
ELEC516/10 Lecture 443
Multipliers —Summary
• Optimization Goals Different Vs Binary Adder
• Once Again: Identify Critical Path
• Other possible techniques
- Data encoding (Booth)- Pipelining
FIRST GLIMPSE AT SYSTEM LEVEL OPTIMIZATION
- Logarithmic versus Linear (Wallace Tree Mult)
ELEC516/10 Lecture 444
Floating-point units
• More complex operation/more time• Fewer access• Often designed outside the normal ALU• Co-processor• Floating point representation• Data = (-1)sign*0.1 Fraction*2exp
• Normalization:– 1 < Data <= ½ (Exp =0, Sign =0)– First Decimal Digit is one– No need for representing it
• A single-cycle comparator based on the priority-encoding algorithm and dynamic circuit design technique [Huang 2002]
• 4 steps:1. XOR gate is used to determine whether each corresponding bit of
the two numbers is equal or not.
2. A priority encoder is used to set the most significant unequal bit of the result from step 1 to ‘1’ and reset all other bits to ‘0’.
3. The result of step 2 is “ANDed” with the two input numbers.
4. All the bits of the results of step 3 are “ORed” together to determine which number is greater.
ELEC516/10 Lecture 451
Dynamic Priority Encoder
Critical path: 7 transistors because of the NAND gate implementation
ELEC516/10 Lecture 452
Wide bit width comparator – 64 bits
• Hierarchical- multistages• Phase pipelining to achieve single clock
ELEC516/10 Lecture 453
New comparator not using Priority encoder
• New algorithm uses a parallel MSBs bit checking method instead of priority encoding to determine the location of the first significant bit that the two inputs are different.
• Using this method facilitates the use of NOR-type logic gate and results in faster speed for dynamic logic implementation
ELEC516/10 Lecture 454
New algorithm• 4 steps
1. Both AB’ and A’B are computed. Unlike the original PE algorithm which uses XOR gate to find the bits that A and B are different, the information of which number is larger at that particular bit location. E.g :4’b0010 indicates that at bit 1, A is larger than B.
2. A data conversion (calculating A* and B*) is done to determine the most significant bit that is a ‘1’ in the result of step 1. Different from the priority encoder, instead of setting the most significant 1-bit to 1 and resetting all the other bits to ‘0’, we set all the preceding bits of the most significant 1-bit (not including the most significant 1-bit itself) to 1 and reset all the other bits to zero. By doing so the implementation can be done using NOR type of dynamic logic.
3. we calculate (A*)’B* and A*(B*)’. If A* has a longer running length of zero, A*(B*)’. will be all zero and (A*)’B* will have some bits equal to 1, and vice versa.
4. We check whether the result of step 3 is an all zero vector or not by ORing all the bits together. A corresponding zero vector means that the other input is the greater one.