EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 1

AT Arithmetic

• Most concern has gone into creating fast implementation of (especially) FP Arith.

• Under the AT (area-time) rule, area is (almost) as important.

• So it’s important to know the latency, bandwidth and area that any particular algorithm requires.


Integer addition

• Adders are the fundamental building block of the processor, defining t.

• Adder types include– carry chain, carry select (conditional sum),

carry lookahead (Brent-Kung), canonic (prefix) carry skip, Ling

• Most high speed 32b adders take about the same area (f normalized)…1 A to 1.5A


Integer addition

• Both area and time scale as n, the adder precision. The delay, t, scales slowly (log n)

• Area scale about linearly with n; so a 64b adder takes 2-3 A, but still fits into t …maybe by definition of a “cycle”.


Carry skip adder


Manchester carry chain


Carry skip logic


Carry select addition


FP addition

• A basic FP adder has 5 steps– exponent difference, pre align, significand add,

post align, and round.

• Assuming that a full shifter has about the same complexity (delay and area) as an add, then 64b FP addition takes 7 - 10 A, and has about 5 t execution


FP additionAdvanced FP adders are faster and use more area:1) Two path FADD creates separate paths for operands;

• a path for operands whose exponents close in value (subtract) this is the only case when we need a full shift to renormalize the result

• a path for other cases where the exponent difference is > 2(this is the only case that uses a full shift to prealign significands)

2) A FADD with integrated rounding. Here the rounding step is eliminated by computing both the sum/difference and the result plus 1… this is done by using 2 adders (or a compound adder) and then MUXing out the final result.


FP adders

• The two path FP adder uses an additional significand adder and exponent adder… about 3-4 A. It reduces FADD delay by one t

• Integrated rounding adds another rounding adder plus MUX…another 3-4 A while reducing delay by another t


FP adders

• Net area time tradeoff

• Basic… Area 10 A and delay 4-5 t• Two path… Area 13.5 A and delay 3-4 t• Integrated round (with two paths)… area

17 A and delay 2-3 t• For pipelining add 1 A per pipe stage and

use upper range on t


Multipliers

• After add, the most important arithmetic op

• Approaches– encode the multiplier bits (Booth 2, Booth 3...)– assimilate the partial products

• one, two or n pass (iterated arrays or trees)• arrays (simple, double, higher level)• trees (Wallace, binary[4:2], ZD,….)

– CPA to produce product


Multipliers

• Integer and FP multipliers usually have about the same execution time (with same precision, n)

• Booth reduces number of pp’s but adds MUXs to generate the pp’s.

• Most of the area, and probably delay too, is in the pp reduction tree.


16 bit Booth 2 multiply


16 bit Booth 2 example


16 bit Booth 2 pp selector logic


16 bit Booth 3 multiply


5 x 5 unsigned multiplication


1-bit adder


Wallace tree


Wallace tree reduction


Multipliers• A full tree implementation of a 54b (FP

type) with Booth 2 has tree height 28 and uses about 2500 CSAs (or about 50 A in the tree). Maybe a total of 10 A in MUXs plus 50 A in tree and 3A in the CPA, 62A total.The fastest multiplier is, maybe, 2 t

• Using a 2 pass tree reduces the hardware considerably; height is 14 using about 700 CSAs or 14 A…total area 5 + 14 + 3 = 22A; 3-4 t


Multipliers

• To pipeline the Multiplier we need a full tree implementation; probably 3-4 t.

• Perhaps Booth3, followed by a full tree (h = 17) and CPA stage.

• Probably area = 50 - 60A


Divide

• Infrequent op, but long latency can affect IPC achieved.

• Algorithms:– SRT 2 or 3 bit (32 - 36 t) maybe 6-10 A– NR or Binomial expansion (10- 14t); needs at

least 6 A for table and control plus use of MPY– Bipartite tables for small n (less than 24b)


Divide

SRT creates quotient 2 or 3 bits/iteration– uses divisor - partial remainder lookup table for

trial quotient then subtracts– result (partial rem.) is in redundant form so no

restoration is needed; also result is left as a sum and carry pair (no cpa needed)

– fast iteration is possible, sometimes 2x per t


Divide

Multiply based use either Newton Raphson or Binomial series– if f(x) = b - 1/x; root is at x = 1/b then NR

iteration is xi+1 = xi (2 b xi )

– converges is quadratic, doubles precision of result each iteration

– so start with table lookup of 1/b to 8b, then 3 iterations gives 64b result then a x (1/b) is quotient


Divide

• Divide is not usually pipelined, except for small n implementations.

• Frequently combined with square root in the same implementation.


Sub word concurrency

• Provides 8, 16, 32b concurrent ops within “existing” integer or FP hardware

• In 64b integer unit can do 8x8, or 4x16, or 2x32 ops concurrently

• Since FP units are designed to be faster, may be use it: 8x4, or 2x16, or 2x24.


Sub word concurrency

• Usually only for add and multiply

• Implementations straightforward for add; more complicated for multiply– requires reorganizing partitions of the pp tree– affects multiply area and delay marginally

(maybe 10% delay and 20% area)

• isa must define “saturating” arithmetic.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

Documents