Top Banner
EE 382 Processor Design Winter 98/99 Michael Flynn 1 AT Arithmetic • Most concern has gone into creating fast implementation of (especially) FP Arith. • Under the AT (area-time) rule, area is (almost) as important. • So it’s important to know the latency, bandwidth and area that any particular algorithm requires.
29

EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 1

AT Arithmetic

• Most concern has gone into creating fast implementation of (especially) FP Arith.

• Under the AT (area-time) rule, area is (almost) as important.

• So it’s important to know the latency, bandwidth and area that any particular algorithm requires.

Page 2: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 2

Integer addition

• Adders are the fundamental building block of the processor, defining t.

• Adder types include– carry chain, carry select (conditional sum),

carry lookahead (Brent-Kung), canonic (prefix) carry skip, Ling

• Most high speed 32b adders take about the same area (f normalized)…1 A to 1.5A

Page 3: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 3

Integer addition

• Both area and time scale as n, the adder precision. The delay, t, scales slowly (log n)

• Area scale about linearly with n; so a 64b adder takes 2-3 A, but still fits into t …maybe by definition of a “cycle”.

Page 4: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 4

Carry skip adder

Page 5: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 5

Manchester carry chain

Page 6: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 6

Carry skip logic

Page 7: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 7

Carry select addition

Page 8: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 8

FP addition

• A basic FP adder has 5 steps– exponent difference, pre align, significand add,

post align, and round.

• Assuming that a full shifter has about the same complexity (delay and area) as an add, then 64b FP addition takes 7 - 10 A, and has about 5 t execution

Page 9: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 9

FP additionAdvanced FP adders are faster and use more area:1) Two path FADD creates separate paths for operands;

• a path for operands whose exponents close in value (subtract) this is the only case when we need a full shift to renormalize the result

• a path for other cases where the exponent difference is > 2(this is the only case that uses a full shift to prealign significands)

2) A FADD with integrated rounding. Here the rounding step is eliminated by computing both the sum/difference and the result plus 1… this is done by using 2 adders (or a compound adder) and then MUXing out the final result.

Page 10: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 10

FP adders

• The two path FP adder uses an additional significand adder and exponent adder… about 3-4 A. It reduces FADD delay by one t

• Integrated rounding adds another rounding adder plus MUX…another 3-4 A while reducing delay by another t

Page 11: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 11

FP adders

• Net area time tradeoff

• Basic… Area 10 A and delay 4-5 t• Two path… Area 13.5 A and delay 3-4 t• Integrated round (with two paths)… area

17 A and delay 2-3 t• For pipelining add 1 A per pipe stage and

use upper range on t

Page 12: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 12

Multipliers

• After add, the most important arithmetic op

• Approaches– encode the multiplier bits (Booth 2, Booth 3...)– assimilate the partial products

• one, two or n pass (iterated arrays or trees)• arrays (simple, double, higher level)• trees (Wallace, binary[4:2], ZD,….)

– CPA to produce product

Page 13: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 13

Multipliers

• Integer and FP multipliers usually have about the same execution time (with same precision, n)

• Booth reduces number of pp’s but adds MUXs to generate the pp’s.

• Most of the area, and probably delay too, is in the pp reduction tree.

Page 14: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 14

16 bit Booth 2 multiply

Page 15: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 15

16 bit Booth 2 example

Page 16: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 16

16 bit Booth 2 pp selector logic

Page 17: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 17

16 bit Booth 3 multiply

Page 18: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 18

5 x 5 unsigned multiplication

Page 19: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 19

1-bit adder

Page 20: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 20

Wallace tree

Page 21: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 21

Wallace tree reduction

Page 22: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 22

Multipliers• A full tree implementation of a 54b (FP

type) with Booth 2 has tree height 28 and uses about 2500 CSAs (or about 50 A in the tree). Maybe a total of 10 A in MUXs plus 50 A in tree and 3A in the CPA, 62A total.The fastest multiplier is, maybe, 2 t

• Using a 2 pass tree reduces the hardware considerably; height is 14 using about 700 CSAs or 14 A…total area 5 + 14 + 3 = 22A; 3-4 t

Page 23: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 23

Multipliers

• To pipeline the Multiplier we need a full tree implementation; probably 3-4 t.

• Perhaps Booth3, followed by a full tree (h = 17) and CPA stage.

• Probably area = 50 - 60A

Page 24: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 24

Divide

• Infrequent op, but long latency can affect IPC achieved.

• Algorithms:– SRT 2 or 3 bit (32 - 36 t) maybe 6-10 A– NR or Binomial expansion (10- 14t); needs at

least 6 A for table and control plus use of MPY– Bipartite tables for small n (less than 24b)

Page 25: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 25

Divide

SRT creates quotient 2 or 3 bits/iteration– uses divisor - partial remainder lookup table for

trial quotient then subtracts– result (partial rem.) is in redundant form so no

restoration is needed; also result is left as a sum and carry pair (no cpa needed)

– fast iteration is possible, sometimes 2x per t

Page 26: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 26

Divide

Multiply based use either Newton Raphson or Binomial series– if f(x) = b - 1/x; root is at x = 1/b then NR

iteration is xi+1 = xi (2 b xi )

– converges is quadratic, doubles precision of result each iteration

– so start with table lookup of 1/b to 8b, then 3 iterations gives 64b result then a x (1/b) is quotient

Page 27: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 27

Divide

• Divide is not usually pipelined, except for small n implementations.

• Frequently combined with square root in the same implementation.

Page 28: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 28

Sub word concurrency

• Provides 8, 16, 32b concurrent ops within “existing” integer or FP hardware

• In 64b integer unit can do 8x8, or 4x16, or 2x32 ops concurrently

• Since FP units are designed to be faster, may be use it: 8x4, or 2x16, or 2x24.

Page 29: EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

EE 382 Processor Design Winter 98/99 Michael Flynn 29

Sub word concurrency

• Usually only for add and multiply

• Implementations straightforward for add; more complicated for multiply– requires reorganizing partitions of the pp tree– affects multiply area and delay marginally

(maybe 10% delay and 20% area)

• isa must define “saturating” arithmetic.