1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation.

1

Farhan Mohamed Ali (W2-1)Jigar Vora (W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4)

Presentation 12

MAD MAC 525

26th April, 2006Short Final Presentation

W2

Project Objective:Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.

Design Manager: Zack Menegakis

2

Agenda

• Marketing (Jigar)• Project Description (Farhan)• Algorithmic Description (Farhan)• Design Process (Sonali)• Floorplan Evolution (Sonali)• Layout (Avni)• Design Specifications (Avni)• Conclusion (Jigar)

3

MARKETING

• Application of product: HDR rendering in gaming graphics

• Why HDR? Used in games like Far Cry

• Optimization for speed( chose this because of market)

• Competition- if enter market, possible barriers to entry

4

MAD MAC and HDR

• What is HDR?

• Show animation explaining concept

5

MAD MAC and HDR• MAD MAC accelerates FP16 blending to enable true HDR graphics

• What is HDR?

• HDR = High Dynamic Range

• Dynamic range is defined as the ratio of the largest value of a signal to the lowest measurable value

• Dynamic range of luminance in real-world scenes can be 100,000 : 1

• With HDR rendering, pixel intensity are allowed to extend beyond [0..1] range of traditional graphics

•Nature isn’t clamped to [0..1] and neither should CG

• In lay terms:

• Bright things can be really bright

• Dark things can be really dark

• And the details can be seen in both

6

7

• Multiply Accumulate unit (MAC)

• Executes function AB+C on 16 bit floating point inputs. Inputs will be OpenEXR format.

• Multiply and add in parallel to greatly speed up operation

• Rounding is only performed only once so greater accuracy than individual multiply and add functions.

• Also known as:

• Fused Multiply Add (FMA)

• Multiply Add (MAD/MADD) in graphics shader programs

• Many applications benefit from a fast FMA

• Graphics – HDR rendering, blending and shader ops

• DSPs – computing vector dot-products in digital filters

• Fast division, square root – eliminates extra hardware

• Available in many newer CPUs and DSPs because it’s so cool

• One ring (circuit) to rule them all!

PROJECT DESCRIPTION

8

ALGORITHMIC DESCRIPTION

• Step through entire process

• Multiply and align occurs concurrently- always align C to A*B

• Outputs go to adder, normalize, round, overflow checker and output register

9

RegArray A RegArray B RegArray C

Multiplier Exp Calc Align

Adder/SubtractorControlLogic

&Sign

Dtrmin

Normalize

Round

Ovf Checker

Leading 0 Anticipator

10 10 10

5

55

1435225

4

36

14

101

5

5

Input Input Input

Output

16 16 16

16RegY

15

1

1

1

Block Diagram

10

IMPLEMENTATION

• Implementation of each module- how and why we chose a particular method keeping in mind goal of speed( multiplier, adder)

11

Design Decisions (contd.):• Multiplier Implementation

– 11 x 11 Carry-Save Multiplier– Reasons:

• Fast because it avoids having ripple carry in every stage

• Enables Compact Layout

12

Design Process

• Verilog-> Schematic-> Layout– Behavioral -> Structural Verilog– Transistors/gates -> Full Schematic– Gate/Component Layout -> Top Level

• Transistor Count fluctuated from 20,200 to 12,800• Major design decisions

– Decided against implementing denormal arithmetic because it would increase the complexity of the project beyond the scope of the class

– Round performed only once at the end.– Picked nPass over Tgate in the normalize shifter– Adder: variable length carry select-> Han-Carlson binary tree

adder

13

VERIFICATION OF DESIGN

Verilog Simulations ( show outputs)– Overview– How/Why it works– Behavioral/Structural

Explain why we couldn’t get a high-level simulator and how we tested our verilog design.

14

SCHEMATICS

• Show schematics of major blocks: adder, multiplier, and top-level

• HOW WE VERIFIED: analog simulation

15

Top Level Schematic

16

Multiplier Schematic

17

Adder Schematic

18

FLOORPLAN EVOLUTION

• Initial floorplan

• How it evolved (with animation)- why and how we changed it

19

Multiplier

Align C

Reg A

Reg

BExpCalc

Reg C

Pipeline Reg Pipeline Reg

AdderLd

Zero

Pipeline Reg

NormalizeRound

Reg Y

Main Floorplan

20

Floorplan

21

Full Chip LayoutExponent

AlignZero

Adder

MultiplierNormalize

Round

Ovf

22

Pipelining

• Initially planned 5-6 pipeline stages

• Reduced to 4 pipeline stages – made possible by implementing fast carry lookahead adders in critical path modules (adder and multiplier)

23

Pipeline Reg

Pipelining Stages

MultiplierAlign

C

Reg A

Reg

BExpCalc

Reg C

Pipeline Reg Pipeline Reg

AdderLd

Zero

Pipeline Reg

NormalizeRound

Reg Y

Pipeline Reg

Overflow checker

24

LAYOUT

• Final Layout

• Layout of large blocks such as multiplier, adder and normalize

25

Layout Decisions

• 3 standard cell heights

• Uniform width vdd and ground rails

• Wider vdd and ground rails in power hungry modules

• Max of 8 flip flops per clock pulse generator

• Metal directionality

26

Multiplier Layout with pipelining

27

Adder Layout

28

Normalize Layout

29

FINAL LAYOUT

30

Design Specifications

• Worst case delay = 2.25ns

• Long buses are all buffered (not tested yet)

• Estimated clocking speed = 400MHz

• Height by width = 193.86 um * 301.545 um

• Area = 58,458 um^2

• Aspect ratio = 1:1.55

• Total Transistor density = 0.22

31

Layout densities

• Active : 14.05%

• Poly : 9.25%

• Metal 1 : 33.89%

• Metal 2 : 18.00%

• Metal 3 : 14.99%

• Metal 4 : 6.29%

32

Layer Masks - Poly

33

Layer Masks – Metal 1

34


35


36


37

Schematic Power: mW (350Mhz)

Layout Power: mW

Schematic Delay

Layout Delay

Multiplier

-w/ pipeline

2.97

??

N/A

??

3.38n

1.9n

N/A

2.25n

Exponents 1.608 2.21 1.01n 1.2n

Align 0.094 0.113 480p 637p

Adder 8.48 9.73 1.34n 1.7n

Leading 0 0.232 0.857 506p 551p

Normalize 1.458 1.546 407p 437p

Round 0.631 1.21 864p 986p

OvfCheck 0.13 0.19 453p 475p

Registers ?? ?? 179p 193p

Total ?? ?? - -

38

Area:

um2

Transistor Count

Transistor

Density

Multiplier

-w/ pipeline

20388 4496 0.22

Exponents 5,163 738 0.14

Align 3,995 500 0.13

Adder 13,202 3174 0.24

Leading 0 1,253 364 0.29

Normalize 3,190 942 0.3

Round 1,802 494 0.28

OvfCheck 200 70 0.35

Registers, etc

N/A 1948 N/A

Total 58,458 12,730 0.22

39

Conclusion

• More marketing

• Summarize chip functionality

• Extending applications of chip

40

Comments?

1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation.

Documents

adder slide

hdr mad mac

entry slide

concept slide

compact layout slide

true hdr graphics

output register slide

zack menegakis slide