Top Banner
1 Lecture 6 Processor Technology
96

1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

Jan 11, 2016

Download

Documents

Kelley Sanders
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

1

Lecture 6Processor

Technology

Page 2: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

2

Advance in Hardware

INTEL Family: (8086/1978 -- Pentium II/1998)exponential performance improvement over

time• number of transitors: increased almost 2500

times (29 K --> 7.5 M)• clock rate: 45 times (10 MHz -> 450 MHz)

Page 3: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

3

Moore’s Law (1969) The number of transistors on a microchip doubles

about every 18-24 months, assuming the price of the chip stays the same

The speed of a microprocessor doubles about every 18-24 months, assuming price stays the same

The price of a microchip drops about 48% every 18-24 months, assuming the performance metric (processor speed or

memory capacity) of the chip stays the same.

Page 4: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

4

Milestones of Chip Density

1K

10K

100K

1M

10M

70 90 92 94 96 98 00

100M

1G

Num

ber

of T

rans

isto

rs p

er

Chi

p

72 74 76 78 80 82 84 86 88 02

4004

•8080

•8085

8086••68000

80286• 68020•• 80386

LSI LogicGate Array

•80486

••

IBMGateArray

• Pentium ProMPU Only

••

•P7

Pentium ProMPU and CacheMemory Chip

▲4K ▲

16K

▲64K

256K▲

▲ 1M

▲4M

▲16M

▲64M

▲256MLSI LogicGate Array

= Memory (DRAM)

= Microprocossor and Logic

Pentium

1G

Memory Increase = 1.5/yearMPU Increase = 1.35/year

Year

Source: ICE

•▲

Page 5: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

5

Outline

Instruction Set Architecture (ISA)

Pipelining Concepts Processor Technology

CISC, RISC, superscalar, VLIW

Case Study Future processor

Page 6: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

6

Part 1: Instruction Set Architecture

(ISA)

Page 7: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

7

Computer Architecture’s Changing Definition 1950s to 1960s: Computer Arithmetic 1970s to mid 1980s: Instruction Set Design,

especially ISA appropriate for compilers 1990s: Design of CPU, memory system, I/O

system, Multiprocessors

Page 8: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

8

Instruction Set Architecture (ISA)

instruction set

software

hardware

Page 9: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

9

Interface DesignA good interface:

• Lasts through many implementations (portability, compatability)

• Is used in many different ways (generality)

• Provides convenient functionality to higher levels

• Permits an efficient implementation at lower levels

Interfaceimp 1

imp 2

imp 3

use

use

use

time

Page 10: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

10

What Operations Should be in Instruction Set?

How many are possible ? Which ones do we need ? Circuit complexity ? How frequently is each used ? How much slower would each be, if

implemented in terms of simpler ones ?

Page 11: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

11

Typically include:

ALU (25-40 % frequency of use) Data transfer (~15-40 %) Control flow (~15-25 %) System (~ 2%) Floating point (~ 15 %) Decimal and string (~ 15 %)

Page 12: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

12

4 types of flow operations:

Conditional branch (Branches): ~73%Unconditional branches (Jumps): ~14 %Procedure calls + return (Jump): ~ 13%

Page 13: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

13

Data types and sizes ?

How many are possible ? Which ones do we need ? How frequently are they used ? How much slower if implemented in

software ?

Page 14: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

14

Data Types

Integer: short, long, extra long. floating-point: single-, double-, quad-

precision. characters: char, strings. bit fields. binary coded decimal.

Page 15: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

15

Other Issues

What are the most common accesses (profile) ?

What should the instruction format be ?

Page 16: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

16

Conflicting goals:

code compactness less no. of lines in program (at machine level

after compilation) less memory, less I-Fetch bandwidth.

easy decoding want fixed format less expensive and faster I-decode.

Page 17: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

17

Ways to get code compactness: (An ideal case)

Huffman encoding --; e.g,50 % 'A' --- "0"25 % 'B' -- '10'12.5 % 'C" -- '110'12.5 % 'D' -- '111”

Variable length according to frequency Easy to implement ? Cost ?

Page 18: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

18

Evolution of Instruction Sets

Design decisions must take into account: technologymachine organizationprogramming languagescompiler technologyoperating systems

And they in turn influence these

Page 19: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

19

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock Rate

Program X

Compiler X (X)

Inst. Set. X X

Organization X (X) X

Technology X

Page 20: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

20

Cycles Per Instruction (CPI)

CPU time = CycleTime * CPI * Ii = 1

n

i i

CPI = CPI * F where F = I i = 1

n

i i i i

Instruction Count

“Instruction Frequency”

Invest Resources where time is Spent!

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

Page 21: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

21

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

Page 22: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

22

Part 2: Pipelining

Page 23: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

23

Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes

A B C D

Page 24: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

24

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 25: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

25

Pipelined Laundry: (Start work ASAP)

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 26: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

26

Computer Pipelining

Overlapping the execution of instructions.Instruction fetch (IF)Instruction decode (ID)Execute (EX)Write back (WB)

Some operation (e.g., IF, ID, EX) is performed on every instruction in the pipeline.

Page 27: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

27

Computer Pipelining

Pipelining increases the Throughput Throughput = no. of instructions executed in

a given time period

Hence, reduces the average execution time per instruction (or CPI).

Page 28: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

28

Pipelining Speedup:

k-stage linear pipeline, n tasks.Pipelined = k+(n-1) cycles.Unpipelined = n x k cycles.See the laundry example.

Speedup = Sk = nk / (k+n-1)

Sk k as n .

Page 29: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

29

Speedup (explanation)

At time t=0, the first pipeline operation enters the pipe.

After k pipeline clock cycles, the 1st result exits.

Then, 1 result exits per clock cycle.

Page 30: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

30

Pipelining Lessons Doesn’t help latency of

single task, but throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Page 31: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

31

Computer Pipelines

Execute billions of instructions, so throughput is what matters

desirable features: all instructions same length, registers located in same place in instruction

format, memory operands only in loads or stores

Page 32: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

32

MIPS example: 5 Stage Pipelining

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

IRLMD

Page 33: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

33

Visualizing Pipelining

Instr.

Order

Time (clock cycles)

Page 34: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

34

Limits to pipelining: Hazards !!

Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: HW cannot support this

combination of instructions (single person to fold and put clothes away)

Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline

Page 35: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

35

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Use reg A

Use reg A

Page 36: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

36

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

stall

Instr 3

Wait for one cycle

Page 37: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

37

Data Hazard on R1

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WB

All wait for the result of r1

Page 38: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

38

3 Generic “Data Hazards”

Assume InstrI followed by InstrJ

Read After Write (RAW) InstrJ tries to read operand before InstrI

writes it

Page 39: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

39

Read After Write (RAW)

1 2 543

1 2 543

Write

Read

Inst i

Inst j

read the old data.

Page 40: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

40

Write After Read (WAR)

InstrJ tries to write operand before InstrI reads i

Gets wrong operand

Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

Page 41: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

41

Write After Read (WAR)

1 2 543

1 2 543

Read

Write

Inst i

Inst j

Always read the correct data.

Page 42: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

42

Write After Write (WAW)

InstrJ tries to write operand before InstrI writes it

Leaves wrong result ( InstrI not InstrJ )

Can’t happen in MIPS’s 5 stage pipeline because:

All instructions take 5 stages, and Writes are always in stage 5

Page 43: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

43

Write After Write (WAW)

1 2 543

1 2 543

Write

Inst i

Inst j

Write

Page 44: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

44

Part 3: Processor Technology

Page 45: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

45

Evolution of Processor Design

Single Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)Mixed (1998)

Page 46: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

46

CISC: Complex Instruction Set Computer.

more than 300 instructions ISA variable instruction/data formats small set of 8 to 24 general-purpose registers allow many memory reference operations

(addressing modes) CPI: 1 to 20 cycles, average CPI: 4 cycles Examples: INTEL x86 series (Pentium, Pentium Pro,

Pentium II), Motorola M680X0, Digital VAX 8600, IBM 390, AMD 486, Cyrix 686

Page 47: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

47

Intel CPU FamilyYear Model Features

1981 Intel 8088 16-bit, 29K, max speed 10 MHz

1982 Intel 80286 16-bit, 130K, max speed 12 MHz

1985 Intel 80386 32-bit, 275K, max speed 20 MHz

1989 Intel 80486 32-bit, 2M, 25 MHz

1993- Intel Pentium 3.11M, 66MHz, CMOS, 2-issue, 266 MHzwith MMX (16/16 KB cache)

1995- Intel Pentium Pro(Socket 7)

5.5M, 200 MHz, 8/8 KB cache, Bi-CMOS,64-bit data bus, 3-issue, no MMX

1997- Intel Pentium II(Slot 1)

16/16 KB L1 cache, 256/512KB L2 cache,300 MHz, 32-bit, 3-issue, with MMX

1998 512KB L2 (half speed), 450 MHz, 7.5 Mtransistors

1998 Intel Pentium II Xeon 512KB L2 (full speed), 450 MHz, 32-bit

Page 48: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

48

RISC: Reduced Instruction Set Computer.

Observation: Only 25% of a complex inst set is frequently

used about 95% of the time 75% of the hardware-supported instructions

are rarely used All instructions are of the same length Push rarely used inst into software Adding cache and Floating Point Units (FPU) in

processor chips

Page 49: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

49

RISC: other features Instruction set: less than 100 instructions Fixed 32- or 64-bit instruction format Only 3 to 5 simple addressing modes Single address mode for load/store: base +

displacementno indirection

Simple branch conditions

Page 50: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

50

RISC

Large register files: 32 integer registers + 32 floating point registers,

some has over 100

execute majority of the instruction in one cycle (average CPI: 1.5)

higher clock rate easy for compiler optimization

Page 51: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

51

Examples of RISC processors

SUN: SPARC, MicroSPARC, SuperSPARC, UltraSPARC,

MIPS: R2000/3000/4000/5000/8000, R10000, INTEL: i860, Digital : Alpha 21164, 21264, 21364, IBM, Apple, Motorola : PowerPC 601, 603, 604e,

620, 630, IBM : POWER2 (SP2), POWER 3 (ASCI Blue Pacific), HP: HP PA-RISC PA-8000

Page 52: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

52

Example: MIPS

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

Page 53: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

53

Advanced Pipelining

cycle

Successive instruction

(a) Single-issue base pipeline

(b) 3-issue superscalar pipeline(c) Single-issue superpipeline

(d) 3-issue superscalar superpipeline

Page 54: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

54

Superscalar Processors (RISC+CISC)

Multiple instructions issued per cycle (IPC > 1). Clock rate matches that of generic scalar RISC. CPI is lower than generic scalar RISC.

Examples: Alpha 21064 (2-issue), 21164 (4-issue), PowerPC: 604e (4), 620(4) HP PA-7200 (2), PA-8000 (4), MIPS: R5000 (2), R10000 (4), SUN: MicroSPARC (2), UltraSparc-2 (4), INTEL i860 (RISC, 2 issues), Pentium (CISC, 2), Pentium

Pro (3), Pentium II (3)

Page 55: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

55

SUN SPARC (RISC) Scalable Processor ARChitecture, a specification,

not a chip. Larger register set: SPARC 128-144. Generations:

SPARC (1987), MicroSPARC, SuperSparc (1993), UltraSparc (1995), UltraSPACR II, Ultra III, Ultra IV,..

Machines:CM-5 : SPARC 33 MHz.CS-2 : SuperSparc 40 MHz (viking).SUN Sunfire (Enterprise 1000): UltraSparc

Page 56: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

56

UltraSPARC Roadmap

UltraSPARC II: 64-bit, 0.25 micron (same as

Pentium II, AMD K6-2), Max 360 MHz, 30 W, (400

MHz later 1998)

UltraSPARC III: 600 MHz late 1999, 1000 MHz

UltraSPARC IV: mid-2000, 1000 MHz, 0.15

micron, Sun’s first copper-based chip

UltraSPARC V: 1500 MHz, 0.07 micron (, by 2002

Page 57: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

57

PowerPC (RISC)

1991, by Apple, IBM, and Motorola. OS: IBM AIX, Apple Mac OS, NetWare 4,

OS/2, Sun Solaris, and Window NT, MS-DOS.

Technology update: September1998: IBM 400-MHz PowerPC

(copper wiring)

Page 58: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

58

PowerPC Family :

First generation : PowerPC 601: desktop PCs.

The 2nd generation : PowerPC 603 (603e, 166 MHz, 3 W, 81 mm2):

portable+battery-powered applications. PowerPC 604 (604e, 5.6 M transistors, Power 10 W,

dynamic branch prediction logic, 4-issue, 6-stage): sophisticated PCs and servers.

PowerPC 620: integrated L2 controller and dedicated cache interface, 4-issue, 5-stage, 30 W, used in servers or supercomputer.

Page 59: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

59

PowerPC 3rd generation: G3

L2 cache support, new bus architecture 32-bit processor, used in iMAC (1998) 0.25-micron, 67 mm2, 250 MHz, 5 W, 6.35 M

transistors. 5 execution units (similar to 603e),

1 floating point unit, 1 branch unit, 1 load/store unit, 2 single-cycle integer unit (603e only 1), 1 system unit

Page 60: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

60

PowerPC 3rd generation: G3

4-stage pipeline: fetch, decode-dispatch,

execute, complete-writeback

fetch unit fetches 4 instructions per clock peak rate: complete 3 instructions per clock Two 32 KB on-chip L1 caches (data + instruction)

: same as 604e, 8-way set associative

Page 61: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

61

PowerPC 3rd generation: G3

On-chip L2 cache: 2-way set associative of sizes 256 KB, 512 KB or 1 MB

Performance: 250 MHz CPU clock, 50 MHz system bus, half-speed 1-MB L2 cache: 10 SPECint95

Bus protocol: MEI (modified exclusive, invalid), used for single or dual-processor design

Page 62: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

62

IBM POWER

Performance Optimized With Enhanced RISC POWER2, 66.7 MHz, 6-issue, used in IBM SP2

4 floating-point operations at once cycle. peak performance: 266 Mflops (66.7 x 4).

ASCI Blue Pacific : using POWER3 3.9 trillion calculation per second 15,000 times faster than the desktop PC at Lawrence Livermore National Lab. 2.6 trillion bytes of memory 1 second = 63,000 years using hand calculator 4096 POWER3 CPUs (8 CPUs per node)

Page 63: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

63

POWER3: Superscalar RISC 8-issue, (most other processors 4-issue) 200 MHz (slow but fast), 0.25 micron, 5 metal layers, 1088 pin IBM’s first 64-bit microprocessor. Used in ASCI Blue Pacific, 4096 nodes, Memory subsystem: 6.4 GB/s POWER3 workstation, Oct. 23 1998: RS/6000 43P

Model 260, single or dual-processor, up to 8 (in SMP form), 4MB L2 instruction cache+256 MB SDRAM memory

Compatable with PowerPC design

Page 64: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

64

IBM

1999: 0.18 micron, copper wiring 2000: silicon-on-insulator 2001: “gigachip” (POWER4 ??)

Page 65: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

65

MIPS:

R4000 (1991, 64-bit), R8000 (1994), R10000

1995 or 1996-, 30 W. 5.9 M transistors, 32/32 KB cache5-7 pipeline stages, 4-issue

SGI Power Challenge up to 18 X R8000 or x 36 R10000

Page 66: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

66

Other Commodity Processors

HP PA-RISC (Precision Architecture): PA-RISC 7200 (CONVEX Examplar, 128

processors)

DEC: Alpha 21064 (CRAY T3D), 21164 (300 MHz, T3E), 21264 (T3E1200)

Intel 80860 (i860): ``Cray on a Chip'' 66 MFLOPS (Cray 1S = 85

MFLOPS)

Page 67: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

67

VLIW: Very Long Instruction Word. Use even more functional units than that of a superscalar

processor. All instructions are the same length The operations in each work are chosen by the compiler. CPI is further lower than superscalar. Clock rate is slow. Microprogrammed control, synchronization of parallel

operations is entirely done at compile time No commodity processor is designed in VLIW (but it is

coming back !! INTEL 64-bit Merced)

Page 68: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

68

Register File

Load/Store Integer ALU FP Unit Branch Unit

Mainmemory

Load/store FP ADD FP Multiply Branch .... Integer ALU

(b) pipeline execution of VLIW instruction

Cycle

(a) A VLIW processor architecture and instruction format

Page 69: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

69

SummaryCISC RISC VLIW

InstructionComplexity

Varies fromSimple tocomplex

One simple operation Many simpleIndependentoperations

Instruction size Varies One size, usually 32bits

One size

Instructionformat

Field placementvaries

Regular, consistentfield placement

Regular, consistentfield placement

Memory reference Bundled withoperations

No bundled,load/store

architecture

No bundled,load/store

architecture

Hardware designfocus

MicrocodedImplementations;

One or morepipeline

No microcode; oneor more pipelines

Multiple pipelines, nomicrocode, no

complex dispatchlogic

Registers Few, sometimesspecial

Many, generalpurpose

Many, generalpurpose

Page 70: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

70

Processor Performance Processor Clock IPC stage SPEC95int

SPEC95fp

Alpha 21164 500 4(2+2) 7-12 12.6 18.3 PowerPC620 200 4 5 9.0 9.0 PowerPC 604e 225 4 6 8.5 7.0 UltraSPARC II 250 4 6-9 8.5 15 HA-8000 180 4 7-9 10.8 18.3 MIPS R10000 200 4 5-7 8.9 17.2 Pentium Pro 200 3(2+1) 12-14 8.7 6.0

Only 1 floating point unit active at a time

Page 71: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

71

Case Study 1: INTEL Processors

Page 72: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

72

Pentium (430 TX Mother Board)

Main Memory(DRAM

Max 256 MB)

Main Memory(DRAM

Max 256 MB)

Memory controller82349TX (MTXC)

82371AB(PIIX4)

Bus Master

PCI Bus (3.3 V or 5V, 30/33 MHz)

PCIDevice

EISA/ISADevice

EISA/ISADevice

ISA/EIO Bus (3.3 V; 5V)

Pentium processor 32-bit address64-bit data500 MB/s (8x60)Host Bus (3.3 V; 60-66 MHz)

USB USB

L2 cache(Max 512 KB)

HD CD-ROM

BMI IDE (33 MB/s)

(up to 5)

Page 73: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

73

Pentium Pro cache:

8 KB data+8 KB instruction cache (L1)On-board L2 cache: 256 KB or 512 KB

40 general purpose registers Data TLB: 64 entries No MMX up to 200 MHz, 35W:

integration of high-speed CPU with high-speed cache is not easy

Page 74: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

74

Execution in Pentium Pro

Five functional units:Store data unitStore address unitLoad address unit Integer ALU unitFloating point/integer unit

3-issue but only one floating point op Peak flop rate= 200 MFLOP at 200 MHz

Page 75: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

75

Pentium Pro (P6)

Pentium ProPentium ProCoreCore

8 KB L18 KB L1DataData

CacheCache

8 KB L18 KB L1InstructionInstruction

CacheCache

Bus Interface Unit

256/512 KB256/512 KBUnified Unified

L2 cacheL2 cache

LocalAPIC

External Bus

Substrate

Half-speed

Full-speed Backside Bus

Page 76: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

76

Pentium Pro Memory Subsystem

L1 cache (Data cache 8 KB): supporting one load and one store per cycle (full-

speed), peak bandwidth of 3.2 GB/s on a 200 MHz

L2 cache: run at full CPU clock speed, can transfer 64 bit per

cycle (1.6 GB/sec on a 200 MHz Pentium Pro)

External bus: 64-bit, 64-bit per bus cycle SMP support

Full cache coherence up to 4 processors

Page 77: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

77

INTEL Pentium Pro SMP

Processor Bus: 532 MB/s (66.6 MHz x 64 bits) Four-way interleaved DRAMs, EDO or

synchronous DRAM Interface to EISA or PCI Bus operations:

write-back cache, MESI protocolpipeline depth: 8

Page 78: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

78

P6 P6P6P6

PCI BridgeDRAM

ControllerDataPath

MICMIC MICMIC MICMIC MICMIC

Memory controller

MIC: memory interface controller

EISA/ISABridge

PCI Bus

PCIDevice

PCIDevice

EISA/ISADevice

EISA/ISADevice

EISA/ISA Bus

Pentium Pro processor bus

Interleave data(288 bits)

32-bit address64-bit data500 MB/s

32-bit address32-bit data132 MB/s

Mem data(72 bits)

Page 79: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

79

Pentium II Larger L1 cache: 16/16 KB L2 cache: 512 KB unified cache thru backside bus Add MMX features back (like Pentium MMX) Slot 1 architecture (240 pins) Clock speed improved: 233, 266, 300,... 450 MHz. SMP support: up to TWO only Deschutes version (1998): >= 333 MHz, 100 MHz

external bus (440BX chipset), AGP, SMP support: 4 processors, Slot 2 (330 pins) ?

Page 80: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

80

Pentium II (P6)

Pentium IIPentium IICoreCore

16 KB L116 KB L1DataData

CacheCache

16 KB L116 KB L1InstructionInstruction

CacheCache

Bus Interface Unit

512 KB512 KBUnified Unified

L2 cacheL2 cache

LocalAPIC

External Bus

Substrate

Half-speed

Full-speed

Page 81: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

81

AMD K6-2

3DNow technology: MMX support 4 floating point units (4-issue); Pentium II only

one floating point unit 300 MHz AMD K6-2: 1.2 GFLOPS > Pentium II

450 MHz Socket 7, 100 MHz external bus, 0.25 micron 9.3 M transistors K6-3 (SharkTooth): 350, 450 MHz, 256 KB on-

chip L2 cache, 21.3 M transistors

Page 82: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

82

Intel Processors for Each Market Segment

PentiumPentium®® Pro Pro ProcessorProcessorPentiumPentium®® Pro Pro ProcessorProcessor

PentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessorPentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessor

Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor

Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor

PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology

PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology

Intel Intel ®® Celeron™ Celeron™ ProcessorProcessorIntel Intel ®® Celeron™ Celeron™ ProcessorProcessor

Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology

Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology

Mobile Pentium II Mobile Pentium II ProcessorProcessorMobile Pentium II Mobile Pentium II ProcessorProcessor

’’9797’’9797 ’’9898’’9898

Basic Basic PC DesktopPC Desktop

Basic Basic PC DesktopPC Desktop

Mid- to High-End Mid- to High-End Server/Server/

WorkstationWorkstation

Mid- to High-End Mid- to High-End Server/Server/

WorkstationWorkstation

Entry-level Server/Entry-level Server/WorkstationWorkstation

Entry-level Server/Entry-level Server/WorkstationWorkstation

Performance Performance DesktopDesktop

Performance Performance DesktopDesktop

Mobile PCMobile PCMobile PCMobile PC

0.25

mic

ron

P6

Mic

roar

chit

ectu

re C

ore

0.25

mic

ron

P6

Mic

roar

chit

ectu

re C

ore

Page 83: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

83

INTEL Merced (mid-2000)

64-bit processor VLIW concept? Need good compiler technique run UNIX (more scalable than NT)

Page 84: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

84

INTEL McKinley (late-2001)

64-bit, 0.13 micron More cache memory than any other

INTEL processors aims at 1000 MHz, 2x faster than Merced Need good compiler technique

Page 85: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

85

INTEL Foster

32-bit, 0.13 micron 1000 MHz high-end PC server longer pipeline + “instruction trace cache”

Page 86: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

86

Intel Roadmap

32-bit: P5: Pentium (1993), P6: Pentium Pro (1995), Pentium II (450 MHz) Celeron 2, Pentium II Xeon (450 MHz), Tanner and Cascades chip (1999), P7: Willamette (desktop), Foster (1000 MHz, high-end PC server)

64-bit: Merced and McKinley

Page 87: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

87

Intel vs. Compaq 64-bit roadmaps:

Year Intel's IA-64 Compaq's Alpha 1998 in progress 21264 at 575 MHz 1999 first samples 21264 at 750 MHz to 1 GHz mid-2000 Merced at 800 MHz + 21364 at 1 GHz + late 2001 McKinley at 1 GHz + EV8 2002 Madison 2003(?) Deerfield

Page 88: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

88

Digital Alpha 21164

0.35 m , 500 MHz, RISC

4-way issue superscalar Up to 2 Integer and 2

Floating point instructions issues per CPU cycle

Large on-chip L2 cache 96 KB, writable, 3-way set

associative

9.3 M transistors Fully pipelined

7-stage integer pipeline 9-stage floating point

pipeline

High-through memory subsystem (400 MB/s)

Page 89: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

89

Alpha 21164 Block Diagram

Instcache(8KB)

Instcache(8KB)

4-wayissueunit

4-wayissueunit

Int UnitInt Unit

FP +FP +

FP *FP *

Merge Log

Merge Log

Write-through

DataCache(8KB)

Write-through

DataCache(8KB)

Write-backL2

Cache(96KB)

Write-backL2

Cache(96KB)

BusInterface

Unit

BusInterface

UnitL3

CacheL3

CacheInt UnitInt Unit

128-bit internal data bus

Inst Unit Exec Unit Memory Unit

128bit data

40bit address

Page 90: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

90

Current Status of Processor Technology

Still don’t work well for some applications:data bases, CAD tools, sparse matrix,..

Alpha 21164, 300 MHz, 4-way superscalarRunning Microsoft SQLserver database on Windows NT It operates at 12% of peak performance

Caches don’t work. Speed is tied to memory bandwidth.

Page 91: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

91

Microprocessor-DRAM performance gap full cache miss time = 100s instructions Alpha 7000 server: 340 ns/5.0 ns = 68

clks (2-issue, x2 = 136 insts) Alpha 8400 Server: 266 ns/3.3ns = 80

clks (21164 processor, 4-issue, x 4 = 320 insts)

Rely on locality + caches to bridge gap

Page 92: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

92

Processor-Memory Gap

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Page 93: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

93

Future Processors

Specialized, very long instruction word (VLIW) machines

Wide, simultaneous multithreaded (SMT) uniprocessor

Single-chip multiprocessor Memory-centric computing engines (IRAM,

PPRAM,CRAM)

Page 94: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

94

IRAM: Berkeley

Growing performance gap between CPU and memory access speed

Microprocessor and DRAM on single chip Bridge the processor-memory

performance gap via on-chip latency and bandwidth

improve power-performance

Page 95: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

95

CRAM (Univ. Toronto)

Computation moved from the CPU into the memory

CRAM= RAM+SIMD PetaOPS performance (1015 operations

per second) Bandwidth internal to memory: 2.9 TB/s Cache/CPU: 800 MB/s

Page 96: 1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

96

Other Projects

PPRAM Project (Kyushu Univ., Japan): Parallel Processing RAM Chip

CMP Project (Stanford)billion-transistor processor architecturesingle-chip multiprocessor (4 to 16)New ISAs