1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

1

Lecture 6Processor

Technology

2

Advance in Hardware

INTEL Family: (8086/1978 -- Pentium II/1998)exponential performance improvement over

time• number of transitors: increased almost 2500

times (29 K --> 7.5 M)• clock rate: 45 times (10 MHz -> 450 MHz)

3

Moore’s Law (1969) The number of transistors on a microchip doubles

about every 18-24 months, assuming the price of the chip stays the same

The speed of a microprocessor doubles about every 18-24 months, assuming price stays the same

The price of a microchip drops about 48% every 18-24 months, assuming the performance metric (processor speed or

memory capacity) of the chip stays the same.

4

Milestones of Chip Density

1K

10K

100K

1M

10M

70 90 92 94 96 98 00

100M

1G

Num

ber

of T

rans

isto

rs p

er

Chi

p

72 74 76 78 80 82 84 86 88 02

4004

•8080

•8085

8086••68000

80286• 68020•• 80386

•

LSI LogicGate Array

•80486

••

IBMGateArray

• Pentium ProMPU Only

••

•P7

Pentium ProMPU and CacheMemory Chip

▲

▲4K ▲

▲

16K

▲64K

256K▲

▲ 1M

▲4M

▲16M

▲64M

▲256MLSI LogicGate Array

= Memory (DRAM)

= Microprocossor and Logic

Pentium

1G

Memory Increase = 1.5/yearMPU Increase = 1.35/year

Year

•

Source: ICE

•▲

5

Outline

Instruction Set Architecture (ISA)

Pipelining Concepts Processor Technology

CISC, RISC, superscalar, VLIW

Case Study Future processor

6

Part 1: Instruction Set Architecture

(ISA)

7

Computer Architecture’s Changing Definition 1950s to 1960s: Computer Arithmetic 1970s to mid 1980s: Instruction Set Design,

especially ISA appropriate for compilers 1990s: Design of CPU, memory system, I/O

system, Multiprocessors

8

Instruction Set Architecture (ISA)

instruction set

software

hardware

9

Interface DesignA good interface:

• Lasts through many implementations (portability, compatability)

• Is used in many different ways (generality)

• Provides convenient functionality to higher levels

• Permits an efficient implementation at lower levels

Interfaceimp 1

imp 2

imp 3

use

use

use

time

10

What Operations Should be in Instruction Set?

How many are possible ? Which ones do we need ? Circuit complexity ? How frequently is each used ? How much slower would each be, if

implemented in terms of simpler ones ?

11

Typically include:

ALU (25-40 % frequency of use) Data transfer (~15-40 %) Control flow (~15-25 %) System (~ 2%) Floating point (~ 15 %) Decimal and string (~ 15 %)

12

4 types of flow operations:

Conditional branch (Branches): ~73%Unconditional branches (Jumps): ~14 %Procedure calls + return (Jump): ~ 13%

13

Data types and sizes ?

How many are possible ? Which ones do we need ? How frequently are they used ? How much slower if implemented in

software ?

14

Data Types

Integer: short, long, extra long. floating-point: single-, double-, quad-

precision. characters: char, strings. bit fields. binary coded decimal.

15

Other Issues

What are the most common accesses (profile) ?

What should the instruction format be ?

16

Conflicting goals:

code compactness less no. of lines in program (at machine level

after compilation) less memory, less I-Fetch bandwidth.

easy decoding want fixed format less expensive and faster I-decode.

17

Ways to get code compactness: (An ideal case)

Huffman encoding --; e.g,50 % 'A' --- "0"25 % 'B' -- '10'12.5 % 'C" -- '110'12.5 % 'D' -- '111”

Variable length according to frequency Easy to implement ? Cost ?

18

Evolution of Instruction Sets

Design decisions must take into account: technologymachine organizationprogramming languagescompiler technologyoperating systems

And they in turn influence these

19

Aspects of CPU Performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock Rate

Program X

Compiler X (X)

Inst. Set. X X

Organization X (X) X

Technology X

20

Cycles Per Instruction (CPI)

CPU time = CycleTime * CPI * Ii = 1

n

i i

CPI = CPI * F where F = I i = 1

n

i i i i

Instruction Count

“Instruction Frequency”

Invest Resources where time is Spent!

CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count

“Average Cycles per Instruction”

21

Example: Calculating CPI

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) (% Time)

ALU 50% 1 .5 (33%)

Load 20% 2 .4 (27%)

Store 10% 2 .2 (13%)

Branch 20% 2 .4 (27%)

1.5

22

Part 2: Pipelining

23

Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes

A B C D

24

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

25

Pipelined Laundry: (Start work ASAP)

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

26

Computer Pipelining

Overlapping the execution of instructions.Instruction fetch (IF)Instruction decode (ID)Execute (EX)Write back (WB)

Some operation (e.g., IF, ID, EX) is performed on every instruction in the pipeline.

27

Computer Pipelining

Pipelining increases the Throughput Throughput = no. of instructions executed in

a given time period

Hence, reduces the average execution time per instruction (or CPI).

28

Pipelining Speedup:

k-stage linear pipeline, n tasks.Pipelined = k+(n-1) cycles.Unpipelined = n x k cycles.See the laundry example.

Speedup = Sk = nk / (k+n-1)

Sk k as n .

29

Speedup (explanation)

At time t=0, the first pipeline operation enters the pipe.

After k pipeline clock cycles, the 1st result exits.

Then, 1 result exits per clock cycle.

30

Pipelining Lessons Doesn’t help latency of

single task, but throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

31

Computer Pipelines

Execute billions of instructions, so throughput is what matters

desirable features: all instructions same length, registers located in same place in instruction

format, memory operands only in loads or stores

32

MIPS example: 5 Stage Pipelining

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

IRLMD

33

Visualizing Pipelining

Instr.

Order

Time (clock cycles)

34

Limits to pipelining: Hazards !!

Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: HW cannot support this

combination of instructions (single person to fold and put clothes away)

Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)

Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline

35

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Use reg A

Use reg A

36

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

stall

Instr 3

Wait for one cycle

37

Data Hazard on R1

Instr.

Order

Time (clock cycles)

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

IF ID/RF EX MEM WB

All wait for the result of r1

38

3 Generic “Data Hazards”

Assume InstrI followed by InstrJ

Read After Write (RAW) InstrJ tries to read operand before InstrI

writes it

39

Read After Write (RAW)

1 2 543

1 2 543

Write

Read

Inst i

Inst j

read the old data.

40

Write After Read (WAR)

InstrJ tries to write operand before InstrI reads i

Gets wrong operand

Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

41

Write After Read (WAR)

1 2 543

1 2 543

Read

Write

Inst i

Inst j

Always read the correct data.

42

Write After Write (WAW)

InstrJ tries to write operand before InstrI writes it

Leaves wrong result ( InstrI not InstrJ )

Can’t happen in MIPS’s 5 stage pipeline because:

All instructions take 5 stages, and Writes are always in stage 5

43

Write After Write (WAW)

1 2 543

1 2 543

Write

Inst i

Inst j

Write

44

Part 3: Processor Technology

45

Evolution of Processor Design

Single Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)Mixed (1998)

46

CISC: Complex Instruction Set Computer.

more than 300 instructions ISA variable instruction/data formats small set of 8 to 24 general-purpose registers allow many memory reference operations

(addressing modes) CPI: 1 to 20 cycles, average CPI: 4 cycles Examples: INTEL x86 series (Pentium, Pentium Pro,

Pentium II), Motorola M680X0, Digital VAX 8600, IBM 390, AMD 486, Cyrix 686

47

Intel CPU FamilyYear Model Features

1981 Intel 8088 16-bit, 29K, max speed 10 MHz



1989 Intel 80486 32-bit, 2M, 25 MHz

1993- Intel Pentium 3.11M, 66MHz, CMOS, 2-issue, 266 MHzwith MMX (16/16 KB cache)

1995- Intel Pentium Pro(Socket 7)

5.5M, 200 MHz, 8/8 KB cache, Bi-CMOS,64-bit data bus, 3-issue, no MMX

1997- Intel Pentium II(Slot 1)

16/16 KB L1 cache, 256/512KB L2 cache,300 MHz, 32-bit, 3-issue, with MMX

1998 512KB L2 (half speed), 450 MHz, 7.5 Mtransistors

1998 Intel Pentium II Xeon 512KB L2 (full speed), 450 MHz, 32-bit

48

RISC: Reduced Instruction Set Computer.

Observation: Only 25% of a complex inst set is frequently

used about 95% of the time 75% of the hardware-supported instructions

are rarely used All instructions are of the same length Push rarely used inst into software Adding cache and Floating Point Units (FPU) in

processor chips

49

RISC: other features Instruction set: less than 100 instructions Fixed 32- or 64-bit instruction format Only 3 to 5 simple addressing modes Single address mode for load/store: base +

displacementno indirection

Simple branch conditions

50

RISC

Large register files: 32 integer registers + 32 floating point registers,

some has over 100

execute majority of the instruction in one cycle (average CPI: 1.5)

higher clock rate easy for compiler optimization

51

Examples of RISC processors

SUN: SPARC, MicroSPARC, SuperSPARC, UltraSPARC,

MIPS: R2000/3000/4000/5000/8000, R10000, INTEL: i860, Digital : Alpha 21164, 21264, 21364, IBM, Apple, Motorola : PowerPC 601, 603, 604e,

620, 630, IBM : POWER2 (SP2), POWER 3 (ASCI Blue Pacific), HP: HP PA-RISC PA-8000

52

Example: MIPS

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

53

Advanced Pipelining

cycle

Successive instruction

(a) Single-issue base pipeline

(b) 3-issue superscalar pipeline(c) Single-issue superpipeline

(d) 3-issue superscalar superpipeline

54

Superscalar Processors (RISC+CISC)

Multiple instructions issued per cycle (IPC > 1). Clock rate matches that of generic scalar RISC. CPI is lower than generic scalar RISC.

Examples: Alpha 21064 (2-issue), 21164 (4-issue), PowerPC: 604e (4), 620(4) HP PA-7200 (2), PA-8000 (4), MIPS: R5000 (2), R10000 (4), SUN: MicroSPARC (2), UltraSparc-2 (4), INTEL i860 (RISC, 2 issues), Pentium (CISC, 2), Pentium

Pro (3), Pentium II (3)

55

SUN SPARC (RISC) Scalable Processor ARChitecture, a specification,

not a chip. Larger register set: SPARC 128-144. Generations:

SPARC (1987), MicroSPARC, SuperSparc (1993), UltraSparc (1995), UltraSPACR II, Ultra III, Ultra IV,..

Machines:CM-5 : SPARC 33 MHz.CS-2 : SuperSparc 40 MHz (viking).SUN Sunfire (Enterprise 1000): UltraSparc

56

UltraSPARC Roadmap

UltraSPARC II: 64-bit, 0.25 micron (same as

Pentium II, AMD K6-2), Max 360 MHz, 30 W, (400

MHz later 1998)

UltraSPARC III: 600 MHz late 1999, 1000 MHz

UltraSPARC IV: mid-2000, 1000 MHz, 0.15

micron, Sun’s first copper-based chip

UltraSPARC V: 1500 MHz, 0.07 micron (, by 2002

57

PowerPC (RISC)

1991, by Apple, IBM, and Motorola. OS: IBM AIX, Apple Mac OS, NetWare 4,

OS/2, Sun Solaris, and Window NT, MS-DOS.

Technology update: September1998: IBM 400-MHz PowerPC

(copper wiring)

58

PowerPC Family :

First generation : PowerPC 601: desktop PCs.

The 2nd generation : PowerPC 603 (603e, 166 MHz, 3 W, 81 mm2):

portable+battery-powered applications. PowerPC 604 (604e, 5.6 M transistors, Power 10 W,

dynamic branch prediction logic, 4-issue, 6-stage): sophisticated PCs and servers.

PowerPC 620: integrated L2 controller and dedicated cache interface, 4-issue, 5-stage, 30 W, used in servers or supercomputer.

59

PowerPC 3rd generation: G3

L2 cache support, new bus architecture 32-bit processor, used in iMAC (1998) 0.25-micron, 67 mm2, 250 MHz, 5 W, 6.35 M

transistors. 5 execution units (similar to 603e),

1 floating point unit, 1 branch unit, 1 load/store unit, 2 single-cycle integer unit (603e only 1), 1 system unit

60


4-stage pipeline: fetch, decode-dispatch,

execute, complete-writeback

fetch unit fetches 4 instructions per clock peak rate: complete 3 instructions per clock Two 32 KB on-chip L1 caches (data + instruction)

: same as 604e, 8-way set associative

61


On-chip L2 cache: 2-way set associative of sizes 256 KB, 512 KB or 1 MB

Performance: 250 MHz CPU clock, 50 MHz system bus, half-speed 1-MB L2 cache: 10 SPECint95

Bus protocol: MEI (modified exclusive, invalid), used for single or dual-processor design

62

IBM POWER

Performance Optimized With Enhanced RISC POWER2, 66.7 MHz, 6-issue, used in IBM SP2

4 floating-point operations at once cycle. peak performance: 266 Mflops (66.7 x 4).

ASCI Blue Pacific : using POWER3 3.9 trillion calculation per second 15,000 times faster than the desktop PC at Lawrence Livermore National Lab. 2.6 trillion bytes of memory 1 second = 63,000 years using hand calculator 4096 POWER3 CPUs (8 CPUs per node)

63

POWER3: Superscalar RISC 8-issue, (most other processors 4-issue) 200 MHz (slow but fast), 0.25 micron, 5 metal layers, 1088 pin IBM’s first 64-bit microprocessor. Used in ASCI Blue Pacific, 4096 nodes, Memory subsystem: 6.4 GB/s POWER3 workstation, Oct. 23 1998: RS/6000 43P

Model 260, single or dual-processor, up to 8 (in SMP form), 4MB L2 instruction cache+256 MB SDRAM memory

Compatable with PowerPC design

64

IBM

1999: 0.18 micron, copper wiring 2000: silicon-on-insulator 2001: “gigachip” (POWER4 ??)

65

MIPS:

R4000 (1991, 64-bit), R8000 (1994), R10000

1995 or 1996-, 30 W. 5.9 M transistors, 32/32 KB cache5-7 pipeline stages, 4-issue

SGI Power Challenge up to 18 X R8000 or x 36 R10000

66

Other Commodity Processors

HP PA-RISC (Precision Architecture): PA-RISC 7200 (CONVEX Examplar, 128

processors)

DEC: Alpha 21064 (CRAY T3D), 21164 (300 MHz, T3E), 21264 (T3E1200)

Intel 80860 (i860): ``Cray on a Chip'' 66 MFLOPS (Cray 1S = 85

MFLOPS)

67

VLIW: Very Long Instruction Word. Use even more functional units than that of a superscalar

processor. All instructions are the same length The operations in each work are chosen by the compiler. CPI is further lower than superscalar. Clock rate is slow. Microprogrammed control, synchronization of parallel

operations is entirely done at compile time No commodity processor is designed in VLIW (but it is

coming back !! INTEL 64-bit Merced)

68

Register File

Load/Store Integer ALU FP Unit Branch Unit

Mainmemory

Load/store FP ADD FP Multiply Branch .... Integer ALU

(b) pipeline execution of VLIW instruction

Cycle

(a) A VLIW processor architecture and instruction format

69

SummaryCISC RISC VLIW

InstructionComplexity

Varies fromSimple tocomplex

One simple operation Many simpleIndependentoperations

Instruction size Varies One size, usually 32bits

One size

Instructionformat

Field placementvaries

Regular, consistentfield placement

Regular, consistentfield placement

Memory reference Bundled withoperations

No bundled,load/store

architecture

No bundled,load/store

architecture

Hardware designfocus

MicrocodedImplementations;

One or morepipeline

No microcode; oneor more pipelines

Multiple pipelines, nomicrocode, no

complex dispatchlogic

Registers Few, sometimesspecial

Many, generalpurpose

Many, generalpurpose

70

Processor Performance Processor Clock IPC stage SPEC95int

SPEC95fp

Alpha 21164 500 4(2+2) 7-12 12.6 18.3 PowerPC620 200 4 5 9.0 9.0 PowerPC 604e 225 4 6 8.5 7.0 UltraSPARC II 250 4 6-9 8.5 15 HA-8000 180 4 7-9 10.8 18.3 MIPS R10000 200 4 5-7 8.9 17.2 Pentium Pro 200 3(2+1) 12-14 8.7 6.0

Only 1 floating point unit active at a time

71

Case Study 1: INTEL Processors

72

Pentium (430 TX Mother Board)

Main Memory(DRAM

Max 256 MB)

Main Memory(DRAM

Max 256 MB)

Memory controller82349TX (MTXC)

82371AB(PIIX4)

Bus Master

PCI Bus (3.3 V or 5V, 30/33 MHz)

PCIDevice

EISA/ISADevice

EISA/ISADevice

ISA/EIO Bus (3.3 V; 5V)

Pentium processor 32-bit address64-bit data500 MB/s (8x60)Host Bus (3.3 V; 60-66 MHz)

USB USB

L2 cache(Max 512 KB)

HD CD-ROM

BMI IDE (33 MB/s)

(up to 5)

73

Pentium Pro cache:

8 KB data+8 KB instruction cache (L1)On-board L2 cache: 256 KB or 512 KB

40 general purpose registers Data TLB: 64 entries No MMX up to 200 MHz, 35W:

integration of high-speed CPU with high-speed cache is not easy

74

Execution in Pentium Pro

Five functional units:Store data unitStore address unitLoad address unit Integer ALU unitFloating point/integer unit

3-issue but only one floating point op Peak flop rate= 200 MFLOP at 200 MHz

75

Pentium Pro (P6)

Pentium ProPentium ProCoreCore

8 KB L18 KB L1DataData

CacheCache

8 KB L18 KB L1InstructionInstruction

CacheCache

Bus Interface Unit

256/512 KB256/512 KBUnified Unified

L2 cacheL2 cache

LocalAPIC

External Bus

Substrate

Half-speed

Full-speed Backside Bus

76

Pentium Pro Memory Subsystem

L1 cache (Data cache 8 KB): supporting one load and one store per cycle (full-

speed), peak bandwidth of 3.2 GB/s on a 200 MHz

L2 cache: run at full CPU clock speed, can transfer 64 bit per

cycle (1.6 GB/sec on a 200 MHz Pentium Pro)

External bus: 64-bit, 64-bit per bus cycle SMP support

Full cache coherence up to 4 processors

77

INTEL Pentium Pro SMP

Processor Bus: 532 MB/s (66.6 MHz x 64 bits) Four-way interleaved DRAMs, EDO or

synchronous DRAM Interface to EISA or PCI Bus operations:

write-back cache, MESI protocolpipeline depth: 8

78

P6 P6P6P6

PCI BridgeDRAM

ControllerDataPath

MICMIC MICMIC MICMIC MICMIC

Memory controller

MIC: memory interface controller

EISA/ISABridge

PCI Bus

PCIDevice

PCIDevice

EISA/ISADevice

EISA/ISADevice

EISA/ISA Bus

Pentium Pro processor bus

Interleave data(288 bits)

32-bit address64-bit data500 MB/s

32-bit address32-bit data132 MB/s

Mem data(72 bits)

79

Pentium II Larger L1 cache: 16/16 KB L2 cache: 512 KB unified cache thru backside bus Add MMX features back (like Pentium MMX) Slot 1 architecture (240 pins) Clock speed improved: 233, 266, 300,... 450 MHz. SMP support: up to TWO only Deschutes version (1998): >= 333 MHz, 100 MHz

external bus (440BX chipset), AGP, SMP support: 4 processors, Slot 2 (330 pins) ?

80

Pentium II (P6)

Pentium IIPentium IICoreCore

16 KB L116 KB L1DataData

CacheCache

16 KB L116 KB L1InstructionInstruction

CacheCache

Bus Interface Unit

512 KB512 KBUnified Unified

L2 cacheL2 cache

LocalAPIC

External Bus

Substrate

Half-speed

Full-speed

81

AMD K6-2

3DNow technology: MMX support 4 floating point units (4-issue); Pentium II only

one floating point unit 300 MHz AMD K6-2: 1.2 GFLOPS > Pentium II

450 MHz Socket 7, 100 MHz external bus, 0.25 micron 9.3 M transistors K6-3 (SharkTooth): 350, 450 MHz, 256 KB on-

chip L2 cache, 21.3 M transistors

82

Intel Processors for Each Market Segment

PentiumPentium®® Pro Pro ProcessorProcessorPentiumPentium®® Pro Pro ProcessorProcessor

PentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessorPentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessor

Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor

Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor

PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology

PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology

Intel Intel ®® Celeron™ Celeron™ ProcessorProcessorIntel Intel ®® Celeron™ Celeron™ ProcessorProcessor

Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology

Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology

Mobile Pentium II Mobile Pentium II ProcessorProcessorMobile Pentium II Mobile Pentium II ProcessorProcessor

’’9797’’9797 ’’9898’’9898

Basic Basic PC DesktopPC Desktop

Basic Basic PC DesktopPC Desktop

Mid- to High-End Mid- to High-End Server/Server/

WorkstationWorkstation

Mid- to High-End Mid- to High-End Server/Server/

WorkstationWorkstation

Entry-level Server/Entry-level Server/WorkstationWorkstation

Entry-level Server/Entry-level Server/WorkstationWorkstation

Performance Performance DesktopDesktop

Performance Performance DesktopDesktop

Mobile PCMobile PCMobile PCMobile PC

0.25

mic

ron

P6

Mic

roar

chit

ectu

re C

ore

0.25

mic

ron

P6

Mic

roar

chit

ectu

re C

ore

83

INTEL Merced (mid-2000)

64-bit processor VLIW concept? Need good compiler technique run UNIX (more scalable than NT)

84

INTEL McKinley (late-2001)

64-bit, 0.13 micron More cache memory than any other

INTEL processors aims at 1000 MHz, 2x faster than Merced Need good compiler technique

85

INTEL Foster

32-bit, 0.13 micron 1000 MHz high-end PC server longer pipeline + “instruction trace cache”

86

Intel Roadmap

32-bit: P5: Pentium (1993), P6: Pentium Pro (1995), Pentium II (450 MHz) Celeron 2, Pentium II Xeon (450 MHz), Tanner and Cascades chip (1999), P7: Willamette (desktop), Foster (1000 MHz, high-end PC server)

64-bit: Merced and McKinley

87

Intel vs. Compaq 64-bit roadmaps:

Year Intel's IA-64 Compaq's Alpha 1998 in progress 21264 at 575 MHz 1999 first samples 21264 at 750 MHz to 1 GHz mid-2000 Merced at 800 MHz + 21364 at 1 GHz + late 2001 McKinley at 1 GHz + EV8 2002 Madison 2003(?) Deerfield

88

Digital Alpha 21164

0.35 m , 500 MHz, RISC

4-way issue superscalar Up to 2 Integer and 2

Floating point instructions issues per CPU cycle

Large on-chip L2 cache 96 KB, writable, 3-way set

associative

9.3 M transistors Fully pipelined

7-stage integer pipeline 9-stage floating point

pipeline

High-through memory subsystem (400 MB/s)

89

Alpha 21164 Block Diagram

Instcache(8KB)

Instcache(8KB)

4-wayissueunit

4-wayissueunit

Int UnitInt Unit

FP +FP +

FP *FP *

Merge Log

Merge Log

Write-through

DataCache(8KB)

Write-through

DataCache(8KB)

Write-backL2

Cache(96KB)

Write-backL2

Cache(96KB)

BusInterface

Unit

BusInterface

UnitL3

CacheL3

CacheInt UnitInt Unit

128-bit internal data bus

Inst Unit Exec Unit Memory Unit

128bit data

40bit address

90

Current Status of Processor Technology

Still don’t work well for some applications:data bases, CAD tools, sparse matrix,..

Alpha 21164, 300 MHz, 4-way superscalarRunning Microsoft SQLserver database on Windows NT It operates at 12% of peak performance

Caches don’t work. Speed is tied to memory bandwidth.

91

Microprocessor-DRAM performance gap full cache miss time = 100s instructions Alpha 7000 server: 340 ns/5.0 ns = 68

clks (2-issue, x2 = 136 insts) Alpha 8400 Server: 266 ns/3.3ns = 80

clks (21164 processor, 4-issue, x 4 = 320 insts)

Rely on locality + caches to bridge gap

92

Processor-Memory Gap

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

93

Future Processors

Specialized, very long instruction word (VLIW) machines

Wide, simultaneous multithreaded (SMT) uniprocessor

Single-chip multiprocessor Memory-centric computing engines (IRAM,

PPRAM,CRAM)

94

IRAM: Berkeley

Growing performance gap between CPU and memory access speed

Microprocessor and DRAM on single chip Bridge the processor-memory

performance gap via on-chip latency and bandwidth

improve power-performance

95

CRAM (Univ. Toronto)

Computation moved from the CPU into the memory

CRAM= RAM+SIMD PetaOPS performance (1015 operations

per second) Bandwidth internal to memory: 2.9 TB/s Cache/CPU: 800 MB/s

96

Other Projects

PPRAM Project (Kyushu Univ., Japan): Parallel Processing RAM Chip

CMP Project (Stanford)billion-transistor processor architecturesingle-chip multiprocessor (4 to 16)New ISAs

1 Lecture 6 Processor Technology. 2 Advance in Hardware INTEL Family: (8086/1978 -- Pentium II/1998) exponential performance improvement over time number.

Documents