1 Lecture 6 Processor Technology
1
Lecture 6Processor
Technology
2
Advance in Hardware
INTEL Family: (8086/1978 -- Pentium II/1998)exponential performance improvement over
time• number of transitors: increased almost 2500
times (29 K --> 7.5 M)• clock rate: 45 times (10 MHz -> 450 MHz)
3
Moore’s Law (1969) The number of transistors on a microchip doubles
about every 18-24 months, assuming the price of the chip stays the same
The speed of a microprocessor doubles about every 18-24 months, assuming price stays the same
The price of a microchip drops about 48% every 18-24 months, assuming the performance metric (processor speed or
memory capacity) of the chip stays the same.
4
Milestones of Chip Density
1K
10K
100K
1M
10M
70 90 92 94 96 98 00
100M
1G
Num
ber
of T
rans
isto
rs p
er
Chi
p
72 74 76 78 80 82 84 86 88 02
4004
•8080
•8085
8086••68000
80286• 68020•• 80386
•
LSI LogicGate Array
•80486
••
IBMGateArray
• Pentium ProMPU Only
••
•P7
Pentium ProMPU and CacheMemory Chip
▲
▲4K ▲
▲
16K
▲64K
256K▲
▲ 1M
▲4M
▲16M
▲64M
▲256MLSI LogicGate Array
= Memory (DRAM)
= Microprocossor and Logic
Pentium
1G
Memory Increase = 1.5/yearMPU Increase = 1.35/year
Year
•
Source: ICE
•▲
5
Outline
Instruction Set Architecture (ISA)
Pipelining Concepts Processor Technology
CISC, RISC, superscalar, VLIW
Case Study Future processor
6
Part 1: Instruction Set Architecture
(ISA)
7
Computer Architecture’s Changing Definition 1950s to 1960s: Computer Arithmetic 1970s to mid 1980s: Instruction Set Design,
especially ISA appropriate for compilers 1990s: Design of CPU, memory system, I/O
system, Multiprocessors
8
Instruction Set Architecture (ISA)
instruction set
software
hardware
9
Interface DesignA good interface:
• Lasts through many implementations (portability, compatability)
• Is used in many different ways (generality)
• Provides convenient functionality to higher levels
• Permits an efficient implementation at lower levels
Interfaceimp 1
imp 2
imp 3
use
use
use
time
10
What Operations Should be in Instruction Set?
How many are possible ? Which ones do we need ? Circuit complexity ? How frequently is each used ? How much slower would each be, if
implemented in terms of simpler ones ?
11
Typically include:
ALU (25-40 % frequency of use) Data transfer (~15-40 %) Control flow (~15-25 %) System (~ 2%) Floating point (~ 15 %) Decimal and string (~ 15 %)
12
4 types of flow operations:
Conditional branch (Branches): ~73%Unconditional branches (Jumps): ~14 %Procedure calls + return (Jump): ~ 13%
13
Data types and sizes ?
How many are possible ? Which ones do we need ? How frequently are they used ? How much slower if implemented in
software ?
14
Data Types
Integer: short, long, extra long. floating-point: single-, double-, quad-
precision. characters: char, strings. bit fields. binary coded decimal.
15
Other Issues
What are the most common accesses (profile) ?
What should the instruction format be ?
16
Conflicting goals:
code compactness less no. of lines in program (at machine level
after compilation) less memory, less I-Fetch bandwidth.
easy decoding want fixed format less expensive and faster I-decode.
17
Ways to get code compactness: (An ideal case)
Huffman encoding --; e.g,50 % 'A' --- "0"25 % 'B' -- '10'12.5 % 'C" -- '110'12.5 % 'D' -- '111”
Variable length according to frequency Easy to implement ? Cost ?
18
Evolution of Instruction Sets
Design decisions must take into account: technologymachine organizationprogramming languagescompiler technologyoperating systems
And they in turn influence these
19
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst. Set. X X
Organization X (X) X
Technology X
20
Cycles Per Instruction (CPI)
CPU time = CycleTime * CPI * Ii = 1
n
i i
CPI = CPI * F where F = I i = 1
n
i i i i
Instruction Count
“Instruction Frequency”
Invest Resources where time is Spent!
CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count
“Average Cycles per Instruction”
21
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) (% Time)
ALU 50% 1 .5 (33%)
Load 20% 2 .4 (27%)
Store 10% 2 .2 (13%)
Branch 20% 2 .4 (27%)
1.5
22
Part 2: Pipelining
23
Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes
A B C D
24
Sequential Laundry
Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
25
Pipelined Laundry: (Start work ASAP)
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
26
Computer Pipelining
Overlapping the execution of instructions.Instruction fetch (IF)Instruction decode (ID)Execute (EX)Write back (WB)
Some operation (e.g., IF, ID, EX) is performed on every instruction in the pipeline.
27
Computer Pipelining
Pipelining increases the Throughput Throughput = no. of instructions executed in
a given time period
Hence, reduces the average execution time per instruction (or CPI).
28
Pipelining Speedup:
k-stage linear pipeline, n tasks.Pipelined = k+(n-1) cycles.Unpipelined = n x k cycles.See the laundry example.
Speedup = Sk = nk / (k+n-1)
Sk k as n .
29
Speedup (explanation)
At time t=0, the first pipeline operation enters the pipe.
After k pipeline clock cycles, the 1st result exits.
Then, 1 result exits per clock cycle.
30
Pipelining Lessons Doesn’t help latency of
single task, but throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
31
Computer Pipelines
Execute billions of instructions, so throughput is what matters
desirable features: all instructions same length, registers located in same place in instruction
format, memory operands only in loads or stores
32
MIPS example: 5 Stage Pipelining
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
IRLMD
33
Visualizing Pipelining
Instr.
Order
Time (clock cycles)
34
Limits to pipelining: Hazards !!
Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: HW cannot support this
combination of instructions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline
35
One Memory Port/Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Use reg A
Use reg A
36
One Memory Port/Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
stall
Instr 3
Wait for one cycle
37
Data Hazard on R1
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WB
All wait for the result of r1
38
3 Generic “Data Hazards”
Assume InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read operand before InstrI
writes it
39
Read After Write (RAW)
1 2 543
1 2 543
Write
Read
Inst i
Inst j
read the old data.
40
Write After Read (WAR)
InstrJ tries to write operand before InstrI reads i
Gets wrong operand
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
41
Write After Read (WAR)
1 2 543
1 2 543
Read
Write
Inst i
Inst j
Always read the correct data.
42
Write After Write (WAW)
InstrJ tries to write operand before InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Can’t happen in MIPS’s 5 stage pipeline because:
All instructions take 5 stages, and Writes are always in stage 5
43
Write After Write (WAW)
1 2 543
1 2 543
Write
Inst i
Inst j
Write
44
Part 3: Processor Technology
45
Evolution of Processor Design
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from Implementation
High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture
RISC
(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)Mixed (1998)
46
CISC: Complex Instruction Set Computer.
more than 300 instructions ISA variable instruction/data formats small set of 8 to 24 general-purpose registers allow many memory reference operations
(addressing modes) CPI: 1 to 20 cycles, average CPI: 4 cycles Examples: INTEL x86 series (Pentium, Pentium Pro,
Pentium II), Motorola M680X0, Digital VAX 8600, IBM 390, AMD 486, Cyrix 686
47
Intel CPU FamilyYear Model Features
1981 Intel 8088 16-bit, 29K, max speed 10 MHz
1982 Intel 80286 16-bit, 130K, max speed 12 MHz
1985 Intel 80386 32-bit, 275K, max speed 20 MHz
1989 Intel 80486 32-bit, 2M, 25 MHz
1993- Intel Pentium 3.11M, 66MHz, CMOS, 2-issue, 266 MHzwith MMX (16/16 KB cache)
1995- Intel Pentium Pro(Socket 7)
5.5M, 200 MHz, 8/8 KB cache, Bi-CMOS,64-bit data bus, 3-issue, no MMX
1997- Intel Pentium II(Slot 1)
16/16 KB L1 cache, 256/512KB L2 cache,300 MHz, 32-bit, 3-issue, with MMX
1998 512KB L2 (half speed), 450 MHz, 7.5 Mtransistors
1998 Intel Pentium II Xeon 512KB L2 (full speed), 450 MHz, 32-bit
48
RISC: Reduced Instruction Set Computer.
Observation: Only 25% of a complex inst set is frequently
used about 95% of the time 75% of the hardware-supported instructions
are rarely used All instructions are of the same length Push rarely used inst into software Adding cache and Floating Point Units (FPU) in
processor chips
49
RISC: other features Instruction set: less than 100 instructions Fixed 32- or 64-bit instruction format Only 3 to 5 simple addressing modes Single address mode for load/store: base +
displacementno indirection
Simple branch conditions
50
RISC
Large register files: 32 integer registers + 32 floating point registers,
some has over 100
execute majority of the instruction in one cycle (average CPI: 1.5)
higher clock rate easy for compiler optimization
51
Examples of RISC processors
SUN: SPARC, MicroSPARC, SuperSPARC, UltraSPARC,
MIPS: R2000/3000/4000/5000/8000, R10000, INTEL: i860, Digital : Alpha 21164, 21264, 21364, IBM, Apple, Motorola : PowerPC 601, 603, 604e,
620, 630, IBM : POWER2 (SP2), POWER 3 (ASCI Blue Pacific), HP: HP PA-RISC PA-8000
52
Example: MIPS
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
53
Advanced Pipelining
cycle
Successive instruction
(a) Single-issue base pipeline
(b) 3-issue superscalar pipeline(c) Single-issue superpipeline
(d) 3-issue superscalar superpipeline
54
Superscalar Processors (RISC+CISC)
Multiple instructions issued per cycle (IPC > 1). Clock rate matches that of generic scalar RISC. CPI is lower than generic scalar RISC.
Examples: Alpha 21064 (2-issue), 21164 (4-issue), PowerPC: 604e (4), 620(4) HP PA-7200 (2), PA-8000 (4), MIPS: R5000 (2), R10000 (4), SUN: MicroSPARC (2), UltraSparc-2 (4), INTEL i860 (RISC, 2 issues), Pentium (CISC, 2), Pentium
Pro (3), Pentium II (3)
55
SUN SPARC (RISC) Scalable Processor ARChitecture, a specification,
not a chip. Larger register set: SPARC 128-144. Generations:
SPARC (1987), MicroSPARC, SuperSparc (1993), UltraSparc (1995), UltraSPACR II, Ultra III, Ultra IV,..
Machines:CM-5 : SPARC 33 MHz.CS-2 : SuperSparc 40 MHz (viking).SUN Sunfire (Enterprise 1000): UltraSparc
56
UltraSPARC Roadmap
UltraSPARC II: 64-bit, 0.25 micron (same as
Pentium II, AMD K6-2), Max 360 MHz, 30 W, (400
MHz later 1998)
UltraSPARC III: 600 MHz late 1999, 1000 MHz
UltraSPARC IV: mid-2000, 1000 MHz, 0.15
micron, Sun’s first copper-based chip
UltraSPARC V: 1500 MHz, 0.07 micron (, by 2002
57
PowerPC (RISC)
1991, by Apple, IBM, and Motorola. OS: IBM AIX, Apple Mac OS, NetWare 4,
OS/2, Sun Solaris, and Window NT, MS-DOS.
Technology update: September1998: IBM 400-MHz PowerPC
(copper wiring)
58
PowerPC Family :
First generation : PowerPC 601: desktop PCs.
The 2nd generation : PowerPC 603 (603e, 166 MHz, 3 W, 81 mm2):
portable+battery-powered applications. PowerPC 604 (604e, 5.6 M transistors, Power 10 W,
dynamic branch prediction logic, 4-issue, 6-stage): sophisticated PCs and servers.
PowerPC 620: integrated L2 controller and dedicated cache interface, 4-issue, 5-stage, 30 W, used in servers or supercomputer.
59
PowerPC 3rd generation: G3
L2 cache support, new bus architecture 32-bit processor, used in iMAC (1998) 0.25-micron, 67 mm2, 250 MHz, 5 W, 6.35 M
transistors. 5 execution units (similar to 603e),
1 floating point unit, 1 branch unit, 1 load/store unit, 2 single-cycle integer unit (603e only 1), 1 system unit
60
PowerPC 3rd generation: G3
4-stage pipeline: fetch, decode-dispatch,
execute, complete-writeback
fetch unit fetches 4 instructions per clock peak rate: complete 3 instructions per clock Two 32 KB on-chip L1 caches (data + instruction)
: same as 604e, 8-way set associative
61
PowerPC 3rd generation: G3
On-chip L2 cache: 2-way set associative of sizes 256 KB, 512 KB or 1 MB
Performance: 250 MHz CPU clock, 50 MHz system bus, half-speed 1-MB L2 cache: 10 SPECint95
Bus protocol: MEI (modified exclusive, invalid), used for single or dual-processor design
62
IBM POWER
Performance Optimized With Enhanced RISC POWER2, 66.7 MHz, 6-issue, used in IBM SP2
4 floating-point operations at once cycle. peak performance: 266 Mflops (66.7 x 4).
ASCI Blue Pacific : using POWER3 3.9 trillion calculation per second 15,000 times faster than the desktop PC at Lawrence Livermore National Lab. 2.6 trillion bytes of memory 1 second = 63,000 years using hand calculator 4096 POWER3 CPUs (8 CPUs per node)
63
POWER3: Superscalar RISC 8-issue, (most other processors 4-issue) 200 MHz (slow but fast), 0.25 micron, 5 metal layers, 1088 pin IBM’s first 64-bit microprocessor. Used in ASCI Blue Pacific, 4096 nodes, Memory subsystem: 6.4 GB/s POWER3 workstation, Oct. 23 1998: RS/6000 43P
Model 260, single or dual-processor, up to 8 (in SMP form), 4MB L2 instruction cache+256 MB SDRAM memory
Compatable with PowerPC design
64
IBM
1999: 0.18 micron, copper wiring 2000: silicon-on-insulator 2001: “gigachip” (POWER4 ??)
65
MIPS:
R4000 (1991, 64-bit), R8000 (1994), R10000
1995 or 1996-, 30 W. 5.9 M transistors, 32/32 KB cache5-7 pipeline stages, 4-issue
SGI Power Challenge up to 18 X R8000 or x 36 R10000
66
Other Commodity Processors
HP PA-RISC (Precision Architecture): PA-RISC 7200 (CONVEX Examplar, 128
processors)
DEC: Alpha 21064 (CRAY T3D), 21164 (300 MHz, T3E), 21264 (T3E1200)
Intel 80860 (i860): ``Cray on a Chip'' 66 MFLOPS (Cray 1S = 85
MFLOPS)
67
VLIW: Very Long Instruction Word. Use even more functional units than that of a superscalar
processor. All instructions are the same length The operations in each work are chosen by the compiler. CPI is further lower than superscalar. Clock rate is slow. Microprogrammed control, synchronization of parallel
operations is entirely done at compile time No commodity processor is designed in VLIW (but it is
coming back !! INTEL 64-bit Merced)
68
Register File
Load/Store Integer ALU FP Unit Branch Unit
Mainmemory
Load/store FP ADD FP Multiply Branch .... Integer ALU
(b) pipeline execution of VLIW instruction
Cycle
(a) A VLIW processor architecture and instruction format
69
SummaryCISC RISC VLIW
InstructionComplexity
Varies fromSimple tocomplex
One simple operation Many simpleIndependentoperations
Instruction size Varies One size, usually 32bits
One size
Instructionformat
Field placementvaries
Regular, consistentfield placement
Regular, consistentfield placement
Memory reference Bundled withoperations
No bundled,load/store
architecture
No bundled,load/store
architecture
Hardware designfocus
MicrocodedImplementations;
One or morepipeline
No microcode; oneor more pipelines
Multiple pipelines, nomicrocode, no
complex dispatchlogic
Registers Few, sometimesspecial
Many, generalpurpose
Many, generalpurpose
70
Processor Performance Processor Clock IPC stage SPEC95int
SPEC95fp
Alpha 21164 500 4(2+2) 7-12 12.6 18.3 PowerPC620 200 4 5 9.0 9.0 PowerPC 604e 225 4 6 8.5 7.0 UltraSPARC II 250 4 6-9 8.5 15 HA-8000 180 4 7-9 10.8 18.3 MIPS R10000 200 4 5-7 8.9 17.2 Pentium Pro 200 3(2+1) 12-14 8.7 6.0
Only 1 floating point unit active at a time
71
Case Study 1: INTEL Processors
72
Pentium (430 TX Mother Board)
Main Memory(DRAM
Max 256 MB)
Main Memory(DRAM
Max 256 MB)
Memory controller82349TX (MTXC)
82371AB(PIIX4)
Bus Master
PCI Bus (3.3 V or 5V, 30/33 MHz)
PCIDevice
EISA/ISADevice
EISA/ISADevice
ISA/EIO Bus (3.3 V; 5V)
Pentium processor 32-bit address64-bit data500 MB/s (8x60)Host Bus (3.3 V; 60-66 MHz)
USB USB
L2 cache(Max 512 KB)
HD CD-ROM
BMI IDE (33 MB/s)
(up to 5)
73
Pentium Pro cache:
8 KB data+8 KB instruction cache (L1)On-board L2 cache: 256 KB or 512 KB
40 general purpose registers Data TLB: 64 entries No MMX up to 200 MHz, 35W:
integration of high-speed CPU with high-speed cache is not easy
74
Execution in Pentium Pro
Five functional units:Store data unitStore address unitLoad address unit Integer ALU unitFloating point/integer unit
3-issue but only one floating point op Peak flop rate= 200 MFLOP at 200 MHz
75
Pentium Pro (P6)
Pentium ProPentium ProCoreCore
8 KB L18 KB L1DataData
CacheCache
8 KB L18 KB L1InstructionInstruction
CacheCache
Bus Interface Unit
256/512 KB256/512 KBUnified Unified
L2 cacheL2 cache
LocalAPIC
External Bus
Substrate
Half-speed
Full-speed Backside Bus
76
Pentium Pro Memory Subsystem
L1 cache (Data cache 8 KB): supporting one load and one store per cycle (full-
speed), peak bandwidth of 3.2 GB/s on a 200 MHz
L2 cache: run at full CPU clock speed, can transfer 64 bit per
cycle (1.6 GB/sec on a 200 MHz Pentium Pro)
External bus: 64-bit, 64-bit per bus cycle SMP support
Full cache coherence up to 4 processors
77
INTEL Pentium Pro SMP
Processor Bus: 532 MB/s (66.6 MHz x 64 bits) Four-way interleaved DRAMs, EDO or
synchronous DRAM Interface to EISA or PCI Bus operations:
write-back cache, MESI protocolpipeline depth: 8
78
P6 P6P6P6
PCI BridgeDRAM
ControllerDataPath
MICMIC MICMIC MICMIC MICMIC
Memory controller
MIC: memory interface controller
EISA/ISABridge
PCI Bus
PCIDevice
PCIDevice
EISA/ISADevice
EISA/ISADevice
EISA/ISA Bus
Pentium Pro processor bus
Interleave data(288 bits)
32-bit address64-bit data500 MB/s
32-bit address32-bit data132 MB/s
Mem data(72 bits)
79
Pentium II Larger L1 cache: 16/16 KB L2 cache: 512 KB unified cache thru backside bus Add MMX features back (like Pentium MMX) Slot 1 architecture (240 pins) Clock speed improved: 233, 266, 300,... 450 MHz. SMP support: up to TWO only Deschutes version (1998): >= 333 MHz, 100 MHz
external bus (440BX chipset), AGP, SMP support: 4 processors, Slot 2 (330 pins) ?
80
Pentium II (P6)
Pentium IIPentium IICoreCore
16 KB L116 KB L1DataData
CacheCache
16 KB L116 KB L1InstructionInstruction
CacheCache
Bus Interface Unit
512 KB512 KBUnified Unified
L2 cacheL2 cache
LocalAPIC
External Bus
Substrate
Half-speed
Full-speed
81
AMD K6-2
3DNow technology: MMX support 4 floating point units (4-issue); Pentium II only
one floating point unit 300 MHz AMD K6-2: 1.2 GFLOPS > Pentium II
450 MHz Socket 7, 100 MHz external bus, 0.25 micron 9.3 M transistors K6-3 (SharkTooth): 350, 450 MHz, 256 KB on-
chip L2 cache, 21.3 M transistors
82
Intel Processors for Each Market Segment
PentiumPentium®® Pro Pro ProcessorProcessorPentiumPentium®® Pro Pro ProcessorProcessor
PentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessorPentiumPentium®® II Xeon™ II Xeon™ ProcessorProcessor
Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor
Pentium II Pentium II ProcessorProcessorPentium II Pentium II ProcessorProcessor
PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology
PentiumPentium®® Processor Processorwith MMXwith MMX™™ TechnologyTechnology
Intel Intel ®® Celeron™ Celeron™ ProcessorProcessorIntel Intel ®® Celeron™ Celeron™ ProcessorProcessor
Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology
Pentium Processor Pentium Processor with MMX with MMX TechnologyTechnology
Mobile Pentium II Mobile Pentium II ProcessorProcessorMobile Pentium II Mobile Pentium II ProcessorProcessor
’’9797’’9797 ’’9898’’9898
Basic Basic PC DesktopPC Desktop
Basic Basic PC DesktopPC Desktop
Mid- to High-End Mid- to High-End Server/Server/
WorkstationWorkstation
Mid- to High-End Mid- to High-End Server/Server/
WorkstationWorkstation
Entry-level Server/Entry-level Server/WorkstationWorkstation
Entry-level Server/Entry-level Server/WorkstationWorkstation
Performance Performance DesktopDesktop
Performance Performance DesktopDesktop
Mobile PCMobile PCMobile PCMobile PC
0.25
mic
ron
P6
Mic
roar
chit
ectu
re C
ore
0.25
mic
ron
P6
Mic
roar
chit
ectu
re C
ore
83
INTEL Merced (mid-2000)
64-bit processor VLIW concept? Need good compiler technique run UNIX (more scalable than NT)
84
INTEL McKinley (late-2001)
64-bit, 0.13 micron More cache memory than any other
INTEL processors aims at 1000 MHz, 2x faster than Merced Need good compiler technique
85
INTEL Foster
32-bit, 0.13 micron 1000 MHz high-end PC server longer pipeline + “instruction trace cache”
86
Intel Roadmap
32-bit: P5: Pentium (1993), P6: Pentium Pro (1995), Pentium II (450 MHz) Celeron 2, Pentium II Xeon (450 MHz), Tanner and Cascades chip (1999), P7: Willamette (desktop), Foster (1000 MHz, high-end PC server)
64-bit: Merced and McKinley
87
Intel vs. Compaq 64-bit roadmaps:
Year Intel's IA-64 Compaq's Alpha 1998 in progress 21264 at 575 MHz 1999 first samples 21264 at 750 MHz to 1 GHz mid-2000 Merced at 800 MHz + 21364 at 1 GHz + late 2001 McKinley at 1 GHz + EV8 2002 Madison 2003(?) Deerfield
88
Digital Alpha 21164
0.35 m , 500 MHz, RISC
4-way issue superscalar Up to 2 Integer and 2
Floating point instructions issues per CPU cycle
Large on-chip L2 cache 96 KB, writable, 3-way set
associative
9.3 M transistors Fully pipelined
7-stage integer pipeline 9-stage floating point
pipeline
High-through memory subsystem (400 MB/s)
89
Alpha 21164 Block Diagram
Instcache(8KB)
Instcache(8KB)
4-wayissueunit
4-wayissueunit
Int UnitInt Unit
FP +FP +
FP *FP *
Merge Log
Merge Log
Write-through
DataCache(8KB)
Write-through
DataCache(8KB)
Write-backL2
Cache(96KB)
Write-backL2
Cache(96KB)
BusInterface
Unit
BusInterface
UnitL3
CacheL3
CacheInt UnitInt Unit
128-bit internal data bus
Inst Unit Exec Unit Memory Unit
128bit data
40bit address
90
Current Status of Processor Technology
Still don’t work well for some applications:data bases, CAD tools, sparse matrix,..
Alpha 21164, 300 MHz, 4-way superscalarRunning Microsoft SQLserver database on Windows NT It operates at 12% of peak performance
Caches don’t work. Speed is tied to memory bandwidth.
91
Microprocessor-DRAM performance gap full cache miss time = 100s instructions Alpha 7000 server: 340 ns/5.0 ns = 68
clks (2-issue, x2 = 136 insts) Alpha 8400 Server: 266 ns/3.3ns = 80
clks (21164 processor, 4-issue, x 4 = 320 insts)
Rely on locality + caches to bridge gap
92
Processor-Memory Gap
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
1000
198
0198
1 198
3198
4198
5 198
6198
7198
8198
9199
0199
1 199
2199
3199
4199
5199
6199
7199
8 199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
93
Future Processors
Specialized, very long instruction word (VLIW) machines
Wide, simultaneous multithreaded (SMT) uniprocessor
Single-chip multiprocessor Memory-centric computing engines (IRAM,
PPRAM,CRAM)
94
IRAM: Berkeley
Growing performance gap between CPU and memory access speed
Microprocessor and DRAM on single chip Bridge the processor-memory
performance gap via on-chip latency and bandwidth
improve power-performance
95
CRAM (Univ. Toronto)
Computation moved from the CPU into the memory
CRAM= RAM+SIMD PetaOPS performance (1015 operations
per second) Bandwidth internal to memory: 2.9 TB/s Cache/CPU: 800 MB/s
96
Other Projects
PPRAM Project (Kyushu Univ., Japan): Parallel Processing RAM Chip
CMP Project (Stanford)billion-transistor processor architecturesingle-chip multiprocessor (4 to 16)New ISAs