CS61CL Machine Structures Lec 11 – Introduction to Cache ...inst.eecs.berkeley.edu/~cs61cl/fa09/misc/lec11.pdf · CS61CL Machine Structures Lec 11 – Introduction to Cache Design
Post on 12-Apr-2020
5 Views
Preview:
Transcript
CS61CL Machine Structures
Lec 11 – Introduction to Cache Design
David Culler Electrical Engineering and Computer Sciences
University of California, Berkeley
CS61CL Road Map
10/14/09 CS61CL F09 2
Hardware
Software
Machine Lang. pgm
Instruction Set Architecture
Machine Organization
HLL Program Asm Lang. Pgm
Com
pile
r
Ass
embl
er
foo.c foo.s
foo.o
I/O system Instr. Set Proc.
Digital Design
Circuit Design
Datapath & Control
Layout & fab
Semiconductor Materials
foo.exe
Link
er
10/14/09 CS61CL F09 3
Turning “Moore stuff” into performance
11/18/09 cs61cl f09 lec 5 4
Performance Trends
MIPS R3000
11/18/09 cs61cl f09 lec 5 5
Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X)
Recall: Performance
• Performance is in units of things per sec – bigger is better
• If we are primarily concerned with response time performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Speedup( E ) = Performance(with E) / Performance( without E)
Review: Pipelined Execution
• Speedup with N stages is ≤ N • Limited by dependences (aka Hazards)
– Structural hazard: two operations want to use same resource at same time
– Data Hazard: cannot use a value before it is produced – Control Hazard: attempt to branch before condition is
determined
11/4/09 UCB CS61CL F09 Lec 10 6
°°°
PC
+ A
BCi
IR
IR_e
x
IR_m
em
IR_w
b
imem
Dm
em
The Problem: Memory Gap
• 1985 80386 cache off-chip • 1989 first Intel CPU (80486) with cache on chip • 1995 first Intel CPU (Pentium Pro) with two levels of cache on chip
µProc
60%/yr.
DRAM
7%/yr. 1
10
100
1000 19
80
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-Memory
Performance Gap: (grows 50% / year)
Perf
orm
ance
Recall: Where do Objects live and work?
11/18/09 UCB CS61CL F09 8
°°°
Processor
Memory
000..0:
FFF..F:
n:
register load
operate
store
word
°°°
000..0:
FFF..F:
n:
read
read-miss
read-hit
Storage Hierarchy Processor
Size of memory at each level
Increasing Distance from
Proc., Decreasing
speed Level 1
Level 2
Level n
Level 3
. . .
Higher
Lower As we move to deeper levels the latency goes up and
price per bit goes down.
Registers
Cache
Memory
Disk
Why Caches Work • Physics:
– Large memories are slow, – Fast memories are small
• Statistics: Programs exhibit locality – Temporal locality: recently accessed locations are likely to be
accessed again soon – Spatial Locality: is a location is accessed others nearby are likely to
accessed too
• Use statistics to cheat the laws of physics – illusion of a large fast memory – on average access to a large memory can be fast – keep recently accessed blocks in a small fast memory
• Ave Mem Access Time = Hit Time + Pmiss* MissPenalty
10/14/09 CS61CL F09 10
Manual vs Automatic Management of the Storage Hierarchy • In everyday life?
– what books in backpack? desk? library? amazon? – music collection?
• Registers? • Files? • Cache?
10/14/09 CS61CL F09 11
Cache: Transparent Memory Acceleration • Processor performs reads and writes on memory
locations – inst. fetch, load, store – memory abstraction is unchanged!
• Cache has copy of a small portion of the memory – hit: present in cache => respond quickly – miss: absent in cache => obtains it from memory and respond
• Unit of transfer: Block – several words of memory into a cache line
• Where can it be placed? • How can we tell if it is there? • What happens to memory on write hit? • What happens to cache on write miss? 10/14/09 CS61CL F09 12
Direct-Mapped Cache • Each memory address is associated with one
possible block within the cache – => only need to look in a single location in the cache for the
data if it exists in the cache – Block is the unit of transfer between cache and memory
Direct-Mapped Cache (B=1, S=4)
Cache Line 0 can be occupied by data from:
– Memory location 0, 4, 8, ... – 4 blocks ⇒ any memory location that is
multiple of 4
Memory Memory �Address
0 1 2 3 4 5 6 7 8 9 A B C D E F
4 Byte Direct Mapped Cache
CacheIndex
0 1 2 3
Block size = 1 byte
Direct-Mapped Cache (B=2, S=4)
• How is the block located? • How is the byte in block
selected? • e.g., Mem address 11101?
Memory Memory �Address
0 2 4 6 8 A C E
10 12 14 16 18 1A 1C 1E
8 Byte Direct
Mapped Cache
Cache Index 0 1 2 3
0 1 2 3
etc
Block size = 2 bytes 4 5 6 7 8 9
00010010
How do you tell if the right block in is the line? • Like luggage at the airport …
10/14/09 CS61CL F09 16
Tag-Check (B=2, S=4, N=1)
• What should go in the tag? – entire address? – don’t need the bits we used in
getting there
Memory (addresses shown)
Mem Address
0 2 4 6 8 A C E
10 12 14 16 18 1A 1C 1E
0 1 2 3
0 1 2 3
etc
Tag Data
4 5 6 7 8 9
8 2
1E 14 0
1
2
3
1 0
3 2
Cache Index
ttttttttttttttttt iiiiiiiiii oooo
tag index byte to check to offset if have select within correct block block* block
Mapping Memory Address to Cache
• * Direct map => 1 block per “set” • More generally, index to select set
Direct-Mapped Cache Example (1/3) • Suppose we have a 8KB of data in a direct
-mapped cache with 16 byte blocks • Determine the size of the tag, index and offset
fields if we’re using a 32-bit architecture • Offset
– need to specify correct byte within a block – block contains 16 bytes
= 24 bytes – need 4 bit to specify correct byte
Direct-Mapped Cache Example (2/3) • Index: (~index into an “array of blocks”)
– need to specify correct block in cache – cache contains 8 KB = 213 bytes – block contains 16 B = 24 bytes – # blocks/cache
= bytes/cache bytes/block
= 213 bytes/cache 24 bytes/block
= 29 blocks/cache – need 9 bits to specify this many blocks
Direct-Mapped Cache Example (3/3) • Tag: use remaining bits as tag
– tag length = addr length - offset - index = 32 - 4 - 9 bits = 19 bits
– so tag is leftmost 19 bits of memory address
Administration • Midterms to be returned in Tu/W lab • HW 8 (the last) out today due ??? • Proj 4 out today, due ??
• Pick the due dates and plan RRRRR Week
10/14/09 CS61CL F09 22
16 KB Direct Mapped Cache, 16B blocks
• Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)
...
Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3
0 1 2 3 4 5 6 7
1022 1023
...
Index 0 0 0 0 0 0 0 0
0 0
1. Load Byte 0x00000014
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
• 000000000000000000 0000000001 0100
Index
Tag field Index field Offset
0 0 0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
So we read block 1 (0000000001)
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
• 000000000000000000 0000000001 0100
Index
Tag field Index field Offset
0 0 0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
No valid data
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
• 000000000000000000 0000000001 0100
Index
Tag field Index field Offset
0 0 0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
So load that data into cache, setting tag, valid
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0 d c b a
• 000000000000000000 0000000001 0100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
Read from cache at offset, return word b • 000000000000000000 0000000001 0100
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
2. Read Byte 0x0000001C
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000001 1100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
Index is Valid
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000001 1100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
Index valid, Tag Matches
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000001 1100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
Index Valid, Tag Matches, return d
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000001 1100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
3. Load Byte 0x00000034
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000011 0100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
So read block 3
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000011 0100
Index
Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
No valid data
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000011 0100
Index Tag field Index field Offset
0
0 0 0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
Load that cache block, return word f
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000000 0000000011 0100
1 0 h g f e
Index
Tag field Index field Offset
0
0
0 0 0 0
0 0
d c b a
0xc-f 0x8-b 0x4-7 0x0-3
4. Load Byte 0x00008014
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000010 0000000001 0100
1 0
Index Tag field Index field Offset
0
0
0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
h g f e
So read Cache Block 1, Data is Valid
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000010 0000000001 0100
1 0
Index
Tag field Index field Offset
0
0
0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
h g f e
Cache Block 1 Tag does not match (0 != 2)
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 0
• 000000000000000010 0000000001 0100
1 0
Index Tag field Index field Offset
0
0
0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
d c b a
h g f e
Miss, so replace block 1 with new data & tag
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 2 l k j i
• 000000000000000010 0000000001 0100
1 0
Index Tag field Index field Offset
0
0
0 0 0 0
0 0
0xc-f 0x8-b 0x4-7 0x0-3
h g f e
And return byte: J
...
Valid Tag
0 1 2 3 4 5 6 7
1022 1023
...
1 2
• 000000000000000010 0000000001 0100
1 0
Index Tag field Index field Offset
0
0
0 0 0 0
0 0
l k j i
0xc-f 0x8-b 0x4-7 0x0-3
h g f e
What to do on a write hit? • Write-through
– update the word in cache block and corresponding word in memory
• Write-back – update word in cache block – allow memory word to be “stale” – ⇒ add ‘dirty’ bit to each block indicating that memory needs
to be updated when block is replaced – ⇒ OS flushes cache before I/O…
• Performance trade-offs?
Types of Cache Misses (Three C’s) • 1st C: Compulsory Misses
– occur when a program is first started – cache does not contain any of that program’s data yet, so
misses are bound to occur – reduced with increasing block size
Types of Cache Misses (Three C’s) • 1st C: Compulsory Misses • 2nd C: Conflict Misses
– miss that occurs because two distinct memory addresses map to the same cache line
– when both are needed keep overwriting each other
• Dealing with Conflict Misses – Solution 1: Make the cache size bigger
» More lines, fewer conflicts » Conflicts far apart in address space remain
– Solution 2: Multiple distinct blocks in the same cache Index
Fully Associative Cache (B=32) • Any block anywhere • Memory address fields:
– Offset: byte within block – Index: non – Tag: all the rest
• Compare all tags in parallel
Byte Offset
:
Cache Data B 0
0 4 31
:
Cache Tag (27 bits long)
Valid
:
B 1 B 31 :
Cache Tag =
= =
=
= :
Types of Cache Misses (Three C’s) • 1st C: Compulsory Misses • 2nd C: Conflict Misses • 3rd C: Capacity Misses
– miss that occurs because the cache has a limited size – miss that would not occur if we increase the size of the cache
N-Way Set Associative Cache • Basic Idea
– direct-map to set – associative lookup of N blocks within it
• Memory address fields: – Tag: same as before – Offset: same as before – Index: points us to the correct “row” (called a set in this case)
• Given memory address: – Find correct set using Index value. – Compare Tag with all Tag values in the determined set. – If a match occurs, hit!, otherwise a miss. – Finally, use the offset field as usual to find the desired data
within the block.
Associative Cache Example
• 2-way set associative cache.
Memory Memory �Address
0 1 2 3 4 5 6 7 8 9 A B C D E F
Cache
Index 0 0 1 1
4-Way Set Associative Cache Circuit
tag index
Block Replacement Policy • Direct-Mapped Cache
– index completely specifies position which position a block can go in on a miss
• N-Way Set Assoc – index specifies a set, but block can occupy any position
within the set on a miss • Fully Associative
– block can be written into any position • Question: if we have the choice, where should we
write an incoming block? – If there are any locations with valid bit off (empty), then
usually write the new block into the first one. – If all possible locations already have a valid block, we
must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.
Block Replacement Policy: LRU • LRU (Least Recently Used)
– Idea: cache out block which has been accessed (read or write) least recently
– Pro: temporal locality ⇒ recent past use implies likely future use: in fact, this is a very effective policy
– Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this
Block Replacement Example • We have a 2-way set associative cache with a
four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem):
0, 2, 0, 1, 4, 0, 2, 3, 5, 4 • How many hits and how many misses will there
be for the LRU block replacement policy?
Block Replacement: LRU
Addresses 0, 2, 0, 1, 4, 0, ...
0 lru
2
1 lru
loc 0 loc 1 set 0
set 1
0 2 lru set 0
set 1
0: miss, bring into set 0 (loc 0)
2: miss, bring into set 0 (loc 1)
0: hit
1: miss, bring into set 1 (loc 0)
4: miss, bring into set 0 (loc 1, replace 2)
0: hit
0 set 0
set 1
lru lru
0 2 set 0
set 1
lru lru
set 0
set 1 0
1 lru lru 2 4 lru
set 0
set 1 0 4
1 lru
lru lru
Big Idea • How to choose between associativity, block size,
replacement & write policy? • Design against a performance model
– Minimize: Average Memory Access Time = Hit Time
+ Miss Penalty x Miss Rate – influenced by technology & program behavior
• Create the illusion of a memory that is large, cheap, and fast - on average
• How can we improve miss penalty?
Improving Miss Penalty • When caches first became popular, Miss Penalty
~ 10 processor clock cycles • Today 2400 MHz Processor (0.4 ns per clock
cycle) and 80 ns to go to DRAM ⇒ 200 processor clock cycles!
Proc $2
DR
AM
$
MEM
Solution: another cache between memory and the processor cache: Second Level (L2) Cache
And in Conclusion… • We would like to have the capacity of disk at the
speed of the processor: unfortunately this is not feasible.
• So we create a memory hierarchy: – each successively lower level contains “most used” data from
next higher level – exploits temporal & spatial locality – do the common case fast, worry less about the exceptions
(design principle of MIPS)
• Locality of reference is a Big Idea
And in Conclusion…
• Mechanism for transparent movement of data among levels of a storage hierarchy
– set of address/value bindings – address ⇒ index to set of candidates – compare desired address with tag – service hit or miss
» load new block and binding on miss
Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3
0 1 2 3 ...
1 0 d c b a
000000000000000000 0000000001 1100 address: tag index offset
And in Conclusion… • We’ve discussed memory caching in detail. Caching in
general shows up over and over in computer systems – Filesystem cache, Web page cache, Game databases /
tablebases, Software memoization, Others? • Big idea: if something is expensive but we want to do it
repeatedly, do it once and cache the result. • Cache design choices:
– Size of cache: speed v. capacity – Block size (i.e., cache aspect ratio) – Write Policy (Write through v. write back – Associativity choice of N (direct-mapped v. set v. fully
associative) – Block replacement policy – 2nd level cache? – 3rd level cache?
• Use performance model to pick between choices, depending on programs, technology, budget, ...
Bonus slides • These are extra slides that used to be included in
lecture notes, but have been moved to this, the “bonus” area to serve as a supplement.
• The slides will appear in the order they would have in the normal presentation
AREA (cache size, B) = HEIGHT (# of blocks) * WIDTH (size of one block, B/block)
WIDTH (size of one block, B/block)
HEIGHT (# of blocks)
AREA (cache size,
B)
2(H+W) = 2H * 2W
Tag Index Offset
TIO The great cache mnemonic
• Ex.: 16KB of data, direct-mapped, 4 word blocks
– Can you work out height, width, area?
• Read 4 addresses 1. 0x00000014 2. 0x0000001C 3. 0x00000034 4. 0x00008014
• Memory vals here:
Address (hex) Value of Word Memory
00000010 00000014 00000018 0000001C
a b c d
... ... 00000030 00000034 00000038 0000003C
e f g h
00008010 00008014 00008018 0000801C
i j k l
... ...
... ...
... ...
Accessing data in a direct mapped cache
• 4 Addresses: – 0x00000014, 0x0000001C, 0x00000034, 0x00008014
• 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields
000000000000000000 0000000001 0100
000000000000000000 0000000001 1100
000000000000000000 0000000011 0100
000000000000000010 0000000001 0100 Tag Index Offset
Accessing data in a direct mapped cache
Do an example yourself. What happens?
• Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l
• Read address 0x00000030 ? 000000000000000000 0000000011 0000
• Read address 0x0000001c ? 000000000000000000 0000000001 1100
...
Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 0 1 2 3 4 5 6 7 ...
1 2 l k j i
1 0 h g f e
Index 0
0
0 0 0 0
Cache
Answers
• 0x00000030 a hit Index = 3, Tag matches,
Offset = 0, value = e • 0x0000001c a miss
Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d
• Since reads, values must = memory values whether or not cached: – 0x00000030 = e – 0x0000001c = d
Address (hex) Value of Word Memory
00000010 00000014 00000018 0000001C
a b c d
... ... 00000030 00000034 00000038 0000003C
e f g h
00008010 00008014 00008018 0000801C
i j k l
... ...
... ...
... ...
Block Size Tradeoff (1/3) • Benefits of Larger Block Size
– Spatial Locality: if we access a given word, we’re likely to access other nearby words soon
– Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well
– Works nicely in sequential array accesses too
Block Size Tradeoff (2/3) • Drawbacks of Larger Block Size
– Larger block size means larger miss penalty » on a miss, takes longer time to load a new block from next
level – If block size is too big relative to cache size, then there are too
few blocks » Result: miss rate goes up
• In general, minimize Average Memory Access Time (AMAT)
= Hit Time + Miss Penalty x Miss Rate
Block Size Tradeoff (3/3) • Hit Time
– time to find and retrieve data from current level cache
• Miss Penalty – average time to retrieve data on a current level miss (includes
the possibility of misses on successive levels of memory hierarchy)
• Hit Rate – % of requests that are found in current level cache
• Miss Rate – 1 - Hit Rate
Extreme Example: One Big Block
• Cache Size = 4 bytes Block Size = 4 bytes – Only ONE entry (row) in the cache!
• If item accessed, likely accessed again soon – But unlikely will be accessed again immediately!
• The next access will likely to be a miss again – Continually loading data into the cache but
discard data (force out) before use it again – Nightmare for cache designer: Ping Pong Effect
Cache Data Valid Bit B 0 B 1 B 3
Tag B 2
Block Size Tradeoff Conclusions
Miss Penalty
Block Size
Increased Miss Penalty & Miss Rate
Average Access
Time
Block Size
Exploits Spatial Locality
Fewer blocks: compromises temporal locality
Miss Rate
Block Size
Analyzing Multi-level cache hierarchy
Proc $2
DR
AM
$
L1 hit
time L1 Miss Rate
L1 Miss Penalty Avg Mem Access Time =
L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty =
L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time =
L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)
L2 hit
time L2 Miss Rate
L2 Miss Penalty
Example • Assume
– Hit Time = 1 cycle – Miss rate = 5% – Miss penalty = 20 cycles – Calculate AMAT…
• Avg mem access time = 1 + 0.05 x 20 = 1 + 1 cycles = 2 cycles
Ways to reduce miss rate • Larger cache
– limited by cost and technology – hit time of first level cache < cycle time (bigger caches are
slower)
• More places in the cache to put each block of memory – associativity
– fully-associative » any block any line
– N-way set associated » N places for each block » direct map: N=1
Typical Scale • L1
– size: tens of KB – hit time: complete in one clock cycle – miss rates: 1-5%
• L2: – size: hundreds of KB – hit time: few clock cycles – miss rates: 10-20%
• L2 miss rate is fraction of L1 misses that also miss in L2
– why so high?
Example: with L2 cache • Assume
– L1 Hit Time = 1 cycle – L1 Miss rate = 5% – L2 Hit Time = 5 cycles – L2 Miss rate = 15% (% L1 misses that miss) – L2 Miss Penalty = 200 cycles
• L1 miss penalty = 5 + 0.15 * 200 = 35 • Avg mem access time = 1 + 0.05 x 35
= 2.75 cycles
Example: without L2 cache • Assume
– L1 Hit Time = 1 cycle – L1 Miss rate = 5% – L1 Miss Penalty = 200 cycles
• Avg mem access time = 1 + 0.05 x 200 = 11 cycles
• 4x faster with L2 cache! (2.75 vs. 11)
• Cache – 32 KB Instructions and 32
KB Data L1 caches – External L2 Cache interface
with integrated controller and cache tags, supports up to 1 MByte external L2 cache
– Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB)
• Pipelining – Superscalar (3 inst/cycle) – 6 execution units (2 integer
and 1 double precision IEEE floating point)
An actual CPU – Early PowerPC
An Actual CPU – Pentium M
32KB I$
32KB D$
top related