Review: The Memory Hierarchy
Increasing
distance
from the
processor in
access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what
is in L1$ is a
subset of what
is in L2$ is a
subset of what
is in MM that is
a subset of is
in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
Review: Principle of Locality Temporal Locality
Keep most recently accessed data items closer to the processor
Spatial Locality Move blocks consisting
of contiguous words to the upper levels
Hit Time << Miss Penalty Hit: data appears in some block in the upper level (Blk X)
Hit Rate: the fraction of accesses found in the upper level
Hit Time: RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a lower level block (Blk Y) Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block’s word to the processor
Miss Types: Compulsory, Conflict, Capacity
Lower Level
MemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then
CPU time = IC × CPI × CC= IC × (CPIideal + Memory-stall cycles) × CC
CPIstall
Memory-stall cycles come from cache misses (a sum of
read-stalls and write-stalls)
Read-stall cycles = reads/program × read miss rate
× read miss penalty
Write-stall cycles = (writes/program × write miss rate
× write miss penalty)
+ write buffer stalls
For write-through caches, we can simplify this to
Memory-stall cycles = miss rate × miss penalty
Set Associative Cache Example
0
Cache
Main Memory
Q2: How do we find it?
Use next 1 low order
memory address bit to
determine which
cache set (i.e., modulo
the number of sets in
the cache)
Tag Data
Q1: Is it there?
Compare all the cache
tags in the set to the
high order 3 memory
address bits to tell if
the memory block is in
the cache
V
0000xx
0001xx
0010xx
0011xx
0100xx
0101xx
0110xx
0111xx
1000xx
1001xx
1010xx
1011xx
1100xx
1101xx
1110xx
1111xx
Two low order bits
define the byte in the
word (32-b words)
One word blocks
Set
1
0
1
Way
0
1
Benefits of Set Associative Caches The choice of direct mapped or set associative depends on
the cost of a miss versus the cost of implementation
0
2
4
6
8
10
12
1-way 2-way 4-way 8-way
Associativity
Mis
s R
ate
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
Data from Hennessy &
Patterson, Computer
Architecture, 2003
Largest gains are in going from direct mapped to 2-way
(20%+ reduction in miss rate)
Q1&Q2: Where can a block be placed/found?
# of sets Blocks per set
Direct mapped # of blocks in cache 1
Set associative (# of blocks in cache)/
associativity
Associativity (typically
2 to 16)
Fully associative 1 # of blocks in cache
Location method # of comparisons
Direct mapped Index 1
Set associative Index the set; compare
set’s tags
Degree of
associativity
Fully associative Compare all blocks tags # of blocks
Q3: Which block should be replaced on a miss?
Easy for direct mapped – only one choice
Set associative or fully associative
Random
LRU (Least Recently Used)
For a 2-way set associative cache, random replacement has a miss rate about 1.1 times higher than LRU.
LRU is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costly
Improving Cache Performance0. Reduce the time to hit in the cache
smaller cache direct mapped cache smaller blocks for writes
no write allocate – no “hit” on cache, just write to write buffer write allocate – to avoid two cycles (first check for hit, then write)
pipeline writes via a delayed write buffer to cache
1. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache – small buffer holding most recently discarded
blocks
Improving Cache Performance2. Reduce the miss penalty
smaller blocks
use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading
check write buffer (and/or victim cache) on read miss – may get lucky
for large blocks fetch critical word first
use multiple cache levels – L2 cache not tied to CPU clock rate
faster backing store/improved memory bandwidth wider buses
memory interleaving, page mode DRAMs
Summary: The Cache Design Space Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
write allocation
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
Next Lecture and Reminders Next lecture
Reading assignment – PH 7.4
Reminders
HW4 due November 8rd
Check grade posting on-line (by your midterm exam number) for correctness
Final exam (tentatively) schedule
Tuesday, December 13th, 2:30-4:20, Location TBD
What is RAID Redundant Array of Independent (Inexpensive) Disks
A set of disk stations treated as one logical station
Data are distributed over the stations
Redundant capacity is used for parity allowing for data repair
Mean Time Between Failures (MTBF)
Levels of RAID 6 levels of RAID (0-5) have been accepted by industry
Other kinds have been proposed in literature
Level 2 and 4 are not commercially available, they are included for clarity
RAID 0 All data (user and system) are distributed over the
disks so that there is a reasonable chance for parallelism
Disk is logically a set of strips (blocks, sectors,…). Strips are numbered and assigned consecutively to the disks (see picture.)
RAID 0a. It splits data among two or more disks.
b. Provides good performance.
c. Lack of data redundancy means there is no
fail over support with this configuration.
d. In the diagram to the right, the odd blocks
are written to disk 0 and the even blocks to
disk 1 such that A1, A2, A3, A4, … would
be the order of blocks read if read
sequentially from the beginning.
e. Used in read only NFS systems and
gaming systems.
Raid 0 (No redundancy)
strip 0
strip 4
strip 8
strip 12
strip 1
strip 5
strip 9
strip 13
strip 2
strip 6
strip 10
strip 14
strip 3
strip 7
strip 11
strip 15
Data mapping Level 0
RAID 1•RAID1 is ‘data mirroring’.
•Two copies of the data are held on
two physical disks, and the data is
always identical.
• Twice as many disks are required to
store the same data when compared to
RAID 0.
•Array continues to operate so long as
at least one drive is functioning.
Raid 1 (mirrored)
strip 0
strip 4
strip 8
strip 12
strip 1
strip 5
strip 9
strip 13
strip 2
strip 6
strip 10
strip 14
strip 3
strip 7
strip 11
strip 15
strip 0
strip 4
strip 8
strip 12
strip 1
strip 5
strip 9
strip 13
strip 2
strip 6
strip 10
strip 14
strip 3
strip 7
strip 11
strip 15
Raid 2 (redundancy through Hamming code)
f0(b)b2b1b0 b2f1(b) f2(b)
RAID 2 Small strips, one byte or one word
Synchronized disks, each I/O operation is performed in a parallel way
Error correction code (Hamming code) allows for correction of a single bit error
Controller can correct without additional delay
Is still expensive, only used in case many frequent errors can be expected
Hamming code
7 6 5 4 3 2 1 P
1 0 1 0 1 0 1
* * * * * 0
* * * * 0
* * * 0
7 6 5 4 3 2 1 P
1 1 1 0 1 0 1
* * * * * 1
* * * * 1
* * * 0
=6
Stored sequence
Data: 1011 in 7,6,5,3
Parity in 4,2,1
Single error can
be repaired
RAID 3 (bit-interleaved parity)
P(b)b2b1b0 b2
RAID 3 Level 2 needs log2(number of disks) parity disks
Level 3 needs only one, for one parity bit
In case one disk crashes, the data can still be reconstructed even on line (“reduced mode”) and be written (X1-4 data, P parity):
P = X1+X2+X3+X4
X1=P+X2+X3+X4
RAID 2-3 have high data transfer times, but perform only one I/O at the time so that response times in transaction oriented environments are not so good
RAID 4 (block-level parity)
block 0
block 4
block 8
block 12
block 1
block 5
block 9
block 13
block 2
block 6
block 10
block 14
block 3
block 7
block 11
block 15
P(0-3)
P(4-7)
P(8-11)
P(12-15)
RAID 4 Larger strips and one parity disk
Blocks are kept on one disk, allowing for parallel access by multiple I/O requests
Writing penalty: when a block is written, the parity disk must be adjusted (e.g. writing on X1):
P =X4+X3+X2+X1
P’=X4+X3+X2+X1’
=X4+X3+X2+X1’+X1+X1
=P+X1+X1’
Parity disk may be a bottleneck
Good response times, less good transfer rates
RAID 5 (block-level distributed parity)
block 0
block 4
block 8
block 12
P(16-19)
block 1
block 5
block 9
P(12-15)
block 16
block 2
block 6
P(8-11)
block 13
block 17
block 3
P(4-7)
block 10
block 14
block 18
P(0-3)
block 7
block 11
block 15
block 19
RAID 5 Distribution of the parity strip to avoid the bottle
neck.
Can use round robin:
Parity disk = (-block number/4) mod 5
Overview Raid 0-2
Overview Raid 3-5