EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 http://www.eng.yale.edu/courses/ eeng449bG EENG 449bG/CPSC 439bG Computer Systems Lecture 18 Memory Hierarchy Design Part II
Dec 21, 2015
EENG449b/SavvidesLec 18.1
4/13/04
April 13, 2004
Prof. Andreas Savvides
Spring 2004
http://www.eng.yale.edu/courses/eeng449bG
EENG 449bG/CPSC 439bG Computer Systems
Lecture 18
Memory Hierarchy Design Part II
EENG449b/SavvidesLec 18.2
4/13/04
Q1: Where can a Block Be Placed in a Cache?
EENG449b/SavvidesLec 18.3
4/13/04
Set Associatively
• Direct mapped = one-way set associative
• Fully associative = set associative with 1 set
• Most popular cache configurations in today’s processors
– Direct mapped, 2-way set associative, 4-way set associative
EENG449b/SavvidesLec 18.4
4/13/04
Examples
• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:
a) 8-byte block size, direct mapped
8-byte block size => 3 bits for byte-within block
32 KB / 8 B = 4 K Block in the cache => need 12 bits to index
32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag
Byte-within-blockindextag
031531 214 …… 1
EENG449b/SavvidesLec 18.5
4/13/04
Examples
4-byte block size => 2 bits for byte-within block
32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index
32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag
Byte-within-blockindextag
021231 111 ……
• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:
a) 4-byte block size, 8-way set associative
EENG449b/SavvidesLec 18.6
4/13/04
Q2: How is a block found if it is in the cache?
Selects the desired data from the block
Selects the set
Compared against for a hit
• If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag
EENG449b/SavvidesLec 18.7
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
EENG449b/SavvidesLec 18.8
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
32 33 34 35
4 5 6 7
8 9 10 11
12 13 14 15
Block replacement
EENG449b/SavvidesLec 18.9
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
32 33 34 35
4 5 6 7
8 9 10 11
12 13 14 15
EENG449b/SavvidesLec 18.10
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
16 17 18 19
4 5 6 7
8 9 10 11
12 13 14 15
Block replacement
EENG449b/SavvidesLec 18.11
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
16 17 18 19
20 21 22 23
8 9 10 11
12 13 14 15
Block replacement
EENG449b/SavvidesLec 18.12
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
16 17 18 19
20 21 22 23
24 25 26 27
12 13 14 15
Block replacement
EENG449b/SavvidesLec 18.13
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
32 33 34 35
20 21 22 23
24 25 26 27
12 13 14 15
Block replacement
EENG449b/SavvidesLec 18.14
4/13/04
Examples
• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block
16 words / (4 words) = 4 blocks in the cache => need 2 bits to index
6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag
0
15
34
78
1112
32 33 34 35
20 21 22 23
24 25 26 27
12 13 14 15
EENG449b/SavvidesLec 18.15
4/13/04
Examples• Processor contains a 16 word, 4-way
associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?
(LRU Replacement)
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block
16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index
6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag
0Set 0
Set 1
Set 2
Set 3
1
4
5
6 10 14
32 16
23
EENG449b/SavvidesLec 18.16
4/13/04
Examples• Processor contains a 16 word, 4-way
associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?
(LRU Replacement)
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block
16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index
6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag
24Set 0
Set 1
Set 2
Set 3
1
4
5
6 10 14
32 16
23
Block replacement
EENG449b/SavvidesLec 18.17
4/13/04
Examples• Processor contains a 16 word, 4-way
associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?
(LRU Replacement)
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block
16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index
6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag
24Set 0
Set 1
Set 2
Set 3
1
4
5
6 10 14
32 16
23 35
EENG449b/SavvidesLec 18.18
4/13/04
Examples• Processor contains a 16 word, 4-way
associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?
(LRU Replacement)
0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block
16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index
6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag
24Set 0
Set 1
Set 2
Set 3
1
4
5
6 10 14
12 16
23 35
Block replacement
EENG449b/SavvidesLec 18.19
4/13/04
Address Breakdown
• Physical address is 44 bits wide, 36-bit block address and 6-bit offset
• Calculating cache index size
• Blocks are 64 bytes so offset needs 6 bits
• Tag size = 38 – 9 = 29 bits
92512264
356,652
ityAssociativSet sizeBlock size CacheIndex
EENG449b/SavvidesLec 18.20
4/13/04
How to Improve Cache Performance?
Four main categories of optimizations1. Reducing miss penalty
- multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches
2. Reducing miss rate- larger block size, larger cache size, higher associativity, way prediction and
pseudoassociativity and computer optimizations
2. Reduce the miss penalty or miss rate via parallelism- non-blocking caches, hardware prefetching and compiler prefetching
3. Reduce the time to hit in the cache- small and simple caches, avoiding address translation, pipelined cache access
yMissPenaltMissRateHitTimeAMAT
Last week Today
EENG449b/SavvidesLec 18.21
4/13/04
Reducing miss rate
• Way Predication:– Perform tag comparison with a single block in
every set» Less comparisons -> simple hardware ->
faster clock
• Pseudoassociative Caches:– Access proceeds as in a direct-mapped cache for a
hit– If a miss, compare to a second entry for a match,
where the second entry can be found fast
EENG449b/SavvidesLec 18.22
4/13/04
Reducing miss rate
• Compiler Optimization:– Loop Interchange & Blocking:
» Exchange the nesting of the loops to make the code access the data in the order it is stored
For (j=0 ->100)For (i=0->5000) Becomes x[i][j] = 2 * x[i][j]
For (i=0 ->5000)For (j=0->100) x[i][j] = 2 * x[i][j]
Maximize the use of the data before replacing it
EENG449b/SavvidesLec 18.23
4/13/04
Reducing miss Penalty• Methods include:
1) Multi-level caches
L2 Equations:AMAT=Hit TimeL1 + Miss RateL1 X Miss PenaltyL1
Miss PenaltyL1=Hit TimeL2 + Miss RateL2 X Miss PenaltyL2
AMAT=Hit TimeL1 + Miss RateL1 X (Hit TimeL2 + Miss RateL2
X Miss PenaltyL2)Definitions:
– Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss RateL2)
– Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL2 X Miss RateL1 )
EENG449b/SavvidesLec 18.24
4/13/04
Reducing miss Penalty• Methods include:
2) Critical word first and early restart
– Don’t wait for the full block to be loaded before restarting the CPU» Early restart – As soon as the requested word
of the block arrives, send it to the CPU and let the CPU continue execution
» Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block.
– Very useful with large blocks– Spatial locality problem: we often want the next
sequential word soon, so not always a benefit (early restart)
EENG449b/SavvidesLec 18.25
4/13/04
Reducing miss Penalty• Methods include:
3) Prioritize read misses over writes
» Write buffers offer RAW conflicts with main memory reads on cache misses
» If simply wait for write buffer to empty might increase the read miss penalty by 50%
» Check write buffer contents before read: if not conflict, let the memory access continue
» Write back?• Read miss may require write of dirty blocks• Normal: write dirty block to memory and then do the
read• Instead, copy the dirty block to the write buffer, do
the read and then do the write• CPU stalls less since it can restart as soon as the read
completes
EENG449b/SavvidesLec 18.26
4/13/04
Reducing miss Penalty4) Merging the write buffer
– CPU stalls if the write-back buffer is full– The buffer may contain an entry matching the address
written to– If so, the writes are merged
EENG449b/SavvidesLec 18.27
4/13/04
Reducing miss Penalty
5) Victim caches:
• How to get the hit time of direct-mapped yet still avoid conflict misses?
• Add buffer to place data discarded from the cache
• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache