EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

EENG449b/SavvidesLec 18.1

4/13/04

April 13, 2004

Prof. Andreas Savvides

Spring 2004

http://www.eng.yale.edu/courses/eeng449bG

EENG 449bG/CPSC 439bG Computer Systems

Lecture 18

Memory Hierarchy Design Part II


4/13/04

Q1: Where can a Block Be Placed in a Cache?


4/13/04

Set Associatively

• Direct mapped = one-way set associative

• Fully associative = set associative with 1 set

• Most popular cache configurations in today’s processors

– Direct mapped, 2-way set associative, 4-way set associative


4/13/04

Examples

• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:

a) 8-byte block size, direct mapped

8-byte block size => 3 bits for byte-within block

32 KB / 8 B = 4 K Block in the cache => need 12 bits to index

32 bits – (12 + 3) bits = 17 bits remaining => need 17 bits for every tag

Byte-within-blockindextag

031531 214 …… 1


4/13/04

Examples

4-byte block size => 2 bits for byte-within block

32 KB / (4 B x 8) = 1 K Sets in the cache => need 10 bits to index

32 bits – (10 + 2) bits = 20 bits remaining => need 20 bits for every tag

Byte-within-blockindextag

021231 111 ……

• 32 KB cache for a byte addressable processor, 32-bit address space. Which bits of the address are used for the tag, index and byte-within-block for the following configuration:

a) 4-byte block size, 8-way set associative


4/13/04

Q2: How is a block found if it is in the cache?

Selects the desired data from the block

Selects the set

Compared against for a hit

• If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag


4/13/04

Examples

• Processor contains a 16 word, direct mapped cache, with a 4 word block size. Which of the following addresses will hit in the cache?

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 124-words block size => 2 bits for word-within block

16 words / (4 words) = 4 blocks in the cache => need 2 bits to index

6 bits – (2+2) bits = 2 bits remaining => need 2 bits for every tag

0

15

34

78

1112

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15


4/13/04

Examples





0

15

34

78

1112

32 33 34 35

4 5 6 7

8 9 10 11

12 13 14 15

Block replacement


4/13/04

Examples





0

15

34

78

1112

32 33 34 35

4 5 6 7

8 9 10 11

12 13 14 15


4/13/04

Examples





0

15

34

78

1112

16 17 18 19

4 5 6 7

8 9 10 11

12 13 14 15

Block replacement


4/13/04

Examples





0

15

34

78

1112

16 17 18 19

20 21 22 23

8 9 10 11

12 13 14 15

Block replacement


4/13/04

Examples





0

15

34

78

1112

16 17 18 19

20 21 22 23

24 25 26 27

12 13 14 15

Block replacement


4/13/04

Examples





0

15

34

78

1112

32 33 34 35

20 21 22 23

24 25 26 27

12 13 14 15

Block replacement


4/13/04

Examples





0

15

34

78

1112

32 33 34 35

20 21 22 23

24 25 26 27

12 13 14 15


4/13/04

Examples• Processor contains a 16 word, 4-way

associate cache, with a 1 word block size. Which of the following addresses will hit in the cache?

(LRU Replacement)

0, 1, 4, 5, 6, 10, 14, 32, 4, 16, 23, 24, 35, 121-word block size => 0 bits for word-within block

16 / (4 + 0) = 4 Sets in the cache => need 2 bits to index

6 bits – (2) bits = 4 bits remaining => need 4 bits for every tag

0Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23


4/13/04



(LRU Replacement)




24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23

Block replacement


4/13/04



(LRU Replacement)




24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

32 16

23 35


4/13/04



(LRU Replacement)




24Set 0

Set 1

Set 2

Set 3

1

4

5

6 10 14

12 16

23 35

Block replacement


4/13/04

Address Breakdown

• Physical address is 44 bits wide, 36-bit block address and 6-bit offset

• Calculating cache index size

• Blocks are 64 bytes so offset needs 6 bits

• Tag size = 38 – 9 = 29 bits

92512264

356,652

ityAssociativSet sizeBlock size CacheIndex


4/13/04

How to Improve Cache Performance?

Four main categories of optimizations1. Reducing miss penalty

- multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

2. Reducing miss rate- larger block size, larger cache size, higher associativity, way prediction and

pseudoassociativity and computer optimizations

2. Reduce the miss penalty or miss rate via parallelism- non-blocking caches, hardware prefetching and compiler prefetching

3. Reduce the time to hit in the cache- small and simple caches, avoiding address translation, pipelined cache access

yMissPenaltMissRateHitTimeAMAT

Last week Today


4/13/04

Reducing miss rate

• Way Predication:– Perform tag comparison with a single block in

every set» Less comparisons -> simple hardware ->

faster clock

• Pseudoassociative Caches:– Access proceeds as in a direct-mapped cache for a

hit– If a miss, compare to a second entry for a match,

where the second entry can be found fast


4/13/04

Reducing miss rate

• Compiler Optimization:– Loop Interchange & Blocking:

» Exchange the nesting of the loops to make the code access the data in the order it is stored

For (j=0 ->100)For (i=0->5000) Becomes x[i][j] = 2 * x[i][j]

For (i=0 ->5000)For (j=0->100) x[i][j] = 2 * x[i][j]

Maximize the use of the data before replacing it


4/13/04

Reducing miss Penalty• Methods include:

1) Multi-level caches

L2 Equations:AMAT=Hit TimeL1 + Miss RateL1 X Miss PenaltyL1

Miss PenaltyL1=Hit TimeL2 + Miss RateL2 X Miss PenaltyL2

AMAT=Hit TimeL1 + Miss RateL1 X (Hit TimeL2 + Miss RateL2

X Miss PenaltyL2)Definitions:

– Local Miss Rate- misses in this cache divided by the total number of accesses to this cache (Miss RateL2)

– Global Miss Rate- misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL2 X Miss RateL1 )


4/13/04


2) Critical word first and early restart

– Don’t wait for the full block to be loaded before restarting the CPU» Early restart – As soon as the requested word

of the block arrives, send it to the CPU and let the CPU continue execution

» Critical word first – request the missed word first from memory and send it to the CPU as soon as it arrives; Let the CPU continue execution while filling the rest of the words in the block.

– Very useful with large blocks– Spatial locality problem: we often want the next

sequential word soon, so not always a benefit (early restart)


4/13/04


3) Prioritize read misses over writes

» Write buffers offer RAW conflicts with main memory reads on cache misses

» If simply wait for write buffer to empty might increase the read miss penalty by 50%

» Check write buffer contents before read: if not conflict, let the memory access continue

» Write back?• Read miss may require write of dirty blocks• Normal: write dirty block to memory and then do the

read• Instead, copy the dirty block to the write buffer, do

the read and then do the write• CPU stalls less since it can restart as soon as the read

completes


4/13/04

Reducing miss Penalty4) Merging the write buffer

– CPU stalls if the write-back buffer is full– The buffer may contain an entry matching the address

written to– If so, the writes are merged


4/13/04

Reducing miss Penalty

5) Victim caches:

• How to get the hit time of direct-mapped yet still avoid conflict misses?

• Add buffer to place data discarded from the cache

• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4KB direct mapped data cache

EENG449b/Savvides Lec 18.1 4/13/04 April 13, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

Documents

byte block size

cache size

word block size

block replacement slide

tag byte

direct mapped cache

eeng449bsavvides lec

tag slide