Cache Memory

Computer ArchitectureChapter 5

Memory Hierarchy Design

Chapter Overview

5.1 Introduction5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory5.8 Protection and Examples of Virtual Memory

Introduction


The Big Picture: Where are We Now?

The Five Classic Components of a Computer

Control

Datapath

Memory

ProcessorInput

Output

Topics In This Chapter: SRAM Memory TechnologyDRAM Memory TechnologyMemory Organization

Levels of the Memory Hierarchy

CPU Registers100s Bytes1s ns

CacheK Bytes4 ns1-0.1 cents/bitMain MemoryM Bytes100ns- 300ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5

-6

CapacityAccess TimeCost

Tapeinfinitesec-min10

-8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Introduction The Big Picture: Where are We Now?

The ABCs of CachesIn this section we will:

Learn lots of definitions about caches – you can’t talk about something until you understand it (this is true in computer science at least!)

Answer some fundamental questions about caches:

Q1: Where can a block be placed in the upper level? (Block placement)Q2: How is a block found if it is in the upper level? (Block identification)Q3: Which block should be replaced on a miss? (Block replacement)Q4: What happens on a write? (Write strategy)


Cache MemoryThe purpose of cache memory is to speed up accesses by storing recently used data closer to the CPU, instead of storing it in main memory.Although cache is much smaller than main memory, its access time is a fraction of that of main memory.Unlike main memory, which is accessed by address, cache is typically accessed by content; hence, it is often called content addressable memory .Because of this, a single large cache memory isn’t always desirable-- it takes longer to search.

CacheSmall amount of fast memorySits between normal main memory and CPUMay be located on CPU chip or module

Cache/Main Memory Structure

Cache operation – overviewCPU requests contents of memory locationCheck cache for this dataIf present, get from cache (fast)If not present, read required block from main memory to cacheThen deliver from cache to CPUCache includes tags to identify which block of main memory is in each cache slot

Cache Read Operation - Flowchart

Comparison of Cache Sizes

a Two values seperated by a slash refer to instruction and data cachesb Both caches are instruction only; no data caches

Processor Type Year of Introduction

L1 cachea L2 cache L3 cache

IBM 360/85 Mainframe 1968 16 to 32 KB — —

PDP-11/70 Minicomputer 1975 1 KB — —

VAX 11/780 Minicomputer 1978 16 KB — —

IBM 3033 Mainframe 1978 64 KB — —

IBM 3090 Mainframe 1985 128 to 256 KB — —

Intel 80486 PC 1989 8 KB — —

Pentium PC 1993 8 KB/8 KB 256 to 512 KB —

PowerPC 601 PC 1993 32 KB — —

PowerPC 620 PC 1996 32 KB/32 KB — —

PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB

IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB

IBM S/390 G6 Mainframe 1999 256 KB 8 MB —

Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

IBM SP High-end server/ supercomputer

2000 64 KB/32 KB 8 MB —

CRAY MTAb Supercomputer 2000 8 KB 2 MB —

Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB

SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —

Itanium 2 PC/server 2002 32 KB 256 KB 6 MB

IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB

CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —

The Principle of Locality

The Principle of Locality:Program access a relatively small portion of the address space at any instant of time.

Three Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)Sequential Locality : Sequential order of program execution except branch instructions.

The ABCs of Caches Definitions

A few terms

Inclusion PropertyCoherence PropertyAccess frequencyAccess timeCycle timeLatency BandwidthCapacityUnit of transfer

Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper levelHit Time: Time to access the upper level which consists of

Upper level access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block Y)Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Consider a memory with three levelsAverage memory access time (assuming hit at 3rd level)h1 * t1 + (1 – h1) [t1 + h2 * t2 + (1 – h2) * ( t2 + t3)] where t1, t2 and t3 are access times at the three levels

Access frequency of level Mi: fi = (1- h1) (1- h2)…(1-hi)hi

Effective Access time = � (fi * ti)


Cache Measures

Hit rate : fraction found in that levelSo high that usually talk about Miss rate

Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks)Miss penalty : time to replace a block from lower level, including time to replace in CPU

access time : time to lower level = f(latency to lower level)transfer time : time to transfer block =f(Bandwidth between upper & lower levels)


Measures

CPU Execution time = (CPU Clock Cycles + Memory Stall Cycles) * Clock Cycle Time

CPU clock cycles includes cache hit and CPU is stalled during miss

Memory Stall cycles = Number of misses * Miss penalty= IC * (Misses / Instruction) * Miss penalty= IC * (Memory Accesses / Instruction) * Miss Rate * Miss penaltyMiss rate and miss penalties are different for reads and writes

Memory Stall Cycles= IC * (Reads / Instruction) * Read Miss Rate * Read Miss penalty+ IC * (Writes / Instruction) * Write Miss Rate * Write Miss penalty

Miss Rate = Misses / Instruction = (Miss rate * Memory Accesses ) / Instruction Count= Miss rate * (Memory Accesses / Instruction)

Typical Cache Organization

Simplest Cache: Direct MappedMemory 4 Byte Direct Mapped Cache

Memory Address0

123456789ABCDEF

Cache Index0123

Location 0 can be occupied by data from:Memory location 0, 4, 8, ... etc.In general: any memory locationwhose 2 LSBs of the address are 0sAddress<1:0> => cache index

Which one should we place in the cache?How can we tell which one is in the cache?


Block 12 is placed in an 8 block cache:Fully associative, direct mapped, 2-way set associativeS.A. Mapping = Block Number Modulo Number Sets

0 1 2 3 4 5 6 7

Block

no.

Direct mapped:block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7

Block

no.

Set associative:block 12 can go anywhere in set 0 (12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7

Block

no.

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block-frame address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block

no.

Where can a block be placed in the Cache?Cache Memory

Each entry in the cache stores wordsTag on each block

No need to check index or block offset

The ABCs of Caches How is a block found if it is in the cache?

Address Byte Offset

Tag Index

Each entry in the cache stores wordsTag on each block

No need to check index or block offset

The ABCs of Caches How is a block found if it is in the cache?

Address Byte Offset

Tag Index

Cache MemoryTake advantage of spatial locality

Store multiple words

The diagram below is a schematic of what cache looks like.

Block 0 contains multiple words from main memory, identified with the tag 00000000. Block 1 contains words identified with the tag 11110101.The other two blocks are not valid.

Cache Memory

As an example, suppose a program generates the address 1AA. In 14-bit binary, this number is: 00000110101010.The first 7 bits of this address go in the tag field, the next 4 bits go in the block field, and the final 3 bits indicate the word within the block.

Cache Organizations I : DirectMapped Cache

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9Block address

Cache Memory

Direct MappingAddress StructureTag s-r Line or Slot r Word w

8 14 2

24 bit address2 bit word identifier (4 byte block)22 bit block identifier

8 bit tag (=22-14)14 bit slot or line

No two blocks in the same line have the same Tag fieldCheck contents of cache by finding line and checking Tag

Direct Mapping Cache Line Table

Cache line Main Memory blocks held0 0, m, 2m, 3m…2s-m1 1,m+1, 2m+1…2s-m+1

m-1 m-1, 2m-1,3m-1…2s-1

Direct Mapping Cache Organization

Direct Mapping pros & cons

SimpleInexpensiveFixed location for given block

If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

Associative Mapping

A main memory block can load into any line of cacheMemory address is interpreted as tag and wordTag uniquely identifies block of memoryEvery line’s tag is examined for a matchCache searching gets expensive

Fully Associative Cache Organization

Tag 22 bitWord2 bit

Associative Mapping Address Structure

22 bit tag stored with each 32 bit block of dataCompare tag field with tag entry in cache to check for hitLeast significant 2 bits of address identify which 16 bit word is required from 32 bit data blocke.g.

Address Tag Data Cache lineFFFFFC FFFFFC 24682468 3FFF

Cache Organizations II : Set Associative CacheCache Memory

Cache Index

Set 1

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01 mod 16

0x50

Stored as partof the cache “state”

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9Block address

Set 0

Set 15

8

Set Associative Mapping

Cache is divided into a number of setsEach set contains a number of linesA given block maps to any line in a given set

e.g. Block B can be in any line of set i

e.g. 2 lines per set2 way associative mappingA given block can be in one of 2 lines in only one set

Set Associative MappingExample

13 bit set numberBlock number in main memory is modulo 213 000000, 00A000, 00B000, 00C000 … map to same set

Two Way Set Associative Cache Organization

Set Associative MappingAddress Structure

Use set field to determine cache set to look inCompare tag field to see if we have a hite.g

Address Tag Data Set number1FF 7FFC 1FF 12345678 1FFF001 7FFC 001 11223344 1FFF

Tag 9 bit Set 13 bitWord2 bit

Two Way Set

Associative Mapping Example

Replacement Algorithms (1)Direct mapping

No choiceEach block only maps to one lineReplace that line

Replacement Algorithms (2)Associative & Set Associative

Hardware implemented algorithm (speed)Least Recently used (LRU)e.g. in 2 way set associative

Which of the 2 block is lru?

First in first out (FIFO)replace block that has been in cache longest

Least frequently usedreplace block which has had fewest hits

Random

Write Policy

Must not overwrite a cache block unless main memory is up to dateMultiple CPUs may have individual cachesI/O may address main memory directly

Write through

All writes go to main memory as well as cacheMultiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to dateLots of trafficSlows down writes

Remember bogus write through caches!

Write back

Updates initially made in cache onlyUpdate bit for cache slot is set when update occursIf block is to be replaced, write to main memory only if update bit is setOther caches get out of syncI/O must access main memory through cacheN.B. 15% of memory references are writes

Let’s Do An Example:The Memory Addresses We’ll Be

UsingHere’s a number of addresses. We’ll be asking for the data at these addresses and see what happens to the cache when we do so.

1090 00000000000000000000010

0010

00010

0458931

ResultOff-set

SetTagAddress

Miss

1440 00000000000000000000010

1101

00000

0458931 Miss

5000 xxxxxxxxxxxxxxxxxxxxxxx xxxx

01000

0458931

1470 xxxxxxxxxxxxxxxxxxxxxxx

xxxx

0458931 xxxxx

Cache:1. Is Direct Mapped2. Contains 512 bytes.3. Has 16 sets.4. Each set can hold 32 bytes or

1 cache line.

Cache Memory

Here’s the Cache We’ll Be Touching

Initially the cache is empty.

N15 (1111)N14 (1110)N13 (1101)N12 (1100)N11 (1011)N10 (1010)N9 (1001)N8 (1000)N7 (0111)N6 (0110)N5 (0101)N4 (0100)N3 (0011)N2 (0010)N1 (0001)N0 (0000)

Data(Can hold a 32-byte cache line.)

TagVSet Address

Cache:1. Is Direct Mapped2. Contains 512

bytes.3. Has 16 sets.4. Each set can hold

32 bytes or 1 cache line.

Doing Some Cache ActionWe want to READdata from address

1090= 010|0010|00010

N15 (1111)

N14 (1110)

N13 (1101)

N12 (1100)

N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)

N2 (0010)

N1 (0001)

N0 (0000)

Data(Always holds a 32-byte cache line.)

TagVSet Address

1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd. Y 00000….10 Data from memory loc. 1088 - 1119

Cache Memory

We want to READdata from address

1440= 010|1101|00000

N15 (1111)

N14 (1110)

N13 (1101)

N12 (1100)

N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)

Data from memory loc. 1088 - 111900000….10Y2 (0010)

N1 (0001)

N0 (0000)


TagVSet Address

Y 00000….10 Data from memory loc. 1440 - 1471

1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.

Doing Some Cache Action

Cache Memory


5000= 1001|1100|01000

N15 (1111)

N14 (1110)

Data from memory loc. 1440 - 147100000…0010Y13 (1101)

N12 (1100)

N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)

Data from memory loc. 1088 - 111900000…….10Y2 (0010)

N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory


1470= 0010|1101|11110

N15 (1111)

N14 (1110)



N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)


N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory


1600= 0011|0010|00000

N15 (1111)

N14 (1110)



N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)


N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory

We want to WRITEdata to address

256= 0000|1000|00000

N15 (1111)

N14 (1110)



N11 (1011)

N10 (1010)

N9 (1001)

N8 (1000)

N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)


N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory


1620= 0011|0010|10100

N15 (1111)

N14 (1110)



N11 (1011)

N10 (1010)

N9 (1001)


N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)


N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory


1099= 0010|0010|01011

N15 (1111)

N14 (1110)



N11 (1011)

N10 (1010)

N9 (1001)


N7 (0111)

N6 (0110)

N5 (0101)

N4 (0100)

N3 (0011)


N1 (0001)

N0 (0000)


TagVSet Address


1001

1000

0100

0011

0011

0010

0010

0010

0010

0010

0001

0000

Tag

1100

0000

0000

0010

0010

1101

1101

0010

0010

0000

0000

1000

Set

101001620

111101470

010005000

000004096

000002048

000001600

000001440

010111099

000101090

000001024

00000512

00000256

OffsetAdd.


Cache Memory

Write through —The information is written to both the block in the cache and to the block in the lower-level memory.

Write back —The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

is block clean or dirty?

WT always combined with write buffers so that don’t wait for lower level memory

What happens on a write?

Cache Memory

A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write buffer;Memory controller: write contents of the buffer to memory.

Write buffer is just a FIFO:Typical number of entries: 4;Must handle bursts of writes;

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write ThroughCache Memory

Assume: a 16-bit (sub-block) write to memory location 0x0 and causes a miss. Do we allocate space in cache and possibly read in the block?

Yes: � Write Allocate (Write back caches)No: � Not Write Allocate (Write through)

Example:WriteMem[100]WriteMem[100]ReadMem[200]WriteMem[200]WriteMem[100]

NWA: four misses and one hitWA: two misses and three hits

Write-miss Policy: Write Allocate vs . Not Allocate

Cache Memory

Pentium 4 Cache

80386 – no on chip cache80486 – 8k using 16 byte lines and four way set associative organizationPentium (all versions) – two on chip L1 caches

Data & instructionsPentium III – L3 cache added off chipPentium 4

L1 caches8k bytes64 byte linesfour way set associative

L2 cache Feeding both L1 caches256k128 byte lines8 way set associative

L3 cache on chip

Intel Cache EvolutionProblem Solution Processor on which feature

first appears

External memory slower than the system bus. Add external cache using faster memory technology.

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.

Add external L3 cache. Pentium III

Move L3 cache on-chip. Pentium 4

Reducing Cache Misses5.1 Introduction5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory5.8 Protection and Examples of Virtual Memory

Compulsory —The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses .(Misses in even an Infinite Cache)Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)Conflict —If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses .(Misses in N-way Associative, Size X Cache)

Classifying Misses: 3 Cs

Cache Memory

Documents

cache slot cache

centsbit main memory

fraction of memory access

normal main memory

cpu cache

cache misses

cache fast

l2 cache