1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.

1

Computer Architecture

Cache Memory

2

Today is brought to you by cache

• What do we want?– Fast access to data from memory– Large size of memory– Acceptable memory system cost

• Where do we get it?– Use a method to interpose a smaller but faster

memory between the data-path and main memory which holds recently accessed data

3

Cache

• Cache = To conceal or store, as in the earth; hide in a secret place; n. A place for hiding or storing provisions, equipment etc, also the things stored or hidden [F. cacher to hide]

• Cache sounds like cash• Programs usually exhibit locality:

– temporal locality: If an item is referenced, with high probability it will be referenced again

– spatial locality: If an item is referenced, the items near to it have high probability of being referenced

4

Learning Objectives1.Know principle of cache implementation

2.Know the difference between direct, partial set associative and fully associative cache and how they work

3.Know the terms: cache hit, cache miss, word size, block size, row or set size, cache rows, cache tag, cache index, direct mapped cache, partial set associative and fully associative cache.

5

Consider this. It is like caching information

• You are in the library gathering books for an assignment1) The well selected books you have gathered probably

contain material that you had not expected but will likely use

2) You do not collect ALL the books from the library to your desk

3) It is quicker to access information from the book on your desk than to go to stack again

• This is like use of cache principles in computing.

6

Cache Principle

• The memory fetch and store on a simply configured CPU-memory system with no cache has access time dependent on memory access speed. In general for a given technology, for larger memory size, the access time increases.

• Cache is a mechanism that can speed up the memory transfers by making use a of proximity principle: machine instructions and memory accesses are often "near" to the previous and following accesses.

• By caching the recent transactions in fast access memory and having another memory transfer process between the main memory and cache, the effective memory access time can be sped up with consequent performance gains.

7

Why is caching effective in computing• Spatial locality arises from

– loops

– data structures, arrays

• Temporal locality arises from

– loops

– sequential access to program instructions

• Memory cost and speed

– SRAM 5-25ns $100-$250/MByte

– DRAM 60-120ns $5-$10/MByte

– Magnetic disk 10-20 ms $0.10 - $0.20/MByte

8

Memory access time and cost

0.1

10

1000

100000

10000000

1000000000

1 2 3 4 5 6

AccessTime

CostPerMB

9

Practical usage of memory types• Advantageous to

build a hierarchy of memories:– fastest and most

expensive, small and close to processor

– slower and least expensive, large and further from processor

Memory

CPU

Memory

Size Cost ($/bit)Speed

Smallest

Biggest

Highest

Lowest

Fastest

Slowest Memory

10

Memory Hierarchy of a Modern Computer System

• By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest

technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1ns 10,000,000ns

(10s ms)

Speed (ns): 10ns 100ns

100sGs

Size (bytes):Ks Ms

TertiaryStorage(Disk)

10,000,000,000ns

(10s sec)Ts

11

The Art of Memory System Design

Processor

$

MEM

Memory

reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Optimize the memory system organizationto minimize the average memory access timefor typical workloads

Workload orBenchmarkprograms

12

Notation for accessing data and instructions in memory

• Define a BLOCK as the minimum size unit of information transferred between two adjacent levels of the memory hierarchy

• When a word of data is required, the whole block that the word is in is transferred.

• There is a high probability that the next word required is also in the block!, hence the next word is obtained from FAST memory rather than SLOW memory

13

Hits and misses

• Define a hit as event when data requested by a processor is available in some block of the highest memory hierarchy.

• A miss is the other case.

• Hit rate is a measure of success in accessing a cache

14

More notation• Hit rate,

• miss rate,

• hit time,

• miss penalty: time to fetch from slow memory

• memory systems are critical to good performance

15

Basics of caches• How do we determine if the data is in the cache?• If data is in the cache, how is it found?

• We only have information on:– address of data– how the cache is organized

• Direct mapped cache:– the data can only be at a specific place

16

Data Address is used to organize cache storage strategy

• Word is organized by byte bits

• Block is organized by bits denoting the word

• Location in cache is indexed by row

• Tag is identification of a block in a cache row

TagIndexBlockByte

Word address bits fields

17

Example 24 bit address with 8 byte block and 2048 blocks in cache of 16384 bytes

18

Bit fields for 4 byte word in 32 bit address with 2b words per block

Field Address Bits UsageWord field 0 : 3 address bits within the word

being accessedBlock field 4 : 4+b-1 identifies word within the block,

field could be emptySet field no bitsTag field 4+b : 31 identifies tag field

(unique identifier for block on its row)

19

Example of direct mapped cache• Example shows address entries that map to the same location in

cache for one byte per word, one word per block, one block per row

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

01

001

11

001

011

101

11

TagIndexBlockByte

Word address bits fields

Index 8 cache entriesData mappedby addressmodulo 8

20

Contents of a direct mapped cache

• Data == Cached block

• TAG == Most significant bits of cached block address that identify the block in that cache row from other blocks that map to that same row

• VALID == Flag bit to indicate the cache content is valid

21

Direct cache

Separate address into fields:

•Byte offset in word

•Index for row of cache

•Tag identifier of block

Address (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

Cache of 2^n words, a block being a 4 byte word, has 2^n*(63-n) bits for 32 bit address

#rows=2^n#bits/row=32+32-2-n+1=63-n

22

Reading: Hits and Misses

• Hit requires no special handling. The data is available

• Instruction fetch cache miss: – Stall the pipeline, apply the PC to memory and

fetch the block. Re-fetch the instruction when the miss has been serviced

– Same for data fetch

23

Multi-word BlocksAddress (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

24

Miss Rates Vs Block Size

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

Cache size

25

Block Size Tradeoff• In general, larger block size take advantage of spatial locality BUT:

– Larger block size means larger miss penalty:

• Takes longer time to fill up the block

– If block size is too big relative to cache size, miss rate will go up

• Too few cache blocks

• In general, Average Access Time:

– = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size

26

Example: 1 KB Direct Mapped Cache with 32 Byte Blocks

• For a 2 ** N byte cache:

–The uppermost (32 - N) bits are always the Cache Tag

–The lowest M bits are the Byte Select (Block Size = 2 ** M)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

27

Extreme Example: single big line

• Cache Size = 4 bytes Block Size = 4 bytes

– Only ONE entry in the cache

• If an item is accessed, likely that it will be accessed again soon

– But it is unlikely that it will be accessed again immediately!!!

– The next access will likely be a miss again

• Continually loading data into the cache butdiscard (force out) them before they are used again

• Worst nightmare of a cache designer: Ping Pong Effect

• Conflict Misses are misses caused by:

– Different memory locations mapped to the same cache index

• Solution 1: make the cache size bigger

• Solution 2: Multiple entries for the same Cache Index

0

Cache DataValid Bit

Byte 0Byte 1Byte 3

Cache Tag

Byte 2

28

Another Extreme Example: Fully Associative

• Fully Associative Cache, N blocks of 32 bytes each

– Forget about the Cache Index

– Compare the Cache Tags of all cache entries in parallel

– Example: Block Size = 32 Byte blocks, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

29

A Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel

• Example: Two-way set associative cache

– Cache Index selects a “set” from the cache

– The two tags in the set are compared in parallel

– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

30

Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

31

Three Cs of Caches:1. Compulsory misses: These are cache misses caused by the first access to

the block that has never been in cache (also known as cold-start misses)

2. Capacity misses: These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program. Capacity misses occur because of blocks being replaced and later retrieved when accessed.

3. Conflict misses: These are cache misses that occur in set-associative or direct-mapped caches when multiple blocks compete for the same set. Conflict misses are those misses in a direct-mapped or set-associative cache that are eliminated in a fully associative cache of the same size. These are also called collision misses.

32

A Summary on Sources of Cache Misses• Compulsory (cold start or process migration, first reference): first

access to a block

– “Cold” fact of life: not a whole lot you can do about it

– Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

• Conflict (collision):

– Multiple memory locations mappedto the same cache location

– Solution 1: increase cache size

– Solution 2: increase associativity

• Capacity:

– Cache cannot contain all blocks access by the program

– Solution: increase cache size

• Invalidation: other process (e.g., I/O) updates memory

33

Summary:• The Principle of Locality:

– Program likely to access a relatively small portion of the address space at any instant of time.

• Temporal Locality: Locality in Time

• Spatial Locality: Locality in Space

• Three Major Categories of Cache Misses:

– Compulsory Misses: sad facts of life. Example: cold start misses.

– Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

– Capacity Misses: increase cache size

• Cache Design Space

– total size, block size, associativity

– replacement policy

– write-hit policy (write-through, write-back)

– write-miss policy

34

Cache design parametersDesign change effect on miss rate possible negative

performance effect

Increase block decreases miss rate may increasesize due to compulsory miss-penalty

misses

Increase size decreases capacity may access timeincrease misses

Increase decreases miss rate may increase access associativity time due to conflict

misses