CS61CL Machine Structures Lec 11 – Introduction to Cache ...inst.eecs.berkeley.edu/~cs61cl/fa09/misc/lec11.pdf · CS61CL Machine Structures Lec 11 – Introduction to Cache Design

CS61CL Machine Structures

Lec 11 – Introduction to Cache Design

David Culler Electrical Engineering and Computer Sciences

University of California, Berkeley

CS61CL Road Map

10/14/09 CS61CL F09 2

Hardware

Software

Machine Lang. pgm

Instruction Set Architecture

Machine Organization

HLL Program Asm Lang. Pgm

foo.c foo.s

I/O system Instr. Set Proc.

Digital Design

Circuit Design

Datapath & Control

Layout & fab

Semiconductor Materials

foo.exe

10/14/09 CS61CL F09 3

Turning “Moore stuff” into performance

11/18/09 cs61cl f09 lec 5 4

Performance Trends

MIPS R3000

11/18/09 cs61cl f09 lec 5 5

Performance(X) Execution_time(Y) n = = Performance(Y) Execution_time(X)

Recall: Performance

• Performance is in units of things per sec –  bigger is better

•  If we are primarily concerned with response time performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Speedup( E ) = Performance(with E) / Performance( without E)

Review: Pipelined Execution

•  Speedup with N stages is ≤ N •  Limited by dependences (aka Hazards)

–  Structural hazard: two operations want to use same resource at same time

–  Data Hazard: cannot use a value before it is produced –  Control Hazard: attempt to branch before condition is

determined

11/4/09 UCB CS61CL F09 Lec 10 6

°°°

The Problem: Memory Gap

•  1985 80386 cache off-chip •  1989 first Intel CPU (80486) with cache on chip •  1995 first Intel CPU (Pentium Pro) with two levels of cache on chip

µProc

60%/yr.

7%/yr. 1

1000 19

Processor-Memory

Performance Gap: (grows 50% / year)

Recall: Where do Objects live and work?

11/18/09 UCB CS61CL F09 8

°°°

Processor

Memory

000..0:

FFF..F:

register load

operate

°°°

000..0:

FFF..F:

read-miss

read-hit

Storage Hierarchy Processor

Size of memory at each level

Increasing Distance from

Proc., Decreasing

speed Level 1

Level 2

Level n

Level 3

Higher

Lower As we move to deeper levels the latency goes up and

price per bit goes down.

Registers

Memory

Why Caches Work •  Physics:

–  Large memories are slow, –  Fast memories are small

•  Statistics: Programs exhibit locality –  Temporal locality: recently accessed locations are likely to be

accessed again soon –  Spatial Locality: is a location is accessed others nearby are likely to

accessed too

•  Use statistics to cheat the laws of physics –  illusion of a large fast memory –  on average access to a large memory can be fast –  keep recently accessed blocks in a small fast memory

•  Ave Mem Access Time = Hit Time + Pmiss* MissPenalty

10/14/09 CS61CL F09 10

Manual vs Automatic Management of the Storage Hierarchy •  In everyday life?

–  what books in backpack? desk? library? amazon? –  music collection?

•  Registers? •  Files? •  Cache?

10/14/09 CS61CL F09 11

Cache: Transparent Memory Acceleration •  Processor performs reads and writes on memory

locations –  inst. fetch, load, store –  memory abstraction is unchanged!

•  Cache has copy of a small portion of the memory –  hit: present in cache => respond quickly –  miss: absent in cache => obtains it from memory and respond

•  Unit of transfer: Block –  several words of memory into a cache line

•  Where can it be placed? •  How can we tell if it is there? •  What happens to memory on write hit? •  What happens to cache on write miss? 10/14/09 CS61CL F09 12

Direct-Mapped Cache •  Each memory address is associated with one

possible block within the cache –  => only need to look in a single location in the cache for the

data if it exists in the cache –  Block is the unit of transfer between cache and memory

Direct-Mapped Cache (B=1, S=4)

Cache Line 0 can be occupied by data from:

–  Memory location 0, 4, 8, ... –  4 blocks ⇒ any memory location that is

multiple of 4

Memory Memory �Address

0 1 2 3 4 5 6 7 8 9 A B C D E F

4 Byte Direct Mapped Cache

CacheIndex

0 1 2 3

Block size = 1 byte

Direct-Mapped Cache (B=2, S=4)

• How is the block located? • How is the byte in block

selected? •  e.g., Mem address 11101?

0 2 4 6 8 A C E

10 12 14 16 18 1A 1C 1E

8 Byte Direct

Mapped Cache

Cache Index 0 1 2 3

0 1 2 3

Block size = 2 bytes 4 5 6 7 8 9

00010010

How do you tell if the right block in is the line? •  Like luggage at the airport …

10/14/09 CS61CL F09 16

Tag-Check (B=2, S=4, N=1)

•  What should go in the tag? –  entire address? –  don’t need the bits we used in

getting there

Memory (addresses shown)

Mem Address

0 2 4 6 8 A C E

10 12 14 16 18 1A 1C 1E

0 1 2 3

Tag Data

4 5 6 7 8 9

1E 14 0

Cache Index

ttttttttttttttttt iiiiiiiiii oooo

tag index byte to check to offset if have select within correct block block* block

Mapping Memory Address to Cache

•  * Direct map => 1 block per “set” •  More generally, index to select set

Direct-Mapped Cache Example (1/3) •  Suppose we have a 8KB of data in a direct

-mapped cache with 16 byte blocks •  Determine the size of the tag, index and offset

fields if we’re using a 32-bit architecture •  Offset

–  need to specify correct byte within a block –  block contains 16 bytes

= 24 bytes –  need 4 bit to specify correct byte

Direct-Mapped Cache Example (2/3) •  Index: (~index into an “array of blocks”)

–  need to specify correct block in cache –  cache contains 8 KB = 213 bytes –  block contains 16 B = 24 bytes –  # blocks/cache

= bytes/cache bytes/block

= 213 bytes/cache 24 bytes/block

= 29 blocks/cache –  need 9 bits to specify this many blocks

Direct-Mapped Cache Example (3/3) •  Tag: use remaining bits as tag

–  tag length = addr length - offset - index = 32 - 4 - 9 bits = 19 bits

–  so tag is leftmost 19 bits of memory address

Administration •  Midterms to be returned in Tu/W lab •  HW 8 (the last) out today due ??? •  Proj 4 out today, due ??

•  Pick the due dates and plan RRRRR Week

10/14/09 CS61CL F09 22

16 KB Direct Mapped Cache, 16B blocks

•  Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)

Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3

0 1 2 3 4 5 6 7

1022 1023

Index 0 0 0 0 0 0 0 0

1. Load Byte 0x00000014

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 0100

Tag field Index field Offset

0 0 0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

So we read block 1 (0000000001)

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 0100

0 0 0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

No valid data

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 0100

0 0 0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

So load that data into cache, setting tag, valid

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

1 0 d c b a

• 000000000000000000 0000000001 0100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

Read from cache at offset, return word b • 000000000000000000 0000000001 0100

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

2. Read Byte 0x0000001C

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 1100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

Index is Valid

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 1100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

Index valid, Tag Matches

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 1100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

Index Valid, Tag Matches, return d

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000001 1100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

3. Load Byte 0x00000034

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000011 0100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

So read block 3

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000011 0100

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

No valid data

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000011 0100

Index Tag field Index field Offset

0 0 0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

Load that cache block, return word f

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000000 0000000011 0100

1 0 h g f e

0 0 0 0

d c b a

0xc-f 0x8-b 0x4-7 0x0-3

4. Load Byte 0x00008014

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000010 0000000001 0100

0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

h g f e

So read Cache Block 1, Data is Valid

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000010 0000000001 0100

0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

h g f e

Cache Block 1 Tag does not match (0 != 2)

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000010 0000000001 0100

0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

d c b a

h g f e

Miss, so replace block 1 with new data & tag

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

1 2 l k j i

• 000000000000000010 0000000001 0100

0 0 0 0

0xc-f 0x8-b 0x4-7 0x0-3

h g f e

And return byte: J

Valid Tag

0 1 2 3 4 5 6 7

1022 1023

• 000000000000000010 0000000001 0100

0 0 0 0

l k j i

0xc-f 0x8-b 0x4-7 0x0-3

h g f e

What to do on a write hit? •  Write-through

–  update the word in cache block and corresponding word in memory

•  Write-back –  update word in cache block –  allow memory word to be “stale” –  ⇒ add ‘dirty’ bit to each block indicating that memory needs

to be updated when block is replaced –  ⇒ OS flushes cache before I/O…

•  Performance trade-offs?

Types of Cache Misses (Three C’s) •  1st C: Compulsory Misses

–  occur when a program is first started –  cache does not contain any of that program’s data yet, so

misses are bound to occur –  reduced with increasing block size

Types of Cache Misses (Three C’s) •  1st C: Compulsory Misses •  2nd C: Conflict Misses

–  miss that occurs because two distinct memory addresses map to the same cache line

–  when both are needed keep overwriting each other

•  Dealing with Conflict Misses –  Solution 1: Make the cache size bigger

»  More lines, fewer conflicts »  Conflicts far apart in address space remain

–  Solution 2: Multiple distinct blocks in the same cache Index

Fully Associative Cache (B=32) •  Any block anywhere •  Memory address fields:

–  Offset: byte within block –  Index: non –  Tag: all the rest

•  Compare all tags in parallel

Byte Offset

Cache Data B 0

0 4 31

Cache Tag (27 bits long)

B 1 B 31 :

Cache Tag =

Types of Cache Misses (Three C’s) •  1st C: Compulsory Misses •  2nd C: Conflict Misses •  3rd C: Capacity Misses

–  miss that occurs because the cache has a limited size –  miss that would not occur if we increase the size of the cache

N-Way Set Associative Cache •  Basic Idea

–  direct-map to set –  associative lookup of N blocks within it

•  Memory address fields: –  Tag: same as before –  Offset: same as before –  Index: points us to the correct “row” (called a set in this case)

•  Given memory address: –  Find correct set using Index value. –  Compare Tag with all Tag values in the determined set. –  If a match occurs, hit!, otherwise a miss. –  Finally, use the offset field as usual to find the desired data

within the block.

Associative Cache Example

•  2-way set associative cache.

0 1 2 3 4 5 6 7 8 9 A B C D E F

Index 0 0 1 1

4-Way Set Associative Cache Circuit

tag index

Block Replacement Policy •  Direct-Mapped Cache

–  index completely specifies position which position a block can go in on a miss

•  N-Way Set Assoc –  index specifies a set, but block can occupy any position

within the set on a miss •  Fully Associative

–  block can be written into any position •  Question: if we have the choice, where should we

write an incoming block? –  If there are any locations with valid bit off (empty), then

usually write the new block into the first one. –  If all possible locations already have a valid block, we

must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.

Block Replacement Policy: LRU •  LRU (Least Recently Used)

–  Idea: cache out block which has been accessed (read or write) least recently

–  Pro: temporal locality ⇒ recent past use implies likely future use: in fact, this is a very effective policy

–  Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

Block Replacement Example •  We have a 2-way set associative cache with a

four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem):

0, 2, 0, 1, 4, 0, 2, 3, 5, 4 •  How many hits and how many misses will there

be for the LRU block replacement policy?

Block Replacement: LRU

Addresses 0, 2, 0, 1, 4, 0, ...

loc 0 loc 1 set 0

0 2 lru set 0

0: miss, bring into set 0 (loc 0)

0: hit

4: miss, bring into set 0 (loc 1, replace 2)

0: hit

0 set 0

lru lru

0 2 set 0

lru lru

set 1 0

1 lru lru 2 4 lru

set 1 0 4

lru lru

Big Idea •  How to choose between associativity, block size,

replacement & write policy? •  Design against a performance model

–  Minimize: Average Memory Access Time = Hit Time

+ Miss Penalty x Miss Rate –  influenced by technology & program behavior

•  Create the illusion of a memory that is large, cheap, and fast - on average

•  How can we improve miss penalty?

Improving Miss Penalty •  When caches first became popular, Miss Penalty

~ 10 processor clock cycles •  Today 2400 MHz Processor (0.4 ns per clock

cycle) and 80 ns to go to DRAM ⇒ 200 processor clock cycles!

Proc $2

Solution: another cache between memory and the processor cache: Second Level (L2) Cache

And in Conclusion… •  We would like to have the capacity of disk at the

speed of the processor: unfortunately this is not feasible.

•  So we create a memory hierarchy: –  each successively lower level contains “most used” data from

next higher level –  exploits temporal & spatial locality –  do the common case fast, worry less about the exceptions

(design principle of MIPS)

•  Locality of reference is a Big Idea

And in Conclusion…

•  Mechanism for transparent movement of data among levels of a storage hierarchy

–  set of address/value bindings –  address ⇒ index to set of candidates –  compare desired address with tag –  service hit or miss

»  load new block and binding on miss

Valid Tag 0xc-f 0x8-b 0x4-7 0x0-3

0 1 2 3 ...

1 0 d c b a

000000000000000000 0000000001 1100 address: tag index offset

And in Conclusion… •  We’ve discussed memory caching in detail. Caching in

general shows up over and over in computer systems –  Filesystem cache, Web page cache, Game databases /

tablebases, Software memoization, Others? •  Big idea: if something is expensive but we want to do it

repeatedly, do it once and cache the result. •  Cache design choices:

–  Size of cache: speed v. capacity –  Block size (i.e., cache aspect ratio) –  Write Policy (Write through v. write back –  Associativity choice of N (direct-mapped v. set v. fully

associative) –  Block replacement policy –  2nd level cache? –  3rd level cache?

•  Use performance model to pick between choices, depending on programs, technology, budget, ...

Bonus slides •  These are extra slides that used to be included in

lecture notes, but have been moved to this, the “bonus” area to serve as a supplement.

•  The slides will appear in the order they would have in the normal presentation

AREA (cache size, B) = HEIGHT (# of blocks) * WIDTH (size of one block, B/block)

WIDTH (size of one block, B/block)

HEIGHT (# of blocks)

AREA (cache size,

2(H+W) = 2H * 2W

Tag Index Offset

TIO The great cache mnemonic

•  Ex.: 16KB of data, direct-mapped, 4 word blocks

–  Can you work out height, width, area?

•  Read 4 addresses 1.   0x00000014 2.   0x0000001C 3.   0x00000034 4.   0x00008014

•  Memory vals here:

Address (hex) Value of Word Memory

00000010 00000014 00000018 0000001C

a b c d

... ... 00000030 00000034 00000038 0000003C

e f g h

00008010 00008014 00008018 0000801C

i j k l

... ...

Accessing data in a direct mapped cache

•  4 Addresses: –  0x00000014, 0x0000001C, 0x00000034, 0x00008014

•  4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields

000000000000000000 0000000001 0100

000000000000000000 0000000001 1100

000000000000000000 0000000011 0100

000000000000000010 0000000001 0100 Tag Index Offset

Accessing data in a direct mapped cache

Do an example yourself. What happens?

•  Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l

•  Read address 0x00000030 ? 000000000000000000 0000000011 0000

•  Read address 0x0000001c ? 000000000000000000 0000000001 1100

Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 0 1 2 3 4 5 6 7 ...

1 2 l k j i

1 0 h g f e

Index 0

0 0 0 0

Answers

• 0x00000030 a hit Index = 3, Tag matches,

Offset = 0, value = e • 0x0000001c a miss

Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d

•  Since reads, values must = memory values whether or not cached: – 0x00000030 = e – 0x0000001c = d

Address (hex) Value of Word Memory

00000010 00000014 00000018 0000001C

a b c d

... ... 00000030 00000034 00000038 0000003C

e f g h

00008010 00008014 00008018 0000801C

i j k l

... ...

Block Size Tradeoff (1/3) •  Benefits of Larger Block Size

–  Spatial Locality: if we access a given word, we’re likely to access other nearby words soon

–  Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well

–  Works nicely in sequential array accesses too

Block Size Tradeoff (2/3) •  Drawbacks of Larger Block Size

–  Larger block size means larger miss penalty »  on a miss, takes longer time to load a new block from next

level –  If block size is too big relative to cache size, then there are too

few blocks »  Result: miss rate goes up

•  In general, minimize Average Memory Access Time (AMAT)

= Hit Time + Miss Penalty x Miss Rate

Block Size Tradeoff (3/3) •  Hit Time

–  time to find and retrieve data from current level cache

•  Miss Penalty –  average time to retrieve data on a current level miss (includes

the possibility of misses on successive levels of memory hierarchy)

•  Hit Rate –  % of requests that are found in current level cache

•  Miss Rate –  1 - Hit Rate

Extreme Example: One Big Block

•  Cache Size = 4 bytes Block Size = 4 bytes –  Only ONE entry (row) in the cache!

•  If item accessed, likely accessed again soon –  But unlikely will be accessed again immediately!

•  The next access will likely to be a miss again –  Continually loading data into the cache but

discard data (force out) before use it again –  Nightmare for cache designer: Ping Pong Effect

Cache Data Valid Bit B 0 B 1 B 3

Tag B 2

Block Size Tradeoff Conclusions

Miss Penalty

Block Size

Increased Miss Penalty & Miss Rate

Average Access

Block Size

Exploits Spatial Locality

Fewer blocks: compromises temporal locality

Miss Rate

Block Size

Analyzing Multi-level cache hierarchy

Proc $2

L1 hit

time L1 Miss Rate

L1 Miss Penalty Avg Mem Access Time =

L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty =

L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time =

L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)

L2 hit

time L2 Miss Rate

L2 Miss Penalty

Example •  Assume

–  Hit Time = 1 cycle –  Miss rate = 5% –  Miss penalty = 20 cycles –  Calculate AMAT…

•  Avg mem access time = 1 + 0.05 x 20 = 1 + 1 cycles = 2 cycles

Ways to reduce miss rate •  Larger cache

–  limited by cost and technology –  hit time of first level cache < cycle time (bigger caches are

slower)

•  More places in the cache to put each block of memory – associativity

–  fully-associative »  any block any line

–  N-way set associated »  N places for each block »  direct map: N=1

Typical Scale •  L1

–  size: tens of KB –  hit time: complete in one clock cycle –  miss rates: 1-5%

•  L2: –  size: hundreds of KB –  hit time: few clock cycles –  miss rates: 10-20%

•  L2 miss rate is fraction of L1 misses that also miss in L2

–  why so high?

Example: with L2 cache •  Assume

–  L1 Hit Time = 1 cycle –  L1 Miss rate = 5% –  L2 Hit Time = 5 cycles –  L2 Miss rate = 15% (% L1 misses that miss) –  L2 Miss Penalty = 200 cycles

•  L1 miss penalty = 5 + 0.15 * 200 = 35 •  Avg mem access time = 1 + 0.05 x 35

= 2.75 cycles

Example: without L2 cache •  Assume

–  L1 Hit Time = 1 cycle –  L1 Miss rate = 5% –  L1 Miss Penalty = 200 cycles

•  Avg mem access time = 1 + 0.05 x 200 = 11 cycles

•  4x faster with L2 cache! (2.75 vs. 11)

•  Cache –  32 KB Instructions and 32

KB Data L1 caches –  External L2 Cache interface

with integrated controller and cache tags, supports up to 1 MByte external L2 cache

–  Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB)

•  Pipelining –  Superscalar (3 inst/cycle) –  6 execution units (2 integer

and 1 double precision IEEE floating point)

An actual CPU – Early PowerPC

An Actual CPU – Pentium M

32KB I$

32KB D$

CS61CL Machine Structures Lec 11 – Introduction to Cache ...inst.eecs.berkeley.edu/~cs61cl/fa09/misc/lec11.pdf · CS61CL Machine Structures Lec 11 – Introduction to Cache Design

Documents

EECS 252 Graduate Computer Architecture Lec 13 – Snooping....

Cache Impact On Performance: An...

CS252 Spring 2017 Graduate Computer Architecture Lecture...

CS61CL L15 Parallelism(1) Pearce, Summer 2009 UCB Paul...

EECC550 - Shaaban #1 Lec # 8 Winter 2006 2-8-2007 The Memory...

Internal Memory -...

L15 Caches2 - University of California,...

CS61CL Machine Structures Lec 8 – State and Register...

EECC756 - Shaaban #1 lec # 10 Spring2009 5-5-2009...

CS252/Patterson Lec 11.1 2/23/01 CS213 Parallel Processing.....

10/28/091 Implementing an Instruction Set David E. Culler...

EECC551 - Shaaban #1 lec # 8 Winter 2006 1-24-2007 The...

EECC756 - Shaaban #1 lec # 11 Spring2009 5-7-2009 Scalable.....

CS61CL L01 Introduction (1) Huddleston, Summer 2009 © UCB.....

d’autonomie - Cache-cache 1 · Les blasons d’autonomie....

CS61CL Machine Structures Lec 6 – Number Representation...