Cache Memory - UFRGSflavio/ensino/cmp502/memoria.pdf · 2002. 11. 7. · cache next-level memory/cache CSE 141 Carro Cache Fundamentals, cont. • cache block size or cache line size

CSE 141 Carro

Cache Memory

CSE 141 Carro

Technology Trends

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

1000:1! 2:1!

CSE 141 Carro

Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

CSE 141 Carro

Impact on Performance

• Suppose a processor executes a program at– CPI = 1.1

– 50% arith/logic, 30% ld/st, 20% control

• Suppose that 10% of memoryoperations get a miss penalty of 50 cycles

• CPI = ideal CPI + average stalls per instruction

• impact of data misses

= 1.1 cycles / instr + 0.30 datamops / instrx 0.10 misses / datamop x 50 cycles / miss

= 1.1 cycles + 1.5 cycles = 2.6 cycles

• 58 % of the time the processor is stalled waiting for memory!

• a 1% instruction miss rate would add an additional 0.5 cycles tothe CPI!

Ideal CPI 1.1Data Mis s 1.5Ins t Mis s 0.5

CSE 141 Carro

Why hierarchy works

• The Principle of Locality:– Program access a relatively small portion of the address space at any

instant of time.

Address Space0 2^n - 1

Probabilityof reference

CSE 141 Carro

Memory Locality

• Memory hierarchies take advantage of memory locality.• Memory locality is the principle that future memory

accesses are near past accesses.• Memories take advantage of two types of locality

– Temporal locality -- near in time => we will often access the samedata again very soon

– Spatial locality -- near in space/distance => our next access isoften very close to our last access (or recent accesses).

This sequence of addresses exhibits both temporal and spatial locality1,2,3,1,2,3,8,8,47,9,10,8,8...

CSE 141 Carro

Locality and cacheing

• Memory hierarchies exploit locality by cacheing (keepingclose to the processor) data likely to be used again.

• This is done because we can build large, slow memoriesand small, fast memories, but we can’t build large, fastmemories.

• If it works, we get the illusion of SRAM access time withdisk capacity

SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

CSE 141 Carro

A typical memory hierarchy

CPU

memory

memory

memory

memory

on-chip cache

off-chip cache

main memory

disk

small expensive $/bit

cheap $/bit

big

•so then where is my program and data??

CSE 141 Carro

Cache Fundamentals

• cache hit -- an access where the data

is found in the cache.

• cache miss -- an access which isn’t

• hit time -- time to access the cache

• miss penalty -- time to move data from further level to closer,then to CPU

• hit ratio -- percentage of time the data is found in the cache

• miss ratio = 1 - hit ratio

cpu

lowest-levelcache

next-levelmemory/cache

CSE 141 Carro

Cache Fundamentals, cont.

• cache block size or cache line size -- the

amount of data that get transferred on a

cache miss.

• instruction cache -- cache that only holds

instructions.

• data cache -- cache that only caches data.

• unified cache -- cache that holds both.

cpu

lowest-levelcache


CSE 141 Carro

Cacheing Issues

On a memory access -

• How do I know if this is a hit or miss?

On a cache miss -

• where to put the new data?

• what data to throw out?

• how to remember what data this is?

cpu

lowest-levelcache


access

miss

CSE 141 Carro

• Our first example:– block size is one word of data

– "direct mapped"

For each item of data at the lower level, there is exactly one location in the cache where it might be.

e.g., lots of items at the lower level share locations in the upper level

First example of a Cache

CSE 141 Carro

• Mapping: address is modulo the number of blocks in thecache

Direct Mapped Cache

00001 00101 01001 01101 10001 10101 11001 11101

000

Cache

Memory

001

010

011

100

101

110

111

CSE 141 Carro

• For MIPS:

Direct Mapped CacheAddress (showing bit positions)

20 10

Byteıoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0

=

What kind of locality are we taking advantage of?

Why must we havethe valid bit?

CSE 141 Carro

How many bits in a cache, anyway?

• Suppose a direct mapped cache, with 64KB of data, andone-word blocks, with 32-bit address.

• 64KB -> 16Kwords, 214 words, in this case, 214 blocks

• Each block has 32 bits of data plus a tag (32-14-2=16 bits)plus a valid bit. So:

• 214 x (32 + 16 +1)= 214 x 49 = 784 x 210 = 784Kbits

• another way: 98KB for a 64KB data. 1.5 times larger

CSE 141 Carro

What happens in a cache miss?

• Read: send the address to main memory and wait

• write: if we write only in the cache, we have aninconsistency!

• If we always write to cache AND memory, we do a write-through

• what is our penalty?

• Solution 1: write buffer

• Solution 2: write back

CSE 141 Carro

• Taking advantage of spatial locality:

Direct Mapped Cache and Locality

Address (showing bit positions)

16 12 Byteıoffset

V Tag Data

Hit Data

16 32

4Kıentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0

CSE 141 Carro

Solution: compare the tag, if miss read, and then write! Cost?

What about misses now?

• Read misses: get a full block from main memory

• Writes: if we write only one word, the rest of the block willbe in what state?

We write here Might belong to someone else!

CSE 141 Carro

Block Size and Miss Rate

1 KB8 KB16 KB64 KB256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164Block size (bytes)

CSE 141 Carro

The effect of block size

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size

As block sizes increases, there will be fewer blocks: not all data is used!The overall effect is a strong increase in access time.

CSE 141 Carro

Performance, that’s what we want!

• CPU TIME = (CPU execution clock cycles +

Memory stall cycles ) X Cycle time

Memory stall cycles = Inst/Prog X Misses/Inst X Miss penalty

Example: assume an instruction cache miss rate of 2%, datamiss 4%, machine with CPI=2; miss penalty=40 cycles.Assume a program distribution with 36% loads and stores,and compare this machine with a perfect cache.

CSE 141 Carro

Continuing the example I

The number of cycles it costs us in instruction misses is:

I x 2% x 40 = 0.8I (we increase the total cycles by this amount)

Data miss cycles: I x 36% x 4% x 40 = 0.56I

Total cycles increase: (0.8+0.56)I = 1.36I

The CPI considering memory stalls is 2+1.36=3.36

CPIstall/CPIperfect= 3.36/2=1.68

Number of cycles a miss takes

% of instruction misses

Number of cycles a miss takes% of load/stores

% of data misses

CSE 141 Carro

Continuing the example II

Repeat the problem with CPI=1 (thanks to a better pipeline, for example)

The CPI considering memory stalls is 1+1.36=2.36CPIstall/CPIperfect= 2.36/1=2.36, we are loosing performance!

Time in stalls:

1.36/3.36=41%, in the first case

1.36/2.36=58%, in the second case!

CSE 141 Carro

Continuing the example III

Repeat the problem with CPI=2, but suppose a CPU with thedouble clock frequency (thanks to technologyimprovements, for example).

If we do not change memory hierarchy, the cost in cycles for amiss will move from 40 to 80!I x 2% x 80 = 1.6 I (we increase the total cycles by this amount)Data miss cycles: I x 36% x 4% x 80 = 1.152 ITotal cycles increase: (1.6+1.152)I = 2.752 IThe CPI considering memory stalls is 2+2.752 = 4.752CPInewclock/CPIoldclock= 4.752/3.36 = 1.41

We have doubled the clock (and hence the power) but not performance...

CSE 141 Carro

To remember!

• A cache miss has the same effect of a wrong branchprediction:

• THERE IS AN INCREASE OF THE CPI

• this is the clue on HOW to compare machines. All the restis just common sense.

CSE 141 Carro

Extreme Example: single big line

• Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache

• If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!

– The next access will likely to be a miss again� Continually loading data into the cache but

discard (force out) them before they are used again

� Worst nightmare of a cache designer: Ping Pong Effect

• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index

� Solution 1: make the cache size bigger

� Solution 2: Multiple entries for the same Cache Index

0

Cache DataValid Bit

Byte 0Byte 1Byte 3

Cache Tag

Byte 2

CSE 141 Carro

Another Extreme Example: Fully Associative• Fully Associative Cache

– Forget about the Cache Index

– Compare the Cache Tags of all cache entries in parallel

– Example: Block Size = 32 B, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

CSE 141 Carro

A Two-way Set Associative Cache• N-way set associative: N entries for each Cache Index

– N direct mapped caches operates in parallel

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache

– The two tags in the set are compared in parallel

– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

CSE 141 Carro

Disadvantage of Set Associative Cache• N-way Set Associative Cache versus Direct Mapped Cache:

– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFOREHit/Miss:

– Possible to assume a hit and continue. Recover later if miss.

• How do we know what to move from the cache?

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

CSE 141 Carro

A Summary on Sources of Cache Misses

• Compulsory (cold start or process migration, firstreference): first access to a block

– “Cold” fact of life: not a whole lot you can do about it

– Note: If you are going to run “billions” of instructions,Compulsory Misses are insignificant

• Conflict (collision):– Multiple memory locations mapped to the same cache location

– Solution 1: increase cache size

– Solution 2: increase associativity

• Capacity:– Cache cannot contain all blocks accessed by the program

– Solution: increase cache size

• Invalidation: other process (e.g., I/O) updates memory

CSE 141 Carro

Sources of Cache Misses Answer

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

Note:If you are going to run “billions” of instructions, Compulsory Misses are insignificant.

Same Same Same

Conflict Miss High Medium Zero

Low Medium High

Same Same Same

CSE 141 Carro

Accessing a Sample Cache

• 64 KB cache, direct-mapped, 32-byte cache block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

64 KB

/ 32 bytes = 2 K

cache blocks/sets

11

=

256

32

16

hit/miss

012.........

...204520462047

word offset

CSE 141 Carro

Accessing a Sample Cache

• 32 KB cache, 2-way set-associative, 16-byte block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

32 KB

/ 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

012.........

...102110221023

word offset

tag datavalid

=

CSE 141 Carro

Cache Associativity

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB2 KB4 KB8 KB

Miss

rate

Associativity 16 KB32 KB64 KB128 KB

CSE 141 Carro

Cache Miss Components

20%

Mis

s ra

te p

er t

ype

2%

4%

6%

8%

10%

12%

14%

1 4 8 16 32 64 128

One-way

Two-way

Cache size (KB) Four-way

Eight-way

Capacity

CSE 141 Carro

LRU replacement algorithms

• only needed for associative caches

• requires one bit for 2-way set-associative, 8 bits for 4-way,24 bits for 8-way.

• can be emulated with log n bits (NMRU)

• can be emulated with use bits for highly associative caches(like page tables)

CSE 141 Carro

Caches in Current Processors

• Often DM at highest level (closest to CPU), associative further away

• split I and D close to the processor (for throughput rather than miss rate),unified further away.

• write-through and write-back both common, but never write-through allthe way to memory.

• 32-byte cache lines very common (but getting larger – 64, 128)

• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even then)

– this means the cache must be able to handle multiple outstanding accesses.

CSE 141 Carro

Key Points

• Caches give illusion of a large, cheap memory with theaccess time of a fast, expensive memory.

• Caches take advantage of memory locality, specificallytemporal locality and spatial locality.

• Cache design presents many options (block size, cachesize, associativity, write policy) that an architect mustcombine to minimize miss rate and access time tomaximize performance.

• Cache misses increase CPI

CSE 141 Carro

Virtual Memory

The magician’s approach to hardware

CSE 141 Carro

The problem:

• Our computer has 32Kbyes of main memory. How can we:

a)run programs that use more than 32Kbytes?

Divide the program into chunks that fit;

Let the user worry about how to bring each chunk fromdisk to memory at the right time.

b)allow multiple users to use our computer?

Pay someone to look at each program and do the abovetasks, just thinking the multiple users programs are like ahuge single program (this is true!).

CSE 141 Carro

Virtual Memory

• It’s just another level in the cache/memory hierarchy

• Virtual memory is the name of the technique that allows usto view main memory as a cache of a larger memory space(on disk). cpu

$

memory

disk

cache

cacheing

virtual memory

CSE 141 Carro

Virtual Memory

• is just cacheing, but uses different terminologycache VM

block page

cache miss page fault

address virtual address

index physical address (sort of)

CSE 141 Carro

Virtual Memory

• What happens if another program in the processor uses the sameaddresses that yours does?

• What happens if your program uses addresses that don’t exist in themachine?

• What happens to “holes” in the address space your program uses?

• So, virtual memory provides– performance (through the cacheing effect)

– protection

– ease of programming/compilation

– efficient use of memory

CSE 141 Carro

Virtual Memory

• is just a mapping function from virtual memory addressesto physical memory locations, which allows cacheing ofvirtual pages in physical memory.

Physical memory

Disk storage

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Virtual pagenumber

Physical page ordisk address

CSE 141 Carro

What makes VM different than memorycaches

• MUCH higher miss penalty (millions of cycles)! If it is notin memory, it is in the disk!

• Therefore:– large pages [equivalent of cache line] (4 KB to MBs)

– associative mapping of pages (typically fully associative)

– software handling of misses (since we have time)

– write-through not an option, only write-back

• substitution policy: LRU

CSE 141 Carro

Mapping virtual to physical address

• Page size =212=4KB

• Physical pages < 4* Virtual pages!

3 2 1 011 10 9 815 14 13 1231 30 29 28 27

Page offsetVirtual page number

Virtual address

3 2 1 011 10 9 815 14 13 1229 28 27

Page offsetPhysical page number

Physical address

Translation

CSE 141 Carro

Virtual Memory mapping

physical addresses

virtual addresses

virtual addresses

disk

We can share code or data!

Where have you seen this?

Address translation via the page tablevirtual page number page offset

valid physical page numberpage table reg

physical page number page offset

virtual address

physical address

pagetable

• all page mappings are in the page table, so hit/miss is determinedsolely by the valid bit (i.e., no tag)

• if we do it in software, where do we store the address translation table?

CSE 141 Carro

Actual (somewhat) hardware

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notıpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Notice: what do wehave to do to save the context?

CSE 141 Carro

Making Address Translation Fast

• A cache for address translations: translation-lookasidebuffer (TLB)

Valid

1

1

1

1

0

1

1

0

1

1

0

1

Page table

Physical pageaddressValid

TLB

1

1

1

1

0

1

TagVirtual page

number

Physical pageor disk address

Physical memory

Disk storage

CSE 141 Carro

TLBs and caches

Valid Tag Data

Page offset

Page offset

Virtual page number

Virtual address

Physical page numberValid

1220

20

16 14

Cache index

32

Cache

DataCache hit

2

Byteoffset

Dirty Tag

TLB hit

Physical page number

Physical address tag

TLB

Physical address

31 30 29 15 14 13 12 11 10 9 8 3 2 1 0

CSE 141 Carro

Memory system Decstation 3100

Yes

Deliver dataıto the CPU

Write?

Try to read dataıfrom cache

Write data into cache,ıupdate the tag, and putı

the data and the addressıinto the write buffer

Cache hit?Cache miss stall

TLB hit?

TLB access

Virtual address

TLB missıexception

No

YesNo

YesNo

Write accessıbit on?

ı

YesNo

Write protectionıexception

Physical address

CSE 141 Carro

Modern systems: nightmare!Characteristic Intel Pentium Pro PowerPC 604

Virtual address 32 bits 52 bitsPhysical address 32 bits 32 bitsPage size 4 KB, 4 MB 4 KB, selectable, and 256 MBTLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data

Both four-way set associative Both two-way set associativePseudo-LRU replacement LRU replacementInstruction TLB: 32 entries Instruction TLB: 128 entriesData TLB: 64 entries Data TLB: 128 entriesTLB misses handled in hardware TLB misses handled in hardware

Characteristic Intel Pentium Pro PowerPC 604Cache organization Split instruction and data caches Split intruction and data cachesCache size 8 KB each for instructions/data 16 KB each for instructions/dataCache associativity Four-way set associative Four-way set associativeReplacement Approximated LRU replacement LRU replacementBlock size 32 bytes 32 bytesWrite policy Write-back Write-back or write-through

CSE 141 Carro

Virtual Memory Key Points - I

• How does virtual memory provide:– protection?

– sharing?

– performance?

– illusion of large main memory?

• Virtual Memory requires twice as many memory accesses,so we cache page table entries in the TLB.

• Three things can go wrong on a memory access: cachemiss, TLB miss, page fault.

CSE 141 Carro

Virtual Memory Key Points - II

• Processor speeds continue to increase very fast— much faster than either DRAM or disk access times

• Design challenge: dealing with this growing disparity

• Trends:– synchronous SRAMs (provide a burst of data)

– redesign DRAM chips to provide higher bandwidth or processing

– restructure code to increase locality

– use prefetching (make cache visible to ISA)

Cache Memory - UFRGSflavio/ensino/cmp502/memoria.pdf · 2002. 11. 7. · cache next-level memory/cache CSE 141 Carro Cache Fundamentals, cont. • cache block size or cache line size

Documents