1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.

1

IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm)

Maurice Wilkes published the first paper on cache memory in 1965. The first computer to actually include one was probably built at Cambridge (a direct mapped cache).

2

COMP 740:COMP 740:Computer Architecture and Computer Architecture and ImplementationImplementation

Montek SinghMontek Singh

Tue, Apr 7, 2009Tue, Apr 7, 2009

Topic: Topic: Introduction to CachesIntroduction to Caches

(Will cover Caches, Main Memory and Virtual (Will cover Caches, Main Memory and Virtual

Memory)Memory)

33

OutlineOutline Cache OrganizationCache Organization Cache Read/Write PoliciesCache Read/Write Policies

Block replacement policiesBlock replacement policies Write-back vs. write-through cachesWrite-back vs. write-through caches Write buffersWrite buffers

Cache PerformanceCache Performance Means of improving performanceMeans of improving performance

Read Appendix C.1 through C.3Read Appendix C.1 through C.3

4

The Five Classic Components of a ComputerThe Five Classic Components of a Computer

This lecture (and next few): Memory SystemThis lecture (and next few): Memory System

Control

Datapath

Memory

Processor

Input

Output

The Big Picture: Where are We The Big Picture: Where are We Now?Now?

5

MotivationMotivation Large (cheap) memories (DRAM) are slowLarge (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fastSmall (costly) memories (SRAM) are fast

Make the average access time smallMake the average access time small service most accesses from a small, fast memoryservice most accesses from a small, fast memory reduce the bandwidth required of the large memoryreduce the bandwidth required of the large memory

Exploit: Locality of ReferenceExploit: Locality of Reference

Processor

Memory System

Cache DRAM

The Motivation for CachesThe Motivation for Caches

9

Memory Hierarchy: TerminologyMemory Hierarchy: Terminology Hit:Hit: data appears in some block in the upper data appears in some block in the upper

level (e.g.: Block X in previous slide) level (e.g.: Block X in previous slide) Hit Rate = fraction of memory access found in upper Hit Rate = fraction of memory access found in upper

levellevel Hit Time = time to access the upper levelHit Time = time to access the upper level

memory access time + Time to determine hit/missmemory access time + Time to determine hit/miss

Miss:Miss: data needs to be retrieved from a block in data needs to be retrieved from a block in the lower level (e.g.: Block Y in previous slide)the lower level (e.g.: Block Y in previous slide) Miss Rate = 1 - (Hit Rate)Miss Rate = 1 - (Hit Rate) Miss Penalty: includes time to fetch a new block from Miss Penalty: includes time to fetch a new block from

lower levellower levelTime to replace a block in the upper level from lower level + Time to replace a block in the upper level from lower level +

Time to deliver the block the processorTime to deliver the block the processor

Hit Time: significantly less than Miss PenaltyHit Time: significantly less than Miss Penalty

10

Cache AddressingCache Addressing

Set 0

Set j-1

Block 0 Block k-1 Replacement info

Sector 0 Sector m-1 Tag

Byte 0 Byte n-1 Valid Dirty Shared

Block/line is unit of allocationBlock/line is unit of allocation Sector/sub-block is unit of transfer and coherenceSector/sub-block is unit of transfer and coherence Cache parameters Cache parameters jj, , kk, , mm, , nn are integers, and are integers, and

generally powers of 2generally powers of 2

1111

Cache ShapesCache Shapes

12

Cache ShapesCache Shapes

Direct-mapped(A = 1, S = 16)

2-way set-associative(A = 2, S = 8)



Fully associative(A = 16, S = 1)

13

Cache OrganizationCache Organization Direct Mapped CacheDirect Mapped Cache

Each memory location can only mapped to 1 cache locationEach memory location can only mapped to 1 cache location No need to make any decision :-)No need to make any decision :-)

Current item replaces previous item in that cache locationCurrent item replaces previous item in that cache location

N-way Set Associative CacheN-way Set Associative Cache Each memory location have a choice of N cache locationsEach memory location have a choice of N cache locations

Fully Associative CacheFully Associative Cache Each memory location can be placed in ANY cache locationEach memory location can be placed in ANY cache location

Cache miss in a N-way Set Associative or Fully Cache miss in a N-way Set Associative or Fully Associative CacheAssociative Cache Bring in new block from memoryBring in new block from memory Throw out a cache block to make room for the new blockThrow out a cache block to make room for the new block Need to decide which block to throw out!Need to decide which block to throw out!

14

4 Questions for Mem Hierarchy4 Questions for Mem Hierarchy Where can a block be placed in the upper Where can a block be placed in the upper

level? level? (Block placement)(Block placement) How is a block found if it is in the upper level?How is a block found if it is in the upper level?

(Block identification)(Block identification) Which block should be replaced on a miss?Which block should be replaced on a miss?

(Block replacement)(Block replacement) What happens on a write?What happens on a write?

(Write strategy)(Write strategy)

15

0431 9Cache Index

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:

0123

:

Cache DataByte 0

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

Byte Select

Example 1: 1KB, Direct-Mapped, 32B Example 1: 1KB, Direct-Mapped, 32B BlocksBlocks For a 1024 (2For a 1024 (21010) byte cache with 32-byte blocks) byte cache with 32-byte blocks

The uppermost 22 = (32 - 10) address bits are the tagThe uppermost 22 = (32 - 10) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = The lowest 5 address bits are the Byte Select (Block Size =

2255)) The next 5 address bits (bit5 - bit9) are the Cache IndexThe next 5 address bits (bit5 - bit9) are the Cache Index

16

0431 9 Cache Index

:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte Select

0x00

Byte Select=

Cache Miss

1

0

1

0xxxxxxx

0x004440

Example 1a: Cache Miss; Empty Example 1a: Cache Miss; Empty BlockBlock

17

0431 9 Cache Index

:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

31


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x00

Byte Select=

1

1

1

0x0002fe

0x004440

New Block of data

Example 1b: … Read in DataExample 1b: … Read in Data

18

0431 9 Cache Index

:

Cache Tag

0x000050 0x01

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x08

Byte Select=

Cache Hit

1

1

1

0x0002fe

0x004440

Example 1c: Cache HitExample 1c: Cache Hit

19

0431 9 Cache Index

:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=

Cache Miss

1

1

1

0x0002fe

0x004440

Example 1d: Cache Miss; Incorrect Example 1d: Cache Miss; Incorrect BlockBlock

20

0431 9 Cache Index

:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0

1

2

3

:

Cache Data

Byte 0

31

Byte 1Byte 31 :


Byte 992Byte 1023 :

Cache Tag

Byte Select

0x04

Byte Select=

1

1

1

0x0002fe

0x002450 New Block of data

Example 1e: … Replace BlockExample 1e: … Replace Block

2222

Replacement PolicyReplacement Policy RandomRandom

Easy to implementEasy to implement LRULRU

Hard to implement; often approximatedHard to implement; often approximated FIFOFIFO

Used as approximation to LRUUsed as approximation to LRU

Little effect (below); most pronounced with small, low associativity Little effect (below); most pronounced with small, low associativity cachescaches

23

Cache Write PolicyCache Write Policy Cache read is much easier to handle than cache Cache read is much easier to handle than cache

writewrite Instruction cache is much easier to design than data cacheInstruction cache is much easier to design than data cache

Cache writeCache write How do we keep data in the cache and memory How do we keep data in the cache and memory

consistent?consistent?

Two options (decision time again :-)Two options (decision time again :-) Write Back: write to cache only. Write the cache block to Write Back: write to cache only. Write the cache block to

memory when that cache block is being replaced on a memory when that cache block is being replaced on a cache misscache missNeed a “dirty bit” for each cache blockNeed a “dirty bit” for each cache blockGreatly reduce the memory bandwidth requirementGreatly reduce the memory bandwidth requirementControl can be complexControl can be complex

Write Through: write to cache and memory at the same Write Through: write to cache and memory at the same timetimeWhat!!! How can this be? Isn’t memory too slow for this?What!!! How can this be? Isn’t memory too slow for this?

24

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write ThroughWrite Buffer for Write Through

Write Buffer: needed between cache and main Write Buffer: needed between cache and main memmem Processor: writes data into the cache and the write Processor: writes data into the cache and the write

bufferbuffer Memory controller: write contents of the buffer to Memory controller: write contents of the buffer to

memorymemory

Write buffer is just a FIFOWrite buffer is just a FIFO Typical number of entries: 4Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM Works fine if store freq. (w.r.t. time) << 1 / DRAM

write cyclewrite cycle

Memory system designer’s nightmareMemory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturationWrite buffer saturation

25

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache

Write Buffer SaturationWrite Buffer Saturation

Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle If this condition exist for a long period of time (CPU cycle

time too quick and/or too many store instructions in a row)time too quick and/or too many store instructions in a row) Store buffer will overflow no matter how big you make itStore buffer will overflow no matter how big you make it CPU Cycle Time << DRAM Write Cycle TimeCPU Cycle Time << DRAM Write Cycle Time

Solutions for write buffer saturationSolutions for write buffer saturation Use a write back cacheUse a write back cache Install a second level (L2) cacheInstall a second level (L2) cache

2626

On a Write MissOn a Write Miss Write allocate – block is allocated in cacheWrite allocate – block is allocated in cache

No-write allocate – no cache block is allocated. No-write allocate – no cache block is allocated. Write is only to main memory (or next level of Write is only to main memory (or next level of hierarchy)hierarchy)

2727

Opteron CacheOpteron Cache

64K bytes in 64 byte blocks

40-bit physical address (1)

2-way set associative.LRU replacement

• Write back• Write allocate on miss• Dirty bit• Victim buffer for replaced blocks• 8 blocks

Tags indexed (2) and compared (3). Note valid bit.

2 clock read on hit.

Miss, 7 clks for 1st 8 bytes, then2 clk / 8-bytes

2828

Separate I & DSeparate I & D Commonly doneCommonly done Increases bandwidth to processorIncreases bandwidth to processor Allows for the different access patterns of Allows for the different access patterns of

instructions and datainstructions and data

29

Cache PerformanceCache Performance

penalty Miss rate Miss Hit time

timeaccessmemory Average

timeCycle penalty Missref MM

Misses

nInstructio

refs MM CPI PipelineIC timeCPU

cache without trafficBus

cache with trafficBus ratio trafficBus

30

MissPenalty

Block Size

MissRate Exploits spatial locality

Fewer blocks: compromisestemporal locality

Block Size

Increased Miss Penalty& Miss Rate

AverageAccess

Time

Block Size

Block Size TradeoffBlock Size Tradeoff In general, larger block size take advantage of spatial In general, larger block size take advantage of spatial

locality, locality, BUT:BUT: Larger block size means larger miss penaltyLarger block size means larger miss penalty

Takes longer time to fill up the blockTakes longer time to fill up the block If block size is too big relative to cache size, miss rate will go upIf block size is too big relative to cache size, miss rate will go up

Too few cache blocksToo few cache blocks

Average Access TimeAverage Access Time Hit Time + Miss Penalty x Miss RateHit Time + Miss Penalty x Miss Rate

31

Sources of Cache MissesSources of Cache Misses Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first

reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it

Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache Multiple mem locations mapped to the same cache

locationlocation Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Increase associativitySolution 2: Increase associativity

CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program

Coherence/InvalidationCoherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory

32

The 3C Model of Cache MissesThe 3C Model of Cache Misses Based on comparison with another cacheBased on comparison with another cache

Compulsory:Compulsory: The first access to a block is not in the cache, The first access to a block is not in the cache, so the block must be brought into the cache. These are also so the block must be brought into the cache. These are also called cold start misses or first reference misses.called cold start misses or first reference misses.(Misses in Infinite Cache)(Misses in Infinite Cache)

Capacity:Capacity: If the cache cannot contain all the blocks needed If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later misses will occur due to blocks being discarded and later retrieved.retrieved.(Misses in fully associative size X Cache)(Misses in fully associative size X Cache)

Conflict:Conflict: If the block-placement strategy is set-associative If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference set. These are also called collision misses or interference misses.misses.(Misses in A-way associative size X Cache but not in (Misses in A-way associative size X Cache but not in fully associative size X Cache)fully associative size X Cache)

Also: Coherence/InvalidationAlso: Coherence/Invalidation Other process (e.g., I/O) updates memoryOther process (e.g., I/O) updates memory

3333

Possible SolutionsPossible Solutions Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first

reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it

Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache locationMultiple mem locations mapped to the same cache location Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Increase associativitySolution 2: Increase associativity

CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program

34

Sources of Cache MissesSources of Cache Misses

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

If you are going to run “billions” of instruction, compulsory misses are insignificant.

Same Same Same

Conflict Miss High Medium Zero

Low(er) Medium High

Same Same Same

35

3Cs Absolute Miss Rate3Cs Absolute Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

36

3Cs Relative Miss Rate3Cs Relative Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

37

How to Improve Cache How to Improve Cache PerformancePerformance LatencyLatency

Reduce miss rateReduce miss rate Reduce miss penaltyReduce miss penalty Reduce hit timeReduce hit time

BandwidthBandwidth Increase hit bandwidthIncrease hit bandwidth Increase miss bandwidthIncrease miss bandwidth

38

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block 1. Reduce Misses via Larger Block SizeSize

39

2. Reduce Misses via Higher 2. Reduce Misses via Higher AssociativityAssociativity 2:1 Cache Rule2:1 Cache Rule

Miss Rate DM cache size N Miss Rate DM cache size N Miss Rate FA cache size N/2 Miss Rate FA cache size N/2 Not merely empiricalNot merely empirical

Theoretical justification in Sleator and Tarjan, “Amortized efficiency of Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2):202-208,1985list update and paging rules”, CACM, 28(2):202-208,1985

Beware: Execution time is only final measure!Beware: Execution time is only final measure! Will clock cycle time increase?Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-wayHill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

40

Example: Ave Mem Access Time vs. Miss Example: Ave Mem Access Time vs. Miss RateRateExample: assume clock cycle time is 1.10 for 2-way, Example: assume clock cycle time is 1.10 for 2-way,

1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mappeddirect mapped

((RedRed means A.M.A.T. not improved by more means A.M.A.T. not improved by more associativity)associativity)

Cache size Associativity(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.684 1.72 1.67 1.61 1.538 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.3232 1.20 1.24 1.25 1.2764 1.14 1.2 1.21 1.23128 1.10 1.17 1.18 1.20

41

3. Reduce Conflict Misses via Victim 3. Reduce Conflict Misses via Victim CacheCache How to combine fast hit How to combine fast hit

time of direct mapped time of direct mapped yet avoid conflict misses yet avoid conflict misses

Add small highly Add small highly associative buffer to associative buffer to hold data discarded hold data discarded from cachefrom cache

Jouppi [1990]: 4-entry Jouppi [1990]: 4-entry victim cache removed victim cache removed 20% to 95% of conflicts 20% to 95% of conflicts for a 4 KB direct for a 4 KB direct mapped data cachemapped data cache

TAG DATA

?

TAG DATA

?

CPU

Mem

42

4. Reduce Conflict Misses via Pseudo-4. Reduce Conflict Misses via Pseudo-Assoc.Assoc. How to combine fast hit time of direct mapped How to combine fast hit time of direct mapped

and have the lower conflict misses of 2-way SA and have the lower conflict misses of 2-way SA cachecache

Divide cache: on a miss, check other half of cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)to see if there, if so have a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclescycles Better for caches not tied directly to processorBetter for caches not tied directly to processor

Hit Time

Pseudo Hit Time Miss Penalty

Time

43

5. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetching Instruction prefetchingInstruction prefetching

Alpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a miss Extra block placed in stream bufferExtra block placed in stream buffer On miss check stream bufferOn miss check stream buffer

Works with data blocks tooWorks with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses

from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for

8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative caches

Prefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penalty e.g., up to 8 prefetch stream buffers in the e.g., up to 8 prefetch stream buffers in the

UltraSPARC IIIUltraSPARC III

44

6. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetching Data prefetchData prefetch

Compiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC

v9)v9) A form of speculative executionA form of speculative execution

don’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible”

to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exception if they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’s

Issuing prefetch instructions takes timeIssuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?

45

7. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns. InstructionsInstructions

Reorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce misses Profiling to look at conflictsProfiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB

direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocks DataData

Merging ArraysMerging Arrays Improve spatial locality by single array of compound elements vs. Improve spatial locality by single array of compound elements vs.

2 arrays2 arrays Loop InterchangeLoop Interchange

Change nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memory Loop FusionLoop Fusion

Combine two independent loops that have same looping and Combine two independent loops that have same looping and some variables overlapsome variables overlap

BlockingBlocking Improve temporal locality by accessing “blocks” of data Improve temporal locality by accessing “blocks” of data

repeatedly vs. going down whole columns or rowsrepeatedly vs. going down whole columns or rows

46

Merging Arrays ExampleMerging Arrays Example

Reduces conflicts between Reduces conflicts between valval and and keykey Addressing expressions are differentAddressing expressions are different

/* Before */int val[SIZE];int key[SIZE];

/* Before */int val[SIZE];int key[SIZE];

/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];

/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];

47

Loop Interchange ExampleLoop Interchange Example

Sequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words

/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];

/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];

/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

48

Loop Fusion ExampleLoop Fusion Example

Before: 2 misses per access to Before: 2 misses per access to aa and and cc After: 1 miss per access to After: 1 miss per access to aa and and cc

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];

/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

49

Blocking ExampleBlocking Example

Two Inner Loops:Two Inner Loops: Read all NxN elements of z[]Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[]Write N elements of 1 row of x[]

Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size 3 NxN 3 NxN no capacity misses; otherwise ... no capacity misses; otherwise ...

Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

50

Blocking Example (contd.)Blocking Example (contd.) Age of accessesAge of accesses

White means not touched yetWhite means not touched yet Light gray means touched a while agoLight gray means touched a while ago Dark gray means newer accessesDark gray means newer accesses

51

Blocking Example (contd.)Blocking Example (contd.)

Work with BxB submatricesWork with BxB submatrices smaller working set can fit within the cachesmaller working set can fit within the cache

fewer capacity missesfewer capacity misses

/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

52

Blocking Example (contd.)Blocking Example (contd.)

Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))

BB = = “blocking factor”“blocking factor”

53

Performance Improvement

1 1.5 2 2.5 3

compress

cholesky(nasa7)

spice

mxm (nasa7)

btrix (nasa7)

tomcatv

gmty (nasa7)

vpenta (nasa7)

mergedarrays

loopinterchange

loop fusion blocking

Summary: Compiler Summary: Compiler Optimizations to Reduce Cache Optimizations to Reduce Cache MissesMisses

54

Reducing Miss PenaltyReducing Miss Penalty1.1. Read Priority over Write on Miss:Read Priority over Write on Miss:

Write through:Write through: Using write buffers: RAW conflicts with reads on cache Using write buffers: RAW conflicts with reads on cache

missesmisses If simply wait for write buffer to empty might increase If simply wait for write buffer to empty might increase

read miss penalty by 50% (old MIPS 1000)read miss penalty by 50% (old MIPS 1000) Check write buffer contents before read; Check write buffer contents before read;

if no conflicts, let the memory access continueif no conflicts, let the memory access continue Write Back?Write Back?

Read miss replacing dirty blockRead miss replacing dirty block Normal: Write dirty block to memory, and then do the Normal: Write dirty block to memory, and then do the

readread Instead copy the dirty block to a write buffer, then do the Instead copy the dirty block to a write buffer, then do the

read, and then do the writeread, and then do the write CPU stall less since restarts as soon as read completesCPU stall less since restarts as soon as read completes

55

Valid Bits

100

200

300

1 1 1 0

1 10 0

0 0 0 1

2. Fetching Subblocks to Reduce Miss 2. Fetching Subblocks to Reduce Miss PenaltyPenalty Don’t have to load full block on a missDon’t have to load full block on a miss Have bits per subblock to indicate validHave bits per subblock to indicate valid

56

3. Early Restart and Critical Word 3. Early Restart and Critical Word FirstFirst Don’t wait for full block to be loaded before Don’t wait for full block to be loaded before

restarting CPUrestarting CPU Early RestartEarly Restart—As soon as the requested word of the —As soon as the requested word of the

block arrrives, send it to the CPU and let the CPU block arrrives, send it to the CPU and let the CPU continue executioncontinue execution

Critical Word FirstCritical Word First—Request the missed word first —Request the missed word first from memory and send it to the CPU as soon as it from memory and send it to the CPU as soon as it arrivesarrives let the CPU continue while filling the rest of the words in let the CPU continue while filling the rest of the words in

the block. the block. also called “wrapped fetch” and “requested word first”also called “wrapped fetch” and “requested word first”

Generally useful only in large blocksGenerally useful only in large blocks Spatial locality a problemSpatial locality a problem

tend to want next sequential word, so not clear if tend to want next sequential word, so not clear if benefit by early restartbenefit by early restart

57

4. Non-blocking Caches4. Non-blocking Caches Non-blocking cacheNon-blocking cache or or lockup-free cachelockup-free cache

allows the data cache to continue to supply allows the data cache to continue to supply cache hits during a misscache hits during a miss ““Hit under miss”Hit under miss”

reduces the effective miss penalty by being helpful during reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPUa miss instead of ignoring the requests of the CPU

““Hit under multiple miss” or “miss under miss”Hit under multiple miss” or “miss under miss”may further lower the effective miss penalty by may further lower the effective miss penalty by

overlapping multiple missesoverlapping multiple misses Significantly increases the complexity of the cache Significantly increases the complexity of the cache

controller as there can be multiple outstanding controller as there can be multiple outstanding memory accessesmemory accesses

58

Value of Hit Under Miss for SPECValue of Hit Under Miss for SPEC

FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

Hit Under i Misses

Av

g.

Me

m.

Acce

ss T

ime

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eqnto

tt

espre

sso

xlisp

com

pre

ss

mdljsp2

ear

fpppp

tom

catv

sw

m256

doduc

su2cor

wave5

mdljdp2

hydro

2d

alv

inn

nasa7

spic

e2g6

ora

0->1

1->2

2->64

Base

Integer Floating Point

“Hit under i Misses”

59

5. Miss Penalty Reduction: L2 5. Miss Penalty Reduction: L2 CacheCacheL2 Equations:L2 Equations:

AMAT = Hit TimeAMAT = Hit TimeL1L1 + Miss Rate + Miss RateL1L1 Miss Penalty Miss PenaltyL1L1

Miss PenaltyMiss PenaltyL1L1 = Hit Time = Hit TimeL2L2 + Miss Rate + Miss RateL2L2 Miss Penalty Miss PenaltyL2L2

AMAT = Hit TimeAMAT = Hit TimeL1L1 + Miss Rate + Miss RateL1L1 (Hit Time (Hit TimeL2L2 + Miss + Miss RateRateL2L2 Miss Penalty Miss PenaltyL2L2))

Definitions:Definitions:Local miss rateLocal miss rate— misses in this cache divided by the total — misses in this cache divided by the total

number of memory accessesnumber of memory accesses to this cache to this cache (Miss rate (Miss rateL2L2))Global miss rateGlobal miss rate—misses in this cache divided by the —misses in this cache divided by the

total number of memory accesses total number of memory accesses generated by the CPUgenerated by the CPU (Miss Rate(Miss RateL1L1 Miss Rate Miss RateL2L2) )

63

Review: Improving Cache Review: Improving Cache PerformancePerformance1. Reduce the miss rate, 1. Reduce the miss rate,

2. Reduce the miss penalty, or2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.3. Reduce the time to hit in the cache.

64

1. Fast Hit Times via Small, Simple 1. Fast Hit Times via Small, Simple CachesCaches Simple caches can be fasterSimple caches can be faster

cache hit time increasingly a bottleneck to CPU cache hit time increasingly a bottleneck to CPU performanceperformanceset associativity requires complex tag matching set associativity requires complex tag matching slower slowerdirect-mapped are simpler direct-mapped are simpler faster faster shorter CPU cycle shorter CPU cycle

timestimes– tag check can be overlapped with transmission of datatag check can be overlapped with transmission of data

Smaller caches can be fasterSmaller caches can be faster can fit on the same chip as CPUcan fit on the same chip as CPU

avoid penalty of going off-chipavoid penalty of going off-chip for L2 caches: compromisefor L2 caches: compromise

keep tags on chip, and data off chipkeep tags on chip, and data off chip– fast tag check, yet greater cache capacityfast tag check, yet greater cache capacity

L1 data cache reduced from 16KB in Pentium III to L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV8KB in Pentium IV

65

TechniqueTechnique MRMR MPMP HTHT ComplexityComplexityLarger Block SizeLarger Block Size ++ –– 00Higher AssociativityHigher Associativity ++ –– 11Victim CachesVictim Caches ++ 22Pseudo-Associative Caches Pseudo-Associative Caches ++ 22HW Prefetching of Instr/DataHW Prefetching of Instr/Data ++ 22Compiler Controlled PrefetchingCompiler Controlled Prefetching ++ 33Compiler Reduce MissesCompiler Reduce Misses ++ 00Priority to Read MissesPriority to Read Misses ++ 11Subblock Placement Subblock Placement ++ ++ 11Early Restart & Critical Word 1st Early Restart & Critical Word 1st ++ 22Non-Blocking CachesNon-Blocking Caches ++ 33Second Level CachesSecond Level Caches ++ 22Small & Simple CachesSmall & Simple Caches –– ++ 00Avoiding Address TranslationAvoiding Address Translation ++ 22

Cache Optimization SummaryCache Optimization Summary

6767

In ConclusionIn Conclusion Have looked at basic types of cachesHave looked at basic types of caches ProblemsProblems How to improve performanceHow to improve performance

NextNext Methods to ensure cache consistency in SMPsMethods to ensure cache consistency in SMPs

1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.

Documents

main memory

processor lower level

memory hierarchy slide

spatial locality locality

virtual memory slide

processor upper level

fast memory service

large memory exploit