1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper on cache memory in 1965. The first computer to actually include one was probably built at Cambridge (a direct mapped cache).
59
Embed
1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm)
Maurice Wilkes published the first paper on cache memory in 1965. The first computer to actually include one was probably built at Cambridge (a direct mapped cache).
2
COMP 740:COMP 740:Computer Architecture and Computer Architecture and ImplementationImplementation
Montek SinghMontek Singh
Tue, Apr 7, 2009Tue, Apr 7, 2009
Topic: Topic: Introduction to CachesIntroduction to Caches
(Will cover Caches, Main Memory and Virtual (Will cover Caches, Main Memory and Virtual
Block replacement policiesBlock replacement policies Write-back vs. write-through cachesWrite-back vs. write-through caches Write buffersWrite buffers
Cache PerformanceCache Performance Means of improving performanceMeans of improving performance
Read Appendix C.1 through C.3Read Appendix C.1 through C.3
4
The Five Classic Components of a ComputerThe Five Classic Components of a Computer
This lecture (and next few): Memory SystemThis lecture (and next few): Memory System
Control
Datapath
Memory
Processor
Input
Output
The Big Picture: Where are We The Big Picture: Where are We Now?Now?
5
MotivationMotivation Large (cheap) memories (DRAM) are slowLarge (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fastSmall (costly) memories (SRAM) are fast
Make the average access time smallMake the average access time small service most accesses from a small, fast memoryservice most accesses from a small, fast memory reduce the bandwidth required of the large memoryreduce the bandwidth required of the large memory
Exploit: Locality of ReferenceExploit: Locality of Reference
Processor
Memory System
Cache DRAM
The Motivation for CachesThe Motivation for Caches
9
Memory Hierarchy: TerminologyMemory Hierarchy: Terminology Hit:Hit: data appears in some block in the upper data appears in some block in the upper
level (e.g.: Block X in previous slide) level (e.g.: Block X in previous slide) Hit Rate = fraction of memory access found in upper Hit Rate = fraction of memory access found in upper
levellevel Hit Time = time to access the upper levelHit Time = time to access the upper level
memory access time + Time to determine hit/missmemory access time + Time to determine hit/miss
Miss:Miss: data needs to be retrieved from a block in data needs to be retrieved from a block in the lower level (e.g.: Block Y in previous slide)the lower level (e.g.: Block Y in previous slide) Miss Rate = 1 - (Hit Rate)Miss Rate = 1 - (Hit Rate) Miss Penalty: includes time to fetch a new block from Miss Penalty: includes time to fetch a new block from
lower levellower levelTime to replace a block in the upper level from lower level + Time to replace a block in the upper level from lower level +
Time to deliver the block the processorTime to deliver the block the processor
Hit Time: significantly less than Miss PenaltyHit Time: significantly less than Miss Penalty
10
Cache AddressingCache Addressing
Set 0
Set j-1
Block 0 Block k-1 Replacement info
Sector 0 Sector m-1 Tag
Byte 0 Byte n-1 Valid Dirty Shared
Block/line is unit of allocationBlock/line is unit of allocation Sector/sub-block is unit of transfer and coherenceSector/sub-block is unit of transfer and coherence Cache parameters Cache parameters jj, , kk, , mm, , nn are integers, and are integers, and
generally powers of 2generally powers of 2
1111
Cache ShapesCache Shapes
12
Cache ShapesCache Shapes
Direct-mapped(A = 1, S = 16)
2-way set-associative(A = 2, S = 8)
4-way set-associative(A = 4, S = 4)
8-way set-associative(A = 8, S = 2)
Fully associative(A = 16, S = 1)
13
Cache OrganizationCache Organization Direct Mapped CacheDirect Mapped Cache
Each memory location can only mapped to 1 cache locationEach memory location can only mapped to 1 cache location No need to make any decision :-)No need to make any decision :-)
Current item replaces previous item in that cache locationCurrent item replaces previous item in that cache location
N-way Set Associative CacheN-way Set Associative Cache Each memory location have a choice of N cache locationsEach memory location have a choice of N cache locations
Fully Associative CacheFully Associative Cache Each memory location can be placed in ANY cache locationEach memory location can be placed in ANY cache location
Cache miss in a N-way Set Associative or Fully Cache miss in a N-way Set Associative or Fully Associative CacheAssociative Cache Bring in new block from memoryBring in new block from memory Throw out a cache block to make room for the new blockThrow out a cache block to make room for the new block Need to decide which block to throw out!Need to decide which block to throw out!
14
4 Questions for Mem Hierarchy4 Questions for Mem Hierarchy Where can a block be placed in the upper Where can a block be placed in the upper
level? level? (Block placement)(Block placement) How is a block found if it is in the upper level?How is a block found if it is in the upper level?
(Block identification)(Block identification) Which block should be replaced on a miss?Which block should be replaced on a miss?
(Block replacement)(Block replacement) What happens on a write?What happens on a write?
(Write strategy)(Write strategy)
15
0431 9Cache Index
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:
0123
:
Cache DataByte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
Byte Select
Example 1: 1KB, Direct-Mapped, 32B Example 1: 1KB, Direct-Mapped, 32B BlocksBlocks For a 1024 (2For a 1024 (21010) byte cache with 32-byte blocks) byte cache with 32-byte blocks
The uppermost 22 = (32 - 10) address bits are the tagThe uppermost 22 = (32 - 10) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = The lowest 5 address bits are the Byte Select (Block Size =
2255)) The next 5 address bits (bit5 - bit9) are the Cache IndexThe next 5 address bits (bit5 - bit9) are the Cache Index
16
0431 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x00
Byte Select=
Cache Miss
1
0
1
0xxxxxxx
0x004440
Example 1a: Cache Miss; Empty Example 1a: Cache Miss; Empty BlockBlock
17
0431 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
31
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x00
Byte Select=
1
1
1
0x0002fe
0x004440
New Block of data
Example 1b: … Read in DataExample 1b: … Read in Data
18
0431 9 Cache Index
:
Cache Tag
0x000050 0x01
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x08
Byte Select=
Cache Hit
1
1
1
0x0002fe
0x004440
Example 1c: Cache HitExample 1c: Cache Hit
19
0431 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select=
Cache Miss
1
1
1
0x0002fe
0x004440
Example 1d: Cache Miss; Incorrect Example 1d: Cache Miss; Incorrect BlockBlock
20
0431 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0
1
2
3
:
Cache Data
Byte 0
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select=
1
1
1
0x0002fe
0x002450 New Block of data
Example 1e: … Replace BlockExample 1e: … Replace Block
2222
Replacement PolicyReplacement Policy RandomRandom
Easy to implementEasy to implement LRULRU
Hard to implement; often approximatedHard to implement; often approximated FIFOFIFO
Used as approximation to LRUUsed as approximation to LRU
Little effect (below); most pronounced with small, low associativity Little effect (below); most pronounced with small, low associativity cachescaches
23
Cache Write PolicyCache Write Policy Cache read is much easier to handle than cache Cache read is much easier to handle than cache
writewrite Instruction cache is much easier to design than data cacheInstruction cache is much easier to design than data cache
Cache writeCache write How do we keep data in the cache and memory How do we keep data in the cache and memory
consistent?consistent?
Two options (decision time again :-)Two options (decision time again :-) Write Back: write to cache only. Write the cache block to Write Back: write to cache only. Write the cache block to
memory when that cache block is being replaced on a memory when that cache block is being replaced on a cache misscache missNeed a “dirty bit” for each cache blockNeed a “dirty bit” for each cache blockGreatly reduce the memory bandwidth requirementGreatly reduce the memory bandwidth requirementControl can be complexControl can be complex
Write Through: write to cache and memory at the same Write Through: write to cache and memory at the same timetimeWhat!!! How can this be? Isn’t memory too slow for this?What!!! How can this be? Isn’t memory too slow for this?
24
ProcessorCache
Write Buffer
DRAM
Write Buffer for Write ThroughWrite Buffer for Write Through
Write Buffer: needed between cache and main Write Buffer: needed between cache and main memmem Processor: writes data into the cache and the write Processor: writes data into the cache and the write
bufferbuffer Memory controller: write contents of the buffer to Memory controller: write contents of the buffer to
memorymemory
Write buffer is just a FIFOWrite buffer is just a FIFO Typical number of entries: 4Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM Works fine if store freq. (w.r.t. time) << 1 / DRAM
write cyclewrite cycle
Memory system designer’s nightmareMemory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturationWrite buffer saturation
25
ProcessorCache
Write Buffer
DRAM
ProcessorCache
Write Buffer
DRAML2Cache
Write Buffer SaturationWrite Buffer Saturation
Store frequency (w.r.t. time) > 1 / DRAM write cycleStore frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle If this condition exist for a long period of time (CPU cycle
time too quick and/or too many store instructions in a row)time too quick and/or too many store instructions in a row) Store buffer will overflow no matter how big you make itStore buffer will overflow no matter how big you make it CPU Cycle Time << DRAM Write Cycle TimeCPU Cycle Time << DRAM Write Cycle Time
Solutions for write buffer saturationSolutions for write buffer saturation Use a write back cacheUse a write back cache Install a second level (L2) cacheInstall a second level (L2) cache
2626
On a Write MissOn a Write Miss Write allocate – block is allocated in cacheWrite allocate – block is allocated in cache
No-write allocate – no cache block is allocated. No-write allocate – no cache block is allocated. Write is only to main memory (or next level of Write is only to main memory (or next level of hierarchy)hierarchy)
2727
Opteron CacheOpteron Cache
64K bytes in 64 byte blocks
40-bit physical address (1)
2-way set associative.LRU replacement
• Write back• Write allocate on miss• Dirty bit• Victim buffer for replaced blocks• 8 blocks
Tags indexed (2) and compared (3). Note valid bit.
2 clock read on hit.
Miss, 7 clks for 1st 8 bytes, then2 clk / 8-bytes
2828
Separate I & DSeparate I & D Commonly doneCommonly done Increases bandwidth to processorIncreases bandwidth to processor Allows for the different access patterns of Allows for the different access patterns of
instructions and datainstructions and data
29
Cache PerformanceCache Performance
penalty Miss rate Miss Hit time
timeaccessmemory Average
timeCycle penalty Missref MM
Misses
nInstructio
refs MM CPI PipelineIC timeCPU
cache without trafficBus
cache with trafficBus ratio trafficBus
30
MissPenalty
Block Size
MissRate Exploits spatial locality
Fewer blocks: compromisestemporal locality
Block Size
Increased Miss Penalty& Miss Rate
AverageAccess
Time
Block Size
Block Size TradeoffBlock Size Tradeoff In general, larger block size take advantage of spatial In general, larger block size take advantage of spatial
locality, locality, BUT:BUT: Larger block size means larger miss penaltyLarger block size means larger miss penalty
Takes longer time to fill up the blockTakes longer time to fill up the block If block size is too big relative to cache size, miss rate will go upIf block size is too big relative to cache size, miss rate will go up
Too few cache blocksToo few cache blocks
Average Access TimeAverage Access Time Hit Time + Miss Penalty x Miss RateHit Time + Miss Penalty x Miss Rate
31
Sources of Cache MissesSources of Cache Misses Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first
reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it
Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache Multiple mem locations mapped to the same cache
CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program
Coherence/InvalidationCoherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory
32
The 3C Model of Cache MissesThe 3C Model of Cache Misses Based on comparison with another cacheBased on comparison with another cache
Compulsory:Compulsory: The first access to a block is not in the cache, The first access to a block is not in the cache, so the block must be brought into the cache. These are also so the block must be brought into the cache. These are also called cold start misses or first reference misses.called cold start misses or first reference misses.(Misses in Infinite Cache)(Misses in Infinite Cache)
Capacity:Capacity: If the cache cannot contain all the blocks needed If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later misses will occur due to blocks being discarded and later retrieved.retrieved.(Misses in fully associative size X Cache)(Misses in fully associative size X Cache)
Conflict:Conflict: If the block-placement strategy is set-associative If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference set. These are also called collision misses or interference misses.misses.(Misses in A-way associative size X Cache but not in (Misses in A-way associative size X Cache but not in fully associative size X Cache)fully associative size X Cache)
Also: Coherence/InvalidationAlso: Coherence/Invalidation Other process (e.g., I/O) updates memoryOther process (e.g., I/O) updates memory
3333
Possible SolutionsPossible Solutions Compulsory (cold start or process migration, first Compulsory (cold start or process migration, first
reference): first access to a blockreference): first access to a block ““Cold” fact of life: not a whole lot you can do about itCold” fact of life: not a whole lot you can do about it
Conflict/Collision/InterferenceConflict/Collision/Interference Multiple mem locations mapped to the same cache locationMultiple mem locations mapped to the same cache location Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Increase associativitySolution 2: Increase associativity
CapacityCapacity Cache cannot contain all blocks access by the programCache cannot contain all blocks access by the program Solution 1: Increase cache sizeSolution 1: Increase cache size Solution 2: Restructure programSolution 2: Restructure program
34
Sources of Cache MissesSources of Cache Misses
Direct Mapped N-way Set Associative Fully Associative
Compulsory Miss
Cache Size
Capacity Miss
Invalidation Miss
Big Medium Small
If you are going to run “billions” of instruction, compulsory misses are insignificant.
Same Same Same
Conflict Miss High Medium Zero
Low(er) Medium High
Same Same Same
35
3Cs Absolute Miss Rate3Cs Absolute Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
Conflict
36
3Cs Relative Miss Rate3Cs Relative Miss Rate
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0%
20%
40%
60%
80%
100%
1 2 4 8
16
32
64
12
8
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
37
How to Improve Cache How to Improve Cache PerformancePerformance LatencyLatency
Reduce miss rateReduce miss rate Reduce miss penaltyReduce miss penalty Reduce hit timeReduce hit time
BandwidthBandwidth Increase hit bandwidthIncrease hit bandwidth Increase miss bandwidthIncrease miss bandwidth
38
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16
32
64
12
8
25
6
1K
4K
16K
64K
256K
1. Reduce Misses via Larger Block 1. Reduce Misses via Larger Block SizeSize
39
2. Reduce Misses via Higher 2. Reduce Misses via Higher AssociativityAssociativity 2:1 Cache Rule2:1 Cache Rule
Miss Rate DM cache size N Miss Rate DM cache size N Miss Rate FA cache size N/2 Miss Rate FA cache size N/2 Not merely empiricalNot merely empirical
Theoretical justification in Sleator and Tarjan, “Amortized efficiency of Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2):202-208,1985list update and paging rules”, CACM, 28(2):202-208,1985
Beware: Execution time is only final measure!Beware: Execution time is only final measure! Will clock cycle time increase?Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-wayHill [1988] suggested hit time ~10% higher for 2-way vs. 1-way
Cache Size (KB)
Mis
s R
ate
per
Typ
e
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
12
8
1-way
2-way
4-way
8-way
Capacity
Compulsory
40
Example: Ave Mem Access Time vs. Miss Example: Ave Mem Access Time vs. Miss RateRateExample: assume clock cycle time is 1.10 for 2-way, Example: assume clock cycle time is 1.10 for 2-way,
1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mappeddirect mapped
((RedRed means A.M.A.T. not improved by more means A.M.A.T. not improved by more associativity)associativity)
3. Reduce Conflict Misses via Victim 3. Reduce Conflict Misses via Victim CacheCache How to combine fast hit How to combine fast hit
time of direct mapped time of direct mapped yet avoid conflict misses yet avoid conflict misses
Add small highly Add small highly associative buffer to associative buffer to hold data discarded hold data discarded from cachefrom cache
Jouppi [1990]: 4-entry Jouppi [1990]: 4-entry victim cache removed victim cache removed 20% to 95% of conflicts 20% to 95% of conflicts for a 4 KB direct for a 4 KB direct mapped data cachemapped data cache
TAG DATA
?
TAG DATA
?
CPU
Mem
42
4. Reduce Conflict Misses via Pseudo-4. Reduce Conflict Misses via Pseudo-Assoc.Assoc. How to combine fast hit time of direct mapped How to combine fast hit time of direct mapped
and have the lower conflict misses of 2-way SA and have the lower conflict misses of 2-way SA cachecache
Divide cache: on a miss, check other half of cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)to see if there, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclescycles Better for caches not tied directly to processorBetter for caches not tied directly to processor
Hit Time
Pseudo Hit Time Miss Penalty
Time
43
5. Reduce Misses by Hardware 5. Reduce Misses by Hardware PrefetchingPrefetching Instruction prefetchingInstruction prefetching
Alpha 21064 fetches 2 blocks on a missAlpha 21064 fetches 2 blocks on a miss Extra block placed in stream bufferExtra block placed in stream buffer On miss check stream bufferOn miss check stream buffer
Works with data blocks tooWorks with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses Jouppi [1990] 1 data stream buffer got 25% misses
from 4KB cache; 4 stream buffers got 43%from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for Palacharla & Kessler [1994] for scientific programs for
8 streams got 50% to 70% of misses from 2 64KB, 4-8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesway set associative caches
Prefetching relies on extra memory bandwidth Prefetching relies on extra memory bandwidth that can be used without penaltythat can be used without penalty e.g., up to 8 prefetch stream buffers in the e.g., up to 8 prefetch stream buffers in the
UltraSPARC IIIUltraSPARC III
44
6. Reducing Misses by Software 6. Reducing Misses by Software PrefetchingPrefetching Data prefetchData prefetch
Compiler inserts special “prefetch” instructions into Compiler inserts special “prefetch” instructions into programprogramLoad data into register (HP PA-RISC loads)Load data into register (HP PA-RISC loads)Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC
v9)v9) A form of speculative executionA form of speculative execution
don’t really know if data is needed or if not in cache alreadydon’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” Most effective prefetches are “semantically invisible”
to prgmto prgmdoes not change registers or memorydoes not change registers or memorycannot cause a fault/exceptioncannot cause a fault/exception if they would fault, they are simply turned into NOP’sif they would fault, they are simply turned into NOP’s
Issuing prefetch instructions takes timeIssuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses?Is cost of prefetch issues < savings in reduced misses?
45
7. Reduce Misses by Compiler 7. Reduce Misses by Compiler Optzns.Optzns. InstructionsInstructions
Reorder procedures in memory so as to reduce missesReorder procedures in memory so as to reduce misses Profiling to look at conflictsProfiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB McFarling [1989] reduced caches misses by 75% on 8KB
direct mapped cache with 4 byte blocksdirect mapped cache with 4 byte blocks DataData
Merging ArraysMerging Arrays Improve spatial locality by single array of compound elements vs. Improve spatial locality by single array of compound elements vs.
2 arrays2 arrays Loop InterchangeLoop Interchange
Change nesting of loops to access data in order stored in memoryChange nesting of loops to access data in order stored in memory Loop FusionLoop Fusion
Combine two independent loops that have same looping and Combine two independent loops that have same looping and some variables overlapsome variables overlap
BlockingBlocking Improve temporal locality by accessing “blocks” of data Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rowsrepeatedly vs. going down whole columns or rows
46
Merging Arrays ExampleMerging Arrays Example
Reduces conflicts between Reduces conflicts between valval and and keykey Addressing expressions are differentAddressing expressions are different
/* Before */int val[SIZE];int key[SIZE];
/* Before */int val[SIZE];int key[SIZE];
/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];
/* After */struct merge { int val; int key;};struct merge merged_array[SIZE];
47
Loop Interchange ExampleLoop Interchange Example
Sequential accesses instead of striding Sequential accesses instead of striding through memory every 100 wordsthrough memory every 100 words
/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];
/* Before */for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j];
/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];
/* After */for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];
48
Loop Fusion ExampleLoop Fusion Example
Before: 2 misses per access to Before: 2 misses per access to aa and and cc After: 1 miss per access to After: 1 miss per access to aa and and cc
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j];
/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
/* After */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
49
Blocking ExampleBlocking Example
Two Inner Loops:Two Inner Loops: Read all NxN elements of z[]Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedlyRead N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[]Write N elements of 1 row of x[]
Capacity Misses a function of N and Cache SizeCapacity Misses a function of N and Cache Size 3 NxN 3 NxN no capacity misses; otherwise ... no capacity misses; otherwise ...
Idea: compute on BxB submatrix that fitsIdea: compute on BxB submatrix that fits
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }
/* Before */for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }
50
Blocking Example (contd.)Blocking Example (contd.) Age of accessesAge of accesses
White means not touched yetWhite means not touched yet Light gray means touched a while agoLight gray means touched a while ago Dark gray means newer accessesDark gray means newer accesses
51
Blocking Example (contd.)Blocking Example (contd.)
Work with BxB submatricesWork with BxB submatrices smaller working set can fit within the cachesmaller working set can fit within the cache
fewer capacity missesfewer capacity misses
/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }
/* After */for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }
52
Blocking Example (contd.)Blocking Example (contd.)
Capacity reqd. goes from (2NCapacity reqd. goes from (2N33 + N + N22) to (2N) to (2N33/B /B +N+N22))
BB = = “blocking factor”“blocking factor”
53
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Summary: Compiler Summary: Compiler Optimizations to Reduce Cache Optimizations to Reduce Cache MissesMisses
54
Reducing Miss PenaltyReducing Miss Penalty1.1. Read Priority over Write on Miss:Read Priority over Write on Miss:
Write through:Write through: Using write buffers: RAW conflicts with reads on cache Using write buffers: RAW conflicts with reads on cache
missesmisses If simply wait for write buffer to empty might increase If simply wait for write buffer to empty might increase
read miss penalty by 50% (old MIPS 1000)read miss penalty by 50% (old MIPS 1000) Check write buffer contents before read; Check write buffer contents before read;
if no conflicts, let the memory access continueif no conflicts, let the memory access continue Write Back?Write Back?
Read miss replacing dirty blockRead miss replacing dirty block Normal: Write dirty block to memory, and then do the Normal: Write dirty block to memory, and then do the
readread Instead copy the dirty block to a write buffer, then do the Instead copy the dirty block to a write buffer, then do the
read, and then do the writeread, and then do the write CPU stall less since restarts as soon as read completesCPU stall less since restarts as soon as read completes
55
Valid Bits
100
200
300
1 1 1 0
1 10 0
0 0 0 1
2. Fetching Subblocks to Reduce Miss 2. Fetching Subblocks to Reduce Miss PenaltyPenalty Don’t have to load full block on a missDon’t have to load full block on a miss Have bits per subblock to indicate validHave bits per subblock to indicate valid
56
3. Early Restart and Critical Word 3. Early Restart and Critical Word FirstFirst Don’t wait for full block to be loaded before Don’t wait for full block to be loaded before
restarting CPUrestarting CPU Early RestartEarly Restart—As soon as the requested word of the —As soon as the requested word of the
block arrrives, send it to the CPU and let the CPU block arrrives, send it to the CPU and let the CPU continue executioncontinue execution
Critical Word FirstCritical Word First—Request the missed word first —Request the missed word first from memory and send it to the CPU as soon as it from memory and send it to the CPU as soon as it arrivesarrives let the CPU continue while filling the rest of the words in let the CPU continue while filling the rest of the words in
the block. the block. also called “wrapped fetch” and “requested word first”also called “wrapped fetch” and “requested word first”
Generally useful only in large blocksGenerally useful only in large blocks Spatial locality a problemSpatial locality a problem
tend to want next sequential word, so not clear if tend to want next sequential word, so not clear if benefit by early restartbenefit by early restart
57
4. Non-blocking Caches4. Non-blocking Caches Non-blocking cacheNon-blocking cache or or lockup-free cachelockup-free cache
allows the data cache to continue to supply allows the data cache to continue to supply cache hits during a misscache hits during a miss ““Hit under miss”Hit under miss”
reduces the effective miss penalty by being helpful during reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPUa miss instead of ignoring the requests of the CPU
““Hit under multiple miss” or “miss under miss”Hit under multiple miss” or “miss under miss”may further lower the effective miss penalty by may further lower the effective miss penalty by
overlapping multiple missesoverlapping multiple misses Significantly increases the complexity of the cache Significantly increases the complexity of the cache
controller as there can be multiple outstanding controller as there can be multiple outstanding memory accessesmemory accesses
58
Value of Hit Under Miss for SPECValue of Hit Under Miss for SPEC
FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
Hit Under i Misses
Av
g.
Me
m.
Acce
ss T
ime
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
eqnto
tt
espre
sso
xlisp
com
pre
ss
mdljsp2
ear
fpppp
tom
catv
sw
m256
doduc
su2cor
wave5
mdljdp2
hydro
2d
alv
inn
nasa7
spic
e2g6
ora
0->1
1->2
2->64
Base
Integer Floating Point
“Hit under i Misses”
59
5. Miss Penalty Reduction: L2 5. Miss Penalty Reduction: L2 CacheCacheL2 Equations:L2 Equations:
AMAT = Hit TimeAMAT = Hit TimeL1L1 + Miss Rate + Miss RateL1L1 Miss Penalty Miss PenaltyL1L1
Miss PenaltyMiss PenaltyL1L1 = Hit Time = Hit TimeL2L2 + Miss Rate + Miss RateL2L2 Miss Penalty Miss PenaltyL2L2
AMAT = Hit TimeAMAT = Hit TimeL1L1 + Miss Rate + Miss RateL1L1 (Hit Time (Hit TimeL2L2 + Miss + Miss RateRateL2L2 Miss Penalty Miss PenaltyL2L2))
Definitions:Definitions:Local miss rateLocal miss rate— misses in this cache divided by the total — misses in this cache divided by the total
number of memory accessesnumber of memory accesses to this cache to this cache (Miss rate (Miss rateL2L2))Global miss rateGlobal miss rate—misses in this cache divided by the —misses in this cache divided by the
total number of memory accesses total number of memory accesses generated by the CPUgenerated by the CPU (Miss Rate(Miss RateL1L1 Miss Rate Miss RateL2L2) )
63
Review: Improving Cache Review: Improving Cache PerformancePerformance1. Reduce the miss rate, 1. Reduce the miss rate,
2. Reduce the miss penalty, or2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.3. Reduce the time to hit in the cache.
64
1. Fast Hit Times via Small, Simple 1. Fast Hit Times via Small, Simple CachesCaches Simple caches can be fasterSimple caches can be faster
cache hit time increasingly a bottleneck to CPU cache hit time increasingly a bottleneck to CPU performanceperformanceset associativity requires complex tag matching set associativity requires complex tag matching slower slowerdirect-mapped are simpler direct-mapped are simpler faster faster shorter CPU cycle shorter CPU cycle
timestimes– tag check can be overlapped with transmission of datatag check can be overlapped with transmission of data
Smaller caches can be fasterSmaller caches can be faster can fit on the same chip as CPUcan fit on the same chip as CPU
avoid penalty of going off-chipavoid penalty of going off-chip for L2 caches: compromisefor L2 caches: compromise
keep tags on chip, and data off chipkeep tags on chip, and data off chip– fast tag check, yet greater cache capacityfast tag check, yet greater cache capacity
L1 data cache reduced from 16KB in Pentium III to L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV8KB in Pentium IV
In ConclusionIn Conclusion Have looked at basic types of cachesHave looked at basic types of caches ProblemsProblems How to improve performanceHow to improve performance
NextNext Methods to ensure cache consistency in SMPsMethods to ensure cache consistency in SMPs