Page 1
Lecture 17 Slide 1 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
EECS470Lecture18NUCA&Prefetching
Fall2020
Prof.RonaldDreslinski
h5p://www.eecs.umich.edu/courses/eecs470
Prefetch A3
11
Correlating Prediction Table
A3A0,A1 A0
History Table
Latest
A1
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.
Page 2
Lecture 17 Slide 2 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Non-Uniform Cache Architecture ASPLOS2002proposedbyUT-AusDn(Kim,Burger,Keckler)
Facts❒ Largesharedon-dieL2❒ Wire-delaydominaDngon-diecache
3 cycles 1MB 180nm, 1999
11 cycles 4MB 90nm, 2004
24 cycles 16MB 50nm, 2010
Page 3
Lecture 17 Slide 3 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multi-banked L2 cache
Bank=128KB 11 cycles
2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles
Page 4
Lecture 17 Slide 4 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Multi-banked L2 cache
Bank=64KB 47 cycles
16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles
Page 5
Lecture 17 Slide 5 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Static NUCA-1
Useprivateper-bankchannel
EachbankhasitsdisDnctaccesslatency
StaDcallydecidedatalocaDonforitsgivenaddress
Averageaccesslatency=34.2cycles
Wireoverhead=20.9%àanissue
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Sense amplifier
Wordline driver and decoder
Page 6
Lecture 17 Slide 6 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Static NUCA-2
Usea2Dswitchednetworktoalleviatewireareaoverhead
Averageaccesslatency=24.2cycles
Wireoverhead=5.9%
Bank
Data bus
Switch Tag Array
Wordline driver and decoder
Predecoder
Page 7
Lecture 17 Slide 7 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
Datacandynamicallymigrate
MovefrequentlyusedcachelinesclosertoCPU
Page 8
Lecture 17 Slide 8 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
SimpleMapping
All4waysofeachbanksetneedstobesearched
Fartherbanksetsàlongeraccess
8 bank sets way 0
way 1
way 2
way 3
one set bank
Page 9
Lecture 17 Slide 9 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
FairMapping
AverageaccessDmeacrossallbanksetsareequal
8 bank sets way 0
way 1
way 2
way 3
one set bank
Page 10
Lecture 17 Slide 10 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Dynamic NUCA
SharedMapping
Sharingtheclosestbanksforfartherbanks
8 bank sets way 0
way 1
way 2
way 3
bank
Page 11
Lecture 17 Slide 11 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
11
The memory wall
Today:1memaccess≈500arithmeDcops
Howtoreducememorystallsforexis3ngSW?
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
Perf
orm
ance
Source:Hennessy&Pa]erson,ComputerArchitecture:AQuan2ta2veApproach,4thed.
Processor
Memory
Page 12
Lecture 17 Slide 12 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
data
Conventional approach #1: Avoid main memory accesses
Cachehierarchies:
Tradeoffcapacityforspeed
Addmorecachelevels?
Diminishinglocalityreturns
NohelpforshareddatainMPs
CPU
64K
4M
Mainmemory
2clk
20clk
200clk
Writedata
CPU
Page 13
Lecture 17 Slide 13 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Conventional approach #2: Hide memory latency
Out-of-orderexecuDon:
Overlapcompute&memstalls
ExpandOoOinstrucDonwindow?
Issue&load-storelogichardtoscale
NohelpfordependentinstrucDons
execuD
on
computememstall
OoOInorder
Page 14
Lecture 17 Slide 14 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Challenges of server apps
Frequentsharing&synchronizaDon
Manylinked-datastructures❒ E.g.,linkedlist,B+tree,…❒ Dependentmisschains[Ranganathan98]
Lowmemorylevelparallelism[Chou04]
50-66%Dmestalledonmemory[Trancoso97][Barroso98][Ailamaki99]……
Ourgoals:Readmisses: Fetchearlier&inparallelWritemisses: Neverstall
exec
ution
Goal Today
compute mem stall
Page 15
Lecture 17 Slide 15 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
What is Prefetching? • FetchmemoryaheadofDme
• Targetscompulsory,capacity,&coherencemisses
Bigchallenges:
1. knowing“what”tofetch• Fetchinguselessinfowastesvaluableresources
2. “when”tofetchit• Fetchingtooearlyclu]ersstorage• Fetchingtoolatedefeatsthepurposeof“pre”-fetching
Page 16
Lecture 17 Slide 16 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Software Prefetching
Compiler/programmerplacesprefetchinstrucDons❒ requiresISAsupport❒ whynotuseregularloads?❒ foundinrecentISA’ssuchasSPARCV-9
Prefetchinto❒ register(binding)❒ caches(non-binding)
Page 17
Lecture 17 Slide 17 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Software Prefetching (Cont.)
e.g.,
for(I=1;I<rows;I++)
for(J=1;J<columns;J++)
{
prefetch(&x[I+1,J]);
sum=sum+x[I,J];
}
Page 18
Lecture 17 Slide 18 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Hardware Prefetching
Whattoprefetch?❒ oneblockspaDallyahead?❒ useaddresspredictorsàworksforregularpa]erns(x,x+8,x+16,.)
Whentoprefetch?❒ oneveryreference❒ oneverymiss❒ whenpriorprefetcheddataisreferenced❒ uponlastprocessorreference❒ usemorecomplicatedrate-matchingmechanisms
Wheretoputprefetcheddata?❒ auxiliarybuffers❒ caches
Page 19
Lecture 17 Slide 19 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Generalized Access Pattern Prefetchers
Howdoyouprefetch
1. Heapdatastructures?
2. Indirectarrayaccesses?
3. Generalizedmemoryaccesspa]erns?
Taxonomyofapproaches:
• SpaDalprefetchers
• Address-correlaDngprefetchers
• PrecomputaDonprefetchers
Page 20
Lecture 17 Slide 20 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Spatial Locality and Sequential Prefetching
WorkswellforI-cache❒ InstrucDonfetchingtendtoaccessmemorysequenDally
Doesn’tworkverywellforD-cache❒ Moreirregularaccesspa]ern❒ regularpa]ernsmayhavenon-unitstride(e.g.matrixcode)
RelaDvelyeasytoimplement❒ Largecacheblocksizealreadyhavetheeffectofprefetching❒ Auerloadingone-cacheline,startloadingthenextlineautomaDcallyifthelineisnotincacheandthebusisnotbusy
Page 21
Lecture 17 Slide 21 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Array/stridecorrelatedtostaDcloadinstrucDon[Baer’91]
ReferencePredicDonTable
RecordloadPC,lastaddr.&stridebetweenlasttwoaddrs.
Onloadàcomputedistancebetweencurrent&lastaddr
• ifnewdistancematchesoldstrideàfoundapa]ern,goaheadandprefetch“currentaddr+stride”
• update“lastaddr”and“laststride”fornextlookup
LoadInst. LastAddress Last Flags
PC(tag) Referenced Stride
……. ……. ……
PC-based Stride Prefetchers
Load Inst PC
Page 22
Lecture 17 Slide 22 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Stream Buffers [Jouppi] EachstreambufferholdsonestreamofsequenDally
prefetchedcachelines
Onaloadmisschecktheheadofallstreambuffersforanaddressmatch
❒ ifhit,poptheentryfromFIFO,updatethecachewithdata
❒ ifnot,allocateanewstreambuffertothenewmissaddress(mayhavetorecycleastreambufferfollowingLRUpolicy)
StreambufferFIFOsareconDnuouslytopped-offwithsubsequentcachelineswheneverthereisroomandthebusisnotbusy
StreambufferscanincorporatestridepredicDonmechanismstosupportnon-unit-stridestreams
FIFO
FIFO
FIFO
FIFO
DCache
Mem
ory
inte
rface
Indirectarrayaccesses(e.g.,A[B[i]])?
Nocachepollu2on
Page 23
Lecture 17 Slide 23 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Global History Buffer (GHB) [Nesbit’04]
HoldsmissaddresshistoryinFIFOorder
LinkedlistswithinGHBconnectrelatedaddresses
❒ SamestaDcload(PC/DC)❒ Sameglobalmissaddress(G/AC)
LinkedlistwalkisshortcomparedwithL2misslatency
Global History Buffer
miss addresses
Index Table
FI
Load PC
FO
Page 24
Lecture 17 Slide 24 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB – Deltas
1 4 8
1 8 8 1 4 4 1 8 8 Delta Stream
Miss Address Stream 27 28 36 44 45 49 53 54 62 70 71
1
1
8
=> Current => Prefetches
Key
8
4
4
Width Depth Hybrid Markov Graph
.3 .3
.3 .7 .7 .7
71 + 8 => 79
79 + 8 => 87
Prefetches 71 + 4 => 75
79 + 4 => 79
Prefetches 71 + 8 => 79
71 + 4 => 75
Prefetches
Page 25
Lecture 17 Slide 25 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB – Stride Prefetching GHB-StrideusesthePCtoaccesstheindextable
Thelinkedlistscontainthelocalhistoryofeachload
Comparethelasttwolocalstrides.Ifthesamethenprefetchn+s,n+2s,…,n+ds
Global History Buffer miss address pointer pointer
Index Table
head pointer
A B C A B C B
1
C
1
PC
=?
Page 26
Lecture 17 Slide 26 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
GHB –Delta Correlation (PC/DC) FormdeltacorrelaDonswithineachload’slocalhistory
Forexample,considerthelocalmissaddressstream:
BestresultsamongdataprefetchersforSPEC2K
[GraciaPérez’04]
Addresses 0 1 2 64 65 66 128 129 Deltas 1 1 62 1 1 62 1
Correlation Prefetch Predictions (1,1) 62 1 1
(1,62) 1 1 62 (62, 1) 1 62 1
Page 27
Lecture 17 Slide 27 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Spatial Correlation
RepeDDvespaDalrelaDonshipsbetweenaccesses• Irregularlayoutànon-strided• Sparseàcan’tcapturewithcacheblocks• But,repeDDveàpredicttoimprovememory-levelpar.NottobeconfusedwithspaDallocality:• pa]ernsmayrepeatoverlarge(e.g.,fewkB)regions
Page 28
Lecture 17 Slide 28 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Large-scalespaDalaccesspa]ernsPa]ernisfuncDonofprogram
Database Page (8kB)
page header
tuple data
tuple slot index
Mem
ory
Example Spatial correlation [Somogyi’06]
Page 29
Lecture 17 Slide 29 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
SMS Operation Summary
spatial patterns
1100000001101…
1100001010001… Spatial patterns stored in a
pattern history table
3
2
1
cache hits time
observe
store
predict
PC1 ld A+4 PC2 ld A
PC3 ld A+3
PC1 ld B+4 PC2 ld B
PC3 ld B+3
evict A+3
29
Page 30
Lecture 17 Slide 30 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Correlation-Based Prefetching [Charney 96]
ConsiderthefollowinghistoryofLoadaddressesemi]edbyaprocessorA,B,C,D,C,E,A,C,F,F,E,A,A,B,C,D,E,A,BC,D,C
AuerreferencingaparDcularaddress(sayAorE),aresomeaddressesmorelikelytobereferencednext
A B C
D E F 1.0
.33 .5
.2
1.0 .6 .2
.67 .6
.5
.2
.2
Markov Model
Page 31
Lecture 17 Slide 31 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
TrackthelikelynextaddressesauerseeingaparDcularaddr.
PrefetchaccuracyisgenerallylowsoprefetchuptoNnextaddressestoincreasecoverage (butthiswastesbandwidth)
Prefetchaccuracycanbeimprovedbyusinglongerhistory❒ DecidewhichaddresstoprefetchnextbylookingatthelastKloadaddressesinsteadofjustthecurrentone
❒ e.g.indexwiththeXORofthedataaddressesfromthelastKloads❒ UsinghistoryofacoupleloadscanincreaseaccuracydramaDcally
Thistechniquecanalsobeappliedtojusttheloadmissstream
LoadDataAddr Prefetch Confidence …. Prefetch Confidence
(tag) Candidate1 …. CandidateN
……. ……. …… .… ……. ……
….
Correlation-Based Prefetching
Load Data Addr
Page 32
Lecture 17 Slide 32 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
MarkovPrefetchers
• Correlatesubsequentcachemisses
• Triggerprefetchonmiss
• Width-prefetching:predict&prefetchfourcandidates❒ predicDngonlyoneresultsinlowcoverage!
• Prefetchintoabuffer
Example: Markov Prefetchers [Joseph’07]
Page 33
Lecture 17 Slide 33 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Tag-Correlating Prefetchers [Kaxiras’04]
CorrelaDon-basedprefetching:
• tablesaretoobig
• theygrowwithdataworkingsetsize
Muchsimilarityinblockaddressesmappingtosets
• whenmarchingthrougharrays,tagsacrosssetsidenDcal!
• savespaceincorrelaDontablesbycorrelaDngtagsonly(notfulladdresses)
Page 34
Lecture 17 Slide 34 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Revisit: Global History Buffer (GHB) [Nesbit’04]
HoldsmissaddresshistoryinFIFOorder
LinkedlistswithinGHBconnectrelatedaddresses
❒ PC/DC❒ Sameglobalmissaddress(G/AC)
LinkedlistwalkisshortcomparedwithL2misslatency
Global History Buffer
miss addresses
Index Table
FI
Miss Address
FO
Page 35
Lecture 17 Slide 35 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Miss Address Stream 27 28 29 27 28 29 28
GHB (G/AC) - Example
29 Global History Buffer
miss address pointer pointer Index Table
28 29 29
29
head pointer
28
27
27 27
=> Current => Prefetches
Key
28 29 28
Miss Address
Page 36
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Linked-DataPrefetchers• Whentraversinglinked-structure:• Learn/recordload-to-loaddependence• GetaheadofprocessorbytraversingstructureinFSM• FSMgetsaheadofprocessorbyskippingcomputaDon
q SimilarproposalswithSWhelp(e.g.,helper/scoutthreads)• Example:while(*p!=NULL){if(p->key==MATCH)p->val++;p=p->next;}
Page 37
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
LinkedDataStructureAccess
next 0 4 8 12 14
next 0 4 8 12 14
next 0 4 8 12 14
next 0 4 8 12 14 Offset
Offset
Offset
Offset
Page 38
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
DetecNngRecursiveAccesses
next
a+0 a+4 a+8 a+12 a+14
next
Offset Offset
LOAD rdest, rsrc(14)
a rsrc:
LOAD rdest, rsrc(14)
b rsrc:
Producerofb Consumerofb/Producerofc
hold same value
next
c+0 c+4 c+8 c+12 c+14
Offset
b+0 b+4 b+8 b+12 b+14
p = p->next;
Example
Page 39
EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Roth,Moshovos,Sohi(HW)[Roth’98]
PC-A: LOAD rdest, rsrc(14)
a rsrc:
PC-B: LOAD rdest, rsrc(14)
b rsrc:
Producer of b Consumer of b/Producer of c
hold same value
Potential Producer Window
Memory Value
Loaded
Producer Instruction Address
PC-A b
Correlation Table
Producer Instruction Address
Consumer Instruction Address
PC-B PC-A
Consumer Instruction Template
LOAD r,r(14)
Page 40
Lecture 17 Slide 40 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar Runahead Execution [Mutlu’03]
Memory-levelparallelismoflargewindowwithoutbuildingit!
WhenoldestinstrucDonisL2miss:❒ Checkpointstateandenterrunaheadmode
Inrunaheadmode:❒ InstrucDonsspeculaDvelypre-executed❒ TodiscoverotherL2misses❒ ProcessorconDnuestorun
RunaheadmodeendswhentheoriginalL2missreturns❒ CheckpointisrestoredandnormalexecuDonresumes
Page 41
Lecture 17 Slide 41 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Compute
Compute
Compute
Load 1 Miss
Stall Compute
Load 2 Miss
Stall
Load 1 Hit Load 2 Hit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Hit
Compute
Load 1 Hit
Saved Cycles
Perfect Caches:
Small Window:
Runahead:
Runahead Example
Page 42
Lecture 17 Slide 42 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Benefits of Runahead Execution
AvoidstallingduringanL2cachemiss!Pre-executedloads/storesgenerateaccurateprefetches❒ bothregularandirregularaccesspa]erns
InstrucDonsonpredictedpath❒ prefetchedintoi-cacheandL2
Hardwareprefetcherandbranchpredictor❒ aretrainedusingfutureaccessinformaDon
Page 43
Lecture 17 Slide 43 EECS 470
© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar
Improving Cache Performance: Summary Missrate
❒ largeblocksize❒ higherassociaDvity❒ vicDmcaches❒ skewed-/pseudo-associaDvity❒ hardware/souwareprefetching❒ compileropDmizaDons
Misspenalty❒ giveprioritytoreadmissesoverwrites/writebacks
❒ subblockplacement❒ earlyrestartandcriDcalwordfirst❒ non-blockingcaches❒ mulD-levelcaches
HitDme(difficult?)❒ smallandsimplecaches❒ avoidingtranslaDonduringL1indexing(later)
❒ pipeliningwritesforfastwritehits
❒ subblockplacementforfastwritehitsinwritethroughcaches