EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Lecture 17 Slide 1 EECS 470

© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS470Lecture18NUCA&Prefetching

Fall2020

Prof.RonaldDreslinski

h5p://www.eecs.umich.edu/courses/eecs470

Prefetch A3

11

Correlating Prediction Table

A3A0,A1 A0

History Table

Latest

A1

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lee, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Georgia Tech, Purdue University, University of Michigan, and University of Wisconsin.



Non-Uniform Cache Architecture ASPLOS2002proposedbyUT-AusDn(Kim,Burger,Keckler)

Facts❒  Largesharedon-dieL2❒  Wire-delaydominaDngon-diecache

3 cycles 1MB 180nm, 1999





Multi-banked L2 cache

Bank=128KB 11 cycles

2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles



Multi-banked L2 cache

Bank=64KB 47 cycles

16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles



Static NUCA-1

Useprivateper-bankchannel

EachbankhasitsdisDnctaccesslatency

StaDcallydecidedatalocaDonforitsgivenaddress

Averageaccesslatency=34.2cycles

Wireoverhead=20.9%àanissue

Tag Array

Data Bus

Address Bus

Bank

Sub-bank

Predecoder

Sense amplifier

Wordline driver and decoder



Static NUCA-2

Usea2Dswitchednetworktoalleviatewireareaoverhead

Averageaccesslatency=24.2cycles

Wireoverhead=5.9%

Bank

Data bus

Switch Tag Array

Wordline driver and decoder

Predecoder



Dynamic NUCA

Datacandynamicallymigrate

MovefrequentlyusedcachelinesclosertoCPU



Dynamic NUCA

SimpleMapping

All4waysofeachbanksetneedstobesearched

Fartherbanksetsàlongeraccess

8 bank sets way 0

way 1

way 2

way 3

one set bank



Dynamic NUCA

FairMapping

AverageaccessDmeacrossallbanksetsareequal

8 bank sets way 0

way 1

way 2

way 3

one set bank



Dynamic NUCA

SharedMapping

Sharingtheclosestbanksforfartherbanks

8 bank sets way 0

way 1

way 2

way 3

bank



11

The memory wall

Today:1memaccess≈500arithmeDcops

Howtoreducememorystallsforexis3ngSW?

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

ance

Source:Hennessy&Pa]erson,ComputerArchitecture:AQuan2ta2veApproach,4thed.

Processor

Memory



data

Conventional approach #1: Avoid main memory accesses

Cachehierarchies:

Tradeoffcapacityforspeed

Addmorecachelevels?

Diminishinglocalityreturns

NohelpforshareddatainMPs

CPU

64K

4M

Mainmemory

2clk

20clk

200clk

Writedata

CPU



Conventional approach #2: Hide memory latency

Out-of-orderexecuDon:

Overlapcompute&memstalls

ExpandOoOinstrucDonwindow?

Issue&load-storelogichardtoscale

NohelpfordependentinstrucDons

execuD

on

computememstall

OoOInorder



Challenges of server apps

Frequentsharing&synchronizaDon

Manylinked-datastructures❒  E.g.,linkedlist,B+tree,…❒  Dependentmisschains[Ranganathan98]

Lowmemorylevelparallelism[Chou04]

50-66%Dmestalledonmemory[Trancoso97][Barroso98][Ailamaki99]……

Ourgoals:Readmisses: Fetchearlier&inparallelWritemisses: Neverstall

exec

ution

Goal Today

compute mem stall



What is Prefetching? •  FetchmemoryaheadofDme

•  Targetscompulsory,capacity,&coherencemisses

Bigchallenges:

1.  knowing“what”tofetch•  Fetchinguselessinfowastesvaluableresources

2.  “when”tofetchit•  Fetchingtooearlyclu]ersstorage•  Fetchingtoolatedefeatsthepurposeof“pre”-fetching



Software Prefetching

Compiler/programmerplacesprefetchinstrucDons❒  requiresISAsupport❒  whynotuseregularloads?❒  foundinrecentISA’ssuchasSPARCV-9

Prefetchinto❒  register(binding)❒  caches(non-binding)



Software Prefetching (Cont.)

e.g.,

for(I=1;I<rows;I++)

for(J=1;J<columns;J++)

{

prefetch(&x[I+1,J]);

sum=sum+x[I,J];

}



Hardware Prefetching

Whattoprefetch?❒  oneblockspaDallyahead?❒  useaddresspredictorsàworksforregularpa]erns(x,x+8,x+16,.)

Whentoprefetch?❒  oneveryreference❒  oneverymiss❒  whenpriorprefetcheddataisreferenced❒  uponlastprocessorreference❒  usemorecomplicatedrate-matchingmechanisms

Wheretoputprefetcheddata?❒  auxiliarybuffers❒  caches



Generalized Access Pattern Prefetchers

Howdoyouprefetch

1.  Heapdatastructures?

2.  Indirectarrayaccesses?

3.  Generalizedmemoryaccesspa]erns?

Taxonomyofapproaches:

•  SpaDalprefetchers

•  Address-correlaDngprefetchers

•  PrecomputaDonprefetchers



Spatial Locality and Sequential Prefetching

WorkswellforI-cache❒  InstrucDonfetchingtendtoaccessmemorysequenDally

Doesn’tworkverywellforD-cache❒  Moreirregularaccesspa]ern❒  regularpa]ernsmayhavenon-unitstride(e.g.matrixcode)

RelaDvelyeasytoimplement❒  Largecacheblocksizealreadyhavetheeffectofprefetching❒  Auerloadingone-cacheline,startloadingthenextlineautomaDcallyifthelineisnotincacheandthebusisnotbusy



Array/stridecorrelatedtostaDcloadinstrucDon[Baer’91]

ReferencePredicDonTable

RecordloadPC,lastaddr.&stridebetweenlasttwoaddrs.

Onloadàcomputedistancebetweencurrent&lastaddr

•  ifnewdistancematchesoldstrideàfoundapa]ern,goaheadandprefetch“currentaddr+stride”

• update“lastaddr”and“laststride”fornextlookup

LoadInst. LastAddress Last Flags

PC(tag) Referenced Stride

……. ……. ……

PC-based Stride Prefetchers

Load Inst PC



Stream Buffers [Jouppi] EachstreambufferholdsonestreamofsequenDally

prefetchedcachelines

Onaloadmisschecktheheadofallstreambuffersforanaddressmatch

❒  ifhit,poptheentryfromFIFO,updatethecachewithdata

❒  ifnot,allocateanewstreambuffertothenewmissaddress(mayhavetorecycleastreambufferfollowingLRUpolicy)

StreambufferFIFOsareconDnuouslytopped-offwithsubsequentcachelineswheneverthereisroomandthebusisnotbusy

StreambufferscanincorporatestridepredicDonmechanismstosupportnon-unit-stridestreams

FIFO

FIFO

FIFO

FIFO

DCache

Mem

ory

inte

rface

Indirectarrayaccesses(e.g.,A[B[i]])?

Nocachepollu2on



Global History Buffer (GHB) [Nesbit’04]

HoldsmissaddresshistoryinFIFOorder

LinkedlistswithinGHBconnectrelatedaddresses

❒  SamestaDcload(PC/DC)❒  Sameglobalmissaddress(G/AC)

LinkedlistwalkisshortcomparedwithL2misslatency

Global History Buffer

miss addresses

Index Table

FI

Load PC

FO



GHB – Deltas

1 4 8

1 8 8 1 4 4 1 8 8 Delta Stream

Miss Address Stream 27 28 36 44 45 49 53 54 62 70 71

1

1

8

=> Current => Prefetches

Key

8

4

4

Width Depth Hybrid Markov Graph

.3 .3

.3 .7 .7 .7

71 + 8 => 79

79 + 8 => 87

Prefetches 71 + 4 => 75

79 + 4 => 79

Prefetches 71 + 8 => 79

71 + 4 => 75

Prefetches



GHB – Stride Prefetching GHB-StrideusesthePCtoaccesstheindextable

Thelinkedlistscontainthelocalhistoryofeachload

Comparethelasttwolocalstrides.Ifthesamethenprefetchn+s,n+2s,…,n+ds

Global History Buffer miss address pointer pointer

Index Table

head pointer

A B C A B C B

1

C

1

PC

=?



GHB –Delta Correlation (PC/DC) FormdeltacorrelaDonswithineachload’slocalhistory

Forexample,considerthelocalmissaddressstream:

BestresultsamongdataprefetchersforSPEC2K

[GraciaPérez’04]

Addresses 0 1 2 64 65 66 128 129 Deltas 1 1 62 1 1 62 1

Correlation Prefetch Predictions (1,1) 62 1 1

(1,62) 1 1 62 (62, 1) 1 62 1



Spatial Correlation

RepeDDvespaDalrelaDonshipsbetweenaccesses• Irregularlayoutànon-strided• Sparseàcan’tcapturewithcacheblocks• But,repeDDveàpredicttoimprovememory-levelpar.NottobeconfusedwithspaDallocality:• pa]ernsmayrepeatoverlarge(e.g.,fewkB)regions



Large-scalespaDalaccesspa]ernsPa]ernisfuncDonofprogram

Database Page (8kB)

page header

tuple data

tuple slot index

Mem

ory

Example Spatial correlation [Somogyi’06]



SMS Operation Summary

spatial patterns

1100000001101…

1100001010001… Spatial patterns stored in a

pattern history table

3

2

1

cache hits time

observe

store

predict

PC1 ld A+4 PC2 ld A

PC3 ld A+3

PC1 ld B+4 PC2 ld B

PC3 ld B+3

evict A+3

29



Correlation-Based Prefetching [Charney 96]

ConsiderthefollowinghistoryofLoadaddressesemi]edbyaprocessorA,B,C,D,C,E,A,C,F,F,E,A,A,B,C,D,E,A,BC,D,C

AuerreferencingaparDcularaddress(sayAorE),aresomeaddressesmorelikelytobereferencednext

A B C

D E F 1.0

.33 .5

.2

1.0 .6 .2

.67 .6

.5

.2

.2

Markov Model



TrackthelikelynextaddressesauerseeingaparDcularaddr.

PrefetchaccuracyisgenerallylowsoprefetchuptoNnextaddressestoincreasecoverage (butthiswastesbandwidth)

Prefetchaccuracycanbeimprovedbyusinglongerhistory❒  DecidewhichaddresstoprefetchnextbylookingatthelastKloadaddressesinsteadofjustthecurrentone

❒  e.g.indexwiththeXORofthedataaddressesfromthelastKloads❒  UsinghistoryofacoupleloadscanincreaseaccuracydramaDcally

Thistechniquecanalsobeappliedtojusttheloadmissstream

LoadDataAddr Prefetch Confidence …. Prefetch Confidence

(tag) Candidate1 …. CandidateN

……. ……. …… .… ……. ……

….

Correlation-Based Prefetching

Load Data Addr



MarkovPrefetchers

• Correlatesubsequentcachemisses

• Triggerprefetchonmiss

• Width-prefetching:predict&prefetchfourcandidates❒  predicDngonlyoneresultsinlowcoverage!

• Prefetchintoabuffer

Example: Markov Prefetchers [Joseph’07]



Tag-Correlating Prefetchers [Kaxiras’04]

CorrelaDon-basedprefetching:

•  tablesaretoobig

•  theygrowwithdataworkingsetsize

Muchsimilarityinblockaddressesmappingtosets

•  whenmarchingthrougharrays,tagsacrosssetsidenDcal!

•  savespaceincorrelaDontablesbycorrelaDngtagsonly(notfulladdresses)



Revisit: Global History Buffer (GHB) [Nesbit’04]

HoldsmissaddresshistoryinFIFOorder

LinkedlistswithinGHBconnectrelatedaddresses

❒  PC/DC❒  Sameglobalmissaddress(G/AC)

LinkedlistwalkisshortcomparedwithL2misslatency

Global History Buffer

miss addresses

Index Table

FI

Miss Address

FO



Miss Address Stream 27 28 29 27 28 29 28

GHB (G/AC) - Example

29 Global History Buffer

miss address pointer pointer Index Table

28 29 29

29

head pointer

28

27

27 27

=> Current => Prefetches

Key

28 29 28

Miss Address

EECS 470


Linked-DataPrefetchers•  Whentraversinglinked-structure:•  Learn/recordload-to-loaddependence•  GetaheadofprocessorbytraversingstructureinFSM•  FSMgetsaheadofprocessorbyskippingcomputaDon

q  SimilarproposalswithSWhelp(e.g.,helper/scoutthreads)•  Example:while(*p!=NULL){if(p->key==MATCH)p->val++;p=p->next;}

EECS 470


LinkedDataStructureAccess

next 0 4 8 12 14

next 0 4 8 12 14

next 0 4 8 12 14

next 0 4 8 12 14 Offset

Offset

Offset

Offset

EECS 470


DetecNngRecursiveAccesses

next

a+0 a+4 a+8 a+12 a+14

next

Offset Offset

LOAD rdest, rsrc(14)

a rsrc:

LOAD rdest, rsrc(14)

b rsrc:

Producerofb Consumerofb/Producerofc

hold same value

next

c+0 c+4 c+8 c+12 c+14

Offset

b+0 b+4 b+8 b+12 b+14

p = p->next;

Example

EECS 470


Roth,Moshovos,Sohi(HW)[Roth’98]

PC-A: LOAD rdest, rsrc(14)

a rsrc:

PC-B: LOAD rdest, rsrc(14)

b rsrc:

Producer of b Consumer of b/Producer of c

hold same value

Potential Producer Window

Memory Value

Loaded

Producer Instruction Address

PC-A b

Correlation Table

Producer Instruction Address

Consumer Instruction Address

PC-B PC-A

Consumer Instruction Template

LOAD r,r(14)


© Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, Vijaykumar Runahead Execution [Mutlu’03]

Memory-levelparallelismoflargewindowwithoutbuildingit!

WhenoldestinstrucDonisL2miss:❒ Checkpointstateandenterrunaheadmode

Inrunaheadmode:❒  InstrucDonsspeculaDvelypre-executed❒ TodiscoverotherL2misses❒ ProcessorconDnuestorun

RunaheadmodeendswhentheoriginalL2missreturns❒ CheckpointisrestoredandnormalexecuDonresumes



Compute

Compute

Compute

Load 1 Miss

Stall Compute

Load 2 Miss

Stall

Load 1 Hit Load 2 Hit

Compute

Load 1 Miss

Runahead

Load 2 Miss Load 2 Hit

Compute

Load 1 Hit

Saved Cycles

Perfect Caches:

Small Window:

Runahead:

Runahead Example



Benefits of Runahead Execution

AvoidstallingduringanL2cachemiss!Pre-executedloads/storesgenerateaccurateprefetches❒ bothregularandirregularaccesspa]erns

InstrucDonsonpredictedpath❒ prefetchedintoi-cacheandL2

Hardwareprefetcherandbranchpredictor❒ aretrainedusingfutureaccessinformaDon



Improving Cache Performance: Summary Missrate

❒  largeblocksize❒  higherassociaDvity❒  vicDmcaches❒  skewed-/pseudo-associaDvity❒  hardware/souwareprefetching❒  compileropDmizaDons

Misspenalty❒  giveprioritytoreadmissesoverwrites/writebacks

❒  subblockplacement❒  earlyrestartandcriDcalwordfirst❒  non-blockingcaches❒  mulD-levelcaches

HitDme(difficult?)❒  smallandsimplecaches❒  avoidingtranslaDonduringL1indexing(later)

❒  pipeliningwritesforfastwritehits

❒  subblockplacementforfastwritehitsinwritethroughcaches

EECS 470 Lecture 17-18 Mul3processors · Lecture 20 EECS 470 Slide 6 Thread-Level Parallelism • Thread-level parallelism (TLP) Collec3on of asynchronous tasks: not started and stopped

Documents