Page 1
CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture
Lecture14–Mul=threading
KrsteAsanovicElectricalEngineeringandComputerSciences
UniversityofCaliforniaatBerkeley
http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152
Page 2
LastTimeLecture13:VLIW
§ InaclassicVLIW,compilerisresponsibleforavoidingallhazards->simplehardware,complexcompiler.LaterVLIWsaddedmoredynamichardwareinterlocks
§ UseloopunrollingandsoIwarepipeliningforloops,traceschedulingformoreirregularcode
§ StaJcschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory
2
Page 3
Mul=threading
§ DifficulttoconJnuetoextractinstrucJon-levelparallelism(ILP)fromasinglesequenJalthreadofcontrol
§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)– TLPfrommulJprogramming(runindependentsequenJaljobs)
– TLPfrommulJthreadedapplicaJons(runonejobfasterusingparallelthreads)
§ MulJthreadingusesTLPtoimproveuJlizaJonofasingleprocessor
3
Page 4
Mul=threading
HowcanweguaranteenodependenciesbetweeninstrucJonsinapipeline?
OnewayistointerleaveexecuJonofinstrucJonsfromdifferentprogramthreadsonsamepipeline
4
F D XMWt0 t1 t2 t3 t4 t5 t6 t7 t8
T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1)
t9
F D XMWF D X M W
F D XMWF D XMW
Interleave4threads,T1-T4,onnon-bypassed5-stagepipe
Priorinstruc:oninathreadalwayscompleteswrite-backbeforenextinstruc:oninsamethreadreadsregisterfile
Page 5
CDC6600PeripheralProcessors(Cray,1964)
§ FirstmulJthreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycleJme§ EachvirtualprocessorexecutesoneinstrucJonevery1000ns§ Accumulator-basedinstrucJonsettoreduceprocessorstate
5
Page 6
SimpleMul=threadedPipeline
§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/wri_enateachpipestage
§ AppearstosoIware(includingOS)asmulJple,albeitslower,CPUs
6
+1
2 Thread select
PC 1 PC
1 PC 1 PC
1 I$ IR GPR1 GPR1 GPR1 GPR1
X
Y
2
D$
Page 7
Mul=threadingCosts
§ Eachthreadrequiresitsownuserstate– PC– GPRs
§ Also,needsitsownsystemstate– Virtual-memorypage-table-baseregister– ExcepJon-handlingregisters
§ Otheroverheads:– AddiJonalcache/TLBconflictsfromcompeJngthreads– (oraddlargercache/TLBcapacity)– MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)
7
Page 8
ThreadSchedulingPolicies
§ Fixedinterleave(CDC6600PPUs,1964)– EachofNthreadsexecutesoneinstrucJoneveryNcycles– Ifthreadnotreadytogoinitsslot,insertpipelinebubble
§ SoIware-controlledinterleave(TIASCPPUs,1971)– OSallocatesSpipelineslotsamongstNthreads– HardwareperformsfixedinterleaveoverSslots,execuJngwhicheverthreadisinthatslot
§ Hardware-controlledthreadscheduling(HEP,1982)– Hardwarekeepstrackofwhichthreadsarereadytogo– Picksnextthreadtoexecutebasedonhardwarepriorityscheme
8
Page 9
DenelcorHEP(BurtonSmith,1982)
FirstcommercialmachinetousehardwarethreadinginmainCPU– 120threadsperprocessor– 10MHzclockrate– Upto8processors– precursortoTeraMTA(MulJthreadedArchitecture)
9
Page 10
CS252
TeraMTA(1990-)
§ Upto256processors§ Upto128acJvethreadsperprocessor§ Processorsandmemorymodulespopulateasparse3DtorusinterconnecJonfabric
§ Flat,sharedmainmemory– Nodatacache– Sustainsonemainmemoryaccesspercycleperprocessor
§ GaAslogicinprototype,1KW/processor@260MHz– SecondversionCMOS,MTA-2,50W/processor– Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz
– Newestversion,XMT2,hashighermemorybandwidthandcapacity
10
Page 11
CS252
MTAPipeline
11
A
W
C
W
M
InstFetch
Mem
oryPo
ol
RetryPool
Interconnec=onNetwork
WritePo
ol
W
Memorypipeline
IssuePool• Everycycle,oneVLIWinstrucJonfromoneacJvethreadislaunchedintopipeline
• InstrucJonpipelineis21cycleslong
• MemoryoperaJonsincur~150cyclesoflatency
AssumingasinglethreadissuesoneinstrucJonevery21cycles,andclockrateis260MHz…
Whatissingle-threadperformance?
EffecJvesingle-threadissuerateis260/21=12.4MIPS
Page 12
Coarse-GrainMul=threading
§ TeraMTAdesignedforsupercompuJngapplicaJonswithlargedatasetsandlowlocality– Nodatacache– Manyparallelthreadsneededtohidelargememorylatency
§ OtherapplicaJonsaremorecachefriendly– Fewpipelinebubblesifcachemostlyhashits– Justaddafewthreadstohideoccasionalcachemisslatencies– Swapthreadsoncachemisses
12
Page 13
MITAlewife(1990)
13
§ ModifiedSPARCchips– registerwindowsholddifferentthreadcontexts
§ Uptofourthreadspernode§ Threadswitchonlocalcachemiss
Page 14
IBMPowerPCRS64-IV(2000)
§ Commercialcoarse-grainmulJthreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline
§ EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecuJonswitchestosecondthread– shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency
– flushpipelinetosimplifyexcepJonhandling
14
Page 15
Oracle/SunNiagaraprocessors
§ Targetisdatacentersrunningwebserversanddatabases,withmanyconcurrentrequests
§ ProvidemulJplesimplecoreseachwithmulJplehardwarethreads,reducedenergy/operaJonthoughmuchlowersinglethreadperformance
§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§ T4[2011],8cores,8threads/core§ T5[2012],16cores,8threads/core§ M5[2012],6cores,8threads/core§ M6[2013],12cores,8threads/core
15
Page 16
Oracle/SunNiagara-3,“RainbowFalls”2009
16
Page 20
CS152Administrivia
§ PS3duetoday§ PS4outWednesdayMarch18,dueFridayApril3§ Lab3dueMondayApril6
20
Page 21
CS252
CS252Administrivia
§ Readingtoday3:30onZoom,linkonpiazza
21
Page 22
SimultaneousMul=threading(SMT)forOoOSuperscalars
§ Techniquespresentedsofarhaveallbeen“verJcal”mulJthreadingwhereeachpipelinestageworksononethreadataJme
§ SMTusesfine-graincontrolalreadypresentinsideanOoOsuperscalartoallowinstrucJonsfrommulJplethreadstoenterexecuJononsameclockcycle.Givesbe_eruJlizaJonofmachineresources.
22
Page 23
Formostapps,mostexecu=onunitslieidleinanOoOsuperscalar
23
From:Tullsen,Eggers,andLevy,“SimultaneousMulJthreading:MaximizingOn-chipParallelism”,ISCA1995.
Foran8-waysuperscalar.
Page 24
SuperscalarMachineEfficiency
24
Issuewidth
Time
Completelyidlecycle(ver:calwaste)
Instruc:onissue
Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)
Page 25
Ver=calMul=threading
25
§ Cycle-by-cycleinterleavingremovesverJcalwaste,butleavessomehorizontalwaste
Issuewidth
Time
Secondthreadinterleavedcycle-by-cycle
Instruc:onissue
Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)
Page 26
ChipMul=processing(CMP)
26
§ WhatistheeffectofsplixngintomulJpleprocessors?– reduceshorizontalwaste,– leavessomeverJcalwaste,and– putsupperlimitonpeakthroughputofeachthread.
Issuewidth
Time
Page 27
IdealSuperscalarMul=threading[Tullsen,Eggers,Levy,UW,1995]
27
§ InterleavemulJplethreadstomulJpleissueslotswithnorestricJons
Issuewidth
Time
Page 28
O-o-OSimultaneousMul=threading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]
§ AddmulJplecontextsandfetchenginesandallowinstrucJonsfetchedfromdifferentthreadstoissuesimultaneously
§ UJlizewideout-of-ordersuperscalarprocessorissuequeuetofindinstrucJonstoissuefrommulJplethreads
§ OOOinstrucJonwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommulJplethreads
§ AnysinglethreadcanuJlizewholemachine
28
Page 29
SMTadapta=ontoparallelismtype
29
Forregionswithhighthread-levelparallelism(TLP)enJremachinewidthissharedbyallthreads
Issuewidth
Time
Issuewidth
Time
Forregionswithlowthread-levelparallelism(TLP)enJremachinewidthisavailableforinstrucJon-levelparallelism(ILP)
Page 30
Pen=um-4Hyperthreading(2002)
§ FirstcommercialSMTdesign(2-waySMT)§ Logicalprocessorssharenearlyallresourcesofthephysicalprocessor
– Caches,execuJonunits,branchpredictors§ Dieareaoverheadofhyperthreading~5%§ Whenonelogicalprocessorisstalled,theothercanmakeprogress
– NologicalprocessorcanuseallentriesinqueueswhentwothreadsareacJve
§ ProcessorrunningonlyoneacJvesoIwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading
§ HyperthreadingdroppedonOoOP6basedfollowonstoPenJum-4(PenJum-M,CoreDuo,Core2Duo),unJlrevivedwithNehalemgeneraJonmachinesin2008.
§ IntelAtom(in-orderx86core)hastwo-wayverJcalmulJthreading– Hyperthreading==(SMTforIntelOoO&VerJcalforIntelInO)
30
Page 31
IBMPower4
31
Single-threadedpredecessortoPower5.8execuJonunitsinout-of-orderengine,eachmayissueaninstrucJoneachcycle.
Page 32
32
Power 4
Power 5
2 fetch (PC), 2 initial decodes
2 commits (architected register sets)
Page 33
Power5dataflow...
33
Whyonly2threads?With4,oneofthesharedresources(physicalregisters,cache,memorybandwidth)wouldbepronetobo_leneck
Page 34
Ini=alPerformanceofSMT
§ PenJum-4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_rate– PenJum-4isdual-threadedSMT– SPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmark
§ RunningonPenJum-4eachof26SPECbenchmarkspairedwitheveryother(262runs)speed-upsfrom0.90to1.58;averagewas1.20
§ Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate
§ Power5running2copiesofeachappspeedupbetween0.89and1.41– Mostgainedsome– Fl.Pt.appshadmostcacheconflictsandleastgains
34
Page 35
SMTPerformance:Applica=onInterac=on
35Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”
Solongastheyaren’tbangingon
theL2too.
Notaffectedbyotherprograms
YourfavoritebenchmarkfromLab2
Page 36
SMTPerformance:Applica=onInterac=on
36Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”
Doesn’tplaynice
YourfavoritebenchmarkfromLab2
Page 37
37Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”
SMTPerformance:Applica=onInterac=on
VerysensiJvetosecondprogram
Page 38
IcountChoosingPolicy
38
Whydoesthisenhancethroughput?
FetchfromthreadwiththeleastinstrucJonsinflight.
Page 39
Summary:Mul=threadedCategories
39
Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Mul=processing
SimultaneousMul=threading
Thread1Thread2
Thread3Thread4
Thread5Idleslot
Page 40
Mul=threadedDesignDiscussion
40
§ WanttobuildamulJthreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§ L1caches(instrucJonanddata)§ L2caches§ Branchpredictor§ TLB§ Physicalregisterfile
Page 41
SMT&Security
41
§ Mosthardwarea_acksrelyonsharedhardwareresourcestoestablishaside-channel– Eg.Sharedoutercaches,DRAMrowbuffers
§ SMTgivesa_ackershigh-BWaccesstopreviouslyprivatehardwareresourcesthataresharedbyco-residentthreads:
§ TLBs:TLBleed(June,‘18)§ L1caches:CacheBleed(2016)§ FuncJonalunitports:PortSmash(Nov,’18)OpenBSD6.4àDisabledHTinBIOS,AMDSMTtofollow
Page 42
Acknowledgements
§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:– Arvind(MIT)– Sco_Beamer(UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPa_erson(UCB)
42