CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture14–Mul=threading

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

LastTimeLecture13:VLIW

§  InaclassicVLIW,compilerisresponsibleforavoidingallhazards->simplehardware,complexcompiler.LaterVLIWsaddedmoredynamichardwareinterlocks

§ UseloopunrollingandsoIwarepipeliningforloops,traceschedulingformoreirregularcode

§  StaJcschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory

2

Mul=threading

§ DifficulttoconJnuetoextractinstrucJon-levelparallelism(ILP)fromasinglesequenJalthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)– TLPfrommulJprogramming(runindependentsequenJaljobs)

– TLPfrommulJthreadedapplicaJons(runonejobfasterusingparallelthreads)

§ MulJthreadingusesTLPtoimproveuJlizaJonofasingleprocessor

3

Mul=threading

HowcanweguaranteenodependenciesbetweeninstrucJonsinapipeline?

OnewayistointerleaveexecuJonofinstrucJonsfromdifferentprogramthreadsonsamepipeline

4

F D XMWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1)

t9

F D XMWF D X M W

F D XMWF D XMW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstruc:oninathreadalwayscompleteswrite-backbeforenextinstruc:oninsamethreadreadsregisterfile

CDC6600PeripheralProcessors(Cray,1964)

§  FirstmulJthreadedhardware§  10“virtual”I/Oprocessors§  Fixedinterleaveonsimplepipeline§  Pipelinehas100nscycleJme§  EachvirtualprocessorexecutesoneinstrucJonevery1000ns§  Accumulator-basedinstrucJonsettoreduceprocessorstate

5

SimpleMul=threadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/wri_enateachpipestage

§ AppearstosoIware(includingOS)asmulJple,albeitslower,CPUs

6

+1

2 Thread select

PC 1 PC

1 PC 1 PC

1 I$ IR GPR1 GPR1 GPR1 GPR1

X

Y

2

D$

Mul=threadingCosts

§ Eachthreadrequiresitsownuserstate–  PC–  GPRs

§ Also,needsitsownsystemstate–  Virtual-memorypage-table-baseregister–  ExcepJon-handlingregisters

§ Otheroverheads:–  AddiJonalcache/TLBconflictsfromcompeJngthreads–  (oraddlargercache/TLBcapacity)–  MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

7

ThreadSchedulingPolicies

§  Fixedinterleave(CDC6600PPUs,1964)–  EachofNthreadsexecutesoneinstrucJoneveryNcycles–  Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§  SoIware-controlledinterleave(TIASCPPUs,1971)–  OSallocatesSpipelineslotsamongstNthreads–  HardwareperformsfixedinterleaveoverSslots,execuJngwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)–  Hardwarekeepstrackofwhichthreadsarereadytogo–  Picksnextthreadtoexecutebasedonhardwarepriorityscheme

8

DenelcorHEP(BurtonSmith,1982)

FirstcommercialmachinetousehardwarethreadinginmainCPU–  120threadsperprocessor–  10MHzclockrate–  Upto8processors–  precursortoTeraMTA(MulJthreadedArchitecture)

9

CS252

TeraMTA(1990-)

§ Upto256processors§ Upto128acJvethreadsperprocessor§  Processorsandmemorymodulespopulateasparse3DtorusinterconnecJonfabric

§  Flat,sharedmainmemory–  Nodatacache–  Sustainsonemainmemoryaccesspercycleperprocessor

§ GaAslogicinprototype,1KW/processor@260MHz–  SecondversionCMOS,MTA-2,50W/processor–  Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz

–  Newestversion,XMT2,hashighermemorybandwidthandcapacity

10

CS252

MTAPipeline

11

A

W

C

W

M

InstFetch

Mem

oryPo

ol

RetryPool

Interconnec=onNetwork

WritePo

ol

W

Memorypipeline

IssuePool• Everycycle,oneVLIWinstrucJonfromoneacJvethreadislaunchedintopipeline

• InstrucJonpipelineis21cycleslong

• MemoryoperaJonsincur~150cyclesoflatency

AssumingasinglethreadissuesoneinstrucJonevery21cycles,andclockrateis260MHz…

Whatissingle-threadperformance?

EffecJvesingle-threadissuerateis260/21=12.4MIPS

Coarse-GrainMul=threading

§  TeraMTAdesignedforsupercompuJngapplicaJonswithlargedatasetsandlowlocality–  Nodatacache–  Manyparallelthreadsneededtohidelargememorylatency

§ OtherapplicaJonsaremorecachefriendly–  Fewpipelinebubblesifcachemostlyhashits–  Justaddafewthreadstohideoccasionalcachemisslatencies–  Swapthreadsoncachemisses

12

MITAlewife(1990)

13

§ ModifiedSPARCchips–  registerwindowsholddifferentthreadcontexts

§ Uptofourthreadspernode§ Threadswitchonlocalcachemiss

IBMPowerPCRS64-IV(2000)

§ Commercialcoarse-grainmulJthreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline

§  EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecuJonswitchestosecondthread–  shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency

–  flushpipelinetosimplifyexcepJonhandling

14

Oracle/SunNiagaraprocessors

§  Targetisdatacentersrunningwebserversanddatabases,withmanyconcurrentrequests

§  ProvidemulJplesimplecoreseachwithmulJplehardwarethreads,reducedenergy/operaJonthoughmuchlowersinglethreadperformance

§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§  T4[2011],8cores,8threads/core§  T5[2012],16cores,8threads/core§ M5[2012],6cores,8threads/core§ M6[2013],12cores,8threads/core

15

Oracle/SunNiagara-3,“RainbowFalls”2009

16

OracleM6-2013

17

OracleM6-2013

18

OracleM6-2013

19

CS152Administrivia

§  PS3duetoday§  PS4outWednesdayMarch18,dueFridayApril3§  Lab3dueMondayApril6

20

CS252

CS252Administrivia

§ Readingtoday3:30onZoom,linkonpiazza

21

SimultaneousMul=threading(SMT)forOoOSuperscalars

§  Techniquespresentedsofarhaveallbeen“verJcal”mulJthreadingwhereeachpipelinestageworksononethreadataJme

§  SMTusesfine-graincontrolalreadypresentinsideanOoOsuperscalartoallowinstrucJonsfrommulJplethreadstoenterexecuJononsameclockcycle.Givesbe_eruJlizaJonofmachineresources.

22

Formostapps,mostexecu=onunitslieidleinanOoOsuperscalar

23

From:Tullsen,Eggers,andLevy,“SimultaneousMulJthreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

SuperscalarMachineEfficiency

24

Issuewidth

Time

Completelyidlecycle(ver:calwaste)

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Ver=calMul=threading

25

§ Cycle-by-cycleinterleavingremovesverJcalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

ChipMul=processing(CMP)

26

§ WhatistheeffectofsplixngintomulJpleprocessors?–  reduceshorizontalwaste,–  leavessomeverJcalwaste,and–  putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time

IdealSuperscalarMul=threading[Tullsen,Eggers,Levy,UW,1995]

27

§  InterleavemulJplethreadstomulJpleissueslotswithnorestricJons

Issuewidth

Time

O-o-OSimultaneousMul=threading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]

§ AddmulJplecontextsandfetchenginesandallowinstrucJonsfetchedfromdifferentthreadstoissuesimultaneously

§ UJlizewideout-of-ordersuperscalarprocessorissuequeuetofindinstrucJonstoissuefrommulJplethreads

§ OOOinstrucJonwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommulJplethreads

§ AnysinglethreadcanuJlizewholemachine

28

SMTadapta=ontoparallelismtype

29

Forregionswithhighthread-levelparallelism(TLP)enJremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)enJremachinewidthisavailableforinstrucJon-levelparallelism(ILP)

Pen=um-4Hyperthreading(2002)

§  FirstcommercialSMTdesign(2-waySMT)§  Logicalprocessorssharenearlyallresourcesofthephysicalprocessor

–  Caches,execuJonunits,branchpredictors§  Dieareaoverheadofhyperthreading~5%§  Whenonelogicalprocessorisstalled,theothercanmakeprogress

–  NologicalprocessorcanuseallentriesinqueueswhentwothreadsareacJve

§  ProcessorrunningonlyoneacJvesoIwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading

§  HyperthreadingdroppedonOoOP6basedfollowonstoPenJum-4(PenJum-M,CoreDuo,Core2Duo),unJlrevivedwithNehalemgeneraJonmachinesin2008.

§  IntelAtom(in-orderx86core)hastwo-wayverJcalmulJthreading–  Hyperthreading==(SMTforIntelOoO&VerJcalforIntelInO)

30

IBMPower4

31

Single-threadedpredecessortoPower5.8execuJonunitsinout-of-orderengine,eachmayissueaninstrucJoneachcycle.

32

Power 4

Power 5

2 fetch (PC), 2 initial decodes

2 commits (architected register sets)

Power5dataflow...

33

Whyonly2threads?With4,oneofthesharedresources(physicalregisters,cache,memorybandwidth)wouldbepronetobo_leneck

Ini=alPerformanceofSMT

§  PenJum-4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_rate–  PenJum-4isdual-threadedSMT–  SPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmark

§ RunningonPenJum-4eachof26SPECbenchmarkspairedwitheveryother(262runs)speed-upsfrom0.90to1.58;averagewas1.20

§  Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate

§  Power5running2copiesofeachappspeedupbetween0.89and1.41–  Mostgainedsome–  Fl.Pt.appshadmostcacheconflictsandleastgains

34

SMTPerformance:Applica=onInterac=on

35Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Solongastheyaren’tbangingon

theL2too.

Notaffectedbyotherprograms

YourfavoritebenchmarkfromLab2



Doesn’tplaynice

YourfavoritebenchmarkfromLab2



VerysensiJvetosecondprogram

IcountChoosingPolicy

38

Whydoesthisenhancethroughput?

FetchfromthreadwiththeleastinstrucJonsinflight.

Summary:Mul=threadedCategories

39

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Mul=processing

SimultaneousMul=threading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Mul=threadedDesignDiscussion

40

§ WanttobuildamulJthreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§ L1caches(instrucJonanddata)§ L2caches§ Branchpredictor§ TLB§ Physicalregisterfile

SMT&Security

41

§ Mosthardwarea_acksrelyonsharedhardwareresourcestoestablishaside-channel– Eg.Sharedoutercaches,DRAMrowbuffers

§  SMTgivesa_ackershigh-BWaccesstopreviouslyprivatehardwareresourcesthataresharedbyco-residentthreads:

§  TLBs:TLBleed(June,‘18)§  L1caches:CacheBleed(2016)§  FuncJonalunitports:PortSmash(Nov,’18)OpenBSD6.4àDisabledHTinBIOS,AMDSMTtofollow

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  Sco_Beamer(UCB)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPa_erson(UCB)

42

CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Documents