Top Banner
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 14 – Mul=threading Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152
42

CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Oct 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS152ComputerArchitectureandEngineeringCS252GraduateComputerArchitecture

Lecture14–Mul=threading

KrsteAsanovicElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

Page 2: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

LastTimeLecture13:VLIW

§  InaclassicVLIW,compilerisresponsibleforavoidingallhazards->simplehardware,complexcompiler.LaterVLIWsaddedmoredynamichardwareinterlocks

§ UseloopunrollingandsoIwarepipeliningforloops,traceschedulingformoreirregularcode

§  StaJcschedulingdifficultinpresenceofunpredictablebranchesandvariablelatencymemory

2

Page 3: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threading

§ DifficulttoconJnuetoextractinstrucJon-levelparallelism(ILP)fromasinglesequenJalthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)– TLPfrommulJprogramming(runindependentsequenJaljobs)

– TLPfrommulJthreadedapplicaJons(runonejobfasterusingparallelthreads)

§ MulJthreadingusesTLPtoimproveuJlizaJonofasingleprocessor

3

Page 4: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threading

HowcanweguaranteenodependenciesbetweeninstrucJonsinapipeline?

OnewayistointerleaveexecuJonofinstrucJonsfromdifferentprogramthreadsonsamepipeline

4

F D XMWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2) T2:ADD x7,x1,x4 T3:XORI x5,x4,12 T4:SD 0(x7),x5 T1:LD x5,12(x1)

t9

F D XMWF D X M W

F D XMWF D XMW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstruc:oninathreadalwayscompleteswrite-backbeforenextinstruc:oninsamethreadreadsregisterfile

Page 5: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CDC6600PeripheralProcessors(Cray,1964)

§  FirstmulJthreadedhardware§  10“virtual”I/Oprocessors§  Fixedinterleaveonsimplepipeline§  Pipelinehas100nscycleJme§  EachvirtualprocessorexecutesoneinstrucJonevery1000ns§  Accumulator-basedinstrucJonsettoreduceprocessorstate

5

Page 6: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SimpleMul=threadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/wri_enateachpipestage

§ AppearstosoIware(includingOS)asmulJple,albeitslower,CPUs

6

+1

2 Thread select

PC 1 PC

1 PC 1 PC

1 I$ IR GPR1 GPR1 GPR1 GPR1

X

Y

2

D$

Page 7: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threadingCosts

§ Eachthreadrequiresitsownuserstate–  PC–  GPRs

§ Also,needsitsownsystemstate–  Virtual-memorypage-table-baseregister–  ExcepJon-handlingregisters

§ Otheroverheads:–  AddiJonalcache/TLBconflictsfromcompeJngthreads–  (oraddlargercache/TLBcapacity)–  MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

7

Page 8: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

ThreadSchedulingPolicies

§  Fixedinterleave(CDC6600PPUs,1964)–  EachofNthreadsexecutesoneinstrucJoneveryNcycles–  Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§  SoIware-controlledinterleave(TIASCPPUs,1971)–  OSallocatesSpipelineslotsamongstNthreads–  HardwareperformsfixedinterleaveoverSslots,execuJngwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)–  Hardwarekeepstrackofwhichthreadsarereadytogo–  Picksnextthreadtoexecutebasedonhardwarepriorityscheme

8

Page 9: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

DenelcorHEP(BurtonSmith,1982)

FirstcommercialmachinetousehardwarethreadinginmainCPU–  120threadsperprocessor–  10MHzclockrate–  Upto8processors–  precursortoTeraMTA(MulJthreadedArchitecture)

9

Page 10: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

TeraMTA(1990-)

§ Upto256processors§ Upto128acJvethreadsperprocessor§  Processorsandmemorymodulespopulateasparse3DtorusinterconnecJonfabric

§  Flat,sharedmainmemory–  Nodatacache–  Sustainsonemainmemoryaccesspercycleperprocessor

§ GaAslogicinprototype,1KW/processor@260MHz–  SecondversionCMOS,MTA-2,50W/processor–  Newerversion,XMT,fitsintoAMDOpteronsocket,runsat500MHz

–  Newestversion,XMT2,hashighermemorybandwidthandcapacity

10

Page 11: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

MTAPipeline

11

A

W

C

W

M

InstFetch

Mem

oryPo

ol

RetryPool

Interconnec=onNetwork

WritePo

ol

W

Memorypipeline

IssuePool• Everycycle,oneVLIWinstrucJonfromoneacJvethreadislaunchedintopipeline

• InstrucJonpipelineis21cycleslong

• MemoryoperaJonsincur~150cyclesoflatency

AssumingasinglethreadissuesoneinstrucJonevery21cycles,andclockrateis260MHz…

Whatissingle-threadperformance?

EffecJvesingle-threadissuerateis260/21=12.4MIPS

Page 12: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Coarse-GrainMul=threading

§  TeraMTAdesignedforsupercompuJngapplicaJonswithlargedatasetsandlowlocality–  Nodatacache–  Manyparallelthreadsneededtohidelargememorylatency

§ OtherapplicaJonsaremorecachefriendly–  Fewpipelinebubblesifcachemostlyhashits–  Justaddafewthreadstohideoccasionalcachemisslatencies–  Swapthreadsoncachemisses

12

Page 13: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

MITAlewife(1990)

13

§ ModifiedSPARCchips–  registerwindowsholddifferentthreadcontexts

§ Uptofourthreadspernode§ Threadswitchonlocalcachemiss

Page 14: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IBMPowerPCRS64-IV(2000)

§ Commercialcoarse-grainmulJthreadingCPU§ BasedonPowerPCwithquad-issuein-orderfive-stagepipeline

§  EachphysicalCPUsupportstwovirtualCPUs§ OnL2cachemiss,pipelineisflushedandexecuJonswitchestosecondthread–  shortpipelineminimizesflushpenalty(4cycles),smallcomparedtomemoryaccesslatency

–  flushpipelinetosimplifyexcepJonhandling

14

Page 15: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Oracle/SunNiagaraprocessors

§  Targetisdatacentersrunningwebserversanddatabases,withmanyconcurrentrequests

§  ProvidemulJplesimplecoreseachwithmulJplehardwarethreads,reducedenergy/operaJonthoughmuchlowersinglethreadperformance

§ Niagara-1[2004],8cores,4threads/core§ Niagara-2[2007],8cores,8threads/core§ Niagara-3[2009],16cores,8threads/core§  T4[2011],8cores,8threads/core§  T5[2012],16cores,8threads/core§ M5[2012],6cores,8threads/core§ M6[2013],12cores,8threads/core

15

Page 16: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Oracle/SunNiagara-3,“RainbowFalls”2009

16

Page 17: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

17

Page 18: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

18

Page 19: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

OracleM6-2013

19

Page 20: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS152Administrivia

§  PS3duetoday§  PS4outWednesdayMarch18,dueFridayApril3§  Lab3dueMondayApril6

20

Page 21: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

CS252

CS252Administrivia

§ Readingtoday3:30onZoom,linkonpiazza

21

Page 22: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SimultaneousMul=threading(SMT)forOoOSuperscalars

§  Techniquespresentedsofarhaveallbeen“verJcal”mulJthreadingwhereeachpipelinestageworksononethreadataJme

§  SMTusesfine-graincontrolalreadypresentinsideanOoOsuperscalartoallowinstrucJonsfrommulJplethreadstoenterexecuJononsameclockcycle.Givesbe_eruJlizaJonofmachineresources.

22

Page 23: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Formostapps,mostexecu=onunitslieidleinanOoOsuperscalar

23

From:Tullsen,Eggers,andLevy,“SimultaneousMulJthreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

Page 24: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SuperscalarMachineEfficiency

24

Issuewidth

Time

Completelyidlecycle(ver:calwaste)

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 25: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Ver=calMul=threading

25

§ Cycle-by-cycleinterleavingremovesverJcalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instruc:onissue

Par:allyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 26: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

ChipMul=processing(CMP)

26

§ WhatistheeffectofsplixngintomulJpleprocessors?–  reduceshorizontalwaste,–  leavessomeverJcalwaste,and–  putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time

Page 27: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IdealSuperscalarMul=threading[Tullsen,Eggers,Levy,UW,1995]

27

§  InterleavemulJplethreadstomulJpleissueslotswithnorestricJons

Issuewidth

Time

Page 28: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

O-o-OSimultaneousMul=threading[Tullsen,Eggers,Emer,Levy,Stamm,Lo,DEC/UW,1996]

§ AddmulJplecontextsandfetchenginesandallowinstrucJonsfetchedfromdifferentthreadstoissuesimultaneously

§ UJlizewideout-of-ordersuperscalarprocessorissuequeuetofindinstrucJonstoissuefrommulJplethreads

§ OOOinstrucJonwindowalreadyhasmostofthecircuitryrequiredtoschedulefrommulJplethreads

§ AnysinglethreadcanuJlizewholemachine

28

Page 29: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTadapta=ontoparallelismtype

29

Forregionswithhighthread-levelparallelism(TLP)enJremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)enJremachinewidthisavailableforinstrucJon-levelparallelism(ILP)

Page 30: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Pen=um-4Hyperthreading(2002)

§  FirstcommercialSMTdesign(2-waySMT)§  Logicalprocessorssharenearlyallresourcesofthephysicalprocessor

–  Caches,execuJonunits,branchpredictors§  Dieareaoverheadofhyperthreading~5%§  Whenonelogicalprocessorisstalled,theothercanmakeprogress

–  NologicalprocessorcanuseallentriesinqueueswhentwothreadsareacJve

§  ProcessorrunningonlyoneacJvesoIwarethreadrunsatapproximatelysamespeedwithorwithouthyperthreading

§  HyperthreadingdroppedonOoOP6basedfollowonstoPenJum-4(PenJum-M,CoreDuo,Core2Duo),unJlrevivedwithNehalemgeneraJonmachinesin2008.

§  IntelAtom(in-orderx86core)hastwo-wayverJcalmulJthreading–  Hyperthreading==(SMTforIntelOoO&VerJcalforIntelInO)

30

Page 31: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IBMPower4

31

Single-threadedpredecessortoPower5.8execuJonunitsinout-of-orderengine,eachmayissueaninstrucJoneachcycle.

Page 32: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

32

Power 4

Power 5

2 fetch (PC), 2 initial decodes

2 commits (architected register sets)

Page 33: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Power5dataflow...

33

Whyonly2threads?With4,oneofthesharedresources(physicalregisters,cache,memorybandwidth)wouldbepronetobo_leneck

Page 34: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Ini=alPerformanceofSMT

§  PenJum-4ExtremeSMTyields1.01speedupforSPECint_ratebenchmarkand1.07forSPECfp_rate–  PenJum-4isdual-threadedSMT–  SPECRaterequiresthateachSPECbenchmarkberunagainstavendor-selectednumberofcopiesofthesamebenchmark

§ RunningonPenJum-4eachof26SPECbenchmarkspairedwitheveryother(262runs)speed-upsfrom0.90to1.58;averagewas1.20

§  Power5,8-processorserver1.23fasterforSPECint_ratewithSMT,1.16fasterforSPECfp_rate

§  Power5running2copiesofeachappspeedupbetween0.89and1.41–  Mostgainedsome–  Fl.Pt.appshadmostcacheconflictsandleastgains

34

Page 35: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTPerformance:Applica=onInterac=on

35Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Solongastheyaren’tbangingon

theL2too.

Notaffectedbyotherprograms

YourfavoritebenchmarkfromLab2

Page 36: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMTPerformance:Applica=onInterac=on

36Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

Doesn’tplaynice

YourfavoritebenchmarkfromLab2

Page 37: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

37Bulpin et al, “Multiprogramming Performance of Pentium 4 with Hyper-Threading”

SMTPerformance:Applica=onInterac=on

VerysensiJvetosecondprogram

Page 38: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

IcountChoosingPolicy

38

Whydoesthisenhancethroughput?

FetchfromthreadwiththeleastinstrucJonsinflight.

Page 39: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Summary:Mul=threadedCategories

39

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Mul=processing

SimultaneousMul=threading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Page 40: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Mul=threadedDesignDiscussion

40

§ WanttobuildamulJthreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§ L1caches(instrucJonanddata)§ L2caches§ Branchpredictor§ TLB§ Physicalregisterfile

Page 41: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

SMT&Security

41

§ Mosthardwarea_acksrelyonsharedhardwareresourcestoestablishaside-channel– Eg.Sharedoutercaches,DRAMrowbuffers

§  SMTgivesa_ackershigh-BWaccesstopreviouslyprivatehardwareresourcesthataresharedbyco-residentthreads:

§  TLBs:TLBleed(June,‘18)§  L1caches:CacheBleed(2016)§  FuncJonalunitports:PortSmash(Nov,’18)OpenBSD6.4àDisabledHTinBIOS,AMDSMTtofollow

Page 42: CS 152 Computer Architecture and Engineering CS252 ... · 14 Oracle/Sun Niagara processors § Target is datacenters running web servers and databases, with many concurrent requests

Acknowledgements

§  ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:–  Arvind(MIT)–  Sco_Beamer(UCB)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPa_erson(UCB)

42