Lecture 09: RISC-V Pipeline Implementa8on CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1
Lecture09:RISC-VPipelineImplementa8on
CSE564ComputerArchitectureSummer2017
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Acknowledgement
• SlidesadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakisfromUCBerkeley
2
Introduc8on
• CPUperformancefactors– InstrucNoncount
• DeterminedbyISAandcompiler– CPIandCycleNme
• DeterminedbyCPUhardware
• ThreegroupsofinstrucNons– Memoryreference:lw,sw– ArithmeNc/logical:add,sub,and,or,slt– Controltransfer:jal,jalr,b*
• CPI– Single-cycle,CPI=1– 5stageunpipelined,CPI=5– 5stagepipelined,CPI=1
CPU Time = InstructionsProgram
* CyclesInstruction
*TimeCycle
AnIdealPipeline
• Allobjectsgothroughthesamestages• Nosharingofresourcesbetweenanytwostages• PropagaNondelaythroughallpipelinestagesisequal• Theschedulingofanobjectenteringthepipelineisnot
affectedbytheobjectsinotherstages
4
stage 1
stage 2
stage 3
stage 4
Thesecondi+onsgenerallyholdforindustrialassemblylines,butinstruc+onsdependoneachother!
Review:UnpipelinedDatapathforRISC-V
5
0x4
RegWriteEn
AddAdd
clk
WBSelMemWrite
addr
wdata
rdataDataMemory
we
WASel Op2SelImmSelOpCode
clk
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
ALU
ALUControl
PCSelbrrindjabspc+4
Bcomp?BrLogic
Review:HardwiredControlTable
6
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU ALUi LW SW BEQtrue
BEQfalse
JAL
JALR
Op2Sel=Reg/Imm WBSel=ALU/Mem/PC PCSel=pc+4/br/rind/jabs
* * *no yes rindPC rd
jabs* * * no
yes PC rdpc+4BrType12 * * no no * *
brBrType12 * * no no * *pc+4BsType12 Imm + yes no * *
pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd
pc+4IType12 Imm + no yes Mem rd
PipelinedDatapath
7
ClockperiodcanbereducedbydividingtheexecuNonofaninstrucNonintomulNplecycles
tC>max{tIM,tRF,tALU,tDM,tRW}(=tDMprobably)
However,CPIwillincreaseunlessinstruc+onsarepipelined
write-backphase
fetchphase
executephase
decode&Reg-fetchphase
memoryphase
addr
wdata
rdataDataMemory
weALU
ImmSelect
0x4Add
addrrdata
Inst.Memory
rd1
GPRs
rs1rs2
wawdrd2
we
IRPC
TechnologyAssump8ons
8
Thus,thefollowingNmingassumpNonisreasonable
• Asmallamountofveryfastmemory(caches)backedupbyalarge,slowermemory• FastALU(atleastforintegers)• MulNportedRegisterfiles(slower!)
tIM~=tRF~=tALU~=tDM~=tRW
A5-stagepipelinewillbefocusofourdetaileddesign-somecommercialdesignshaveover30pipeline
stagestodoanintegeradd!
5-StagePipelinedExecu8on
9
+me t0 t1 t2 t3 t4 t5 t6 t7 ....instrucNon1 IF1 ID1 EX1 MA1 WB1instrucNon2 IF2 ID2 EX2 MA2 WB2instrucNon3 IF3 ID3 EX3 MA3 WB3instrucNon4 IF4 ID4 EX4 MA4 WB4instrucNon5 IF5 ID5 EX5 MA5 WB5
Write-Back(WB)
I-Fetch(IF)
Execute(EX)
Decode,Reg.Fetch(ID)
Memory(MA)
addr
wdata
rdataDataMemory
weALU
ImmSelect
0x4Add
addrrdata
Inst.Memory
rd1
GPRs
rs1rs2wawdrd2
we
IRPC
5-StagePipelinedExecu8onResourceUsageDiagram
10
+me t0 t1 t2 t3 t4 t5 t6 t7 ....IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5EX I1 I2 I3 I4 I5MA I1 I2 I3 I4 I5WB I1 I2 I3 I4 I5
Resources
Write-Back(WB)
I-Fetch(IF)
Execute(EX)
Decode,Reg.Fetch(ID)
Memory(MA)
addr
wdata
rdataDataMemory
weALU
0x4Add
addrrdata
Inst.Memory
rd1
GPRs
rs1rs2wawdrd2
we
IRPC
ImmSelect
PipelinedExecu8on:ALUInstruc8ons
11
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawdrd2
we
wdata
addr
wdata
rdataDataMemory
we
Notquitecorrect!WeneedanInstruc+onReg(IR)foreachstage
PipelinedRISC-VDatapathwithoutjumps
12
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawdrd2
we
DataMemorywdata
addr
wdata
rdata
we
ImmSel Op2Sel
WBSelMemWrite
RegWriteEn
F D E M W
ControlPointsNeedtoBeConnected
ALUControl
Instruc8onsinteractwitheachotherinpipeline
• AninstrucNoninthepipelinemayneedaresourcebeingusedbyanotherinstrucNoninthepipelineàstructuralhazard
• AninstrucNonmaydependonsomethingproducedbyanearlierinstrucNon– Dependencemaybeforadatavalue
àdatahazard– DependencemaybeforthenextinstrucNon’saddress
àcontrolhazard(branches,excep+ons)
13
ResolvingStructuralHazards
• StructuralhazardoccurswhentwoinstrucNonsneedsamehardwareresourceatsameNme– CanresolveinhardwarebystallingnewerinstrucNonNllolder
instrucNonfinishedwithresource• Astructuralhazardcanalwaysbeavoidedbyaddingmorehardwaretodesign– E.g.,iftwoinstrucNonsbothneedaporttomemoryatsame
Nme,couldavoidhazardbyaddingsecondporttomemory• Our5-stagepipelinehasnostructuralhazardsbydesign
– ThankstoRISC-VISA,whichwasdesignedforpipelining
14
DataHazards
15
... x1 ← x0 + 10 x4 ← x1 + 17 ...
x1 is stale. Oops!
x1 ← … x4 ← x1 …
IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR
Imm Select
ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
HowWouldYouResolveThis?
• ThreeopNons– Wait(stall)– Bypass:askthemforwhatyouneedbeforehis/herfinal
deliverable– Speculateonvaluestoread
16
ResolvingDataHazards(1)
17
Strategy 1: Wait for the result to be available by freezing earlier pipeline stages è interlocks
InterlockstoresolveDataHazards
18
IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR
Imm Select
ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
bubble
... x1 ← x0 + 10 x4 ← x1 + 17 ...
Stall Condition
StalledStagesandPipelineBubbles
19
stalled stages
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I3 I4 I5 EX I1 - - - I2 I3 I4 I5 MA I1 - - - I2 I3 I4 I5 WB I1 - - - I2 I3 I4 I5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) x1 ← (x0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5
Resource Usage
- ⇒ pipeline bubble
InterlockControlLogic
20
IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR
Imm Select
ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
bubble
Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.
stall Cstall
ws
rs2 rs1 ?
InterlockControlLogicignoringjumps&branches
21
Shouldwealwaysstallifanrsfieldmatchessomerd?
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR ALUrd1
GPRs
rs1rs2
wawdrd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stallCstall
wsW
rs1rs2 ?
weW
re1 re2Cre
wsEweM wsM
Cdest CdestweE
noteveryinstrucNonwritesaregister=>wenoteveryinstrucNonreadsaregister=>re
ImmSelect
we:writeenable,1-biton/offws:writeselect,5-bitregisternumberre:readenable,1-biton/offrs:readselect,5-bitregisternumber
InRISC-VSodorImplementa8on
22
Source&Des8na8onRegisters
23
ALUI/LW/JALRALU
SW/Bcond
func7 rs2 rs1 func3 rd opcode immediate12 rs1 func3 rd opcode imm rs2 rs1 func3 imm
Jump Offset[19:0]
opcode
rd opcode source(s) des+na+on
ALU rd<=rs1func10rs2 rs1,rs2 rdALUI rd<=rs1opimm rs1 rdLW rd<=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcondrs1,rs2 rs1,rs2 -
true: PC<=PC+imm false: PC<=PC+4
JAL x1<=PC,PC<=PC+imm - rdJALR rd<=PC,PC<=rs1+imm rs1 rd
DerivingtheStallSignal
24
Cdestws=rd
we=Caseopcode
ALU,ALUi,LW,JALR=>on... =>off
Crere1=Caseopcode
ALU,ALUi, =>on =>off
re2=Caseopcode
=>on ->off
LW,SW,Bcond,JALRJAL
ALU,SW,Bcond...
Cstall stall=((rs1D==wsE)&&weE+ (rs1D==wsM)&&weM+ (rs1D==wsW)&&weW)&&re1D+ ((rs2D==wsE)&&weE+ (rs2D==wsM)&&weM+ (rs2D==wsW)&&weW)&&re2D
HazardsduetoLoads&Stores
25
...M[x1+7]<=x2x4<=M[x3+5]...
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawdrd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
StallCondi+on
Isthereanypossibledatahazardinthisinstruc+onsequence?
Whatifx1+7=x3+5?
Load&StoreHazards
26
However,thehazardisavoidedbecauseourmemorysystemcompleteswritesinonecycle!Load/StorehazardsaresomeNmesresolvedinthepipelineandsomeNmesinthememorysystemitself.Moreonthislaterinthecourse.
...M[x1+7]<=x2x4<=M[x3+5]...
x1+7=x3+5=>datahazard
ResolvingDataHazards(2)
27
Strategy2:Routedataassoonaspossibleaweritiscalculatedtotheearlierpipelinestageàbypass
Bypassing
28
Eachstallorkillintroducesabubbleinthepipeline =>CPI>1
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 (I4) stalled stages IF4 ID4 EX4 (I5) IF5 ID5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 ← x1 + 17 IF2 ID2 EX2 MA2 WB2 (I3) IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5
Anewdatapath,i.e.,abypass,cangetthedatafromtheoutputoftheALUtoitsinput
HardwareSupportforForwarding
Detec8ngRAWHazards
• Pass register numbers along pipeline – ID/EX.RegisterRs = register number for Rs in ID/EX – ID/EX.RegisterRt = register number for Rt in ID/EX – ID/EX.RegisterRd = register number for Rd in ID/EX
• Current instruction being executed in ID/EX register • Previous instruction is in the EX/MEM register • Second previous is in the MEM/WB register • RAW Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
FwdfromEX/MEMpipelinereg
FwdfromMEM/WBpipelinereg
Detec8ngtheNeedtoForward• But only if forwarding instruction will write to a register!
– EX/MEM.RegWrite, MEM/WB.RegWrite
• And only if Rd for that instruction is not R0 – EX/MEM.RegisterRd ≠ 0 – MEM/WB.RegisterRd ≠ 0
ForwardingCondi8ons
• Detecting RAW hazard with Previous Instruction – if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 (Forward from EX/MEM pipe stage)
– if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 (Forward from EX/MEM pipe stage)
• Detecting RAW hazard with Second Previous – if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 (Forward from MEM/WB pipe stage)
– if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 (Forward from MEM/WB pipe stage)
AddingaBypass
33
ASrc
...(I1)x1<=x0+10(I2)x4<=x1+17
x4<=x1... x1<=...
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawdrd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stall
D
E M W
Whendoesthisbypasshelp?x1<=M[x0+10]x4<=x1+17
JAL500x4<=x1+17
yes no no
TheBypassSignalDerivingitfromtheStallSignal
34
ASrc=(rs1D==wsE)&&weE&&re1D
we=CaseopcodeALU,ALUi,LW,,JALJALR=>on...=>off
NobecauseonlyALUandALUiinstrucNonscanbenefitfromthisbypass
Isthiscorrect?
SplitweEintotwocomponents:we-bypass,we-stall
stall=(((rs1D==wsE)&&weE+(rs1D==wsM)&&weM+(rs1D==wsW)&&weW)&&re1D+((rs2D==wsE)&&weE+(rs2D==wsM)&&weM+(rs2D==wsW)&&weW)&&re2D)
ws=rd
BypassandStallSignals
35
we-bypassE=CaseopcodeEALU,ALUi =>on
... =>off
ASrc=(rs1D==wsE)&&we-bypassE&&re1D
SplitweEintotwocomponents:we-bypass,we-stall
stall=((rs1D==wsE)&&we-stallE+ (rs1D==wsM)&&weM+(rs1D==wsW)&&weW)&&re1D
+((rs2D==wsE)&&weE+(rs2D==wsM)&&weM+(rs2D==wsW)&&weW)&&re2D
we-stallE=CaseopcodeELW,JAL,JALR=>on
JAL =>on... =>off
FullyBypassedDatapath
36
ASrcIRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR ALU
ImmSelect
rd1
GPRs
rs1rs2
wawdrd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stall
D
E M W
PCforJAL,...
BSrc
Istheres+llaneedforthestallsignal?stall=(rs1D==wsE)&&(opcodeE==LWE)&&(wsE!=0)&&re1D
+(rs2D==wsE)&&(opcodeE==LWE)&&(wsE!=0)&&re2D
ControlHazards
WhatdoweneedtocalculatenextPC?• ForJumps
– Opcode,PCandoffset
• ForJumpRegister– Opcode,Registervalue,andPC
• ForCondiNonalBranches– Opcode,Register(forcondiNon),PCandoffset
• ForallotherinstrucNons– OpcodeandPC(andhavetoknowit’snotoneofabove)
37
PCCalcula8onBubbles
38
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) x1 ← x0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) x3 ← x2 + 17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 IF4 ID4 EX4 MA4 WB4
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 - I2 - I3 - I4 ID I1 - I2 - I3 - I4 EX I1 - I2 - I3 - I4 MA I1 - I2 - I3 - I4 WB I1 - I2 - I3 - I4
Resource Usage
- ⇒ pipeline bubble
SpeculatenextaddressisPC+4
39
A jump instruction kills (not stalls) the following instruction
stall
How?
I2
I1
104
IR IR
PC addr inst
Inst Memory
0x4 Add
bubble
IR
E M Add
Jump?
PCSrc (pc+4 / jabs / rind/ br)
I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD
kill
PipeliningJumps
40
I2
I1
104
stall
IR IR
PC addr inst
Inst Memory
0x4 Add
bubble
IR
E M Add
Jump?
PCSrc (pc+4 / jabs / rind/ br)
IRSrcD = Case opcodeD JAL ⇒ bubble ... ⇒ IM
To kill a fetched instruction -- Insert a mux before IR
Any interaction between stall and jump?
bubble
IRSrcD
I2 I1
304 bubble
I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD
kill
JumpPipelineDiagrams
41
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5 ID I1 I2 - I4 I5 EX I1 I2 - I4 I5 MA I1 I2 - I4 I5 WB I1 I2 - I4 I5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: J 304 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 - - - - (I4) 304: ADD IF4 ID4 EX4 MA4 WB4
Resource Usage
- ⇒ pipeline bubble
PipeliningCondi8onalBranches
42
I1 096ADDI2 100BEQx1,x2+200I3 104ADDI4 304ADD
BEQ?
I2
I1
104
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
PCSrc(pc+4/jabs/rind/br)
bubble
IRSrcD
BranchcondiNonisnotknownunNltheexecutestage
whatac+onshouldbetakeninthedecodestage?
AYALU
Taken?
PipeliningCondi8onalBranches
43
I1 096ADDI2 100BEQx1,x2+200I3 104ADDI4 304ADD
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
PCSrc(pc+4/jabs/rind/br)
bubble
IRSrcD
AYALU
Taken?
Ifthebranchistaken-killthetwofollowinginstrucNons-theinstrucNonatthedecodestageisnotvalid⇒stallsignalisnotvalid
I2 I1
108I3
Bcond?
?
PipeliningCondi8onalBranches
44
I1: 096ADDI2: 100BEQx1,x2+200I3: 104ADDI4: 304ADD
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E M
PCSrc(pc+4/jabs/rind/br)
bubble AYALU
Taken?I2 I1
108I3
Bcond?
Jump?
IRSrcD
IRSrcE
Ifthebranchistaken-killthetwofollowinginstrucNons-theinstrucNonatthedecodestageisnotvalid⇒stallsignalisnotvalid
Add
PC
BranchPipelineDiagrams(resolvedinexecutestage)
45
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5 ID I1 I2 I3 - I5 EX I1 I2 - - I5 MA I1 I2 - - I5 WB I1 I2 - - I5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 - - - (I4) 108: IF4 - - - - (I5) 304: ADD IF5 ID5 EX5 MA5 WB5
Resource Usage
- ⇒ pipeline bubble
WhatIf…
• Weusedasimplebranchthatcomparesonlyoneregister(rs1)againstzero
• Canwedoanybeyer?
46
IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR
Imm Select
ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
Usesimplerbranches(e.g.,onlycompareoneregagainstzero)withcompareindecodestage
47
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5 ID I1 I2 - I4 I5 EX I1 I2 - I4 I5 MA I1 I2 - I4 I5 WB I1 I2 - I4 I5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 - - - - (I4) 300: ADD IF4 ID4 EX4 MA4 WB4
Resource Usage
- ⇒ pipeline bubble
BranchDelaySlots(exposecontrolhazardtosoeware)
• ChangetheISAseman8cssothattheinstrucNonthatfollowsajumporbranchisalwaysexecuted– givescompilertheflexibilitytoputinausefulinstrucNonwherenormally
apipelinebubblewouldhaveresulted.
48
Delayslotinstruc+onexecutedregardlessofbranchoutcome
I1 096 ADD I2 100 BEQZ r1, +200 I3 104 ADD I4 300 ADD
BranchPipelineDiagrams(branchdelayslot)
49
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX I1 I2 I3 I4 MA I1 I2 I3 I4 WB I1 I2 I3 I4
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 EX3 MA3 WB3 (I4) 300: ADD IF4 ID4 EX4 MA4 WB4
Resource Usage
Post-1990RISCISAsdon’thavedelayslots
• EncodesmicroarchitecturaldetailintoISA– C.f.IBM650drumlayout
• Whataretheproblemswithdelayslots?
• Performanceissues– E.g.,I-cachemissorpagefaultondelayslotinstrucNoncauses
machinetowait,evenifdelayslotisaNOP• Complicatesmoreadvancedmicroarchitectures
– 30-stagepipelinewithfour-instrucNon-per-cycleissue• Complicatesthecompiler’sjob• BeyerbranchpredicNonreducedneedfordelayslots
50
WhyanInstruc8onmaynotbedispatchedeverycycle(CPI>1)
• Fullbypassingmaybetooexpensivetoimplement– typicallyallfrequentlyusedpathsareprovided– someinfrequentlyusedbypasspathsmayincreasecycleNmeandcounteractthebenefitofreducingCPI
• Loadshavetwo-cyclelatency– InstrucNonawerloadcannotuseloadresult– MIPS-IISAdefinedloaddelayslots,asowware-visiblepipelinehazard(compilerschedulesindependentinstrucNonorinsertsNOPtoavoidhazard).RemovedinMIPS-II(pipelineinterlocksaddedinhardware)
• MIPS:“MicroprocessorwithoutInterlockedPipelineStages”• CondiNonalbranchesmaycausebubbles
– killfollowinginstrucNon(s)ifnodelayslots
51
Machineswithso^ware-visibledelayslotsmayexecutesignificantnumberofNOPinstruc+onsinsertedbythecompiler.NOPsincreaseinstruc+ons/program!
RISC-VBranchesandJumps
• JAL:uncondi8onaljumptoPC+immediate
• JALR:indirectjumptors1+immediate
• Branch:if(rs1condsrs2),branchtoPC+immediate
52
RISC-VBranchesandJumps
53
Instruc<on Takenknown? Targetknown?
JAL
JALRB<cond.>
EachinstrucNonfetchdependsononeortwopiecesofinformaNonfromtheprecedinginstrucNon:
1)IstheprecedinginstrucNonatakenbranch?2)Ifso,whatisthetargetaddress?
• JAL:uncondi8onaljumptoPC+immediate• JALR:indirectjumptors1+immediate• Branch:if(rs1condsrs2),branchtoPC+immediate
AeerInst.Decode
AeerInst.Decode AeerInst.Decode
AeerInst.Decode AeerReg.Fetch
AeerExecute
BranchPenal8esinModernPipelines
54
A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
UltraSPARC-IIIinstrucNonfetchpipelinestages(in-orderissue,4-waysuperscalar,750MHz,2000)
BranchTargetAddressKnown
BranchDirec+on&JumpRegisterTargetKnown
ReducingControlFlowPenalty
• SowwaresoluNons– Eliminatebranches-loopunrolling
• Increasestherunlength– ReduceresoluNonNme-instrucNonscheduling
• ComputethebranchcondiNonasearlyaspossible(oflimitedvaluebecausebranchesowenincriNcalpaththroughcode)
• HardwaresoluNons– Findsomethingelsetodo-delayslots
• Replacespipelinebubbleswithusefulwork(requiressowwarecooperaNon)
– Speculate-branchpredicNon• SpeculaNveexecuNonofinstrucNonsbeyondthebranch
55
BranchPredic8on
• Mo+va+on:– BranchpenalNeslimitperformanceofdeeplypipelined
processors– Modernbranchpredictorshavehighaccuracy– (>95%)andcanreducebranchpenalNessignificantly
• Requiredhardwaresupport:– Predic+onstructures:
• Branchhistorytables,branchtargetbuffers,etc.
– Mispredictrecoverymechanisms:• Keepresultcomputa+onseparatefromcommit • KillinstrucNonsfollowingbranchinpipeline• Restorestatetothatfollowingbranch
56
Sta8cBranchPredic8on
57
Overallprobabilityabranchistakenis~60-70%but:
ISAcanayachpreferreddirecNonsemanNcstobranches,e.g.,MotorolaMC88110
bne0(preferredtaken) beq0(nottaken)
backward90%
forward50%
WhatC++statementdoesthislooklike
WhatC++statementdoesthislooklike
DynamicBranchPredic8onlearningbasedonpastbehavior
• TemporalcorrelaNon(Nme)– IfItellyouthatacertainbranchwastakenlastNme,doesthishelp?
– ThewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecuNon
• SpaNalcorrelaNon(space)– Severalbranchesmayresolveinahighlycorrelatedmanner– Forinstance,apreferredpathofexecuNon
58
DynamicBranchPredic8on
• 1-bitpredicNonscheme– Low-porNonaddressasaddressforaone-bitflagforTakenor
NotTakenhistorically– Simple
• 2-bitpredicNon– Misstwicetochange
BranchPredic8onBits
• Assume2BPbitsperinstrucNon• ChangethepredicNonawertwoconsecuNvemistakes!
60
¬takewrong
taken¬taken
taken
taken
taken¬takeright
takeright
takewrong
¬taken
¬taken¬taken
BPstate: (predicttake/¬take)x(lastpredic+onright/wrong)
BranchHistoryTable
61
4K-entryBHT,2bits/entry,~80-90%correctpredicNons
00FetchPC
Branch? TargetPC
+
I-Cache
Opcode offsetInstruc+on
kBHTIndex
2k-entryBHT,2bits/entry
Taken/¬Taken?
Exploi8ngSpa8alCorrela8onYehandPaC,1992
62
IffirstcondiNonfalse,secondcondiNonalsofalseHistoryregister,H,recordsthedirecNonofthelastNbranchesexecutedbytheprocessor
if (x[i] < 7) then!!y += 1;!
if (x[i] < 5) then!!c -= 4;!
Two-LevelBranchPredictor
63
Pen+umProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)
0 0
kFetchPC
ShiwinTaken/¬Takenresultsofeachbranch
2-bitglobalbranchhistoryshiwregister
Taken/¬Taken?
Specula8ngBothDirec8ons• AnalternaNvetobranchpredicNonistoexecutebothdirecNonsofabranchspeculaNvely
– resourcerequirementisproporNonaltothenumberofconcurrentspeculaNveexecuNons
– onlyhalftheresourcesengageinusefulworkwhenbothdirecNonsofabranchareexecutedspeculaNvely
– branchpredicNontakeslessresourcesthanspeculaNveexecuNonofbothpaths
• WithaccuratebranchpredicNon,itismorecosteffecNvetodedicateallresourcestothepredicteddirecNon!– Whatwouldyouchoosewith80%accuracy?
64
AreWeMissingSomething?
• Knowingwhetherabranchistakenornotisgreat,butwhatelsedoweneedtoknowaboutit?
Branchtargetaddress
65
Limita8onsofBHTs
66
OnlypredictsbranchdirecNon.Therefore,cannotredirectfetchstreamunNlawerbranchtargetisdetermined.
UltraSPARC-IIIfetchpipeline
Correctlypredictedtakenbranch
penalty
JumpRegisterpenalty
A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute
Remainderofexecutepipeline(+another6stages)
BranchTargetBuffer
67
BPbitsarestoredwiththepredictedtargetaddress.IFstage:If(BP=taken)thennPC=targetelsenPC=PC+4Later:checkpredic+on,ifwrongthenkilltheinstruc+onandupdateBTB&BPbelseupdateBPb
IMEM
PC
BranchTargetBuffer(2kentries)
k
BPbpredicted
target BP
target
AddressCollisions(Mis-Predic8on)
68
WhatwillbefetchedawertheinstrucNonat1028?BTBpredicNon = Correcttarget = =>
Assumea128-entryBTB
BPbtargettake236
1028Add.....
132Jump+104
InstrucNonMemory
2361032
killPC=236andfetchPC=1032
Isthisacommonoccurrence?
BTBisonlyforControlInstruc8ons
• IsevenbranchpredicNonfastenoughtoavoidbubbles?• WhendoweindextheBTB?
– i.e.,whatstateisthebranchin,inordertoavoidbubbles?
• BTBcontainsusefulinforma8onforbranchandjumpinstruc8onsonly=>Donotupdateitforotherinstruc8ons
• ForallotherinstrucNonsthenextPCisPC+4!
• Howtoachievethiseffectwithoutdecodingtheinstruc+on?
69
BranchTargetBuffer(BTB)
70
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• OnlytakenbranchesandjumpsheldinBTB• NextPCdeterminedbeforebranchfetchedanddecoded
2k-entry direct-mapped BTB (can also be associative) I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
AreWeMissingSomething?(2)
• WhendoweupdatetheBTBorBHT?
71
IR IR IR
PC A
B
Y
R
MD1 MD2
addr inst
Inst Memory
0x4 Add
IR
Imm Select
ALU rd1
GPRs
rs1 rs2
wa wd rd2
we
wdata
addr
wdata
rdata Data Memory
we
CombiningBTBandBHT
• BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
• BHTcanholdmanymoreentriesandismoreaccurate
72
A PCGeneraNon/MuxP InstrucNonFetchStage1F InstrucNonFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstrucNonstoFuncNonalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdateda^erbranchresolvesinEstage
UsesofJumpRegister(JR)
• Switchstatements(jumptoaddressofmatchingcase)
• DynamicfuncNoncall(jumptorun-NmefuncNonaddress)
• SubrouNnereturns(jumptoreturnaddress)
73
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefuncNonusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfuncNoncall)
BTBworkswellifusuallyreturntothesameplace⇒O^enonefunc+oncalledfrommanydis+nctcallsites!
Subrou8neReturnStack
SmallstructuretoaccelerateJRforsubrouNnereturns,typicallymuchmoreaccuratethanBTBs.
74
&fb() &fc()
Pushcalladdresswhenfunc+oncallexecuted
Popreturnaddresswhensubrou+nereturndecoded
fa() { fb(); } fb() { fc(); } fc() { fd(); }
&fd() kentries(typicallyk=8-16)