Arquitectura de Computadoras PipeliningPerformance- 1 Laboratorio de Tecnologías de Información Pipelining: Pipelining: Performance and Limits Performance and Limits Arquitectura de Computadoras Arquitectura de Computadoras Arturo D Arturo D í í az P az P é é rez rez Centro de Investigaci Centro de Investigaci ó ó n y de Estudios Avanzados del IPN n y de Estudios Avanzados del IPN Laboratorio de Tecnolog Laboratorio de Tecnolog í í as de Informaci as de Informaci ó ó n n [email protected][email protected]
46
Embed
Arturo Díaz Pérez Laboratorio de Tecnologías de Información …adiaz/ArqComp/14... · 2014. 7. 17. · Arquitectura de Computadoras PipeliningPerformance- 2 Laboratorio de Tecnologías
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Arquitectura de Computadoras PipeliningPerformance- 1
Laboratorio deTecnologías de Información
Pipelining:Pipelining: Performance and LimitsPerformance and Limits
Arquitectura de ComputadorasArquitectura de ComputadorasArturo DArturo Dííaz Paz Péérezrez
Centro de InvestigaciCentro de Investigacióón y de Estudios Avanzados del IPNn y de Estudios Avanzados del IPN
Laboratorio de TecnologLaboratorio de Tecnologíías de Informacias de Informacióónn
Arquitectura de Computadoras PipeliningPerformance- 2
Laboratorio deTecnologías de Información
Review: Summary of Pipelining BasicsReview: Summary of Pipelining Basics
Pipelines pass control information down the pipe just as data moves down pipeForwarding/Stalls handled by local controlHazards limit performance
■
Structural: need more HW resources■
Data: need forwarding, compiler scheduling■
Control: early evaluation & PC, delayed branch, prediction
Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latencyWhat about performance?
Arquitectura de Computadoras PipeliningPerformance- 3
Laboratorio deTecnologías de Información
Is CPI = 1 for our pipeline?Is CPI = 1 for our pipeline?
Remember that CPI is an “Average # cycles/inst
CPI here is 1, since the average throughput is 1 instruction every cycle.What if there are stalls or multi-cycle execution?Usually CPI > 1. How close can we get to 1??
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Arquitectura de Computadoras PipeliningPerformance- 4
Laboratorio deTecnologías de Información
Speedup Equation for PipeliningSpeedup Equation for Pipelining
Pipeline SpeedupAverage instruction time unpipelined
Average instruction time pipelined
CPI Clock cycle
CPI Clock cycle
= CPI CPI
Clock cycleClock cycle
unpipelined unpipelined
pipelined pipelined
unpipelined
pipelined
unpipelined
pipelined
=
=××
×
Pipeline Speedup = CPI Pipeline depth
CPI Clock cycleClock cycle
Ideal
pipelined
unpipelined
pipelined
××
CPI = CPI
Pipeline depthIdealunpipelined
CPI = CPI + Pipeline stall cycles per instructionpipelined Ideal
Arquitectura de Computadoras PipeliningPerformance- 5
Laboratorio deTecnologías de Información
Speedup Equation for Pipelining, cont.Speedup Equation for Pipelining, cont.
Arquitectura de Computadoras PipeliningPerformance- 6
Laboratorio deTecnologías de Información
ExampleExample: Dual: Dual--portport vs. single vs. single portportMachine A: Dual ported memoryMachine B: Single ported memory, 1.05 times faster clock rateCPIideal = 1 for bothLoads are 40% of instructions executed
SpeedupB = Pipeline depth/(1+0.4*1) x (clockunpipe /(clockpipe /1.05))= 0.75 x Pipeline depth
SpeedupA = Pipeline depth/(1+0) x (clockunpipe /clockpipe )= Pipeline depth
SpeedupA / SpeedupB = Pipeline depth/(0.75 x Pipeline depth= 1.33
Machine A is 1.33 times faster
Arquitectura de Computadoras PipeliningPerformance- 7
Laboratorio deTecnologías de Información
Performance SummaryPerformance Summary
Just overlap tasks, and easy if tasks are independentSpeedup <= Pipeline depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:• structural: need more HW resources• data: need forwarding, compiler scheduling• control: discuss next time
Speedup = Pipeline depth
+ Pipeline stall cycles Clock cycleClock cycle
unpipelined
pipelined1 ×
Arquitectura de Computadoras PipeliningPerformance- 8
Laboratorio deTecnologías de Información
Recap: Pipeline HazardsRecap: Pipeline Hazards
I-Fet ch DCD MemOpFetch OpFetch Exec Store
IFetch DCD ° ° °StructuralHazard
I-Fet ch DCD OpFetch Jump
IFetch DCD ° ° °
Control Hazard
IF DCD EX Mem WB
IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write)
IF DCD OF Ex RS WAR Data Hazard (write after read)
IF DCD EX Mem WB
IF DCD EX Mem WB
Arquitectura de Computadoras PipeliningPerformance- 9
Laboratorio deTecnologías de Información
Hazard DetectionHazard Detection
Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline.A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )
■
Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction.
■
When instruction issues, reserve its result register. ■
When on operation completes, remove its write reservation.
A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j )
Arquitectura de Computadoras PipeliningPerformance- 10
Laboratorio deTecnologías de Información
Record of Pending Writes In Pipeline RegistersRecord of Pending Writes In Pipeline Registers
Current operand registersPending writeshazard <=
((rs == rwex)
& regWex
) OR((rs == rwmem)
& regWme
) OR((rs == rwwb) & regWwb
) OR((rt == rwex) & regWex
) OR((rt == rwmem) & regWme
) OR((rt == rwwb
) & regWwb
)
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw
rs
rt
Arquitectura de Computadoras PipeliningPerformance- 11
Laboratorio deTecnologías de Información
Resolve RAW by Resolve RAW by ““forwardingforwarding”” (or bypassing)(or bypassing)
Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe
Increase muxes to add paths from pipeline registers
Data Forwarding = Data Bypassing
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw
rs
rtForward
mux
Arquitectura de Computadoras PipeliningPerformance- 12
Laboratorio deTecnologías de Información
What about memory operations?What about memory operations?
A B
op Rd Ra Rb
op Rd Ra Rb
Rd
to regfile
R
Rd
If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations!What does delaying WB on arithmetic operations cost?
–
cycles ? –
hardware ?
What about data dependence on loads? R1 <-
R4 + R5
R2 <-
Mem[ R2 + I ] R3 <-
R2 + R1
⇒ “Delayed Loads”Can recognize this in decode stage and introduce bubble while stalling fetch stag.Tricky situation:
R1 <-
Mem[ R2 + I ] Mem[R3+34] <-
R1
Handle with bypass in memory stage!
D
Mem
T
Arquitectura de Computadoras PipeliningPerformance- 13
Laboratorio deTecnologías de Información
Software SchedulingSoftware Scheduling
Slow Code:
lw Rb, blw Rc, cadd Ra, Rb, Rcsw a, Ralw Re, elw Rf, fsub Rd, Re, Rfsw d, Rd
Fast Code:
lw Rb, blw Rc, clw Re, eadd Ra, Rb, Rclw Rf, fsw a, Rasub Rd, Re, Rfsw d, Rd
Try
producing
fast
code
fora = b + c;d = e -
f;
assuming
a, b, c, d, e, and
f in memory
Arquitectura de Computadoras PipeliningPerformance- 14
Arquitectura de Computadoras PipeliningPerformance- 15
Laboratorio deTecnologías de Información
Pipelined ProcessorPipelined Processor
Separate control at each stageStalls propagate backwards to freeze previous stagesBubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.
Exe
c
Reg
. Fi
le
Mem
Acce
ss
Dat
aM
em
A
B
S
M
Reg
File
Equa
l
PC
Nex
t PC
IR
Inst
. Mem
Valid
IRex
Dcd
Ctrl
IRm
em
Ex
Ctrl
IRw
b
Mem
Ctrl
WB
Ctrl
D
Stalls
Bubbles
Arquitectura de Computadoras PipeliningPerformance- 16
IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access.
■
IS–second half of access to instruction cache. ■
RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection.
■
EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
■
DF–data fetch, first half of access to data cache.■
DS–second half of access to data cache.■
TC–tag check, determine whether the data cache access hit.■
WB–write back for loads and register-register operations.
8 Stages: What is impact on Load delay? Branch delay? Why?
Arquitectura de Computadoras PipeliningPerformance- 21
Laboratorio deTecnologías de Información
Case Study: MIPS Case Study: MIPS R4000R4000
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
TWO CycleLoad Latency
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
THREE CycleBranch Latency(conditions evaluatedduring EX phase)Delay slot plus two stallsBranch likely cancels delay slot if not taken
Arquitectura de Computadoras PipeliningPerformance- 22
Laboratorio deTecnologías de Información
MIPS R4000 Floating PointMIPS R4000 Floating Point
FP Adder, FP Multiplier, FP DividerLast step of FP Multiplier/Divider uses FP Adder HW8 kinds of stages in FP units:
Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
Arquitectura de Computadoras PipeliningPerformance- 23
Laboratorio deTecnologías de Información
MIPS FP Pipe StagesMIPS FP Pipe Stages
FP Instr 1 2 3 4 5 6 7 8 …Add, Subtract U S+A A+R R+SMultiply U E+M M M M N N+A RDivide U A R D28 … D+A D+R, D+R, D+A, D+R, A, RSquare root U E (A+R)108 … A RNegate U SAbsolute value U SFP compare U A RStages:
M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers
A Mantissa ADD stage D Divide pipeline stageE Exception test stage
Arquitectura de Computadoras PipeliningPerformance- 24
Laboratorio deTecnologías de Información
R4000 PerformanceR4000 PerformanceNot ideal CPI of 1:
■
Load stalls (1 or 2 clock cycles)■
Branch stalls (2 cycles + unfilled slots)■
FP result stalls: RAW data hazard (latency)■
FP structural stalls: Not enough FP hardware (parallelism)
00.5
11.5
22.5
33.5
44.5
eqnt
ott
espr
esso gc
c li
dodu
c
nasa
7
ora
spic
e2g6
su2c
or
tom
catv
Base Load stalls Branch stalls FP result stalls FP structuralstalls
Arquitectura de Computadoras PipeliningPerformance- 25
Superscalar DLX: 2 instructions, 1 FP & 1 anything else– Fetch 64-bits/clock cycle; Int on left, FP on right– Can only issue 2nd instruction if 1st instruction issues– More ports for FP registers to do FP load & FP op in a pair
Type PipeStagesInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WBInt. instruction IF ID EX MEM WBFP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in SS■
instruction in right half can’t use it, nor instructions in next slot
Arquitectura de Computadoras PipeliningPerformance- 34
Laboratorio deTecnologías de Información
Unrolled Loop that Minimizes Stalls for ScalarUnrolled Loop that Minimizes Stalls for Scalar
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.3 clocks per iterationNeed more registers in VLIW(EPIC => 128int + 128FP)
Arquitectura de Computadoras PipeliningPerformance- 38
Laboratorio deTecnologías de Información
Software PipeliningSoftware PipeliningObservation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations
Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
Arquitectura de Computadoras PipeliningPerformance- 39
Laboratorio deTecnologías de InformaciónSoftware PipeliningSoftware Pipelining
sum = 0.0;for( i=1; i<=N; i++ ) { ; sum = sum + a[i]*b[i]load a[i];load b[i];mult a[i]*b[i]add sum[i]
Until doneselect most common path -- a traceschedule trace accross basic blocksrepair other paths
Arquitectura de Computadoras PipeliningPerformance- 45
Laboratorio deTecnologías de Información
Trace Scheduling, contTrace Scheduling, cont
trace to be scheduled:
b[i] = “old”a[i] = ...b[i] = “new”;c[i] = ...if( A[i] <= 0 ) go to A
B:
repair code:
A: restore old b[i]xmaybe recalculate c[i]go to B:
b[i] = “old”a[i] = ...if( a[i] > 0 ) then
b[i] = “new”; /*common case*/else
xendifc[i] = ...
Arquitectura de Computadoras PipeliningPerformance- 46
Laboratorio deTecnologías de Información
SummarySummaryHazards limit performance
■
Structural: need more HW resources■
Data: need forwarding, compiler scheduling■
Control: early evaluation & PC, delayed branch, prediction
Data hazards must be handled carefully:■
RAW data hazards handled by forwarding■
WAW and WAR hazards don’t exist in 5-stage pipeline
MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)Exceptions in 5-stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage
■
Must flush all previous instructions
More performance from deeper pipelines, parallelismInstruction Level Parallelism is used to get CPI < 1.0