1 Appendix C Pipelining: Basic and Intermediate Concepts Computer Architecture A Quantitative Approach, Fifth Edition
Jan 17, 2016
1
Appendix C
Pipelining: Basic and Intermediate Concepts
Computer ArchitectureA Quantitative Approach, Fifth Edition
2
Basic Pipelining
Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance.
Overview of basic pipelining What is pipelining? Computing pipeline speedup Clocking pipelines Pipelining MIPS Pipeline hazards Handling interrupts.
3
Pipelining
4
Pipelining 3 Stages
Assume a 2 ns flip-flop delay
5
Pipelining: Computing the speedup Time per instruction
TPI = CPI cycle time We can think about pipelining as reducing either CPI
or cycle time Ideal speedup
Requires that all stages be perfectly balanced No synchronization (latch, flip-flop) overhead No stall cycles
The speedup from a pipeline is limited CPIreal = CPIideal + CPI stall
CCTreal = Timelongest pipestage + Timelatch overhead
stagespipelineofnumberpipelineTPIwith
pipelineTPIwithoutSpeedup
6
MIPS Instruction Formats
7
Basic MIPS Pipeline
8
Basic MIPS Pipeline (simplified)
9
Pipelining By Adding Registers
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
10
MIPS Pipelined Execution
Instruction 1 2 3 4 5 6 7 8 9
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB
i+4 IF ID EX MEM WB
11
Rules for pipeline registers Each stage must be independent, so inter-stage registers
must hold Data values Control signals, including
Decoded instruction fields MUX controls ALU controls
Think of the register file as two independent units Read file, accessed in ID Write file, accessed in WB
There is no “final” set of registers after WB, (WB/IF) because the instruction is finished and all results are recorded in permanent machine state (register file, memory, and PC)
12
A More Accurate Pipeline Schematic
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
13
Pipeline Dataflow: the details
Reg-RegALU
Reg-immedALU
Load Store Branch Jump
IF IR2 = IMEM[PC]PC2 = PC = PC+4
ID A3 = Regs[IR25..21]; B3=Regs[20..16];IR3=IR2;PC3=PC2;
IM3=IR2[15]16 ##IR2[14..0]
EX ALU4= A3 op B3;IR4 = IR3PC4 = PC3
ALU4 = A3 op IM3IR4 = IR3PC4 = PC3
ALU4 = A3 + IM3IR4 = IR3PC4 = PC3MD4 = B3
ALU4 = PC3 + IM3CO4 = A3 op 0
IR4 = IR3PC4 = PC3
ALU4 = PC3 + IM3IR4 = IR3PC4 = PC3
MEM IR5=IR4PC5=PC4
IR5=IR4PC5=PC4
WB5 = DMEM[ALU4]
DMEM[ALU4] = MD4
IR5=IR4PC5=PC4
If (C04) PC=ALU4
IR5=IR4PC5=PC4
PC = ALU4
WB Din = WB Din = WB Din = WB
14
Problems with Pipelining (Dependencies and Hazards)
Dependencies: a property of the program
Data dependencies Instruction j uses the result produced by
instruction I
Control dependencies The execution of instruction j depends upon the
result of instruction i
15
Dependencies and Hazards
Hazard a result of dependencies in the pipeline Hazards lead to pipeline stalls or the execution of the
wrong instruction Data hazards
Instruction depends upon the result of an instruction still in the pipeline
Structural Hazard Two instructions try to use the same hardware
resource in a single cycle Control hazard
Caused by the delay in fetching an instruction and decision about changes in instruction flow
16
Structural hazards When two instructions need to use the same
hardware resource in the same cycle. Resources are not duplicated
Register file write ports Resources is not fully pipelined, I.e. takes more than one cycle
Division, floating points
Fix #1: Stall later instruction Low cost, but increases average CPI Best used for rare events Examples:
MIPS R2000 multi-cycle multiply SPARC V1 single memory port for instruction and data
Fix #2: Duplicate the resource Increase cost, but preserves CPI Best used for cheap resources and/or frequent events
17
Structural hazards, continued Example resource duplication
Separate instruction and data memory Separate ALU and PC adders Register files with multiple ports
Fix #3: Pipeline expensive resource Moderate cost compared to duplication, expensive compared to stalling Best used for high performance or specialty machines
Fully pipelined floating point units for scientific machines.
How to avoid structural hazards altogether Design the ISA so that each resource needed by an instruction:
Is used once Is always used in the same pipeline stage Takes one cycle
MIPS is designed with pipelining in mind, x86 is not
18
Types of Data Hazards RAW (Read After Write)
Only hazard for “fixed” pipelines Later instruction must be read after the earlier instruction writes
WAW (Write After Write) Variable length pipeline Later instructions must write after earlier instruction I
WAR (Write after Read) Pipeline with late read Later instruction must write after earlier instruction reads
We can have Data hazard through memory locations
F R A M W
F R A M W
F R 1 2 3 4 W
F R A M W
F R 1 2 3 4 R 5 W
F R A M W
19
Example RAW pipeline hazard
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
20
Stall for RAW hazards Relatively cheap: just needs some extra compare and
control logic Detected in ID stage by comparing the registers to be read with
the registers to be written for the instruction currently in the EX, MEM, or WB stages
Stall if a match is found Increases the average CPI Would happen much too frequently
F R X M W
R X M WF Bubble
Write Data to R1 here
Read from R1 hereADD R1, R2, R3ADD R4, R1, R5
21
Stall type #1: Freeze the whole pipeline
Freeze all pipe stages for one or more cycles, and suppress writeback Needs only one global stall signal which suppresses all latching in all
pipeline stages Sometimes called a “fixed pipe” or “frozen pipe” stall
Works for cache misses Will not work to remove pipeline hazards
1 2 3 4 5 6 7 8 9 10 11
I IF ID EX MEM WB WB
I+1 IF ID EX MEM MEM WB
I+2 IF ID EX EX MEM WB
I+3 IF ID ID EX MEM WB
I+4 IF IF EX EX MEM WB
I+5 IF ID EX MEM WB
I+6 IF ID EX MEM
22
Stall type #2: Delay completion of an instruction
Instruction progress stops for one cycle Earlier instructions continue towards completion Prior instructions must suspend and make no more progress An “elastic pipe: stall Good when the need for stalling is only detected after decode, like for
pipeline hazards
1 2 3 4 5 6 7 8 9 10 11
I IF ID EX MEM WB
I+1 IF ID EX MEM WB
I+2 IF ID Stall EX MEM WB
I+3 IF Stall ID EX MEM WB
I+4 Stall IF EX EX MEM WB
I+5 IF ID EX MEM WB
I+6 IF ID EX MEM
Bubble in: EX MEM WB
23
Bypass (Forwarding)
If data is available elsewhere in the pipeline, there is no need to stall
Detect condition Bypass (or forward) data directly to the consuming
pipeline stage Bypass eliminates stalls for single-cycle operations
Reduces longest stall to N-1 cycles for N-cycle operations
24
Physical Forwarding Paths
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
• The third forwarding operation might not be necessary • if we can make read-after-write register file
25
Example forwarding decisions If EX has just finished an operation for which ID
wants to read the value from either operand, we must forward
If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS1_Reg_Num
then ALUmuxA =SelectALU4 If IR.Will_Write_Reg and IR4.Write_Reg_Num ==
IR3.RS2_Reg_Num
then ALUmuxB =SelectALU4 Need one comparison and multiplex control for each
forwarding path Be careful: if you forward from more than one
instruction, choose the closest in the pipeline
26
Physical Forwarding Paths
PCInstruction
memory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
27
Forwarding Animation (1)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
5
2
4
$1
$3
3
1
2
Control
ALU
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
6
10 10
$4
$2
6
2
4
$2
$5
5
2
4
Control
ALU
10
2
WB
M
WB
28
Forwarding Animation (2)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
and $4, $2, $5 sub $2, $1, $3
ID/EX
before<1>
EX/MEM
before<2>
MEM/WB
or $4, $4, $2
Clock 3
2
5
10 10
$2
$5
5
2
4
$1
$3
3
1
2
Control
ALU
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
or $4, $4, $2 and $4, $2, $5
ID/EX
sub $2, . . .
EX/MEM
before<1>
MEM/WB
add $9, $4, $2
Clock 4
4
6
10 10
$4
$2
6
2
4
$2
$5
5
2
4
Control
ALU
10
2
WB
M
WB
29
Forwarding Animation (3)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
2
4
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
PCInstruction
memory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
2
4
9
ALU
10
4
4
WB
4
1
Registers
Instruction
IF/ID
ID/EX
4
Control
30
Forwarding Animation (4)
PCInstruction
memory
Registers
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
Instruction
IF/ID
add $9, $4, $2 or $4, $4, $2
ID/EX
and $4, . . .
EX/MEM
sub $2, . . .
MEM/WB
after<1>
Clock 5
4
2
10 10
$4
$2
2
4
9
$4
$2
4
2
24
Control
ALU
10
WB
2
1
PCInstruction
memory
Mux
Mux
Mux
EX
M
WB
M
WB
Datamemory
Mux
Forwardingunit
after<1>after<2> add $9, $4, $2 or $4, . . .
EX/MEM
and $4, . . .
MEM/WB
Clock 6
10
$4
$2
2
4
9
ALU
10
4
4
WB
4
1
Registers
Instruction
IF/ID
ID/EX
4
Control
31
Other Data Hazards WAR (Write After Read)
Can happen if the instruction pipeline has early writes and/or late reads; something like:DIV (R1), Suppose that it does not read destination indirect until after the divide
ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1
Can not happen in DLX because all reads are early (ID) and all writes are late (WB)
WAW (Write After Write) Can happen when a fast operation follows a slow one;
like
LW R1,0(R2) IF ID EX MEM WB
ADD R1, R2, R3 IF ID EX WB
Can not happen in DLX (integer) because there is only one WB stage and instructions use it in order
32
One data hazard left
Loaded data is not available until the end of MEM, which is too late for the following instruction
Forwarding can not help, so we must stall – or just “decree” that you can not write code like this. Such a decree is called a “delayed load” and was used in the original MIPS 2000
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
33
Stalling to interlock
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
34
The software fix: instruction scheduling to avoid stalls
Since we can not avoid a stall following a load, avoid the stall by rearranging the code (“pipeline scheduling”), if possible
Replacesub r4, r5, r7lw r1, 50(r2)add r3, r1, r4
Withlw r1, 50(r2)sub r4, r5, r7add r3, r1, r4
This can improve a simple RISC machine performance
35
The software fix: instruction scheduling to avoid stalls
But it is limited Usually limited to basic blocks between
branches, 5-7 instructions Difficult to do interchanges to variables
referenced indirectly (pointer, array, or parameter) due to the risk of aliases.
36
Branches and jumps
Control point: know target and condition Mem control point Branch penalty: number of pipeline stages to control
point
This 3-cycle penalty works, but since branches occur every 5-7 instructions, it kills performance. What to do Determine the branch condition earlier than EX Compute the target address earlier than MEM
Instruction 1 2 3 4 5 6 7 8 9
Branch I IF ID EX MEM WB
I+1 IF Stall Stall IF ID EX MEM WB
I+2 IF ID EX MEM
I+3 IF ID EX
37
Characteristics of MIPS branches and jumps
The branch condition Only has EQ/NE comparison to zero Fast and cheap, no need for a full ALU Use a 32-bit NOR gate instead
The branch target Always PC-relative Needs only 16 bit adder (and carry propagation)
The jump target Always PC-relative Target = {PC[31:28], offset, 00}
All can be moved to the ID stage, at the cost of additional hardware (and maybe increased cycle time)
Still requires one stall
38
Pipelining and Branch ISA Design
Simple branches Makes ID control point possible Maybe increases cycle time 1 cycle penalty
Complex branches Requires EX control point Maybe lower cycle time 2 cycle branch penalty
39
Reducing branch penalties (1)
Predict that the branch will not be taken Continue fetching from sequential addresses. Cancel later if branch was taken Easy to do
If it is not, continue If it is, change the following instructions into a
NOP and thus take a 1-cycle penalty Helps a little, but bets the wrong way for
loops
40
Reducing branch penalties (2)
Predict that the branch will be taken Only useful if the target address is known
before the branch condition – not true for MIPS
Cancel later if the branch was not taken Always has some delay in fetching the branch
target
41
Reducing branch penalties (3)
Change the ISA: delay the effect of the branch Always execute the instruction(s) after the
branch or jump Depends on the compiler to find something
useful to do in the branch delay slot(s). An ugly dependence of ISA on implementation
– may change Interaction with branch prediction, interrupts.
42
Filling the branch delay slot
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
…
add $s1, $s2, $s3
if $s1 = 0 then
add $s1, $s2, $s3
if $s1 = 0 then
add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6add $s1, $s2, $s3
if $s1 = 0 then
sub $t4, $t5, $t6
add $s1, $s2, $s3
if $s2 = 0 then
BecomesBecomesBecomes
Delay slot
Delay slot
Delay slot
sub $t4, $t5, $t6
if $s2 = 0 then
add $s1, $s2, $s3
43
How useful are canceling branches
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
com
press
eqnto
tt
espre
sso gcc li
doduc
ear
hydro
2d
mdljd
p
su2c
or
Benchmark
Per
cen
tage
of
con
dit
ion
al b
ran
ches
Canceled delay slot
Empty slot
Integer : 35 % slots wasted
Floating point : 25% slots wasted
44
Performance of Branch schemes?
Effective CPI = 1 + %branches average branch penalty
For integer MIPS: 20% of instructions are branches or jumps. 70% of them go to the target
Strategy Branch Taken penalty
Branch not taken penalty
Effective CPI
Stall 3 3 1.60
Branch in ID 1 1 1.20
Predict taken 1 1 1.20
Predict not taken 1 0 1.14
Delay slot 0.5 0.5 1.10
Cancel branch 0.3 0.3 1.06
45
Pipeline example Consider the following pipeline which
implements the MIPS-like ISA. The only variation on the MIPS ISA is the support of full register compares in branch instructions
Instruction 1 2 3 4 5 6 7 8 9 10 11
I IF ID EX1 EX2/MEM1
MEM2 WB
I+1 IF ID EX1 EX2/MEM1
MEM2 WB
I+2 IF ID EX1 EX2/MEM1
MEM2 WB
I+3 IF ID EX1 EX2/MEM1
MEM2 WB
I+4 IF ID EX1 EX2/MEM1
MEM2 WB
I+5 IF ID EX1 EX2/MEM1
MEM2 WB
46
The Pipeline stages
Stage Function
IF Instruction fetch
ID Instruction decode. Register fetch
EX1 Address generation (data and PC-target)
EX2/MEM1 ALU operationBranch condition resolutionFirst cycle of memory access
MEM2 Second cycle of memory access
WB Register file writeback
47
Assumptions
Writes to the register file occur in the first half of the clock cycle while reads from the register file occur in the second half
All bypass paths have been implemented to minimize pipeline stalls due to data hazards
The pipeline implements hardware interlocks
48
Questions How many register file ports does the processor
need to minimize structural hazards?
Indicate all forwarding required to minimize stalls in the given pipeline. Also, specify the minimum number of comparators needed to implement forwarding?
What is the worst case delay due to RAW data hazards?
What is the branch delay of this pipeline?
49
Instruction Dependencies The frequencies in the table are presented as
percentages of all instructions executed
Type Instruction Sequence Frequency
1 ALUop Rx,-,-ALUop -,-,Rx or ALUop -,Rx,-
10%
2 ALUop Rx,-,-Store Rx,-(-)
5%
3 ALUop Rx,-,-Load -,-(Rx) or Store -,-(Rx)
5%
4 ALUop Rx,-,-JumpRegister Rx
1%
5 ALUop Rx,-,-Branch Rx,-,# or Branch -,Rx,#
2%
6 Load Rx,-(-)ALUop -,-,Rx or ALUop -,Rx,-
15%
7 Load Rx, -(-)Load -,-(Rx) or Store -,-(Rx)
3%
8 Load Rx, -(-)Branch Rx,-,# or Branch -,Rx,#
2%
9 Load Rx, -(-)JumpRegister Rx
1%
50
More Questions List the instruction sequences from the previous table
that cause data stalls in the pipeline. Indicate the corresponding number of stall cycles.
Compute the CPI for the pipeline due to data hazards only. Ignore instruction sequences that are not listed in the table
If the frequency of conditional branches is 10% of which 65% are taken and the frequency of unconditional branches is 6%, compute the overall CPI assuming a TAKEN branch prediction scheme.
51
Summary
Pipelining: overlaps execution of instructions Improves instruction throughput → latency of long program
Problem: structural, data, and control hazards Hazards occur if there are dependences and pipeline
exposes them Common solution: stall, forwarding, scheduling Performance
CPIreal = CPIideal + Stallsstructural + Stallsdata + Stallscontrol
Cycle timereal = Timelongest pipestage + Register Overhead What makes pipelining easier
Simple instructions (load-stores, branches Fixed length, encoding with few formats