1
RecapRecap(Pipelining)(Pipelining)
2
What is Pipelining?• A way of speeding up execution of tasks
• Key idea:
overlap execution of multiple taks
3
Automobile Manufacturing1. Build frame. 60 min.
2. Add engine. 50 min.
3. Build body. 80 min.
4. Paint. 40 min.
5. Finish. 45 min.
275 min.
Latency: Time from start to finish for one car.
Throughput: Number of finished cars per time unit.
1 car/275 min = 0.218 cars/hour
275 minutes per car.
Issues: How can we make the process better by adding?
(smaller is better)
(larger is better)
4
An Assembly line
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
60 50 80 40 45
First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.
80 80
Last two stages only receive onecar/80 min to work on.
80 80
Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)
time
Will approach 1 car/80 min as time goes on
5
Pipelining a Digital System
• Key idea: break big computation up into pieces
Separate each piece with a pipeline register
1ns
200ps 200ps 200ps 200ps 200ps
PipelineRegister
6
Pipelining a Digital System
• Why do this? Because it's faster for repeated computations
1ns
Non-pipelined:1 operation finishesevery 1ns
200ps 200ps 200ps 200ps 200ps
Pipelined:1 operation finishesevery 200ps
7
Comments about pipelining
• Pipelining increases throughput, but not latency
– Answer available every 200ps, BUT
– A single computation still takes 1ns
• Limitations:
– Computations must be divisible into stages of equal sizes
– Pipeline registers add overhead
8
Another Example
Comb.Logic
REG
30ns 3ns
Clock
Delay = 33nsThroughput = 30MHz
Time
UnpipelinedSystem
Op1 Op2 Op3??
– One operation must complete before next can begin– Operations spaced 33ns apart
9
3 Stage Pipelining
– Space operations 13ns apart
– 3 operations occur simultaneously
REG
Clock
Comb.Logic
REG
Comb.Logic
REG
Comb.Logic
10ns 3ns 10ns 3ns 10ns 3ns
Delay = 39nsThroughput = 77MHz
Time
Op1
Op2
Op3
Op4
10
Limitation: Nonuniform Pipelining
Clock
REG
Com.Log.
REG
Comb.Logic
REG
Comb.Logic
5ns 3ns 15ns 3ns 10ns 3ns
Delay = 18 * 3 = 54 nsThroughput = 55MHz
• Throughput limited by slowest stage• Delay determined by clock period * number of stages
• Must attempt to balance stages
11
Limitation: Deep Pipelines
• Diminishing returns as add more pipeline stages• Register delays become limiting factor
• Increased latency• Small throughput gains• More hazards
Delay = 48ns, Throughput = 128MHzClock
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
REG
Com.Log.
5ns 3ns
12
MIPSPipeliningPipelining
13
MIPS 5-stage pipelineMIPS 5-stage pipeline• The MIPS processor needs 5 stages to execute instructions
• Pipelining stages:– IF - Instruction Fetch
– ID - Instruction Decode
– EX - Execute / Address Calculation
– MEM - Memory Access (read / write)
– WB - Write Back (results into register file)
• Not all instructions need all the stages (e.g., add instruction does not need the MEM stage)
14
Basic MIPS Pipelined Processor
IF/ID
Pipeline Registers
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
ID/EX EX/MEM MEM/WB
15
Pipelined Example - Executing Multiple Instructions
• Consider the following instruction sequence:
lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
16
Executing Multiple InstructionsClock Cycle 1
LW
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
17
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 2
LWSW
18
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 3
LWSWADD
19
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 4
LWSWADDSUB
20
Executing Multiple InstructionsClock Cycle 5
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADDSUB
21
Executing Multiple InstructionsClock Cycle 6
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SWADDSUB
22
Executing Multiple InstructionsClock Cycle 7
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
ADDSUB
23
Executing Multiple InstructionsClock Cycle 8
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SUB
24
Alternative View - Multicycle Diagram
IM REG ALU DM REGlw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG
CC 8
25
Processor Pipelining
• There are two ways that pipelining can help:
1. Reduce the clock cycle time, and keep the same CPI
2. Reduce the CPI, and keep the same clock cycle time
CPU time = Instruction count * CPU time = Instruction count * CPICPI * * Clock cycle timeClock cycle time
26
Reduce the clock cycle time, and keep Reduce the clock cycle time, and keep the same CPIthe same CPI
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
CPI = 1CPI = 1
Clock = X HzClock = X Hz
27
Reduce the clock cycle time, and keep Reduce the clock cycle time, and keep the same CPIthe same CPI
Pipeline Registers
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
CPI = 1CPI = 1
Clock = Clock = X*5 HzX*5 Hz
28
Reduce the CPI, and keep the same Reduce the CPI, and keep the same cycle timecycle time
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
CPI = 5CPI = 5
Clock = X*5 HzClock = X*5 Hz
29
Reduce the CPI, and keep the same Reduce the CPI, and keep the same cycle timecycle time
Pipeline Registers
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
CPI = 1CPI = 1
Clock = Clock = X*5 HzX*5 Hz
30
Pipeline performancePipeline performance
• Ideally we get a speedup (by reducing clock cycle or reducing the CPI) equal to the number of stages.
• In practice, we do not achieve that – but we get close:
– Pipelining has additional overhead (e.g., pipeline registers)
– Pipeline hazards
31
Pipeline HazardsPipeline Hazards• Hazards are situations in pipelining which
prevent the next instruction in the instruction stream from executing during the designated clock cycle.
• Hazards reduce the ideal speedup gained from pipelining (e.g., CPI =1) and are classified into three classes:
– Structural hazards
– Data hazards
– Control hazards