Jan 03, 2016
Automobile Manufacturing1. Build frame. 60 min.
2. Add engine. 50 min.
3. Build body. 80 min.
4. Paint. 40 min.
5. Finish. 45 min.
275 min.
Latency: Time from start to finish for one car.
Throughput: Number of finished cars per time unit.
1 car/275 min = 0.218 cars/hour
275 minutes per car.
Issues: How can we make the process better by adding more workers?
(smaller is better)
(larger is better)
6.1
An Assembly line
6.1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
60 50 80 40 45
First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.
80 80
Last two stages only receive onecar/80 min to work on.
80 80
Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)
time
Will approach 1 car/80 min as time goes on
Applying Assembly Lines to CPUs
• The single-cycle design did everything “at once”
• Can we break the single-cycle design up into stages?
6.1
• Issues:
• Car assembly works well. Will it be so easy to do the same technique to a CPU?
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Instr. Fetch,PC=PC+4
Instr. DecodeRegister Fetch
Execute,Address Calc.
Memory
Reg.Write-back
Breaking up the Single-Cycle Datapath
6.2
Stages frommulti-cycle design
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Instr. Fetch,PC=PC+4
Instr. DecodeRegister Fetch
Execute,Address Calc.
Memory
Reg.Write-back
The Key - Pipeline Registers
6.2
clock
PC+4
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
0
1
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
Rd:[15-11]
1
0
Example: R-type Instruction
6.2
PC+4
Writes the correct data to the wrong register
In general, arrows that go backwards across pipeline stages may be bad news...
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
Result
Result Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
0
1Rd:[15-11]
1
0
Correcting the Write Register Problem
6.2
PC+4
0
1
Rt:[20-16]
Rd:[15-11]
Assembly-line Control Signals
135 4
In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car.
F: StandardE: 135 HPB: 2-doorP: GreenF: Leather
E: 190 HPB: 4-doorP: BlueF: Cotton
B: 2-doorP: LavenderF: Leather
P: GreenF: Vinyl
F: Leather
2
By separating the control signals by stages, only the signals needed for the current stage must be decoded.
All signals for later stages must be passed along.
6.1
InstructionMemory
Data Memory
AddAdd
4
Read address
Instruction [31-0]
Read address
Write address
Write data
Read dataResult
Zero
ResultResult
Sh.Left2
1
00
1
signextend
PC
16 32
Read reg. num A
RegistersRead reg num B
Write reg num
Write reg data
Read reg data A
Read reg data B
Read reg num A
Imm:[15-0]
Rs:[25-21]
Rt:[20-16]
1
0
The Pipelined Control Logic
6.3
PC+4
0
1
Rt:[20-16]
Rd:[15-11]
ALUcontrol
ALUOp
RegWrite
Mem
To
Reg
MemWrite
MemRead
ALUSrc
PCSrc
RegDest
Op:[31-26]
W
ME
Control W
MW
Branch
How’d we do?
• Compared to Single-cycle
• 5 stages --> Potentially 5x speedup
• Not likely• Stages won’t all be equally long• Pipeline registers will cause some delays
• Latency --> Greater than in single-cycle design
• More complexity, but nicely divided up
Example 1
• Consider executing the following code
add $3, $4, $5
and $6, $7, $8
sub $9, $10, $11
on
i) A single-cycle machine with a cycle time of 200 ns
ii) A 5-stage pipeline machine with a cycle time of 50 ns
Which one runs faster?
What if the instructions were 100 instead of 3?
Analyzing Pipelines
6.4
ADD $10, $14, $0SUB $12, $13, $2AND $1, $6, $11SW $3, 200($9)OR $9, $13, $7
OR IF RF M WBEX
IF RF M WBSW EX
IF RF M WBAND EX
IF RF M WBSUB EX
IF RF MADD EX WB
Data Hazards
6.4
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7 Writes register $13Writes register $13
Reads wrong $13Reads wrong $13
Reads wrong $13Reads wrong $13
Reads ? $13Reads ? $13
Reads correct $13Reads correct $13 OR IF RF M WBEX
IF RF M WBSW EX
IF RF M WBAND EX
IF RF M WBSUB EX
IF RF MADD EX WB
Preventing Data Hazards
6.4
ADD $13, $14, $0NOPNOPNOPSUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
Insert NOP’s into the instructionstream to allow WB to happen beforeRF.
Assume we can’t write a registerand read the new value in the same cycle
Assume we can’t write a registerand read the new value in the same cycle
IF RFOR
IF RFSW EX
IF RF MAND EX
IF M WBSUB EXRF
IF RF MADD WBEX
IF M WBSUB EXRF
Detecting Hazards
6.5
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
Check each instruction as it is being decoded (RF-ID stage).If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard.
Write: $13
Read A: $13
Read B: $13
Read A: $13 IF RFOR EX
SW IF RF MEX
IF RF M WBAND EX
IF RF M WBSUB EX
ADD IF RF M WBEXCompare write reg # in EX with read reg # in RF
Compare write reg # in M with read reg # in RF
Compare write reg # in WB with read reg # in RF
Stalling with Bubbles
6.5
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7
IF RFOR
IFSUB
IFSUB
IFSUB
Stalling:• Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers.• Don’t write PC+4 into PC --> Stay at the current instruction and try again.
IF RF MADD WBEX
IF RF M WBSUB EX
IF RF MAND EX
SW IF RF EX
==
=
Register Forwarding
6.6
ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $2
Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage.
--> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage.
IF RF M WBSUB EX
IF RF M WBAND EX
IF RF M WBOR EX
IF RF M WBSW EX
IF RF MADD WBEX
Unforwardable Loads
6.6
LW $2, 30($2)AND $1, $2, $13SW $3, 200($2)OR $9, $2, $1
IF RF M WBAND EX
IF RF MLW WBEX
IF RF M WBSW EX
IF RF M WBOR EXOR
IF RF M WBAND EX
Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load.
Example 2
• Consider executing the following code on a 5-stage pipeline datapath
add $3, $4, $5
lw $7, 100($3)
sub $8, $7, $9
1. Identify any potential data dependencies
2. How many cycles will it take to execute this code assuming no register forwarding?
3. How many cycles will it take to execute this code assuming register forwarding is available?
Branch Hazards
6.7
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
IF RF M WBAND EX
IF RF M WBOR EX
IF RF M WBOR EXLW
IF RF M WBSW EX
Don’t know result of branch untilthe end of the M stage
Don’t know result of branch untilthe end of the M stage
If the branch is taken,we’ve blown it by executingthe intervening instructions
If the branch is taken,we’ve blown it by executingthe intervening instructions
IF RFBEQ WBEX M
Solution 1: Stall
6.5
IF RFADD
IFAND
IFAND
IFAND
IF RF MBEQ WBEX
IF RF M WBAND EX
IF RF MSW EX
OR IF RF EX
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem
Branchnot taken
Branchnot taken
6.7
BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5
SKIP: LW $2,32($4)
IF RFBEQ WBEX M
If we guess right, we win --> No stall at all
IF RF M WBLW EX
IF RF M WBOR EX
If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall
IF RF M WBAND EX
IF RF M WBSW EX
Solution 2: Assume not Taken
Must be undone if branchis taken!
Must be undone if branchis taken!
Branch is taken...Branch is taken...
6.7
Solution 3: Better Prediction
• Predict that the branch goes the same way as the last time
• Works great for loops
• Works great for “special-case” code
• Need to keep track of the information for each branch, though...
• One or two bits will do
• Keep a small table of recently used branches and which way they went
6.7
Solution 4: Delayed BranchesXOR $1, $3, $3ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0BEQ $10, $11, SKIPLW $4, 60($2)
SKIP AND $1, $2, $3
If we had some warning, wecould compute the branch aheadof time...
XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP
ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)
SKIP AND $1, $2, $3
3 delay slots3 delay slots These instructionsare always executed.Branch can’t dependon them...
3-slot Delayed Branch
6.7
IF RFB3E WBEX M
IF RF M WBLW or AND EX
Branch-After-Three-EQ $10,$11,SKIPADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)
SKIP AND $1, $2, $3
IF RF WBEX MADD
IF RF WBEX MSUB
IF RF WBEX MOR
Branch summary
• Two decent solutions:
• Branch prediction• Requires more hardware• Used in modern microprocessors
• Delayed branch• Requires special software manipulation• Often doesn’t deliver its promise• Used often in CPUs 4-10 years ago
Example 3
• Consider executing the following codeLOOP: add $3, $4, $5
and $6, $7, $8bne $12, $8, LOOP
oni) A single-cycle machine with a cycle time of 200 nsii) A 5-stage pipeline machine with a cycle time of 50
nsA. Assume the loop executes 10 timesB. Assume the loop executes 100 timesC. Assume the loop executes 1000 timesWhich one runs faster?
Example 4
• Consider executing the following code on a 5-stage pipeline datapath
addi $3, $0, 10LOOPSTART: lw $5, ARRAY($3)
addi $5, $5, 1sw $5, ARRAYaddi $3, $3, -1bne $3, $0, LOOPSTARTadd $3, $5, $6sub $7, $8, $9addi $4, $6, 3
1. Identify potential data dependencies2. How many cycles will it take to execute this code?
A. With nops/stallsB. With branch prediction assuming branch not takenC. With branch prediction based on one previous result