Page 1
Lecture 4 Slide 1 EECS 470
EECS 470
Lecture 4
Pipelining & Hazards II Winter 2021
Jon Beaumont
http://www.eecs.umich.edu/courses/eecs470
GAS STATION
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.
Page 2
Lecture 4 Slide 2 EECS 470
Class Question
Which of the following best explains why pipelining results
in speedup?
a) Instructions are executed with shorter latency
b) Clock period is reduced
c) More instructions are executed at the same time
d) Magnets
Page 3
Lecture 4 Slide 3 EECS 470
Announcements
• Reminder Lab #1 due tomorrow by 12:30p
Get checked off by GSI/IA
Verilog assignment #1 due tomorrow Submit to autograder by 11:59p
HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p
• I have OH today from 3-4 OH format for all staff: Join Zoom link, put yourself on Office Hour
Queue You will be let into a breakout room when you are at the head
Page 4
Lecture 4 Slide 4 EECS 470
Last Time
• Baseline processor discussion Review 5-stage pipeline from EECS 370
Page 5
Lecture 4 Slide 5 EECS 470
Today
• Hazards Detection Resolution
Software (avoidance) Hardware (stalling, forwarding)
Page 6
Lecture 4 Slide 6 EECS 470
Lingering Questions
• "How recent was the pipeline method developed? What will be the next best method?" Basic pipelines have been used since the very early days of
computing (1930s) Deep pipelines became very popular with vector processors in the
1970s Less popular know we'll discuss why
Recent trends have been not towards better performance, but
better reliability and power-effeciency EECS 573 (Microarchitectures) covers a lot of these interesting topics
• Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD
Page 7
Lecture 4 Slide 7 EECS 470
Balancing Pipeline Stages
IF
ID
EX
MEM
WB
TIF= 6 units
TID= 2 units
TEX= 9 units
TMEM= 5 units
TWB= 8 units
Can we do better in terms of either performance or efficiency?
Page 8
Lecture 4 Slide 8 EECS 470
Balancing Pipeline Stages
Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage
Recent Trends: Deeper pipelines (more and more stages)
Pipeline depth growing more slowly since Pentium 4. Why?
Multiple pipelines Pipelined memory/cache accesses (tricky)
Page 9
Lecture 4 Slide 9 EECS 470
The Cost of Deeper Pipelines
Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies
Suppose add 1 2 3
nand 3 4 5
F D E M W F D E M W
t0 t1 t2 t3 t4 t5
Inst0 Inst1
F D E M W F D E M W
t0 t1 t2 t3 t4 t5
add nand E Stall
F E M D Stall D
RAW!!
(read-after-write
dependency)
Page 10
Lecture 4 Slide 10 EECS 470
Terminology
Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated
Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware
Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time
Page 11
Lecture 4 Slide 11 EECS 470
Handling Data Hazards
Avoidance (static) Make sure there are no hazards in the code
Detect and Stall (dynamic) Stall until earlier instructions finish
Detect and Forward (dynamic) Get correct value from elsewhere in pipeline
Page 12
Lecture 4 Slide 12 EECS 470
Handling Data Hazards: Avoidance
Programmer/compiler must know implementation details Insert noops between dependent instructions
add 1 2 3 noop noop nand 3 4 5
write R3 in cycle 5
read R3 in cycle 6
Page 13
Lecture 4 Slide 13 EECS 470
Problems with Avoidance
Binary compatibility New implementations may require more noops
Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle
Intel Itanium – 25-40% of instructions are noops
Slower execution CPI=1, but many instructions are noops
Page 14
Lecture 4 Slide 14 EECS 470
Handling Data Hazards: Detect & Stall
Detection Compare regA & regB with DestReg of preceding insn.
3 bit comparators
Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute
Which of the "Avoidance" issues does "Detect & Stall" fix? (select all)
a) Binary compatibility
b) Code size
c) Slower execution
Page 15
15
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
op
dest
offset
valB
valA
PC+1 PC+1
target
ALU
result
op
dest
valB
op
dest
ALU
result
mdata
eq? instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 22-24
data
dest
Fetch Decode Execute Memory WB
Page 16
16
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
op
dest
offset
valB
valA
PC+1 PC+1
target
ALU
result
op
dest
valB
op
dest
ALU
result
mdata
eq? instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
dest
Fetch Decode Execute Memory WB
Page 17
17
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
op
offset
valB
valA
PC+1 PC+1
target
ALU
result
op
valB
op
ALU
result
mdata
eq?
ad
d 1
2 3
7 10
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
End of Cycle 1
Page 18
18
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
3
7
14
PC+1 PC+1
target
ALU
result
op
valB
op
ALU
result
mdata
eq? na
nd
3 4
5
7 10
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
End of Cycle 2
Page 19
19
Hazard detection
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
3
7
14
PC+1 PC+1
target
ALU
result
op
valB
op
ALU
result
mdata
eq? na
nd
3 4
5
7 10
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
First half of cycle 3
Page 20
20
REG
file
IF/
ID
ID/
EX
3
compare
Hazard
detected
regA
regB
compare
compare compare
3
Page 21
21
3
Hazard
detected
regA
regB
compare
0 1 1
0 1 1
0 0 0
1
Page 22
22
Hazard
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
7
14
1 2
target
ALU
result
valB
ALU
result
mdata
eq? na
nd
3 4
5
7 10
11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
en
en
First half of cycle 3
Page 23
23
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
2
21
add
ALU
result
mdata
na
nd
3 4
5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
End of cycle 3
noop
Page 24
24
Hazard
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
noop
2
21
add
ALU
result
mdata
na
nd
3 4
5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
en
en
First half of cycle 4
Page 25
25
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
noop
2
noop
add
21
na
nd
3 4
5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
End of cycle 4
noop
Page 26
26
No Hazard
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
noop
2
noop
add
21
na
nd
3 4
5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
First half of cycle 5
Page 27
27
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
nand
11
21
2 3
noop
noop
ad
d 3
7 7
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5 data
End of cycle 5
Page 28
Lecture 4 Slide 28 EECS 470
Problems with Detect & Stall
CPI increases on every hazard
Are these stalls necessary? Not always! The new value for R3 is in the EX/Mem register
Reroute the result to the nand Called “forwarding” or “bypassing”
Page 29
Lecture 4 Slide 29 EECS 470
Handling Data Hazards: Detect & Forward
Detection Same as detect and stall, but…
each possible hazard requires different forwarding paths
Forward Add data paths for all possible sources Add mux in front of ALU to select source
“bypassing logic” often a critical path in wide-issue machines I.e. superscalar machines # paths grows quadratically with machine width
Page 30
Lecture 4 Slide 30 EECS 470
Sample Code Reminder
Run the following code on a pipelined datapath: nand 3 4 5 ; reg 5 = reg 3 ~& reg 4 add 6 3 7 ; reg 7 = reg 6 + reg 3 lw 3 6 10 ; reg 6 = Mem[reg3+10] sw 6 2 12 ; Mem[reg6+10] =reg 2
Poll: How many data dependencies are here? How many stalls will we see?
Page 31
31
Hazard
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
7
14
1 2
na
nd
3 4
5
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
fwd fwd fwd
3
First half of cycle 3
Page 32
32
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
nand
11
10
2 3
21
add
ad
d 6
3 7
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5 data
H1
3
End of cycle 3
Page 33
33
New Hazard
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
nand
11
10
2 3
21
add
ad
d 6
3 7
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5 data
3 M
U
X
H1
3
First half of cycle 4
21
11
Page 34
34
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
10
1
3 4
-2
nand
add
21
lw 3
6 1
0
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
7 5 3 data
M
U
X
H2 H1
End of cycle 4
Page 35
35
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
add
10
1
3 4
-2
nand
add
21
lw 3
6 1
0
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
7 5 3 data
M
U
X
H2 H1
First half of cycle 5
3 No Hazard
21
1
Page 36
36
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
lw
10
21
4 5
22
add
nand
-2
sw 6
2 1
2
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
7 5 data
M
U
X
H2 H1
6
End of cycle 5
Page 37
37
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
lw
10
21
4 5
22
add
nand
-2
sw 6
2 1
2
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7 5
data
M
U
X
H2 H1
First half of cycle 6
Hazard
6
en
en
L
Page 38
38
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
5
31
lw
add
22
sw 6
2 1
2
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7 data
M
U
X
H2
End of cycle 6
noop
Page 39
39
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
noop
5
31
lw
add
22
sw 6
2 1
2
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7 data
M
U
X
H2
First half of cycle 7
Hazard
6
Page 40
40
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
sw
12
7
1
5
noop
lw
99
7 21 11 -2
14
1
0
22
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 data
M
U
X
H3
End of cycle 7
Page 41
41
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
sw
12
7
1
5
noop
lw
99
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 data
M
U
X
H3
First half of cycle 8
99
12
Page 42
42
PC Inst
mem R
egis
ter
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
111
sw
7
noop
7 21 11 -2
14
99
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
M
U
X
H3
End of cycle 8
Page 43
Lecture 4 Slide 43 EECS 470
Control Hazards
beq 1 1 10
sub 3 4 5
F D E M W
F D E M W
t0 t1 t2 t3 t4 t5
beq sub squash
Page 44
Lecture 4 Slide 44 EECS 470
Handling Control Hazards
Avoidance (static) No branches? Convert branches to predication
Control dependence becomes data dependence
Detect and Stall (dynamic) Stop fetch until branch resolves
Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong
Page 45
Lecture 4 Slide 45 EECS 470
Avoidance: if-conversion
if (a == b) {
x++;
y = n / d;
}
sub t1 a, b
jnz t1, PC+2
add x x, #1
div y n, d
sub t1 a, b
add(t1) x x, #1
div(t1) y n, d
sub t1 a, b
add t2 x, #1
div t3 n, d
cmov(t1) x t2
cmov(t1) y t3
If you're interested:
https://en.wikipedia.org/wiki/Predication_(computer_architecture)
Page 46
Lecture 4 Slide 46 EECS 470
Handling Control Hazards: Detect & Stall
Detection In decode, check if opcode is branch or jump
Stall Hold next instruction in Fetch Pass noop to Decode
Page 47
Lecture 4 Slide 47 EECS 470
Problems with Detect & Stall
CPI increases on every branch
Are these stalls necessary? Not always! Branch is only taken half the time
Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don’t complete
Page 48
Lecture 4 Slide 48 EECS 470
Handling Control Hazards: Speculate & Squash
Speculate Assume branch is not taken
Squash Overwrite opcodes in Fetch, Decode, Execute with noop Pass target to Fetch
Page 49
49
PC REG
file
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
sign
ext
Control
equal
M
U
X
beq
sub
add
nand
ad
d
sub
beq
beq
Inst
mem
no
op
no
op
no
op
Page 50
Lecture 4 Slide 50 EECS 470
Problems with Speculate & Squash
Always assumes branch is not taken
Can we do better? Yes. Predict branch direction and target! Why possible? Program behavior repeats.
More on branch prediction to come...
Page 51
Lecture 4 Slide 51 EECS 470
Next Time
• Going one step beyond pipelining: dynamic scheduling (a.k.a. out-of-order processing) Introduce a specific algorithm: scoreboard scheduling
• Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD