GAS STATION Pipelining & Hazards IILecture 4 EECS 470 Slide 6 © Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Lecture 4 Slide 1 EECS 470

EECS 470

Lecture 4

Pipelining & Hazards II Winter 2021

Jon Beaumont

http://www.eecs.umich.edu/courses/eecs470

GAS STATION

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.


Class Question

Which of the following best explains why pipelining results

in speedup?

a) Instructions are executed with shorter latency

b) Clock period is reduced

c) More instructions are executed at the same time

d) Magnets


Announcements

• Reminder Lab #1 due tomorrow by 12:30p

Get checked off by GSI/IA

Verilog assignment #1 due tomorrow Submit to autograder by 11:59p

HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p

• I have OH today from 3-4 OH format for all staff: Join Zoom link, put yourself on Office Hour

Queue You will be let into a breakout room when you are at the head


Last Time

• Baseline processor discussion Review 5-stage pipeline from EECS 370


Today

• Hazards Detection Resolution

Software (avoidance) Hardware (stalling, forwarding)


Lingering Questions

• "How recent was the pipeline method developed? What will be the next best method?" Basic pipelines have been used since the very early days of

computing (1930s) Deep pipelines became very popular with vector processors in the

1970s Less popular know we'll discuss why

Recent trends have been not towards better performance, but

better reliability and power-effeciency EECS 573 (Microarchitectures) covers a lot of these interesting topics

• Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD

https://bit.ly/3oSr5FD


Balancing Pipeline Stages

IF

ID

EX

MEM

WB

TIF= 6 units

TID= 2 units

TEX= 9 units

TMEM= 5 units

TWB= 8 units

Can we do better in terms of either performance or efficiency?


Balancing Pipeline Stages

Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage

Recent Trends: Deeper pipelines (more and more stages)

Pipeline depth growing more slowly since Pentium 4. Why?

Multiple pipelines Pipelined memory/cache accesses (tricky)


The Cost of Deeper Pipelines

Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies

Suppose add 1 2 3

nand 3 4 5

F D E M W F D E M W

t0 t1 t2 t3 t4 t5

Inst0 Inst1

F D E M W F D E M W

t0 t1 t2 t3 t4 t5

add nand E Stall

F E M D Stall D

RAW!!

(read-after-write

dependency)


Terminology

Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated

Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware

Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time


Handling Data Hazards

Avoidance (static) Make sure there are no hazards in the code

Detect and Stall (dynamic) Stall until earlier instructions finish

Detect and Forward (dynamic) Get correct value from elsewhere in pipeline


Handling Data Hazards: Avoidance

Programmer/compiler must know implementation details Insert noops between dependent instructions

add 1 2 3 noop noop nand 3 4 5

write R3 in cycle 5

read R3 in cycle 6


Problems with Avoidance

Binary compatibility New implementations may require more noops

Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle

Intel Itanium – 25-40% of instructions are noops

Slower execution CPI=1, but many instructions are noops


Handling Data Hazards: Detect & Stall

Detection Compare regA & regB with DestReg of preceding insn.

3 bit comparators

Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute

Which of the "Avoidance" issues does "Detect & Stall" fix? (select all)

a) Binary compatibility

b) Code size

c) Slower execution

15

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

dest

valB

op

dest

ALU

result

mdata

eq? instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

Fetch Decode Execute Memory WB

16

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

op

dest

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

dest

valB

op

dest

ALU

result

mdata

eq? instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

dest

Fetch Decode Execute Memory WB

17

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

op

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq?

ad

d 1

2 3

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

End of Cycle 1

18

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

3

7

14

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq? na

nd

3 4

5

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of Cycle 2

19

Hazard detection

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

3

7

14

PC+1 PC+1

target

ALU

result

op

valB

op

ALU

result

mdata

eq? na

nd

3 4

5

7 10

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

First half of cycle 3

20

REG

file

IF/

ID

ID/

EX

3

compare

Hazard

detected

regA

regB

compare

compare compare

3

21

3

Hazard

detected

regA

regB

compare

0 1 1

0 1 1

0 0 0

1

22

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

7

14

1 2

target

ALU

result

valB

ALU

result

mdata

eq? na

nd

3 4

5

7 10

11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en


23

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

2

21

add

ALU

result

mdata

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 3

noop

24

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

21

add

ALU

result

mdata

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3

en

en


25

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

noop

add

21

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

End of cycle 4

noop

26

No Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

2

noop

add

21

na

nd

3 4

5

7 10 11

14

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

3


27

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

21

2 3

noop

noop

ad

d 3

7 7

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

End of cycle 5


Problems with Detect & Stall

CPI increases on every hazard

Are these stalls necessary? Not always! The new value for R3 is in the EX/Mem register

Reroute the result to the nand Called “forwarding” or “bypassing”


Handling Data Hazards: Detect & Forward

Detection Same as detect and stall, but…

each possible hazard requires different forwarding paths

Forward Add data paths for all possible sources Add mux in front of ALU to select source

“bypassing logic” often a critical path in wide-issue machines I.e. superscalar machines # paths grows quadratically with machine width


Sample Code Reminder

Run the following code on a pipelined datapath: nand 3 4 5 ; reg 5 = reg 3 ~& reg 4 add 6 3 7 ; reg 7 = reg 6 + reg 3 lw 3 6 10 ; reg 6 = Mem[reg3+10] sw 6 2 12 ; Mem[reg6+10] =reg 2

Poll: How many data dependencies are here? How many stalls will we see?

31

Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

7

14

1 2

na

nd

3 4

5

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

3

fwd fwd fwd

3


32

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

10

2 3

21

add

ad

d 6

3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

H1

3

End of cycle 3

33

New Hazard

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

nand

11

10

2 3

21

add

ad

d 6

3 7

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

5 data

3 M

U

X

H1

3


21

11

34

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

10

1

3 4

-2

nand

add

21

lw 3

6 1

0

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 3 data

M

U

X

H2 H1

End of cycle 4

35

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

add

10

1

3 4

-2

nand

add

21

lw 3

6 1

0

7 10 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 3 data

M

U

X

H2 H1


3 No Hazard

21

1

36

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

lw

10

21

4 5

22

add

nand

-2

sw 6

2 1

2

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

7 5 data

M

U

X

H2 H1

6

End of cycle 5

37

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

lw

10

21

4 5

22

add

nand

-2

sw 6

2 1

2

7 21 11 77

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 5

data

M

U

X

H2 H1


Hazard

6

en

en

L

38

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

5

31

lw

add

22

sw 6

2 1

2

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 data

M

U

X

H2

End of cycle 6

noop

39

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

noop

5

31

lw

add

22

sw 6

2 1

2

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 7 data

M

U

X

H2


Hazard

6

40

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

22

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 data

M

U

X

H3

End of cycle 7

41

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

sw

12

7

1

5

noop

lw

99

7 21 11 -2

14

1

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

6 data

M

U

X

H3


99

12

42

PC Inst

mem R

egis

ter

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

111

sw

7

noop

7 21 11 -2

14

99

0

8

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

data

M

U

X

H3

End of cycle 8


Control Hazards

beq 1 1 10

sub 3 4 5

F D E M W

F D E M W

t0 t1 t2 t3 t4 t5

beq sub squash


Handling Control Hazards

Avoidance (static) No branches? Convert branches to predication

Control dependence becomes data dependence

Detect and Stall (dynamic) Stop fetch until branch resolves

Speculate and squash (dynamic) Keep going past branch, throw away instructions if wrong


Avoidance: if-conversion

if (a == b) {

x++;

y = n / d;

}

sub t1 a, b

jnz t1, PC+2

add x x, #1

div y n, d

sub t1 a, b

add(t1) x x, #1

div(t1) y n, d

sub t1 a, b

add t2 x, #1

div t3 n, d

cmov(t1) x t2

cmov(t1) y t3

If you're interested:

https://en.wikipedia.org/wiki/Predication_(computer_architecture)

https://en.wikipedia.org/wiki/Predication_(computer_architecture)


Handling Control Hazards: Detect & Stall

Detection In decode, check if opcode is branch or jump

Stall Hold next instruction in Fetch Pass noop to Decode


Problems with Detect & Stall

CPI increases on every branch

Are these stalls necessary? Not always! Branch is only taken half the time

Assume branch is NOT taken Keep fetching, treat branch as noop If wrong, make sure bad instructions don’t complete


Handling Control Hazards: Speculate & Squash

Speculate Assume branch is not taken

Squash Overwrite opcodes in Fetch, Decode, Execute with noop Pass target to Fetch

49

PC REG

file

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

sign

ext

Control

equal

M

U

X

beq

sub

add

nand

ad

d

sub

beq

beq

Inst

mem

no

op

no

op

no

op


Problems with Speculate & Squash

Always assumes branch is not taken

Can we do better? Yes. Predict branch direction and target! Why possible? Program behavior repeats.

More on branch prediction to come...


Next Time

• Going one step beyond pipelining: dynamic scheduling (a.k.a. out-of-order processing) Introduce a specific algorithm: scoreboard scheduling

• Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD

https://bit.ly/3oSr5FD

GAS STATION Pipelining & Hazards IILecture 4 EECS 470 Slide 6 © Wenisch 2016 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Documents