1 Appendix C Pipelining: Basic and Intermediate Concepts Computer Architecture A Quantitative Approach, Fifth Edition.

1

Appendix C

Pipelining: Basic and Intermediate Concepts

Computer ArchitectureA Quantitative Approach, Fifth Edition

2

Basic Pipelining

Pipelining is the organizational implementation technique that has been responsible for the most dramatic increase in computer performance.

Overview of basic pipelining What is pipelining? Computing pipeline speedup Clocking pipelines Pipelining MIPS Pipeline hazards Handling interrupts.

3

Pipelining

4

Pipelining 3 Stages

Assume a 2 ns flip-flop delay

5

Pipelining: Computing the speedup Time per instruction

TPI = CPI cycle time We can think about pipelining as reducing either CPI

or cycle time Ideal speedup

Requires that all stages be perfectly balanced No synchronization (latch, flip-flop) overhead No stall cycles

The speedup from a pipeline is limited CPIreal = CPIideal + CPI stall

CCTreal = Timelongest pipestage + Timelatch overhead

stagespipelineofnumberpipelineTPIwith

pipelineTPIwithoutSpeedup

6

MIPS Instruction Formats

7

Basic MIPS Pipeline

8

Basic MIPS Pipeline (simplified)

9

Pipelining By Adding Registers

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

10

MIPS Pipelined Execution

Instruction 1 2 3 4 5 6 7 8 9

i IF ID EX MEM WB

i+1 IF ID EX MEM WB

i+2 IF ID EX MEM WB

i+3 IF ID EX MEM WB

i+4 IF ID EX MEM WB

11

Rules for pipeline registers Each stage must be independent, so inter-stage registers

must hold Data values Control signals, including

Decoded instruction fields MUX controls ALU controls

Think of the register file as two independent units Read file, accessed in ID Write file, accessed in WB

There is no “final” set of registers after WB, (WB/IF) because the instruction is finished and all results are recorded in permanent machine state (register file, memory, and PC)

12

A More Accurate Pipeline Schematic

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

13

Pipeline Dataflow: the details

Reg-RegALU

Reg-immedALU

Load Store Branch Jump

IF IR2 = IMEM[PC]PC2 = PC = PC+4

ID A3 = Regs[IR25..21]; B3=Regs[20..16];IR3=IR2;PC3=PC2;

IM3=IR2[15]16 ##IR2[14..0]

EX ALU4= A3 op B3;IR4 = IR3PC4 = PC3

ALU4 = A3 op IM3IR4 = IR3PC4 = PC3

ALU4 = A3 + IM3IR4 = IR3PC4 = PC3MD4 = B3

ALU4 = PC3 + IM3CO4 = A3 op 0

IR4 = IR3PC4 = PC3

ALU4 = PC3 + IM3IR4 = IR3PC4 = PC3

MEM IR5=IR4PC5=PC4

IR5=IR4PC5=PC4

WB5 = DMEM[ALU4]

DMEM[ALU4] = MD4

IR5=IR4PC5=PC4

If (C04) PC=ALU4

IR5=IR4PC5=PC4

PC = ALU4

WB Din = WB Din = WB Din = WB

14

Problems with Pipelining (Dependencies and Hazards)

Dependencies: a property of the program

Data dependencies Instruction j uses the result produced by

instruction I

Control dependencies The execution of instruction j depends upon the

result of instruction i

15

Dependencies and Hazards

Hazard a result of dependencies in the pipeline Hazards lead to pipeline stalls or the execution of the

wrong instruction Data hazards

Instruction depends upon the result of an instruction still in the pipeline

Structural Hazard Two instructions try to use the same hardware

resource in a single cycle Control hazard

Caused by the delay in fetching an instruction and decision about changes in instruction flow

16

Structural hazards When two instructions need to use the same

hardware resource in the same cycle. Resources are not duplicated

Register file write ports Resources is not fully pipelined, I.e. takes more than one cycle

Division, floating points

Fix #1: Stall later instruction Low cost, but increases average CPI Best used for rare events Examples:

MIPS R2000 multi-cycle multiply SPARC V1 single memory port for instruction and data

Fix #2: Duplicate the resource Increase cost, but preserves CPI Best used for cheap resources and/or frequent events

17

Structural hazards, continued Example resource duplication

Separate instruction and data memory Separate ALU and PC adders Register files with multiple ports

Fix #3: Pipeline expensive resource Moderate cost compared to duplication, expensive compared to stalling Best used for high performance or specialty machines

Fully pipelined floating point units for scientific machines.

How to avoid structural hazards altogether Design the ISA so that each resource needed by an instruction:

Is used once Is always used in the same pipeline stage Takes one cycle

MIPS is designed with pipelining in mind, x86 is not

18

Types of Data Hazards RAW (Read After Write)

Only hazard for “fixed” pipelines Later instruction must be read after the earlier instruction writes

WAW (Write After Write) Variable length pipeline Later instructions must write after earlier instruction I

WAR (Write after Read) Pipeline with late read Later instruction must write after earlier instruction reads

We can have Data hazard through memory locations

F R A M W

F R A M W

F R 1 2 3 4 W

F R A M W

F R 1 2 3 4 R 5 W

F R A M W

19

Example RAW pipeline hazard

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

20

Stall for RAW hazards Relatively cheap: just needs some extra compare and

control logic Detected in ID stage by comparing the registers to be read with

the registers to be written for the instruction currently in the EX, MEM, or WB stages

Stall if a match is found Increases the average CPI Would happen much too frequently

F R X M W

R X M WF Bubble

Write Data to R1 here

Read from R1 hereADD R1, R2, R3ADD R4, R1, R5

21

Stall type #1: Freeze the whole pipeline

Freeze all pipe stages for one or more cycles, and suppress writeback Needs only one global stall signal which suppresses all latching in all

pipeline stages Sometimes called a “fixed pipe” or “frozen pipe” stall

Works for cache misses Will not work to remove pipeline hazards

1 2 3 4 5 6 7 8 9 10 11

I IF ID EX MEM WB WB

I+1 IF ID EX MEM MEM WB

I+2 IF ID EX EX MEM WB

I+3 IF ID ID EX MEM WB

I+4 IF IF EX EX MEM WB

I+5 IF ID EX MEM WB

I+6 IF ID EX MEM

22

Stall type #2: Delay completion of an instruction

Instruction progress stops for one cycle Earlier instructions continue towards completion Prior instructions must suspend and make no more progress An “elastic pipe: stall Good when the need for stalling is only detected after decode, like for

pipeline hazards

1 2 3 4 5 6 7 8 9 10 11

I IF ID EX MEM WB

I+1 IF ID EX MEM WB

I+2 IF ID Stall EX MEM WB

I+3 IF Stall ID EX MEM WB

I+4 Stall IF EX EX MEM WB

I+5 IF ID EX MEM WB

I+6 IF ID EX MEM

Bubble in: EX MEM WB

23

Bypass (Forwarding)

If data is available elsewhere in the pipeline, there is no need to stall

Detect condition Bypass (or forward) data directly to the consuming

pipeline stage Bypass eliminates stalls for single-cycle operations

Reduces longest stall to N-1 cycles for N-cycle operations

24

Physical Forwarding Paths

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

• The third forwarding operation might not be necessary • if we can make read-after-write register file

25

Example forwarding decisions If EX has just finished an operation for which ID

wants to read the value from either operand, we must forward

If IR.Will_Write_Reg and IR4.Write_Reg_Num == IR3.RS1_Reg_Num

then ALUmuxA =SelectALU4 If IR.Will_Write_Reg and IR4.Write_Reg_Num ==

IR3.RS2_Reg_Num

then ALUmuxB =SelectALU4 Need one comparison and multiplex control for each

forwarding path Be careful: if you forward from more than one

instruction, choose the closest in the pipeline

26

Physical Forwarding Paths

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

27

Forwarding Animation (1)

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB

28


PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB

29


PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Instruction

IF/ID

ID/EX

4

Control

30


PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Instruction

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Instruction

IF/ID

ID/EX

4

Control

31

Other Data Hazards WAR (Write After Read)

Can happen if the instruction pipeline has early writes and/or late reads; something like:DIV (R1), Suppose that it does not read destination indirect until after the divide

ADD ..,(R1)+ Incremented value of R1 is written before DIV has read value of R1

Can not happen in DLX because all reads are early (ID) and all writes are late (WB)

WAW (Write After Write) Can happen when a fast operation follows a slow one;

like

LW R1,0(R2) IF ID EX MEM WB

ADD R1, R2, R3 IF ID EX WB

Can not happen in DLX (integer) because there is only one WB stage and instructions use it in order

32

One data hazard left

Loaded data is not available until the end of MEM, which is too late for the following instruction

Forwarding can not help, so we must stall – or just “decree” that you can not write code like this. Such a decree is called a “delayed load” and was used in the original MIPS 2000

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

33

Stalling to interlock

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

34

The software fix: instruction scheduling to avoid stalls

Since we can not avoid a stall following a load, avoid the stall by rearranging the code (“pipeline scheduling”), if possible

Replacesub r4, r5, r7lw r1, 50(r2)add r3, r1, r4

Withlw r1, 50(r2)sub r4, r5, r7add r3, r1, r4

This can improve a simple RISC machine performance

35

The software fix: instruction scheduling to avoid stalls

But it is limited Usually limited to basic blocks between

branches, 5-7 instructions Difficult to do interchanges to variables

referenced indirectly (pointer, array, or parameter) due to the risk of aliases.

36

Branches and jumps

Control point: know target and condition Mem control point Branch penalty: number of pipeline stages to control

point

This 3-cycle penalty works, but since branches occur every 5-7 instructions, it kills performance. What to do Determine the branch condition earlier than EX Compute the target address earlier than MEM

Instruction 1 2 3 4 5 6 7 8 9

Branch I IF ID EX MEM WB

I+1 IF Stall Stall IF ID EX MEM WB

I+2 IF ID EX MEM

I+3 IF ID EX

37

Characteristics of MIPS branches and jumps

The branch condition Only has EQ/NE comparison to zero Fast and cheap, no need for a full ALU Use a 32-bit NOR gate instead

The branch target Always PC-relative Needs only 16 bit adder (and carry propagation)

The jump target Always PC-relative Target = {PC[31:28], offset, 00}

All can be moved to the ID stage, at the cost of additional hardware (and maybe increased cycle time)

Still requires one stall

38

Pipelining and Branch ISA Design

Simple branches Makes ID control point possible Maybe increases cycle time 1 cycle penalty

Complex branches Requires EX control point Maybe lower cycle time 2 cycle branch penalty

39

Reducing branch penalties (1)

Predict that the branch will not be taken Continue fetching from sequential addresses. Cancel later if branch was taken Easy to do

If it is not, continue If it is, change the following instructions into a

NOP and thus take a 1-cycle penalty Helps a little, but bets the wrong way for

loops

40


Predict that the branch will be taken Only useful if the target address is known

before the branch condition – not true for MIPS

Cancel later if the branch was not taken Always has some delay in fetching the branch

target

41


Change the ISA: delay the effect of the branch Always execute the instruction(s) after the

branch or jump Depends on the compiler to find something

useful to do in the branch delay slot(s). An ugly dependence of ISA on implementation

– may change Interaction with branch prediction, interrupts.

42

Filling the branch delay slot

a. From before b. From target c. From fall through

sub $t4, $t5, $t6

…

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6add $s1, $s2, $s3

if $s1 = 0 then

sub $t4, $t5, $t6

add $s1, $s2, $s3

if $s2 = 0 then

BecomesBecomesBecomes

Delay slot

Delay slot

Delay slot

sub $t4, $t5, $t6

if $s2 = 0 then

add $s1, $s2, $s3

43

How useful are canceling branches

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

com

press

eqnto

tt

espre

sso gcc li

doduc

ear

hydro

2d

mdljd

p

su2c

or

Benchmark

Per

cen

tage

of

con

dit

ion

al b

ran

ches

Canceled delay slot

Empty slot

Integer : 35 % slots wasted

Floating point : 25% slots wasted

44

Performance of Branch schemes?

Effective CPI = 1 + %branches average branch penalty

For integer MIPS: 20% of instructions are branches or jumps. 70% of them go to the target

Strategy Branch Taken penalty

Branch not taken penalty

Effective CPI

Stall 3 3 1.60

Branch in ID 1 1 1.20

Predict taken 1 1 1.20

Predict not taken 1 0 1.14

Delay slot 0.5 0.5 1.10

Cancel branch 0.3 0.3 1.06

45

Pipeline example Consider the following pipeline which

implements the MIPS-like ISA. The only variation on the MIPS ISA is the support of full register compares in branch instructions

Instruction 1 2 3 4 5 6 7 8 9 10 11

I IF ID EX1 EX2/MEM1

MEM2 WB

I+1 IF ID EX1 EX2/MEM1

MEM2 WB


MEM2 WB


MEM2 WB


MEM2 WB


MEM2 WB

46

The Pipeline stages

Stage Function

IF Instruction fetch

ID Instruction decode. Register fetch

EX1 Address generation (data and PC-target)

EX2/MEM1 ALU operationBranch condition resolutionFirst cycle of memory access

MEM2 Second cycle of memory access

WB Register file writeback

47

Assumptions

Writes to the register file occur in the first half of the clock cycle while reads from the register file occur in the second half

All bypass paths have been implemented to minimize pipeline stalls due to data hazards

The pipeline implements hardware interlocks

48

Questions How many register file ports does the processor

need to minimize structural hazards?

Indicate all forwarding required to minimize stalls in the given pipeline. Also, specify the minimum number of comparators needed to implement forwarding?

What is the worst case delay due to RAW data hazards?

What is the branch delay of this pipeline?

49

Instruction Dependencies The frequencies in the table are presented as

percentages of all instructions executed

Type Instruction Sequence Frequency

1 ALUop Rx,-,-ALUop -,-,Rx or ALUop -,Rx,-

10%

2 ALUop Rx,-,-Store Rx,-(-)

5%

3 ALUop Rx,-,-Load -,-(Rx) or Store -,-(Rx)

5%

4 ALUop Rx,-,-JumpRegister Rx

1%

5 ALUop Rx,-,-Branch Rx,-,# or Branch -,Rx,#

2%

6 Load Rx,-(-)ALUop -,-,Rx or ALUop -,Rx,-

15%

7 Load Rx, -(-)Load -,-(Rx) or Store -,-(Rx)

3%

8 Load Rx, -(-)Branch Rx,-,# or Branch -,Rx,#

2%

9 Load Rx, -(-)JumpRegister Rx

1%

50

More Questions List the instruction sequences from the previous table

that cause data stalls in the pipeline. Indicate the corresponding number of stall cycles.

Compute the CPI for the pipeline due to data hazards only. Ignore instruction sequences that are not listed in the table

If the frequency of conditional branches is 10% of which 65% are taken and the frequency of unconditional branches is 6%, compute the overall CPI assuming a TAKEN branch prediction scheme.

51

Summary

Pipelining: overlaps execution of instructions Improves instruction throughput → latency of long program

Problem: structural, data, and control hazards Hazards occur if there are dependences and pipeline

exposes them Common solution: stall, forwarding, scheduling Performance

CPIreal = CPIideal + Stallsstructural + Stallsdata + Stallscontrol

Cycle timereal = Timelongest pipestage + Register Overhead What makes pipelining easier

Simple instructions (load-stores, branches Fixed length, encoding with few formats

1 Appendix C Pipelining: Basic and Intermediate Concepts Computer Architecture A Quantitative Approach, Fifth Edition.

Documents

result of instruction

basic mips pipeline

instruction flow

pipelining dependencies

mips instruction formats

pipeline dataflow

pipeline stalls

pipeline registerseach