Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Automobile Manufacturing1. Build frame. 60 min.

2. Add engine. 50 min.

3. Build body. 80 min.

4. Paint. 40 min.

5. Finish. 45 min.

275 min.

Latency: Time from start to finish for one car.

Throughput: Number of finished cars per time unit.

1 car/275 min = 0.218 cars/hour

275 minutes per car.

Issues: How can we make the process better by adding more workers?

(smaller is better)

(larger is better)

6.1

An Assembly line

6.1

1

1

1

1

1

2

2

2

2

2

3

3

3

3

3

4

4

4

4

4

60 50 80 40 45

First two stagescan’t produce faster thanone car/80 min or a backlog will occurat third stage.

80 80

Last two stages only receive onecar/80 min to work on.

80 80

Latency: 400 min/carThroughput: 4 cars/640 min (1 car/160 min)

time

Will approach 1 car/80 min as time goes on

Applying Assembly Lines to CPUs

• The single-cycle design did everything “at once”

• Can we break the single-cycle design up into stages?

6.1

• Issues:

• Car assembly works well. Will it be so easy to do the same technique to a CPU?

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A

RegistersRead reg num B

Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Instr. Fetch,PC=PC+4

Instr. DecodeRegister Fetch

Execute,Address Calc.

Memory

Reg.Write-back

Breaking up the Single-Cycle Datapath

6.2

Stages frommulti-cycle design

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A


Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Instr. Fetch,PC=PC+4

Instr. DecodeRegister Fetch

Execute,Address Calc.

Memory

Reg.Write-back

The Key - Pipeline Registers

6.2

clock

PC+4

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A


Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

0

1

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

Rd:[15-11]

1

0

Example: R-type Instruction

6.2

PC+4

Writes the correct data to the wrong register

In general, arrows that go backwards across pipeline stages may be bad news...

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

Result

Result Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A


Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

0

1Rd:[15-11]

1

0

Correcting the Write Register Problem

6.2

PC+4

0

1

Rt:[20-16]

Rd:[15-11]

Assembly-line Control Signals

135 4

In an assembly line, the manufacturing instructions can be attachedto the car. The instructions then move along with the car.

F: StandardE: 135 HPB: 2-doorP: GreenF: Leather

E: 190 HPB: 4-doorP: BlueF: Cotton

B: 2-doorP: LavenderF: Leather

P: GreenF: Vinyl

F: Leather

2

By separating the control signals by stages, only the signals needed for the current stage must be decoded.

All signals for later stages must be passed along.

6.1

InstructionMemory

Data Memory

AddAdd

4

Read address

Instruction [31-0]

Read address

Write address

Write data

Read dataResult

Zero

ResultResult

Sh.Left2

1

00

1

signextend

PC

16 32

Read reg. num A


Write reg num

Write reg data

Read reg data A

Read reg data B

Read reg num A

Imm:[15-0]

Rs:[25-21]

Rt:[20-16]

1

0

The Pipelined Control Logic

6.3

PC+4

0

1

Rt:[20-16]

Rd:[15-11]

ALUcontrol

ALUOp

RegWrite

Mem

To

Reg

MemWrite

MemRead

ALUSrc

PCSrc

RegDest

Op:[31-26]

W

ME

Control W

MW

Branch

How’d we do?

• Compared to Single-cycle

• 5 stages --> Potentially 5x speedup

• Not likely• Stages won’t all be equally long• Pipeline registers will cause some delays

• Latency --> Greater than in single-cycle design

• More complexity, but nicely divided up

Example 1

• Consider executing the following code

add $3, $4, $5

and $6, $7, $8

sub $9, $10, $11

on

i) A single-cycle machine with a cycle time of 200 ns

ii) A 5-stage pipeline machine with a cycle time of 50 ns

Which one runs faster?

What if the instructions were 100 instead of 3?

Analyzing Pipelines

6.4

ADD $10, $14, $0SUB $12, $13, $2AND $1, $6, $11SW $3, 200($9)OR $9, $13, $7

OR IF RF M WBEX

IF RF M WBSW EX

IF RF M WBAND EX

IF RF M WBSUB EX

IF RF MADD EX WB

Data Hazards

6.4

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7 Writes register $13Writes register $13

Reads wrong $13Reads wrong $13

Reads wrong $13Reads wrong $13

Reads ? $13Reads ? $13

Reads correct $13Reads correct $13 OR IF RF M WBEX

IF RF M WBSW EX

IF RF M WBAND EX

IF RF M WBSUB EX

IF RF MADD EX WB

Preventing Data Hazards

6.4

ADD $13, $14, $0NOPNOPNOPSUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

Insert NOP’s into the instructionstream to allow WB to happen beforeRF.

Assume we can’t write a registerand read the new value in the same cycle

Assume we can’t write a registerand read the new value in the same cycle

IF RFOR

IF RFSW EX

IF RF MAND EX

IF M WBSUB EXRF

IF RF MADD WBEX

IF M WBSUB EXRF

Detecting Hazards

6.5

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

Check each instruction as it is being decoded (RF-ID stage).If it reads a register that will be written by any instruction ahead of it (in RF, EX, or M stages), there is a hazard.

Write: $13

Read A: $13

Read B: $13

Read A: $13 IF RFOR EX

SW IF RF MEX

IF RF M WBAND EX

IF RF M WBSUB EX

ADD IF RF M WBEXCompare write reg # in EX with read reg # in RF

Compare write reg # in M with read reg # in RF

Compare write reg # in WB with read reg # in RF

Stalling with Bubbles

6.5

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $7

IF RFOR

IFSUB

IFSUB

IFSUB

Stalling:• Kill the current executionby “neutralizing” all the controlsignals so that it won’t write any registers.• Don’t write PC+4 into PC --> Stay at the current instruction and try again.

IF RF MADD WBEX

IF RF M WBSUB EX

IF RF MAND EX

SW IF RF EX

==

=

Register Forwarding

6.6

ADD $13, $14, $0SUB $12, $13, $2AND $1, $6, $13SW $3, 200($13)OR $9, $13, $2

Register $13’s value is computed in the EX stage of the ADD even thoughit isn’t written in the register until the WB stage.

--> The pipeline register following the EX stage hold the value of $13 that’s needed in the SUB instruction’s EX stage.

IF RF M WBSUB EX

IF RF M WBAND EX

IF RF M WBOR EX

IF RF M WBSW EX

IF RF MADD WBEX

Unforwardable Loads

6.6

LW $2, 30($2)AND $1, $2, $13SW $3, 200($2)OR $9, $2, $1

IF RF M WBAND EX

IF RF MLW WBEX

IF RF M WBSW EX

IF RF M WBOR EXOR

IF RF M WBAND EX

Loads don’t compute the register to write back until the Memory stage. This is one stage to late for the next instruction. ---> We can’t prevent stalls if the instruction following a Load uses the result of the Load.

Example 2

• Consider executing the following code on a 5-stage pipeline datapath

add $3, $4, $5

lw $7, 100($3)

sub $8, $7, $9

1. Identify any potential data dependencies

2. How many cycles will it take to execute this code assuming no register forwarding?

3. How many cycles will it take to execute this code assuming register forwarding is available?

Branch Hazards

6.7

BEQ $2, $1, SKIPAND $1, $2, $13SW $3, 200($2)OR $9, $2, $4ADD $3, $2, $5

SKIP: LW $2,32($4)

IF RF M WBAND EX

IF RF M WBOR EX

IF RF M WBOR EXLW

IF RF M WBSW EX

Don’t know result of branch untilthe end of the M stage

Don’t know result of branch untilthe end of the M stage

If the branch is taken,we’ve blown it by executingthe intervening instructions

If the branch is taken,we’ve blown it by executingthe intervening instructions

IF RFBEQ WBEX M

Solution 1: Stall

6.5

IF RFADD

IFAND

IFAND

IFAND

IF RF MBEQ WBEX

IF RF M WBAND EX

IF RF MSW EX

OR IF RF EX


SKIP: LW $2,32($4)

Stalling always solves theproblem. If we didn’t have somany branches in programs, it wouldnot be a problem

Branchnot taken

Branchnot taken

6.7


SKIP: LW $2,32($4)

IF RFBEQ WBEX M

If we guess right, we win --> No stall at all

IF RF M WBLW EX

IF RF M WBOR EX

If we guessed wrong, 1. We have to undo all that we did (fortunately, no writebacks have occured yet). 2. We still take all the time of a stall

IF RF M WBAND EX

IF RF M WBSW EX

Solution 2: Assume not Taken

Must be undone if branchis taken!

Must be undone if branchis taken!

Branch is taken...Branch is taken...

6.7

Solution 3: Better Prediction

• Predict that the branch goes the same way as the last time

• Works great for loops

• Works great for “special-case” code

• Need to keep track of the information for each branch, though...

• One or two bits will do

• Keep a small table of recently used branches and which way they went

6.7

Solution 4: Delayed BranchesXOR $1, $3, $3ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0BEQ $10, $11, SKIPLW $4, 60($2)

SKIP AND $1, $2, $3

If we had some warning, wecould compute the branch aheadof time...

XOR $1, $3, $3 Branch-After-Three-EQ $10,$11,SKIP

ADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)

SKIP AND $1, $2, $3

3 delay slots3 delay slots These instructionsare always executed.Branch can’t dependon them...

3-slot Delayed Branch

6.7

IF RFB3E WBEX M

IF RF M WBLW or AND EX

Branch-After-Three-EQ $10,$11,SKIPADD $2, $3, $4SUB $4, $3, $1OR $3, $2, $0LW $4, 60($2)

SKIP AND $1, $2, $3

IF RF WBEX MADD

IF RF WBEX MSUB

IF RF WBEX MOR

Branch summary

• Two decent solutions:

• Branch prediction• Requires more hardware• Used in modern microprocessors

• Delayed branch• Requires special software manipulation• Often doesn’t deliver its promise• Used often in CPUs 4-10 years ago

Example 3

• Consider executing the following codeLOOP: add $3, $4, $5

and $6, $7, $8bne $12, $8, LOOP

oni) A single-cycle machine with a cycle time of 200 nsii) A 5-stage pipeline machine with a cycle time of 50

nsA. Assume the loop executes 10 timesB. Assume the loop executes 100 timesC. Assume the loop executes 1000 timesWhich one runs faster?

Example 4

• Consider executing the following code on a 5-stage pipeline datapath

addi $3, $0, 10LOOPSTART: lw $5, ARRAY($3)

addi $5, $5, 1sw $5, ARRAYaddi $3, $3, -1bne $3, $0, LOOPSTARTadd $3, $5, $6sub $7, $8, $9addi $4, $6, 3

1. Identify potential data dependencies2. How many cycles will it take to execute this code?

A. With nops/stallsB. With branch prediction assuming branch not takenC. With branch prediction based on one previous result

Automobile Manufacturing 1. Build frame. 60 min. 2. Add engine. 50 min. 3. Build body. 80 min. 4. Paint. 40 min. 5. Finish.45 min. 275 min. Latency: Time.

Documents

pipeline stages

car assembly

sign extendpc1632imm

singlecycle5 stages

later stages

singlecycle datapath6

cpusthe singlecycle

assembly line6