Top Banner
1 COMP381 by M. Hamdi Improving Processor Performance with Pipelining Pipelining
64

COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Jan 01, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

1COMP381 by M. Hamdi

Improving Processor

Performance with

PipeliningPipelining

Page 2: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Introduction to PipeliningIntroduction to Pipelining• Pipelining: An implementation technique that overlaps the execution of

multiple instructions. It is a keykey technique in achieving high-performance

• Laundry ExampleLaundry Example• Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Page 3: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Sequential LaundrySequential Laundry

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 4: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

Pipelined LaundryPipelined LaundryStart work ASAPStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads • Speedup = 6/3.5 = 1.7

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 5: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

5COMP381 by M. Hamdi

Pipelining Lessons• Latency vs. Throughput• Question

– What is the latency in both cases ?

– What is the throughput in both cases ?

Pipelining doesn’t help latency of single task,

It helps throughput of entire workload

A

B

C

D

30 40 40 40 40 20

Page 6: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

6COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Question– What is the fastest operation in the example ?– What is the slowest operation in the example

Pipeline rate limited by slowest pipeline stage

A

B

C

D

30 40 40 40 40 20

Page 7: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

7COMP381 by M. Hamdi

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20Multiple tasks operating simultaneously using

different resources

Page 8: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

8COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Question

– Would the speedup increase if we had more steps ?

A

B

C

D

30 40 40 40 40 20

Potential Speedup = Number of pipe stages

Page 9: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

9COMP381 by M. Hamdi

Pipelining Lessons [contd…]

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

• Question– Will it affect if “Folder” also took 40 minutes

Unbalanced lengths of pipe stages reduces speedup

Page 10: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

10COMP381 by M. Hamdi

Pipelining Lessons [contd…]

A

B

C

D

30 40 40 40 40 20

Time to “fill” pipeline and time to “drain” it reduces speedup

Page 11: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

11COMP381 by M. Hamdi

Pipelining a Digital System

• Key idea: break big computation up into pieces

Separate each piece with a pipeline register1ns

200ps 200ps 200ps 200ps 200ps

PipelineRegister

Page 12: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

12COMP381 by M. Hamdi

Pipelining a Digital System

• Why do this? Because it's faster for repeated computations

1ns

Non-pipelined:1 operation finishesevery 1ns

200ps 200ps 200ps 200ps 200ps

Pipelined:1 operation finishesevery 200ps

Page 13: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

13COMP381 by M. Hamdi

Comments about pipelining

• Pipelining increases throughput, but not latency

– Answer available every 200ps, BUT

– A single computation still takes 1ns

• Limitations:

– Computations must be divisible into stages of equal sizes

– Pipeline registers add overhead

Page 14: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

14COMP381 by M. Hamdi

Another Example

Comb.Logic

REG

30ns 3ns

Clock

Delay = 33nsThroughput = 30MHz

Time

UnpipelinedSystem

Op1 Op2 Op3??

– One operation must complete before next can begin– Operations spaced 33ns apart

Page 15: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

15COMP381 by M. Hamdi

3 Stage Pipelining

– Space operations 13ns apart

– 3 operations occur simultaneously

REG

Clock

Comb.Logic

REG

Comb.Logic

REG

Comb.Logic

10ns 3ns 10ns 3ns 10ns 3ns

Delay = 39nsThroughput = 77MHz

Time

Op1

Op2

Op3

Op4

Page 16: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

16COMP381 by M. Hamdi

Limitation: Nonuniform Pipelining

Clock

REG

Com.Log.

REG

Comb.Logic

REG

Comb.Logic

5ns 3ns 15ns 3ns 10ns 3ns

Delay = 18 * 3 = 54 nsThroughput = 55MHz

• Throughput limited by slowest stage• Delay determined by clock period * number of stages

• Must attempt to balance stages

Page 17: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

17COMP381 by M. Hamdi

Limitation: Deep Pipelines

• Diminishing returns as add more pipeline stages

• Register delays become limiting factor• Increased latency

• Small throughput gains

• More hazards

Delay = 48ns, Throughput = 128MHzClock

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

REG

Com.Log.

5ns 3ns

Page 18: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

18COMP381 by M. Hamdi

Computer (Processor) PipeliningComputer (Processor) Pipelining

• It is one KEYKEY method of achieving High-Performance in modern microprocessors

• It is being used in many different designs (not just processors)– http://www.siliconstrategies.com/story/OEG20020820S0054

• It is a completely hardware mechanism

• A major advantage of pipelining over “parallel processing” is that it is not visiblenot visible to the programmer

• An instructioninstruction execution pipeline involves a number of steps, where each step completes a part of an instruction.

• Each step is called a pipe stage or a pipe segment.

Page 19: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

19COMP381 by M. Hamdi

Instr 2

Pipelining• Multiple instructions overlapped in execution• Throughput optimization: doesn’t reduce time

for individual instructions

Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Instr 3Instr 2Instr 1

Stage 2Stage 3Stage 4Stage 5Stage 6Stage 7Stage 1

Page 20: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

20COMP381 by M. Hamdi

Computer PipeliningComputer Pipelining

• The stages or steps are connected one to the next to form a pipe -- instructions enter at one end and progress through the stage and exit at the other end.

• ThroughputThroughput of an instruction pipeline is determined by how often an instruction exists the pipeline.

• The time to move an instruction one step down the line is equal to the machine cyclethe machine cycle (Clock Rate) and is determined by the stage with the longest processing delay (slowest pipeline stage).

Page 21: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

21COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals• An important pipeline design consideration is to

balance the length of each pipeline stage.

• If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls):

Time per instruction on unpipelined Time per instruction on unpipelined machinemachine

Number of pipe stagesNumber of pipe stages

• Under these ideal conditions:– Speedup from pipelining equals the number of pipeline

stages: n, – One instruction is completed every cycle, CPI = 1 .

Page 22: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

22COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals• Under these ideal conditions:

– Speedup from pipelining equals the number of pipeline stages: n,

– One instruction is completed every cycle, CPI = 1 .

– This is an asymptote of course, but +10% is commonly achieved

– Difference is due to difficulty in achieving balanced stage design

• Two ways to view the performance mechanism– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)• Work split into more stages

• Simpler stages result in faster clock cycles

Page 23: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

23COMP381 by M. Hamdi

Implementation of MIPSImplementation of MIPS

• We use the MIPS processor as an example to demonstrate the concepts of computer pipelining.

• MIPS ISA is designed based on sound measurements and sound architectural considerations (as covered in class).

• It is used by numerous companies (Nintendo and Playstation) through liscencing agreements.

• These same concepts are being used by ALL other processors as well.

Page 24: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

24COMP381 by M. Hamdi

MIPS64 Instruction MIPS64 Instruction FormatFormat 166 5 5

ImmediatertrsOpcode

6 5 5 5 5

Opcode rs rt rd func

6 26

Opcode Offset added to PC

J - Type instruction

R - type instruction

Jump and jump and link. Trap and return from exception

Register-register ALU operations: rd rs func rt Function encodes the data path operation: Add, Sub .. Read/write special registers and moves.

Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)Conditional branch instructions (rs1 is register, rd unused)Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)

I - type instruction

6

shamt

0 5 6 10 11 15 16 31

0 5 6 10 11 15 16 20 21 25 26 31

0 5 6 31

Page 25: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

25COMP381 by M. Hamdi

A Basic Multi-Cycle A Basic Multi-Cycle Implementation of MIPSImplementation of MIPS

• Every integer MIPS instruction can be implemented in at most five clock cycles (branch – 2 cycles, Store – 4 cycles, other – 5 cycles):

1 Instruction fetch cycle (IF):

IR Mem[PC]NPC PC + 4

2 Instruction decode/register fetch cycle (ID):

A Regs[rs];B Regs[rt];

Imm ((IR16)16##IR 16..31) sign-extended immediate field of IR

Note: IR (instruction register), NPC (next sequential program counter register) A, B, Imm are temporary registers

Page 26: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

26COMP381 by M. Hamdi

A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

3 Execution/Effective address cycle (EX):

– Memory reference:

ALUOutput A + Imm;

– Register-Register ALU instruction:

ALUOutput A op B;

– Register-Immediate ALU instruction:

ALUOutput A op Imm;

– Branch:

ALUOutput NPC + Imm;

Cond (A == 0)

Page 27: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

27COMP381 by M. Hamdi

A Basic Implementation of MIPS (continued)A Basic Implementation of MIPS (continued)

4 Memory access/branch completion cycle (MEM):

– Memory reference:

LMD Mem[ALUOutput] orMem[ALUOutput] B;

– Branch:

if (cond) PC ALUOutput else PC NPC

Note: LMD (load memory data) register

Page 28: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

28COMP381 by M. Hamdi

A Basic Implementation of MIPS A Basic Implementation of MIPS (continued)(continued)

5 Write-back cycle (WB):

– Register-Register ALU instruction:

Regs[rd] ALUOutput;

– Register-Immediate ALU instruction:

Regs[rt] ALUOutput;

– Load instruction:

Regs[rt] LMD;

Note: LMD (load memory data) register

Page 29: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

29COMP381 by M. Hamdi

Basic MIPS Multi-Cycle Integer Datapath ImplementationBasic MIPS Multi-Cycle Integer Datapath Implementation

Page 30: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

30COMP381 by M. Hamdi

Simple MIPS Pipelined Simple MIPS Pipelined Integer Instruction Integer Instruction

ProcessingProcessing Clock Number Time in clock cycles Instruction Number 1 2 3 4 5 6 7 8 9

Instruction I IF ID EX MEM WBInstruction I+1 IF ID EX MEM WBInstruction I+2 IF ID EX MEM WBInstruction I+3 IF ID EX MEM WBInstruction I +4 IF ID EX MEM WB

Time to fill the pipeline

MIPS Pipeline Stages:

IF = Instruction FetchID = Instruction DecodeEX = ExecutionMEM = Memory AccessWB = Write Back

First instruction, ICompleted

Last instruction, I+4 completed

Page 31: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

31COMP381 by M. Hamdi

Pipelining The MIPS Processor

• There are 5 steps in instruction execution:

1. Instruction Fetch

2. Instruction Decode and Register Read

3. Execution operation or calculate address

4. Memory access

5. Write result into register

Page 32: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

32COMP381 by M. Hamdi

Datapath for Instruction Fetch

Instruction <- MEM[PC]PC <- PC + 4

RDMemory

ADDR

PC

Instruction

4

ADD

Page 33: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

33COMP381 by M. Hamdi

Datapath for R-Type Instructions

add rd, rs, rtR[rd] <- R[rs] + R[rt];

5 5 5

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

op rs rt rd functshamt

Operation

ALU Zero

Instruction

3

Page 34: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

34COMP381 by M. Hamdi

Datapath for Load/Store Instructions

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

lw rt, offset(rs)R[rt] <- MEM[R[rs] + s_extend(offset)];

Page 35: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

35COMP381 by M. Hamdi

Datapath for Load/Store Instructions

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

3

EXTND

16 32

Zero

RDWD

MemRead

MemoryADDR

MemWrite

5

sw rt, offset(rs)MEM[R[rs] + sign_extend(offset)] <- R[rt]

Page 36: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

36COMP381 by M. Hamdi

Datapath for Branch Instructions

beq rs, rt, offset

op rs rt offset/immediate

5 5

16

RD1

RD2

RN1 RN2 WN

WD

RegWrite

Register File

Operation

ALU

EXTND

16 32

Zero

ADD

<<2

PC +4 from instruction datapath

if (R[rs] == R[rt]) then PC <- PC+4 + s_extend(offset<<2)

Page 37: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

37COMP381 by M. Hamdi

Single-Cycle Processor

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

IFInstruction Fetch

IDInstruction Decode

EXExecute/ Address Calc.

MEMMemory Access

WBWrite Back

Page 38: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

38COMP381 by M. Hamdi

Pipelining - Key Idea

• Question: What happens if we break execution into multiple cycles?

• Answer: in the best case, we can start executing a new instruction on each clock cycle - this is pipelining

• Pipelining stages:– IF - Instruction Fetch– ID - Instruction Decode– EX - Execute / Address Calculation– MEM - Memory Access (read / write)– WB - Write Back (results into register file)

Page 39: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

39COMP381 by M. Hamdi

Pipeline RegistersPipeline Registers• Pipeline registers are named with 2 stages (the

stages that the register is “between.”)

• ANY information needed in a later pipeline stage MUST be passed via a pipeline register

–Example:IF/ID register gets

•instruction

•PC+4

• No register is needed after WB. Results from the WB stage are already stored in the register file, which serves as a pipeline register between instructions.

Page 40: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

40COMP381 by M. Hamdi

Basic Pipelined Processor

IF/ID

Pipeline Registers

5 516

RD1

RD2

RN1 RN2 WN

WD

Register File ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

5

Instruction I32

MUX

<<2RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

32

ID/EX EX/MEM MEM/WB

Page 41: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

41COMP381 by M. Hamdi

Single-Cycle vs. Pipelined Execution

Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $3, 300($0) InstructionFetch

TimeInstructionOrder

800ps

800ps

800ps

Pipelined0 200 400 600 800 1000 1200 1400 1600

lw $1, 100($0) InstructionFetch

REGRD

ALU REGWR

MEM

lw $2, 200($0)

lw $3, 300($0)

TimeInstructionOrder

200psInstruction

FetchREGRD

ALU REGWR

MEM

InstructionFetch

REGRD

ALU REGWR

MEM

200ps

200ps 200ps 200ps 200ps 200ps

Page 42: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

42COMP381 by M. Hamdi

Pipelined Example - Executing Multiple Instructions

• Consider the following instruction sequence:lw $r0, 10($r1)

sw $sr3, 20($r4)

add $r5, $r6, $r7

sub $r8, $r9, $r10

Page 43: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

43COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 1

LW

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Page 44: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

44COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 2

LWSW

Page 45: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

45COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 3

LWSWADD

Page 46: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

46COMP381 by M. Hamdi

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

Executing Multiple InstructionsClock Cycle 4

LWSWADDSUB

Page 47: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

47COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 5

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

LWSWADDSUB

Page 48: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

48COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 6

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SWADDSUB

Page 49: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

49COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 7

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

ADDSUB

Page 50: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

50COMP381 by M. Hamdi

Executing Multiple InstructionsClock Cycle 8

5

RD1

RD2

RN1

RN2

WN

WD

RegisterFile

ALU

EXTND

16 32

RD

WD

DataMemory

ADDR

32

MUX

<<2

RD

InstructionMemory

ADDR

PC

4

ADD

ADD

MUX

5

5

5

IF/ID ID/EX EX/MEM MEM/WB

Zero

SUB

Page 51: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

51COMP381 by M. Hamdi

Alternative View - Multicycle Diagram

IM REG ALU DM REGlw $r0, 10($r1)

sw $r3, 20($r4)

add $r5, $r6, $r7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

IM REG ALU DM REG

IM REG ALU DM REG

sub $r8, $r9, $r10 IM REG ALU DM REG

CC 8

Page 52: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

52COMP381 by M. Hamdi

Pipelining: Design GoalsPipelining: Design Goals

• Two ways to view the performance mechanism

– Reduced CPI (i.e. non-piped to piped change)

• Close to 1 instruction/cycle if you’re lucky

– Reduced cycle-time (i.e. increasing pipeline depth)

• Work split into more stages

• Simpler stages result in faster clock cycles

Page 53: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

53COMP381 by M. Hamdi

Pipelining Performance Pipelining Performance ExampleExample

• Example: For an unpipelined CPU: – Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for

memory operations with instruction frequencies of 40%, 20% and 40%, respectively.

– If pipelining adds 0.2 ns to the machine clock cycle then the speedup in instruction execution from pipelining is:

Non-pipelined Average instruction execution time = Clock cycle x Average CPI

= 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns

In the pipelined five implementation five stages are used with an average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns

Speedup from pipelining = Instruction time unpipelined

Instruction time pipelined

= 4.4 ns / 1.2 ns = 3.7 times faster

Page 54: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

54COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and Latency:Latency:

A More realistic ExamplesA More realistic Examples

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Consider the pipeline above with the indicateddelays. We want to know what is the pipelinethroughput and the pipeline latency.

Pipeline throughput: instructions completed per second.

Pipeline latency: how long does it take to execute a single instruction in the pipeline.

Page 55: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

55COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Pipeline throughput: how often an instruction is completed.

)(10/1

4,10,5,4,5max/1

)(),(),(),(),(max/1

overheadregisterpipelineignoringnsinstr

nsnsnsnsnsinstr

WBlatMEMlatEXlatIDlatIFlatinstr

Pipeline latency: how long does it take to execute an instruction in the pipeline.

nsnsnsnsnsns

WBlatMEMlatEXlatIDlatIFlatL

28410545

)()()()()(

Page 56: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

56COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Simply adding the latencies to compute the pipelinelatency, only would work for an isolated instruction

L(I5) = 43ns

IF MEMIDI1 L(I1) = 28nsEX WBMEMIDIFI2 L(I2) = 33nsEX WB

MEMIDIFI3 L(I3) = 38nsEX WBMEMIDIFI4 EX WB

We are in trouble! The latency is not constant.This happens because this is an unbalancedpipeline. The solution is to make every state

the same length as the longest one.

Page 57: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

57COMP381 by M. Hamdi

Synchronous Pipeline Synchronous Pipeline Throughput and LatencyThroughput and Latency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

The slowest pipeline stage also limits the latency!!

IF MEMIDI1

L(I1) = L(I2) = L(I3) = L(I4) = 50ns

EX WBIF MEMIDI2 L(I2) = 50nsEX WB

IF MEMID EX WBIF MEMID EX

0 10 20 30 40 50 60

I3I4

Page 58: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

58COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

How long does it take to execute (issue) 20000 instructionsin this pipeline? (disregard latency, bubbles caused bybranches, cache misses, hazards)

How long would it take using the same moduleswithout pipelining?

snsnsExecTime pipe 2002000001020000

snsnsExecTime pipenon 5605600002820000

Page 59: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

59COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM WB5 ns 4 ns 5 ns 10 ns 4 ns

Thus the speedup that we got from the pipeline is:

8.2 200

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

How can we improve this pipeline design?

We need to reduce the unbalance to increasethe clock speed.

Page 60: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

60COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

Now we have one more pipeline stage, but themaximum latency of a single stage is reduced in half.

MEM2

5 ns

nsinstr

nsnsnsnsnsnsinstr

WBlatMEMlatMEMlatEXlatIDlatIFlatinstrT

5/1

)4,5,5,5,4,5max(/1

)(),2(),1(),(),(),(max(/1

nsnsL 3056

The new latency for a single instruction is:

Page 61: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

61COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

IF MEM1IDI1 EX WBMEM2IF MEM1IDI2 EX WBMEM2

IF MEM1IDI3 EX WBMEM2IF MEM1IDI4 EX WBMEM2

IF MEM1IDI5 EX WBMEM2IF MEM1IDI6 EX WBMEM2

IF MEM1IDI7 EX WBMEM2

Page 62: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

62COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

snsnsExecTime pipe 100100000520000

How long does it take to execute 20000 instructionsin this pipeline? (disregard bubbles caused bybranches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

6.5 100

560

s

s

ExecTime

ExecTimeSpeedup

pipe

pipenonpipe

Page 63: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

63COMP381 by M. Hamdi

Pipeline Throughput and Pipeline Throughput and LatencyLatency

IF ID EX MEM1 WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM2

5 ns

What have we learned from this example?

1. It is important to balance the delays in the stages of the pipeline

2. The throughput of a pipeline is 1/max(delay).

3. The latency is Nmax(delay), where N is the number of stages in the pipeline.

Page 64: COMP381 by M. Hamdi 1 Pipelining Improving Processor Performance with Pipelining.

64COMP381 by M. Hamdi

Pipelining is Not That Easy for Pipelining is Not That Easy for ComputersComputers

• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions.

– Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline

– Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC

• A possible solution is to “stall” the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline