Top Banner
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria
76

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Dec 14, 2015

Download

Documents

Nicole Weeks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Advanced Computer Architectures

Laboratory on DLX Pipelining

Vittorio Zaccaria

Page 2: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

DLX Load/Store Architecture

Registers are faster than memory The compiler can do deeper optimization

16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding:

Addr. Mode contained in the operation code Fits in one word Faster decoding

Page 3: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

DLX (cont.) 32 General purpose registers 32 bit instructions:

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

Page 4: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

DLX Pipeline

Page 5: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Pipeline Visualization

Page 6: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

–Structural hazards: HW cannot support this combination of instructions

–Data hazards: Instruction depends on result of prior instruction still in the pipeline

–Control hazards: Pipelining of branches & other instructions that change the PC

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Hazards

Page 7: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Structural Hazards

Page 8: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Data Hazards

Page 9: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Control Hazards

Page 10: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

An example program:

.datadati_a: .word 1,2,3,4,5,6,7,8dati_b: .word 2,3,4,5,6,7,7,9

.text

.global main

add r3,r0,0loop: lw r4,dati_a(r3)

lw r5,dati_b(r3)sub r5,r5,r4addi r3,r3,4bnez r5,loop

exit:

Page 11: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

1st Exercise: Draw pipeline chart Indicate:

Data Hazards between WB stages and ID stages.

Control Hazards between EX stage and IF stage

Page 12: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM

Hazard Individuation

Page 13: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

2nd Exercise: Hazard Resolution Software solution

NOPs insertion Hardware solutions

Bubbles/stalls generation Register forwarding

Software optimizations Code rescheduling

Page 14: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

NOP insertionadd r3,r0,0NOPNOP

Loop: Lw r4,dati_a(r3)Lw r5,dati_b(r3)NOPNOPSub r5,r5,r4Add r3,r3,4NOPBnez r5,LoopNOP

Page 15: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

NOP dynamic execution

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17

add r3,r0,0 IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WBNOP IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WBNOP IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

First loop:

Second loop: ........

Loop composed by 5 instr and 4 Nops

Page 16: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Performance Indexes CPI= average clock cycles per

instruction; Average Clock cycles=

n° instr+n°stalls/nops+44 is the n° of cycles needed to execute the last instruction.

CPI=[Average Clock cycles]/[n° instr]

Page 17: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of NOPs Actual CPI=

Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7

MIPS frequency[=200Mhz]

------------------------- = 82.35 MIPS CPI*10^6

Page 18: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

NOPs Manual Exercise Execute manually the loop for two

cycles (finishing on the nop after the 2nd bnez) and calculate CPI and MIPS

10 minutes

Page 19: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results CPI= (21+4)/11=2.27 MIPS= 88

Page 20: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Asymptotic loop performance Consider an intermediate cycle of

the loop. Count instructions + nops of the

cycle and divide it by the number of effective instructions -> asymptotical CPI

10 minutes

Page 21: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI=

(Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n

MIPS frequency[=200Mhz]

------------------------- = 111 MIPS CPI*10^6

Page 22: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles Bubbles are NOPs inserted by the

hardware. Branch instructions provoke the

generation of a NOP Next instructions are stalled Previous instructions are executed.

Page 23: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF BubbleBubble ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF BubbleBubble ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF Bubble ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB

Page 24: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of bubbles Actual CPI=

Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7

MIPS frequency[=200Mhz]

------------------------- = 82.35 MIPS CPI*10^6

Page 25: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Verify on the simulator File-> load code ... -> pipe1.s ->

select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st

load of the 2nd cycle has been executed)

Page 26: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Result

Page 27: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercise Preview what happens in an

intermediate cycle Calculate asymptotical CPI and

MIPS 10 minutes

Page 28: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Let’s simulate it Simulate the program until the 4th

cycle

Page 29: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Solutions After the 1st cycle, we note the

same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are:

Asymptotic values: CPI=1.8 MIPS=111.11

Page 30: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Result Forwarding

Page 31: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Result Forwarding

Page 32: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Forwarding Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID Bubble EX MEM WB

Add r3,r3,4 IF Bubble ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB

Page 33: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle

Just to the WB of the 2nd bnez

Page 34: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Simulation results

Page 35: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercise Calculate CPI and MIPS for the 2

cycles. Calculate Asymptotical CPI and

MIPS. 15 minutes

Page 36: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results 2 cycles:

11 instructions 1 nop 2 stalls 4 cycles to flush the pipe

CPI=18/11=1.63 MIPS=122

Page 37: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Asymptotical Results

5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=142.86.

Page 38: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Speedup Speed up of A w.r.t. B:

Exec. Time B

-------------

Exec. Time A

Page 39: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes

Page 40: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.2

9

Page 41: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Scheduling Optimizations change of the order of operations

to minimize stalls/bubbles (forwarding enabled):

lw r3,0(r2)add r3,r3,r7lw r4,0(r2)add r4,r4,r8add r4,r4,r3

CPI=(5+2+4)/5

lw r3,0(r2)lw r4,0(r2)add r3,r3,r7add r4,r4,r8add r4,r4,r3

CPI=(5+4)/5

Page 42: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

1st Exercise

addi r1,r0,1

seq r2,r1,r1

add r3,r3,r3

Loop: lw r4,0(r3)

sub r3,r3,r4

bnez r1,Loop

Page 43: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercises Draw the conflicts between operations

until the end of the 3rd execution of the cycle (last instruction bnez). No forwarding possible.

Insert bubbles/aborts in the right place to solve hazards.

Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes

Page 44: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Hazard Diagramaddi r 1, r 0, 1 IF ID EX MEMWB

seq r 2, r 1, r 1 IF ID EX MEMWB

add r 3, r 3, r 3 IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

Page 45: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles/Stall insertion

Page 46: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2

Page 47: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline

diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware.

Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes

Page 48: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Pipeline Diagram

Page 49: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results CPI=21/12=1.75 Asymptotical

CPI=[(4+1)n+4]/3n=5/3=1.66

Page 50: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

2nd exercise

loop: lw r2,dati_a(r4)

lw r3,dati_b(r5)

add r1,r2,r3

sw dati_a(r6),r1

addi r4,r4,4

addi r5,r5,4

addi r6,r6,4

j loop

Page 51: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards, assume no forwarding possible.

Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes

Page 52: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results

8 instructions1 NOP4 stalls=> CPI=~13/8

Page 53: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results No forwarding and no scheduling

asymptotic result: 13/8

Page 54: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

A Possible Re-Scheduling

loop:lw r2,dati_a(r4)

lw r3,dati_b(r5)

addi r4,r4,4

addi r5,r5,4

add r1,r2,r3

sw dati_a(r6),r1

addi r6,r6,4

j loopIdea: increase distance of add from last lw.

Page 55: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Re-Scheduling results

Scheduled code decreases CPI to 11/8

Page 56: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes

Page 57: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Forwarding Results

With forwarding but not rescheduling we obtain: 10/8

Page 58: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Re-schedulingWe use the same re-scheduled code:

By rescheduling the loop we obtain 9/8

Page 59: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Speedup Results

Total requested speedup is:

CPI[unscheduled,unforwarded] 13

---------------------------- = ----

CPI[scheduled,forwarded] 9

Page 60: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

3rd Exercise

loop: lw r2,dati_a(r1)addi r2,r2,4lw r3,dati_b(r1)addi r3,r3,4lw r4,dati_a(r1)addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop

Page 61: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes

Page 62: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles insertion

11 instructions, 1 nop, 12 stalls => CPI= 24/11

Page 63: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Rescheduled codeloop: lw r2,dati_a(r1)

lw r3,dati_b(r1)lw r4,dati_a(r1)addi r2,r2,4addi r3,r3,4addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop

Idea: perform elaborations after all data has been loaded

Page 64: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Scheduled code results

11 instr., 1 nop, 7 stalls => CPI=19/11

Page 65: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes

Page 66: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles insertion

11 + 1 NOP + 4 stalls => CPI=16/11

Page 67: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Rescheduling Results

11 instr. + 1 NOP + 1 stall => CPI=13/11Requested Speedup=24/13

Page 68: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Floating Point Pipeline Hazards DLX FPU Pipeline

Page 69: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

DLX FPU Pipeline Latency of a FU=number of cycles that

must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1).

Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU.

A stall in a pipeline does not mean a stall in the entire processor.

Page 70: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

FPU Latencies and I.I.

FU Latency

Initiation Interval

Integer ALU 0 1

FP add 1 1

FP and integer multiply

4 1

FP and integer divide

18 19 [structural hazards!]

WINDLX default latencies

Page 71: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Problems with FPUs Divide instructions can provoke

structural hazards and need to be stalled in the ID stage.

Writes in the RF can be more than one.

WAW hazards are possible because WB can be reached out of order.

RAW hazards more frequent due to the longer latency of operations.

Page 72: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Long Stalls even with Full Forwarding

Page 73: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Register file structural hazard solution. Structural hazards on register file:

Solution: stall one of the instructions before entering the MEM stage.

Page 74: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

FPU WAW Hazards

Subd finishes before multd!there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results!

ld f6,dati_a(r2)ld f2,dati_b(r3)multd f6,f2,f4subd f6,f2,f2addd f6,f8,f2

Page 75: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Exercise: execute only a cycle of this loop:

loop: ld f0,dati_a(r2)

ld f4,dati_b(r3)

multd f0,f0,f4

addd f2,f0,f2

addi r2,r2,8

addi r3,r3,8

sub r5,r4,r2

bnez r5,loop

How many cycles between the IF of the 1st ld and the WB of the 1st bnez?

Page 76: Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Vittorio Zaccaria – Laboratory of

Architectures

Results

CPI of the trace =19/8 instructions.