Tomasulo Algorithm

ASSIGNMENT # 1

Subject

“COMPUTER ARCHITECTURE”

Teacher

“Ma’am Aden Iqbal”

By

“Farwa Abdul Hannan”

(12-CS-13)

Monday, 28 March, 2016

NFC – INSITUTDE OF ENGINEERING AND

FERTILIZER RESEARCH, FSD

1

Tomasulo Algorithm

1) Consider the code sequence shown below.

LD F6, 12(R2)

LD F2, 16(R3)

ADDD F0, F2, F4

DIVD F10, F0, F6

SUBD F8, F6, F2

ADDI R2, R2, 8

ADDI R3, R3, 16

ADDD F6, F8, F2

a) Identify all WAR, WAW, and RAW dependencies in the instruction stream.

WAR WAW RAW

SUBD F8, F6, F2

ADDD F6, F8, F2

LD F6, 12(R2)

ADDD F6, F8, F2

LD F2, 16(R3)

ADDDF0, F2, F4

NIL NIL ADDD F0, F2, F4

DIVD F10, F0, F6

NIL NIL LD F6, 12(R2)

SUBD F8, F6, F2

b) Draw a pipeline diagram of how instructions would issue in a machine using

Tamasulo algorithm as discussed in class:. Assume that the FP Add unit has 4

EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases.

FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations

are NOT pipelined.

2

Cycle 1, 2, 3

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 Load1 Yes 12+R2

LD F2 16+ R3 2 Load2 Yes 16+R3

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 No

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

3 FU ADD1 Load2 Load1

Cycle 4


3

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 Load2 Yes 16+R3


DIVD F10 F0 F6 4

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station


ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No



4 FU ADD1 Load2 M(A1) MULT1

Cycle 5


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

4

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station


4 ADD1 Yes ADDD M(A2) R(F4)

4 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No


MULT2 No



5 FU ADD1 M(A2) M(A1) ADD2 MULT1

Cycle 6


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


5

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station




ADD3 No


MULT2 No




Cycle 7


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

6

ADDI R3 R3 16 7 7

ADDD F6 F8 F2

Reservation Station




ADD3 No


MULT2 No




Cycle 8


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

7




ADD3 No ADDD M(A2) ADD2


MULT2 No



8 FU ADD1 M(A2) ADD3 ADD2 MULT1

Cycle 9


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station





8


MULT2 No




Cycle 10


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 10 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station


ADD1 No

ADD2 Yes SUBD M(A1) M(A2)

4 ADD3 Yes ADDD M-M M(A2)

24 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No


9


10 FU M+R4 M(A2) ADD3 ADD2 MULT1

Cycle 11


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9 11

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station


ADD1 No

ADD2 No



MULT2 No



11 FU M+R4 M(A2) ADD3 M-M MULT1

10

Cycle 14


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14

Reservation Station


ADD1 No

ADD2 No



MULT2 No



14 FU M+R4 M(A2) ADD3 M-M MULT1

Cycle 15


11

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station


ADD1 No

ADD2 No

ADD3 No


MULT2 No



15 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 35


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

12

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4 35

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station


ADD1 No

ADD2 No

ADD3 No


MULT2 No



35 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 36


Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4 35 36

13

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station


ADD1 No

ADD2 No

ADD3 No

MULT1 No

MULT2 No



36 FU M+R4 M(A2) M-M+M M-M (M+R4)/M

c) Tomasulo’s algorithm has a disadvantage. Only one result can complete per

clock, per CDB. Using the same latencies as above, find a code sequence of no

more than 12 instructions where Tomasulo’s algorithm must stall due to CDB

contention. Indicate where this occurs in your sequence.

It occurs in the following cycle

Cycle 9 Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No


DIVD F10 F0 F6 4

14

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station






MULT2 No




2) Evaluate the performance of several implementation options for

the following workload:

LOOP:

L.D F3, R4(R6) # F3 = MEM[r4+r6]

MUL.D F4, F3, F2 # F4 = F3*F2

S.D F4, R3(R6) # MEM[R3+R6] = F4

A.D F4, F3, F3 # F4 = F3+F3

Only one instruction can complete per result

per CDB

15

A.D F10, F10, F4 # F10 = F10 + F4

DSUBUI R6,R6, #4 # R6 = R6 - 4

BNEQ R6, loop # if R6 != 0, jump to LOOP

Assume the processor implements Tomasulo’s algorithm (with reservation stations and no reorder

buffer), as well as the following:

A single instruction is issued per cycle.

All function units are not pipelined.

No forwarding between or within function units; results are communicated via the single

CDB.

The memory execution unit uses three stages for load and 2 cycles for store. Load and store

have separate reservation stations, but either a load or store can execute at any one time

since they share the memory port.

Issue and write result stages require one cycle each. Address generation is performed

separate from the ALU in the load and store buffers.

Branches execute in the integer unit, and instructions issued after a branch wait until the

branch has been resolved and broadcast on the CDB.

Functional Unit Queues and Latencies:

Functional Unit # of Functional Units Latency (cycles in EX) # of Reservation Stations

Memory – Load 1 3 2

Memory – Store 1 2 2

Integer 1 1 5

FP – Add 1 4 3

FP – Multiply 1 2 2

a) Perform a simulation of the first two iterations for a single issue architecture.

Create the table below

Iteration 1

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle j k

L.D F3 R4 R6 1 2 6

MUL.D F4 F3 F2 2 6 17

S.D F4 R3 R6 3 17 21

16

A.D F4 F3 F3 4 21 26

A.D F10 F10 F4 5 26 31

DSUBUI R6 R6 #4 6 31 36

BNEQ R6 loop 7 37

Iteration 2

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle j k

L.D F3 R4 R6 8 38 42

MUL.D F4 F3 F2 9 42 53

S.D F4 R3 R6 10 53 56

A.D F4 F3 F3 11 56 61

A.D F10 F10 F4 12 61 66

DSUBUI R6 R6 #4 13 66 71

BNEQ R6 loop 14 71

b) What is the performance bottleneck?

The delay in transmission of data through the circuits of a computer's microprocessor or

over a TCP/IP network. The delay typically occurs when a system's bandwidth cannot

support the amount of information being relayed at the speed it is being processed

c) What is the “steady state” of this loop – that is how many cycles will an average

loop iteration take if loop startup and shutdown effects are ignored?

The steady state of the loop occurs when the R6 will be equal to zero which means at R6

equal to zero the loop will no longer keep on iterating and will be in a steady state.

d) Where will the first issue stall occur?

The first stall will occur when the second instruction of MULTD F4, F3, F2 will execute

because its execution will be dependent on the F3 of LD. So RAW delay will occur.

Tomasulo Algorithm

Education