Top Banner
ASSIGNMENT # 1 Subject COMPUTER ARCHITECTURETeacher “Ma’am Aden Iqbal” By “Farwa Abdul Hannan” (12-CS-13) Monday, 28 March, 2016 NFC INSITUTDE OF ENGINEERING AND FERTILIZER RESEARCH, FSD
17

Tomasulo Algorithm

Feb 15, 2017

Download

Education

Farwa Ansari
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tomasulo Algorithm

ASSIGNMENT # 1

Subject

“COMPUTER ARCHITECTURE”

Teacher

“Ma’am Aden Iqbal”

By

“Farwa Abdul Hannan”

(12-CS-13)

Monday, 28 March, 2016

NFC – INSITUTDE OF ENGINEERING AND

FERTILIZER RESEARCH, FSD

Page 2: Tomasulo Algorithm

1

Tomasulo Algorithm

1) Consider the code sequence shown below.

LD F6, 12(R2)

LD F2, 16(R3)

ADDD F0, F2, F4

DIVD F10, F0, F6

SUBD F8, F6, F2

ADDI R2, R2, 8

ADDI R3, R3, 16

ADDD F6, F8, F2

a) Identify all WAR, WAW, and RAW dependencies in the instruction stream.

WAR WAW RAW

SUBD F8, F6, F2

ADDD F6, F8, F2

LD F6, 12(R2)

ADDD F6, F8, F2

LD F2, 16(R3)

ADDDF0, F2, F4

NIL NIL ADDD F0, F2, F4

DIVD F10, F0, F6

NIL NIL LD F6, 12(R2)

SUBD F8, F6, F2

b) Draw a pipeline diagram of how instructions would issue in a machine using

Tamasulo algorithm as discussed in class:. Assume that the FP Add unit has 4

EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases.

FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations

are NOT pipelined.

Page 3: Tomasulo Algorithm

2

Cycle 1, 2, 3

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 Load1 Yes 12+R2

LD F2 16+ R3 2 Load2 Yes 16+R3

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 No

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

3 FU ADD1 Load2 Load1

Cycle 4

Instruction Status Load/Buffers

Page 4: Tomasulo Algorithm

3

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 Load2 Yes 16+R3

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 Yes ADDD R(F4) Load2

ADD2 No

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

4 FU ADD1 Load2 M(A1) MULT1

Cycle 5

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

Page 5: Tomasulo Algorithm

4

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

4 ADD1 Yes ADDD M(A2) R(F4)

4 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

5 FU ADD1 M(A2) M(A1) ADD2 MULT1

Cycle 6

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 Load3 No

Page 6: Tomasulo Algorithm

5

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6

ADDI R3 R3 16

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

3 ADD1 Yes ADDD M(A2) R(F4)

3 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

6 FU ADD1 M(A2) M(A1) ADD2 MULT1

Cycle 7

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

Page 7: Tomasulo Algorithm

6

ADDI R3 R3 16 7 7

ADDD F6 F8 F2

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

2 ADD1 Yes ADDD M(A2) R(F4)

2 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

7 FU ADD1 M(A2) M(A1) ADD2 MULT1

Cycle 8

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

Page 8: Tomasulo Algorithm

7

Time Name Busy Op. Vj Vk Qj Qk

1 ADD1 Yes ADDD M(A2) R(F4)

1 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No ADDD M(A2) ADD2

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

8 FU ADD1 M(A2) ADD3 ADD2 MULT1

Cycle 9

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

0 ADD1 Yes ADDD M(A2) R(F4)

0 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No ADDD M(A2) ADD2

Page 9: Tomasulo Algorithm

8

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

9 FU ADD1 M(A2) ADD3 ADD2 MULT1

Cycle 10

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 10 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 Yes SUBD M(A1) M(A2)

4 ADD3 Yes ADDD M-M M(A2)

24 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

Register Result Status

Page 10: Tomasulo Algorithm

9

Clock F0 F2 F4 F6 F8 F10 F12 F14

10 FU M+R4 M(A2) ADD3 ADD2 MULT1

Cycle 11

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 10 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 9 11

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 No

3 ADD3 Yes ADDD M-M M(A2)

23 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

11 FU M+R4 M(A2) ADD3 M-M MULT1

Page 11: Tomasulo Algorithm

10

Cycle 14

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 10 11 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 No

0 ADD3 Yes ADDD M-M M(A2)

20 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

14 FU M+R4 M(A2) ADD3 M-M MULT1

Cycle 15

Instruction Status Load/Buffers

Page 12: Tomasulo Algorithm

11

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 10 Load3 No

DIVD F10 F0 F6 4

SUBD F8 F6 F2 5 8

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 No

ADD3 No

20 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

15 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 35

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

Page 13: Tomasulo Algorithm

12

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 10 11 Load3 No

DIVD F10 F0 F6 4 35

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 No

ADD3 No

0 MULT1 Yes DIVD M+R4 M(A1)

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

35 FU M+R4 M(A2) M-M+M M-M MULT1

Cycle 36

Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 10 11 Load3 No

DIVD F10 F0 F6 4 35 36

Page 14: Tomasulo Algorithm

13

SUBD F8 F6 F2 5 8 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8 14 15

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

ADD1 No

ADD2 No

ADD3 No

MULT1 No

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

36 FU M+R4 M(A2) M-M+M M-M (M+R4)/M

c) Tomasulo’s algorithm has a disadvantage. Only one result can complete per

clock, per CDB. Using the same latencies as above, find a code sequence of no

more than 12 instructions where Tomasulo’s algorithm must stall due to CDB

contention. Indicate where this occurs in your sequence.

It occurs in the following cycle

Cycle 9 Instruction Status Load/Buffers

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle

Busy Address

j k

LD F6 12+ R2 1 3 4 Load1 No

LD F2 16+ R3 2 4 5 Load2 No

ADDD F0 F2 F4 3 9 Load3 No

DIVD F10 F0 F6 4

Page 15: Tomasulo Algorithm

14

SUBD F8 F6 F2 5 9

ADDI R2 R2 8 6 6 7

ADDI R3 R3 16 7 7 8

ADDD F6 F8 F2 8

Reservation Station

Time Name Busy Op. Vj Vk Qj Qk

0 ADD1 Yes ADDD M(A2) R(F4)

0 ADD2 Yes SUBD M(A1) M(A2)

ADD3 No ADDD M(A2) ADD2

MULT1 Yes DIVD M(A1) ADD1

MULT2 No

Register Result Status

Clock F0 F2 F4 F6 F8 F10 F12 F14

9 FU ADD1 M(A2) ADD3 ADD2 MULT1

2) Evaluate the performance of several implementation options for

the following workload:

LOOP:

L.D F3, R4(R6) # F3 = MEM[r4+r6]

MUL.D F4, F3, F2 # F4 = F3*F2

S.D F4, R3(R6) # MEM[R3+R6] = F4

A.D F4, F3, F3 # F4 = F3+F3

Only one instruction can complete per result

per CDB

Page 16: Tomasulo Algorithm

15

A.D F10, F10, F4 # F10 = F10 + F4

DSUBUI R6,R6, #4 # R6 = R6 - 4

BNEQ R6, loop # if R6 != 0, jump to LOOP

Assume the processor implements Tomasulo’s algorithm (with reservation stations and no reorder

buffer), as well as the following:

A single instruction is issued per cycle.

All function units are not pipelined.

No forwarding between or within function units; results are communicated via the single

CDB.

The memory execution unit uses three stages for load and 2 cycles for store. Load and store

have separate reservation stations, but either a load or store can execute at any one time

since they share the memory port.

Issue and write result stages require one cycle each. Address generation is performed

separate from the ALU in the load and store buffers.

Branches execute in the integer unit, and instructions issued after a branch wait until the

branch has been resolved and broadcast on the CDB.

Functional Unit Queues and Latencies:

Functional Unit # of Functional Units Latency (cycles in EX) # of Reservation Stations

Memory – Load 1 3 2

Memory – Store 1 2 2

Integer 1 1 5

FP – Add 1 4 3

FP – Multiply 1 2 2

a) Perform a simulation of the first two iterations for a single issue architecture.

Create the table below

Iteration 1

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle j k

L.D F3 R4 R6 1 2 6

MUL.D F4 F3 F2 2 6 17

S.D F4 R3 R6 3 17 21

Page 17: Tomasulo Algorithm

16

A.D F4 F3 F3 4 21 26

A.D F10 F10 F4 5 26 31

DSUBUI R6 R6 #4 6 31 36

BNEQ R6 loop 7 37

Iteration 2

Instruction Issue

Cycle

Execute

Cycle

Write

Cycle j k

L.D F3 R4 R6 8 38 42

MUL.D F4 F3 F2 9 42 53

S.D F4 R3 R6 10 53 56

A.D F4 F3 F3 11 56 61

A.D F10 F10 F4 12 61 66

DSUBUI R6 R6 #4 13 66 71

BNEQ R6 loop 14 71

b) What is the performance bottleneck?

The delay in transmission of data through the circuits of a computer's microprocessor or

over a TCP/IP network. The delay typically occurs when a system's bandwidth cannot

support the amount of information being relayed at the speed it is being processed

c) What is the “steady state” of this loop – that is how many cycles will an average

loop iteration take if loop startup and shutdown effects are ignored?

The steady state of the loop occurs when the R6 will be equal to zero which means at R6

equal to zero the loop will no longer keep on iterating and will be in a steady state.

d) Where will the first issue stall occur?

The first stall will occur when the second instruction of MULTD F4, F3, F2 will execute

because its execution will be dependent on the F3 of LD. So RAW delay will occur.