-
•1
CPE 631: ILP, Dynamic Exploitation
Electrical and Computer EngineeringUniversity of Alabama in
Huntsville
Aleksandar Milenković[email protected]
http://www.ece.uah.edu/~milenka
2
©AM
LaCASA
Outline
Instruction Level Parallelism (ILP)Recap: Data
DependenciesExtended MIPS Pipeline and HazardsDynamic scheduling
with a scoreboard
3
©AM
LaCASA
ILP: Concepts and Challenges
ILP (Instruction Level Parallelism) –overlap execution of
unrelated instructionsTechniques that increase amount of
parallelism exploited among instructions
reduce impact of data and control hazardsincrease processor
ability to exploit parallelism
Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls + Control stalls
Reducing each of the terms of the right-hand side minimize CPI
and thus increase instruction throughput
4
©AM
LaCASA
Two approaches to exploit parallelism
Dynamic techniqueslargely depend on hardware to locate the
parallelism
Static techniquesrelay on software
-
•2
5
©AM
LaCASA
Techniques to exploit parallelism
DH stallsBasic compiler pipeline scheduling (A.2, 4.1)
CH stallsLoop Unrolling (4.1)
WAR and WAW stallsDynamic scheduling with register renaming
(3.2)
DH stalls (RAW)Basic dynamic scheduling (A.8)
Control hazard stallsDelayed branches (A.2)
Data hazard (DH) stallsForwarding and bypassing (Section
A.2)
Ideal CPI, and D/CH stallsCompiler speculation (4.4)
Ideal CPI and DH stallsSoftware pipelining and trace scheduling
(4.3)
Ideal CPI, DH stallsCompiler dependence analysis (4.4)
RAW stalls w. memoryDynamic memory disambiguation (3.2, 3.7)
Data and control stallsSpeculation (3.7)
Ideal CPIIssuing multiple instruction per cycle (3.6)
CH stallsDynamic branch prediction (3.4)
Reduces Technique (Section in the textbook)
6
©AM
LaCASA
Where to look for ILP?
Amount of parallelism available within a basic block BB:
straight line code sequence of instructions with no branchesin
except to the entry, and no branches out except at the exitExample:
Gcc (Gnu C Compiler): 17% control transfer
5 or 6 instructions + 1 branchDependencies => amount of
parallelism in a basic block is likely to be much less than 5=>
look beyond single block to get more instruction level
parallelism
Simplest and most common way to increase amount of parallelism
among instruction is to exploit parallelism among iterations of a
loop =>Loop Level Parallelism
Vector Processing: see Appendix G
for(i=1; i
-
•3
9
©AM
LaCASA
Definition: Name Dependencies
Two instructions use same name (register or memory location) but
don’t exchange data
Antidependence (WAR if a hazard for HW)Instruction j writes a
register or memory location that instruction i reads from and
instruction i is executed firstOutput dependence (WAW if a hazard
for HW)Instruction i and instruction j write the same register or
memory location; ordering between instructions must be preserved.
If dependent, can’t execute in parallel
Renaming to remove data dependenciesAgain Name Dependencies are
Hard for Memory Accesses
Does 100(R4) = 20(R6)?From different loop iterations, does
20(R6) = 20(R6)?
10
©AM
LaCASA
Where are the name dependencies?1 Loop:L.D F0,0(R1)2 ADD.D
F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F0,-8(R1)5
ADD.D F4,F0,F26 S.D -8(R1),F4 ;drop DSUBUI & BNEZ7 L.D
F0,-16(R1)8 ADD.D F4,F0,F29 S.D -16(R1),F4 ;drop DSUBUI &
BNEZ10 L.D F0,-24(R1)11 ADD.D F4,F0,F212 S.D -24(R1),F413 SUBUI
R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP
How can remove them?
11
©AM
LaCASA
Where are the name dependencies?1 Loop:L.D F0,0(R1)2 ADD.D
F4,F0,F23 S.D 0(R1),F4 ;drop DSUBUI & BNEZ4 L.D F6,-8(R1)5
ADD.D F8,F6,F26 S.D -8(R1),F8 ;drop DSUBUI & BNEZ7 L.D
F10,-16(R1)8 ADD.D F12,F10,F29 S.D -16(R1),F12 ;drop DSUBUI &
BNEZ10 L.D F14,-24(R1)11 ADD.D F16,F14,F212 S.D -24(R1),F1613
DSUBUI R1,R1,#32 ;alter to 4*814 BNEZ R1,LOOP15 NOP
The Orginal“register renaming”
12
©AM
LaCASA
Definition: Control Dependencies
Example: if p1 {S1;}; if p2 {S2;};S1 is control dependent on p1
and S2 is control dependent on p2 but not on p1Two constraints on
control dependences:
An instruction that is control dep. on a branch cannot be moved
before the branch, so that its execution is no longer controlled by
the branchAn instruction that is not control dep. on a branch
cannot be moved to after the branch so that its execution is
controlled by the branch
DADDU R5, R6, R7
ADD R1, R2, R3
BEQZ R4, LSUB R1, R5, R6
L: OR R7, R1, R8
-
•4
Dynamically Scheduled Pipelines
14
©AM
LaCASA
Overcoming Data Hazards with Dynamic Scheduling
Why in HW at run time?Works when can’t know real dependence at
compile timeSimpler compiler Code for one machine runs well on
another
Example
Key idea: Allow instructions behind stall to proceed
DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F12,F8,F12
SUB.D cannot execute because the dependence of ADD.D on DIV.D
causes the pipeline to stall; yet SUBD is not data dependent on
anything!
15
©AM
LaCASA
Overcoming Data Hazards with Dynamic Scheduling (cont’d)
Enables out-of-order execution => out-of-order
completionOut-of-order execution divides ID stage:
1. Issue—decode instructions, check for structural hazards2.
Read operands—wait until no data hazards, then read operands
Scoreboarding –technique for allowing instructions to execute
out of order when there are sufficient resources and no data
dependencies (CDC 6600, 1963)
16
©AM
LaCASA
Scoreboarding Implications
Out-of-order completion => WAR, WAW hazards?
Solutions for WARQueue both the operation and copies of its
operandsRead registers only during Read Operands stage
For WAW, must detect hazard: stall until other completesNeed to
have multiple instructions in execution phase => multiple
execution units or pipelined execution unitsScoreboard keeps track
of dependencies, state or operationsScoreboard replaces ID, EX, WB
with 4 stages
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F8,F8,F12
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F10,F8,F12
-
•5
17
©AM
LaCASA
Four Stages of Scoreboard Control
ID1: Issue — decode instructions & check for structural
hazardsID2: Read operands — wait until no data hazards, then read
operandsEX: Execute — operate on operands; when the result is
ready, it notifies the scoreboard that it hascompleted executionWB:
Write results — finish execution; the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instructionDIV.D F0,F2,F4ADD.D F10,F0,F8
SUB.D F8,F8,F12
Scoreboarding stalls the the SUBD in its write result stage
until ADDD reads its operands
18
©AM
LaCASA
Four Stages of Scoreboard Control
1. Issue—decode instructions & check for structural hazards
(ID1)If a functional unit for the instruction is free and no other
active instruction has the same destination register (WAW), the
scoreboard issues the instruction to the functional unit and
updates its internal data structure. If a structural or WAW hazard
exists, then the instruction issue stalls, and no further
instructions will issue until these hazards are cleared.
2. Read operands—wait until no data hazards, then read operands
(ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing the
operand is being written by a currently active functional unit.
When the source operands are available, the scoreboard tells the
functional unit to proceed to read the operands from the registers
and begin execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be sent into
execution out of order.
19
©AM
LaCASA
Four Stages of Scoreboard Control
3. Execution—operate on operands (EX)The functional unit begins
execution upon receiving operands. When the result is ready, it
notifies the scoreboard that it hascompleted execution.
4. Write result—finish execution (WB)Once the scoreboard is
aware that the functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it writes results. If
WAR, then it stalls the instruction.Example:
CDC 6600 scoreboard would stall SUBD until ADD.D reads
operands
DIV.D F0,F2,F4ADD.D F10,F0,F8SUB.D F8,F8,F14
20
©AM
LaCASA
Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
(Capacity = window size)2. Functional unit status—Indicates the
state of the functional unit (FU). 9 fields for each functional
unit
Busy—Indicates whether the unit is busy or notOp—Operation to
perform in the unit (e.g., + or –)Fi—Destination registerFj,
Fk—Source-register numbersQj, Qk—Functional units producing source
registers Fj, FkRj, Rk—Flags indicating when Fj, Fk are ready
3. Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions will write that register
-
•6
21
©AM
LaCASA
MIPS with a Scoreboard
Add1Add2Add3
FP Mult
Registers
Control/StatusScoreboard
FP Mult
FP Div
FP Div
FP Div
Control/Status
22
©AM
LaCASA
Detailed Scoreboard Pipeline Control
Read operandsExecution complete
Instruction status
Write result
Issue
Bookkeeping
Rj← No; Rk← No
∀f(if Qj(f)=FU then Rj(f)← Yes);∀f(if Qk(f)=FU then Rj(f)←
Yes);
Result(Fi(FU))← 0; Busy(FU)← No
Busy(FU)← yes; Op(FU)← op; Fi(FU)← ’D’; Fj(FU)← ’S1’;
Fk(FU)← ’S2’; Qj← Result(’S1’); Qk← Result(’S2’); Rj← not Qj;
Rk← not Qk; Result(’D’)← FU;
Rj and Rk
Functional unit done
Wait until
∀f((Fj( f )≠Fi(FU) or Rj( f )=No) &
(Fk( f ) ≠Fi(FU) or Rk( f )=No))
Not busy (FU) and not result (D)
23
©AM
LaCASA
Scoreboard ExampleInstruction status Read
ExecutioWriteInstruction j k Issue operandcompleteResultL.D F6 34+
R2L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D F6
F8 F2Functional unit status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2
NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
24
©AM
LaCASA
Scoreboard Example: Cycle 1Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D
F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj?
Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2
YesMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
Issue 1st L.D!
-
•7
25
©AM
LaCASA
Scoreboard Example: Cycle 2Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0 F6ADD.D
F6 F8 F2Functional unit status dest S1 S2 FU for jFU for kFj?
Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2
YesMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Issue 2nd L.D? Structural hazard!No further instructions will
issue!
26
©AM
LaCASA
Scoreboard Example: Cycle 3Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0
F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for
kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2
YesMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
Issue MUL.D?
27
©AM
LaCASA
Scoreboard Example: Cycle 4Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0
F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for
kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2
YesMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Integer
Check for WAR hazards!If none, write result!
28
©AM
LaCASA
Scoreboard Example: Cycle 5Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5MUL.D F0 F2 F4SUB.D F8 F6 F2DIV.D F10 F0
F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for
kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3
YesMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Issue 2nd L.D!
-
•8
29
©AM
LaCASA
Scoreboard Example: Cycle 6Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6MUL.D F0 F2 F4 6SUB.D F8 F6 F2DIV.D F10
F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU for
kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3
YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide
No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 Integer
Issue MUL.D!
30
©AM
LaCASA
Scoreboard Example: Cycle 7Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7MUL.D F0 F2 F4 6SUB.D F8 F6 F2 7DIV.D
F10 F0 F6ADD.D F6 F8 F2Functional unit status dest S1 S2 FU for jFU
for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3
YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6
F2 Integer Yes NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 Integer Add
Issue SUB.D!
31
©AM
LaCASA
Scoreboard Example: Cycle 8Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6SUB.D F8 F6 F2
7DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2
FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3
YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6
F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 Integer Add Divide
Issue DIV.D!
32
©AM
LaCASA
Scoreboard Example: Cycle 9Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7
9DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2
FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
10 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No
2 Add Yes Sub F8 F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6
Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide
Read operands for MUL.D and SUB.D!Assume we can feed Mult1 and
Add units in the same clock cycle.Issue ADD.D? Structural Hazard
(unit is busy)!
-
•9
33
©AM
LaCASA
Scoreboard Example: Cycle 11Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1 S2
FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
8 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No
0 Add Yes Sub F8 F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6
Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide
Last cycle of SUB.D execution.
34
©AM
LaCASA
Scoreboard Example: Cycle 12Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2Functional unit status dest S1
S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
7 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Sub F8
F6 F2 Integer Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Add Divide
Check WAR on F8. Write F8.
35
©AM
LaCASA
Scoreboard Example: Cycle 13Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13Functional unit status dest
S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
6 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6
F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide
Issue ADD.D!
36
©AM
LaCASA
Scoreboard Example: Cycle 14Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14Functional unit status
dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
5 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No
2 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No
Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
Read operands for ADD.D!
-
•10
37
©AM
LaCASA
Scoreboard Example: Cycle 15Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14Functional unit status
dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
4 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No
1 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No
Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
38
©AM
LaCASA
Scoreboard Example: Cycle 16Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status
dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
3 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 No
0 Add Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No
Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
39
©AM
LaCASA
Scoreboard Example: Cycle 17Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9SUB.D F8 F6 F2 7 9
11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit status
dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
2 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6
F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
Why cannot write F6?
40
©AM
LaCASA
Scoreboard Example: Cycle 19Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19SUB.D F8 F6 F2
7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit
status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
0 Mult1 Yes Mult F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6
F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
-
•11
41
©AM
LaCASA
Scoreboard Example: Cycle 20Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6
F2 7 9 11 12DIV.D F10 F0 F6 8ADD.D F6 F8 F2 13 14 16Functional unit
status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 Yes Mult
F0 F2 F4 Integer Yes YesMult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide
Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Mult1 Add Divide
42
©AM
LaCASA
Scoreboard Example: Cycle 21Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6
F2 7 9 11 12DIV.D F10 F0 F6 8 21ADD.D F6 F8 F2 13 14 16Functional
unit status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2
NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 Yes
Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide
43
©AM
LaCASA
Scoreboard Example: Cycle 22Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6
F2 7 9 11 12DIV.D F10 F0 F6 8 21ADD.D F6 F8 F2 13 14 16
22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2
NoAdd Yes Add F6 F8 F2 Yes Yes
40 Divide Yes Div F10 F0 F6 Mult1 Yes YesRegister result
statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
22 FU Add Divide
Write F6?
44
©AM
LaCASA
Scoreboard Example: Cycle 61Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6
F2 7 9 11 12DIV.D F10 F0 F6 8 21 61ADD.D F6 F8 F2 13 14 16
22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2
NoAdd No
0 Divide Yes Div F10 F0 F6 Mult1 Yes YesRegister result
statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide
-
•12
45
©AM
LaCASA
Scoreboard Example: Cycle 62Instruction status Read
ExecutioWriteInstruction j k Issue operandcompletResultL.D F6 34+
R2 1 2 3 4L.D F2 45+ R3 5 6 7 8MUL.D F0 F2 F4 6 9 19 20SUB.D F8 F6
F2 7 9 11 12DIV.D F10 F0 F6 8 21 61 62ADD.D F6 F8 F2 13 14 16
22Functional unit status dest S1 S2 FU for jFU for kFj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2
NoAdd NoDivide Yes Div F10 F0 F6 Mult1 Yes Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU Divide
46
©AM
LaCASA
Scoreboard Results
For the CDC 660070% improvement for Fortran150% improvement for
hand coded assembly languagecost was similar to one of the
functional units
surprisingly lowbulk of cost was in the extra busses
Still this was in ancient timeno caches & no main
semiconductor memoryno software pipeliningcompilers?
So, why is it coming backperformance via ILP
47
©AM
LaCASA
Scoreboard Limitations
Amount of parallelism among instructionscan we find independent
instructions to execute
Number of scoreboard entrieshow far ahead the pipeline can look
for independent instructions (we assume a window does not extend
beyond a branch)
Number and types of functional unitsavoid structural hazards
Presence of antidependences and output dependences
WAR and WAW stalls become more important
48
©AM
LaCASA
Things to Remember
Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls + Control stallsData
dependenciesDynamic scheduling to minimise stallsDynamic scheduling
with a scoreboard
-
•13
49
©AM
LaCASA
Scoreboard Limitations
Amount of parallelism among instructionscan we find independent
instructions to execute
Number of scoreboard entrieshow far ahead the pipeline can look
for independent instructions (we assume a window does not extend
beyond a branch)
Number and types of functional unitsavoid structural hazards
Presence of antidependences and output dependences
WAR and WAW stalls become more important
50
©AM
LaCASA
Tomasulo’s Algorithm
Used in IBM 360/91 FPU (before caches)Goal: high FP performance
without special compilersConditions:
Small number of floating point registers (4 in 360) prevented
interesting compiler scheduling of operationsLong memory accesses
and long FP delaysThis led Tomasulo to try to figure out how to get
more effective registers — renaming in hardware!
Why Study 1966 Computer? The descendants of this have
flourished!
Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,
…
51
©AM
LaCASA
Tomasulo’s Algorithm (cont’d)
Control & buffers distributed with Function Units (FU)FU
buffers called “reservation stations” => buffer the operands of
instructions waiting to issue;
Registers in instructions replaced by values or pointers to
reservation stations (RS) => register renaming
avoids WAR, WAW hazardsMore reservation stations than registers,
so can do optimizations compilers can’t
Results to FU from RS, not through registers, over Common Data
Bus that broadcasts results to all FUsLoad and Stores treated as
FUs with RSs as wellInteger instructions can go past branches,
allowing FP ops beyond basic block in FP queue
52
©AM
LaCASA
Tomasulo-based FPU for MIPS
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation Stations
Common Data Bus (CDB)
To Mem
FP OpQueue
Load Buffers
Store Buffers
Load1Load2Load3Load4Load5Load6
From Instruction Unit
Store1Store2Store3
-
•14
53
©AM
LaCASA
Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –)Vj, Vk: Value
of Source operands
Store buffers has V field, result to be storedQj, Qk:
Reservation stations producing source registers (value to be
written)
Note: Qj/Qk=0 => source operand is already available in Vj
/VkStore buffers only have Qi for RS producing result
Busy: Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will
write each register, if one exists. Blank when no pending
instructions that will write that register.
54
©AM
LaCASA
Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op QueueIf reservation station
free (no structural hazard), control issues instr & sends
operands (renames registers)
2. Execute—operate on operands (EX)When both operands ready then
execute;if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)Write it on Common Data Bus
to all awaiting units; mark reservation station available
Normal data bus: data + destination (“go to” bus)Common data
bus: data + source (“come from” bus)
64 bits of data + 4 bits of Functional Unit source addressWrite
if matches expected Functional Unit (produces result)Does the
broadcast
Example speed: 2 clocks for Fl .pt. +,-; 10 for * ; 40 clks for
/
55
©AM
LaCASA
Tomasulo ExampleInstruction status: Exec WriteInstruction j k
Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3
Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6
F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Clock cycle counter
FU countdown
Instruction stream
3 Load/Buffers
3 FP Adder R.S.2 FP Mult R.S.
56
©AM
LaCASA
Tomasulo Example Cycle 1Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1
Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8
F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
-
•15
57
©AM
LaCASA
Tomasulo Example Cycle 2Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1
Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3
NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
Note: Can have multiple loads outstanding
58
©AM
LaCASA
Tomasulo Example Cycle 3Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3
NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
• Note: registers names are removed (“renamed”) in Reservation
Stations; MULT issued
• Load1 completing; what is waiting for Load1?
59
©AM
LaCASA
Tomasulo Example Cycle 4Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3
NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4)
Load2Mult2 No
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
• Load2 completing; what is waiting for Load2? 60
©AM
LaCASA
Tomasulo Example Cycle 5Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No
10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
• Timer starts down for Add1, Mult1
-
•16
61
©AM
LaCASA
Tomasulo Example Cycle 6Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
• Issue ADDD here despite name dependency on F6?
62
©AM
LaCASA
Tomasulo Example Cycle 7Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No
8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
• Add1 (SUBD) completing; what is waiting for it?
63
©AM
LaCASA
Tomasulo Example Cycle 8Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 No2 Add2 Yes ADDD (M-M) M(A2)
Add3 No7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 (M-M) Mult2
64
©AM
LaCASA
Tomasulo Example Cycle 9Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 No1 Add2 Yes ADDD (M-M) M(A2)
Add3 No6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 M(A2) Add2 (M-M) Mult2
-
•17
65
©AM
LaCASA
Tomasulo Example Cycle 10Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 No0 Add2 Yes ADDD (M-M) M(A2)
Add3 No5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 M(A2) Add2 (M-M) Mult2
• Add2 (ADDD) completing; what is waiting for it? 66
©AM
LaCASA
Tomasulo Example Cycle 11Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 No
4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) (M-M+M(M-M) Mult2
• Write result of ADDD here?• All quick instructions complete in
this cycle!
67
©AM
LaCASA
Tomasulo Example Cycle 12Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 No
3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 M(A2) (M-M+M(M-M) Mult2
68
©AM
LaCASA
Tomasulo Example Cycle 13Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 No
2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 M(A2) (M-M+M(M-M) Mult2
-
•18
69
©AM
LaCASA
Tomasulo Example Cycle 14Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD
F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 No
1 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 M(A2) (M-M+M(M-M) Mult2
70
©AM
LaCASA
Tomasulo Example Cycle 15Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3
NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 No
0 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 M(A2) (M-M+M(M-M) Mult2
• Mult1 (MULTD) completing; what is waiting for it?
71
©AM
LaCASA
Tomasulo Example Cycle 16Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3
NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
40 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) (M-M+M(M-M) Mult2
• Just waiting for Mult2 (DIVD) to complete
72
©AM
LaCASA
Tomasulo Example Cycle 55Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3
NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
1 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
55 FU M*F4 M(A2) (M-M+M(M-M) Mult2
-
•19
73
©AM
LaCASA
Tomasulo Example Cycle 56Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3
NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 No
0 Mult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M(M-M) Mult2
• Mult2 (DIVD) is completing; what is waiting for it? 74
©AM
LaCASA
Tomasulo Example Cycle 57Instruction status: Exec
WriteInstruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3
4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3
NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10
11
Reservation Stations: S1 S2 RS RSTime Name Busy Op Vj Vk Qj
Qk
Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD M*F4 M(A1)
Register result status:Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) (M-M+M(M-M) Result
• Once again: In-order issue, out-of-order execution and
out-of-order completion.
75
©AM
LaCASA
Tomasulo Drawbacks
Complexitydelays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620
in CA:AQA 2/e, but not in silicon!
Many associative stores (CDB) at high speedPerformance limited
by Common Data Bus
Each CDB must go to multiple functional units ⇒ high
capacitance, high wiring densityNumber of functional units that can
complete per cycle limited to one!
Multiple CDBs ⇒ more FU logic for parallel assoc
storesNon-precise interrupts!
We will address this later
76
©AM
LaCASA
Tomasulo Loop Example
This time assume Multiply takes 4 clocksAssume 1st load takes 8
clocks (L1 cache miss), 2nd load takes 1 clock (hit)To be clear,
will show clocks for SUBI, BNEZ
Reality: integer instructions ahead of Fl. Pt. Instructions
Show 2 iterations
Loop: LD F0 0(R1)MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 #8BNEZ R1
Loop
-
•20
77
©AM
LaCASA
Loop ExampleInstruction status: Exec WriteITER Instruction j k
Issue CompResult Busy Addr Fu
1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3
No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1
Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
0 80 Fu
Added Store Buffers
Value of Register used for address, iteration control
Instruction Loop
Iter-ationCount
78
©AM
LaCASA
Loop Example Cycle 1Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 80Load2 NoLoad3 NoStore1 NoStore2
NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
No SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
1 80 Fu Load1
79
©AM
LaCASA
Loop Example Cycle 2Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No
Load3 NoStore1 NoStore2 NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
2 80 Fu Load1 Mult1
80
©AM
LaCASA
Loop Example Cycle 3Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 No
Store1 Yes 80 Mult1Store2 NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
3 80 Fu Load1 Mult1
Implicit renaming sets up data flow graph
-
•21
81
©AM
LaCASA
Loop Example Cycle 4Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 No
Store1 Yes 80 Mult1Store2 NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
4 80 Fu Load1 Mult1
82
©AM
LaCASA
Loop Example Cycle 5Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 No
Store1 Yes 80 Mult1Store2 NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
5 72 Fu Load1 Mult1
83
©AM
LaCASA
Loop Example Cycle 6Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD
F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult1
Store2 NoStore3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
6 72 Fu Load2 Mult1
84
©AM
LaCASA
Loop Example Cycle 7Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD
F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0
F2 7 Store2 No
Store3 NoReservation Stations: S1 S2 RS
Time Name Busy Op Vj Vk Qj Qk Code:Add1 No LD F0 0 R1Add2 No
MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 Yes Multd R(F2) Load1 SUBI R1
R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
7 72 Fu Load2 Mult2
-
•22
85
©AM
LaCASA
Loop Example Cycle 8Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD
F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0
F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ
R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
8 72 Fu Load2 Mult2
86
©AM
LaCASA
Loop Example Cycle 9Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 Load1 Yes 801 MULTD F4 F0 F2 2 Load2 Yes 721 SD
F4 0 R1 3 Load3 No2 LD F0 0 R1 6 Store1 Yes 80 Mult12 MULTD F4 F0
F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load1 SUBI R1 R1 #8Mult2 Yes Multd R(F2) Load2 BNEZ
R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
9 72 Fu Load2 Mult2
87
©AM
LaCASA
Loop Example Cycle 10Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 Yes 721 SD
F4 0 R1 3 Load3 No2 LD F0 0 R1 6 10 Store1 Yes 80 Mult12 MULTD F4
F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
4 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #8Mult2 Yes Multd R(F2)
Load2 BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
10 64 Fu Load2 Mult2
88
©AM
LaCASA
Loop Example Cycle 11Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4
F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
3 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #84 Mult2 Yes Multd
M[72] R(F2) BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
11 64 Fu Load3 Mult2
-
•23
89
©AM
LaCASA
Loop Example Cycle 12Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4
F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
2 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #83 Mult2 Yes Multd
M[72] R(F2) BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
12 64 Fu Load3 Mult2
90
©AM
LaCASA
Loop Example Cycle 13Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 Load2 No1 SD F4 0
R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12 MULTD F4
F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
1 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #82 Mult2 Yes Multd
M[72] R(F2) BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
13 64 Fu Load3 Mult2
91
©AM
LaCASA
Loop Example Cycle 14Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 Load2 No1 SD
F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80 Mult12
MULTD F4 F0 F2 7 Store2 Yes 72 Mult22 SD F4 0 R1 8 Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
0 Mult1 Yes Multd M[80] R(F2) SUBI R1 R1 #81 Mult2 Yes Multd
M[72] R(F2) BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
14 64 Fu Load3 Mult2
92
©AM
LaCASA
Loop Example Cycle 15Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1
SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80
[80]*R22 MULTD F4 F0 F2 7 15 Store2 Yes 72 Mult22 SD F4 0 R1 8
Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
No SUBI R1 R1 #8
0 Mult2 Yes Multd M[72] R(F2) BNEZ R1 LoopRegister result
statusClock R1 F0 F2 F4 F6 F8 F10 F12 ... F30
15 64 Fu Load3 Mult2
-
•24
93
©AM
LaCASA
Loop Example Cycle 16Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1
SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80
[80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8
Store3 No
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1
4 Mult1 Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1
Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
16 64 Fu Load3 Mult1
94
©AM
LaCASA
Loop Example Cycle 17Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1
SD F4 0 R1 3 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80
[80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8
Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
17 64 Fu Load3 Mult1
95
©AM
LaCASA
Loop Example Cycle 18Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1
SD F4 0 R1 3 18 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 Yes 80
[80]*R22 MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8
Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
18 64 Fu Load3 Mult1
96
©AM
LaCASA
Loop Example Cycle 19Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 No1 MULTD F4 F0 F2 2 14 15 Load2 No1
SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2
MULTD F4 F0 F2 7 15 16 Store2 Yes 72 [72]*R22 SD F4 0 R1 8 19
Store3 Yes 64 Mult1
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
19 56 Fu Load3 Mult1
-
•25
97
©AM
LaCASA
Loop Example Cycle 20Instruction status: Exec WriteITER
Instruction j k Issue CompResult Busy Addr Fu
1 LD F0 0 R1 1 9 10 Load1 Yes 561 MULTD F4 F0 F2 2 14 15 Load2
No1 SD F4 0 R1 3 18 19 Load3 Yes 642 LD F0 0 R1 6 10 11 Store1 No2
MULTD F4 F0 F2 7 15 16 Store2 No2 SD F4 0 R1 8 19 20 Store3 Yes 64
Mult1
Reservation Stations: S1 S2 RS Time Name Busy Op Vj Vk Qj Qk
Code:
Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1
Yes Multd R(F2) Load3 SUBI R1 R1 #8Mult2 No BNEZ R1 Loop
Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 ...
F30
20 56 Fu Load1 Mult1
• Once again: In-order issue, out-of-order execution and
out-of-order completion.
98
©AM
LaCASA
Why can Tomasulo overlap iterations of loops?
Register renamingMultiple iterations use different physical
destinations for registers (dynamic loop unrolling)
Reservation stations Permit instruction issue to advance past
integer control flow operationsAlso buffer old values of registers
- totally avoiding the WAR stall that we saw in the scoreboard
Other perspective: Tomasulo building data flow dependency graph
on the fly
99
©AM
LaCASA
Tomasulo’s scheme offers 2 major advantages
(1) the distribution of the hazard detection logicdistributed
reservation stations and the CDBIf multiple instructions waiting on
single result, & each instruction has other operand, then
instructions can be released simultaneously by broadcast on CDB If
a centralized register file were used, the units would have to read
their results from the registers when register buses are
available.
(2) the elimination of stalls for WAW and WAR hazards
100
©AM
LaCASA
Multiple Issue
Allow multiple instructions to issue in a single clock cycle
(CPI < 1)Two flavors
SuperscalarIssue varying number of instruction per clockCan be
statically (compiler tech.) or dynamically(Tomasulo) scheduled
VLIW (Very Long Instruction Word)Issue a fixed number of
instructions formatted as a single long instruction or as a fixed
instruction packet
-
•26
101
©AM
LaCASA
Multiple Issue with Dynamic Scheduling
FP addersFP adders
Add1Add2Add3
FP multipliersFP multipliers
Mult1Mult2
From Mem FP Registers
Reservation Stations
To Mem
FP OpQueue
Load Buffers
Store Buffers
Load1Load2Load3Load4Load5Load6
From Instruction Unit
Store1Store2Store3
Issue: 2 instructions per clock cycle102
©AM
LaCASA
Multiple Issue with Dynamic Scheduling: An Example
Loop: L.D F0, 0(R1)ADD.D F4,F0,F2S.D 0(R1), F4DADDIU R1,R1,-#8
BNE R1,R2,Loop
Assumptions:
2-issue processor: can issue any pair of instructions if
reservation stations are available
Resources: ALU (int + effective address),a separate pipelined FP
for each operation type,branch prediction hardware, 1 CDB
2 cc for loads, 3 cc for FP Add
Branches single issue, branch prediction is perfect
103
©AM
LaCASA
Execution in Dual-issue Tomasulo Pipeline
Wait for BNE1413127LD.D F0,0(R1)3Wait for LD.D18157ADD.D
F4,F0,F23Wait for ADD.D19138S.D 0(R1), F43Wait for ALU15148DADDIU
R1,R1,-#83Wait for DAIDU169BNE R1,R2,Loop3
Wait for BNE9874LD.D F0,0(R1)2
Wait for LD.D13104ADD.D F4,F0,F22
Wait for ADD.D1485S.D 0(R1), F42Wait for ALU1095DADDIU
R1,R1,-#82Wait for DAIDU116BNE R1,R2,Loop2
Wait for DAIDU63BNE R1,R2,Loop1
Wait for ALU542DADDIU R1,R1,-#81
Wait for ADD.D932S.D 0(R1), F41
Wait for LD.D851ADD.D F4,F0,F21
first issue4321LD.D F0,0(R1)1
Com.Write at CDBMem.Access
Exe. (begins)IssueInst.Iter.
104
©AM
LaCASA
Multiple Issue with Dynamic Scheduling: Resource Usage
3/S.D19
3/ADD.D18
17
16
3/DADDIU3/ADD.D15
3/L.D2/S.D3/ DADDIU14
2/ADD.D3/L.D3/S.D13
3/L.D12
11
2/DADDIU2/ADD.D10
2/L.D1/S.D2/ DADDIU9
1/ADD.D2/L.D2/S.D8
2/L.D7
6
1/DADDIU1/ADD.D5
1/L.D1/DADDIU4
1/L.D1/S.D3
1/L.D2
CDBData CacheFP ALUInt ALUClock
-
•27
105
©AM
LaCASA
Multiple Issue with Dynamic Scheduling
DADDIU waits for ALU used by S.DAdd one ALU dedicated to
effective address calculationUse 2 CDBs
Draw table for the dual-issue version of Tomasulo’s pipeline
106
©AM
LaCASA
Multiple Issue with Dynamic Scheduling
Wait for BNE111097LD.D F0,0(R1)315127ADD.D F4,F0,F23
16108S.D 0(R1), F431098DADDIU R1,R1,-#83
119BNE R1,R2,Loop3
Wait for BNE8764LD.D F0,0(R1)2
Wait for LD.D1294ADD.D F4,F0,F22
Wait for ADD.D1375S.D 0(R1), F42Executes earlier765DADDIU
R1,R1,-#82
86BNE R1,R2,Loop2
Wait for DAIDU53BNE R1,R2,Loop1
Executes earlier432DADDIU R1,R1,-#81
Wait for ADD.D932S.D 0(R1), F41
Wait for LD.D851ADD.D F4,F0,F21
first issue4321LD.D F0,0(R1)1
Com.Write at CDBMem.Access
Exe. (begins)IssueInst.Iter.
107
©AM
LaCASA
Multiple Issue with Dynamic Scheduling: Resource Usage
3/ADD.D
2/ADD.D
3/L.D
3/DADDIU
1/ADD.D
2/DADDIU
1/L.D
CDB#1
3/S.D
3/L.D
2/S.D
2/L.D
1/S.D
1/L.D
Adr. Adder
3/S.D16
15
14
2/S.D13
3/ADD.D12
11
3/L.D10
1/S.D2/ADD.D3/ DADDIU9
2/L.D8
2/L.D7
2/ DADDIU6
1/ADD.D5
1/DADDIU4
1/L.D1/DADDIU3
2
CDB#2Data CacheFP ALUInt ALUClock
108
©AM
LaCASA
What about Precise Interrupts?
Tomasulo had:In-order issue, out-of-order execution, and
out-of-order completionNeed to “fix” the out-of-order completion
aspect so that we can find precise breakpoint in instruction
stream
-
•28
109
©AM
LaCASA
Hardware-based Speculation
With wide issue processors control dependences become a burden,
even with sophisticated branch predictorsSpeculation: speculate on
the outcome of branches and execute the program as if our guesses
were correct => need a mechanism to handle situations when the
speculations were incorrect
110
©AM
LaCASA
Relationship between precise interrupts and speculation
Speculation is a form of guessingImportant for branch
prediction:
Need to “take our best shot” at predicting branch direction
If we speculate and are wrong, need to back up and restart
execution to point at which we predicted incorrectly:
This is exactly same as precise exceptions!Technique for both
precise interrupts/exceptions and speculation: in-order completion
or commit
111
©AM
LaCASA
HW support for precise interruptsNeed HW buffer for results of
uncommitted instructions: reorder buffer (ROB)
4 fields: instr. type, destination, value, readyUse reorder
buffer number instead of reservation station when execution
completesSupplies operands between execution complete &
commit(Reorder buffer can be operand source => more registers
like RS)Instructions commitOnce instruction commits, result is put
into registerAs a result, easy to undo speculated instructions on
mispredicted branches or exceptions
ReorderBuffer
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
FP Regs
112
©AM
LaCASA
Four Steps of Speculative TomasuloAlgorithm
1. Issue—get instruction from FP Op QueueIf reservation station
and reorder buffer slot free, issue instr & send operands &
reorder buffer no. for destination (this stage sometimes called
“dispatch”)
2. Execution—operate on operands (EX)When both operands ready
then execute; if not ready, watch CDB for result; when both in
reservation station, execute; checks RAW (sometimes called
“issue”)
3. Write result—finish execution (WB)Write on Common Data Bus to
all awaiting FUs & reorder buffer; mark reservation station
available.
4. Commit—update register with reorder resultWhen instr. at head
of reorder buffer & result present, update register with result
(or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
“graduation”)
-
•29
113
©AM
LaCASA
What are the hardware complexities with reorder buffer (ROB)?How
do you find the latest version of a register?
(As specified by Smith paper) need associative comparison
networkCould use future file or just use the register result status
buffer to track which specific reorder buffer has received the
value
Need as many ports on ROB as register file
ReorderBuffer
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
FP Regs
Compar network
Reorder Table
Des
t Re
g
Resu
lt
Exce
ptions
?
Valid
Prog
ram C
ount
er
114
©AM
LaCASA
Summary
Reservations stations: implicit register renaming to larger set
of registers + buffering source operands
Prevents registers as bottleneckAvoids WAR, WAW hazards of
ScoreboardAllows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond
branches)Today, helps cache misses as well
Don’t stall for L1 Data cache miss (insufficient ILP for L2
miss?)Lasting Contributions
Dynamic schedulingRegister renamingLoad/store disambiguation
360/91 descendants are Pentium III; PowerPC 604; MIPS R10000;
HP-PA 8000; Alpha 21264