Top Banner
Speculative Execution CS510 Computer Architectures Lecture 11 - 1 Lecture 11 Lecture 11 Trace Scheduling, Trace Scheduling, Conditional Execution, Conditional Execution, Speculation, Speculation, Limits of ILP Limits of ILP
26

Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Dec 14, 2015

Download

Documents

Nicole Embrey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 1

Lecture 11Lecture 11Trace Scheduling, Trace Scheduling,

Conditional Execution, Conditional Execution, Speculation, Speculation, Limits of ILPLimits of ILP

Lecture 11Lecture 11Trace Scheduling, Trace Scheduling,

Conditional Execution, Conditional Execution, Speculation, Speculation, Limits of ILPLimits of ILP

Page 2: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 2

Trace SchedulingTrace SchedulingTrace SchedulingTrace Scheduling

• Parallelism across IF branches vs. LOOP branches– Trace scheduling works when the behavior of the branches is

fairly predictable at compile time

• Two steps:– Trace Selection

• Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code

– Trace Compaction• Squeeze trace into few VLIW instructions

• Need bookkeeping code in case prediction is wrong

Page 3: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 3

Trace SchedulingTrace SchedulingTrace SchedulingTrace Scheduling

* See the kinds of exceptions in page 179

Trace Compaction by speculation - Move the code associated with B and C to make VLIW word(s) before the branch - This may cause exceptions when executed

X

A[i] = A[i]+B[i]

B[i]=

C[i]=

A[i]=0T F

Select this Trace If True branch is taken more frequently

Speculation should not introduce any new exception*

Page 4: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 4

HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions

HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions

• Avoid branch prediction by turning branches into conditionally executed instructions:

if (x) then A = B op C else NOP– If false, then neither stores result nor causes exception*

– Expanded ISA of Alpha, MIPS, SPARC have conditional move; PA-RISC can annul any following instr.

• Drawbacks to conditional instructions– Still takes a clock even if “annulled”

– Stall if condition is evaluated late

– Complex conditions reduce effectiveness; condition becomes known late in pipeline

* See the kinds of exceptions in page 179

Page 5: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 5

HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions

HW Support for More ILPHW Support for More ILPConditional InstructionsConditional Instructions

LWC must have no effect if the condition is not satisfied. LWC cannot write the result nor cause any exceptions if the condition is not satisfied.

Two-issue superscalar, combination of one M reference and one ALU(or Br) operations

First instruction slot Second instruction slot

LW R1,40(R2) ADD R3,R4,R5

ADD R6,R3,R7

BEQZ R10,L

LW R8,20(R10)

LW R9,0(R8)

Waste of the Green slot.Data dependence in Reds.

Example

BNZ R1,L CMOVZ R2,R3,R1

MOV R2,R3

L:

First instruction slot Second instruction slot

LW R1,40(R2) ADD R3,R4,R5

LWC R8,20(R10),R10 ADD R6,R3,R7

BEQZ R10,L

LW R9,0(R8)

Execute LW only when [R10] = 0, i.e.,LWC is same as LW unless 3rd operand is 0.

Page 6: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 6

HW Support for More ILPHW Support for More ILPSpeculationSpeculation

HW Support for More ILPHW Support for More ILPSpeculationSpeculation

Speculation

Allow an instruction to issue that is dependent on a branch (predicted to be taken) without any consequences(including exceptions).

If branch is not actually taken (“HW undo”)

– allows the execution of an instruction before the processor knows that the instruction should execute(i.e., it avoids control dependence stall)

• Often try to combine with dynamic scheduling

• Tomasulo

Separate speculative bypassing of results from real bypassing of results

– When an instruction is no longer speculative, write its results (instruction commit)

– execute out-of-order but commit in order

Page 7: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 7

Compiler Speculation with HW Support:Compiler Speculation with HW Support:

(1) HW-SW Cooperation for Speculation(1) HW-SW Cooperation for SpeculationCompiler Speculation with HW Support:Compiler Speculation with HW Support:

(1) HW-SW Cooperation for Speculation(1) HW-SW Cooperation for Speculation

• HW undo for miss prediction– simply handle all resumable exceptions when exception occurs

– simply return an undefined value for any exception that would cause termination

the compiled code using compiler-basedspeculation

LW R1, 0(R3) ; load ALW R14, 0(R2) ; speculative load BBEQZ R1, L3 ; other branch of the ifADD R14, R1, 4 ; the else clause

L3: SW 0(R3), R14 ; nonspeculative store

if (A==0) A =B; else A = A + 4;

compiled code

LW R1, 0(R3) ; load ABNEZ R1,L1 ; test ALW R1, 0(R2) ; if clauseJ L2 ; skip else

L1: ADD R1,R1,4 ; else clauseL2: SW 0(R3), R1 ; store A

* Assume the then clause is almost always executed. Register renaming;

Need for an extra register

Page 8: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 8

Compiler Speculation with HW Support:Compiler Speculation with HW Support:

(2) Speculation with Poison Bits(2) Speculation with Poison BitsCompiler Speculation with HW Support:Compiler Speculation with HW Support:

(2) Speculation with Poison Bits(2) Speculation with Poison Bits

• Speculation with Poison Bits– allows compiler speculation with less change to the exception

behavior

– a poison bit is added to every register

– another bit is added to every instruction to indicate whether the instruction is speculative

LW R1, 0(R3) ; load ALW* R14, 0(R2) ; speculative load BBEQZ R1, L3 ; other branch of the ifADD R14, R1, 4 ; the else clause

L3: SW 0(R3), R14 ; nonspeculative store

If the speculative LW* generates a terminating exception,the poison bit of R14 will be set. When the nonspeculativeSW instruction occurs, it will raise an exception if the poisonbit for R14 is on.

Page 9: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 9

Compiler Speculation Compiler Speculation with HW Supportwith HW Support

• The main disadvantages of the two previous schemes– the need to introduce copies to deal with register renaming

– the possibility of exhausting the registers

• Speculative Instructions with Renaming (Boosting)– flagging the instructions which are moved past branches as

speculative

– providing renaming and buffering in the HW

Page 10: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 10

Compiler Speculation with HW Support:Compiler Speculation with HW Support:

(3) Speculative Instructions (3) Speculative Instructions with Renamingwith Renaming

Compiler Speculation with HW Support:Compiler Speculation with HW Support:

(3) Speculative Instructions (3) Speculative Instructions with Renamingwith Renaming

• Extra register is no longer necessary• Result of the boosted instruction is not written into R1

until after branch• Other boosted instructions could use the results of the boosted load

LW R1, 0(R3) ; load ALW+ R1, 0(R2) ;;boosted load BBEQZ R1, L3 ; other branch of the ifADD R1, R1, 4 ; the else clause

L3: SW 0(R3), R1 ; nonspeculative store

written to R1

never written to R1

Page 11: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 11

Hardware-based SpeculationHardware-based SpeculationHardware-based SpeculationHardware-based Speculation

• Hardware-based Speculation– dynamic branch prediction

– speculation to allow the execution of instructions before the control dependencies are resolved

– dynamic scheduling to deal with the scheduling of different combinations of basic blocks

• Advantages– dynamic runtime disambiguation of memory addresses

– hardware-based branch prediction

– a completely precise exception model

– does not require compensation or bookkeeping code

– does not require different code sequences to achieve good performance for different implementation

Page 12: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 12

HW-based SpeculationHW-based SpeculationHW-based SpeculationHW-based Speculation

Need HW buffer for results of uncommitted instructions: reorder buffer

– Reorder buffer can be operand source

– Once operand commits, result is found in register

– 3 fields: instr. type, destination, value

– Use reorder buffer number instead of reservation station

– Instructions commit in order

– As a result, it is easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBuffer

FP Regs

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

From M(LD)

Page 13: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 13

4 4 Steps of Speculative Steps of Speculative Tomasulo AlgorithmTomasulo Algorithm

4 4 Steps of Speculative Steps of Speculative Tomasulo AlgorithmTomasulo Algorithm

1. Issue: Get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr &

send operands & reorder buffer no. to the RS

2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB

for result; when both in reservation station, execute

3. Write result: Finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer;

mark reservation station available.

4. Commit: Update register with reorder result When an instruction is at the head of reorder buffer & result

present, update register with result (or store to memory) and remove the instruction from reorder buffer.

Page 14: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 14

Limits to ILPLimits to ILPLimits to ILPLimits to ILPConflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about:

– Benchmarks (vectorized Fortran FP vs. integer C programs)

– Hardware sophistication

– Compiler sophistication

Page 15: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 15

Limits to ILPLimits to ILPLimits to ILPLimits to ILP

HW Model for ultimate issue performance; MIPS compilers

1. Register renaming: Infinite virtual registers and all WAW & WAR hazards are avoided

2. Branch prediction: Perfect; no mispredictions

3. Jump prediction: All jumps perfectly predicted => machine with perfect speculation and an

unbounded buffer of instructions available

4. Memory-address alias analysis: addresses are known and a store can be moved before a

load provided addresses are not equal

1 cycle latency for all instructions

Page 16: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 16

Upper Limit to ILPUpper Limit to ILPUpper Limit to ILPUpper Limit to ILP

Programs

Instr

ucti

on

Issu

es p

er

cycle

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doducd tomcatv

54.862.6

17.9

75.2

118.7

150.1Integer programsFloating point programs

Page 17: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 17

Limitations on Window Size Limitations on Window Size and Maximum Issue Countand Maximum Issue Count

Limitations on Window Size Limitations on Window Size and Maximum Issue Countand Maximum Issue Count

• Window : the set of instructions examined for simultaneous execution

– n instructions: to determine whether they have any register dependencies among them

2n - 2 + 2n - 4 + ..... + 2 = n2-n• 2000 instructions -- 4 million comparisons• 50 instructions -- 2450 comparisons

– current technology : window size - 4 to 32• requires about 900 comparisons

• Multiple Issues -- lengthen the clock cycle

– typically have clock cycles that are 1.5 to 3 times longer

– typically have CPIs that are 2 to 3 times lower

Page 18: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 18

Window Size ImpactWindow Size ImpactWindow Size ImpactWindow Size Impact

5563

18

75

119

150

3540

17

60 60 60

1015 12

49

16

45

10 13 11

35

15

34

8 8 914

914

4 4 4 5 4 63 3 3 3 3 3

0

20

40

60

80

100

120

140

160

gcc espresso li fpppp doduc tomcatv

infinite

2k

512

12832

8

4

Inst

r uct

ion

Is

ses

per

Cy

cle

Integer Programs FP Programs

Page 19: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 19

More Realistic HW:More Realistic HW: Branch ImpactBranch ImpactMore Realistic HW:More Realistic HW: Branch ImpactBranch Impact

window of 2000 and maximum issue of 64 instructions/clock cycle

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

35

41

16

6158 60

9

1210

48

15

67 6

46

13

45

6 6 7

45

14

45

2 2 2

29

4

19

46

Perfect Selective predictor Standard 2-bit Static

None

Inst

r uct

ion

Is

sues

per

Cyc

le

Perfect Selective predictor Standard 2-bit Static None correlation+ BHT BHT BHT(512) Profile

Branch Prediction

Page 20: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 20

Selective History PredictorSelective History PredictorSelective History PredictorSelective History Predictor8192 x 2 bits

2048 x 4 x 2 bits

Branch Addr

GlobalHistory

2

00011011

Taken/Not Taken

8K x 2 bitSelector

11

10

01

00

Choose Non-correlator

Choose Correlator

10

11 Taken10 ”01 Not Taken00 ”

Non-correlatingpredictor

Correlatingpredictor

Page 21: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 21

More Realistic HW:More Realistic HW: Register ImpactRegister ImpactMore Realistic HW:More Realistic HW:

Register ImpactRegister Impact2000 instr window, 64 instr issue, 8K 2-level Prediction

Program

0

10

20

30

40

50

60

gcc espresso li fpppp doducd tomcatv

11

15

12

29

54

10

15

12

49

16

1013

12

35

15

44

9 10 11

20

11

28

5 5 6 5 57

4 45

45 5

59

45

Infinite 256 128 64 32 None*

Inst

r uct

ion

Is

sues

per

Cyc

le

*DLX: 31 Integer Registers/16 FP Registers

No. of renaming Regs

Page 22: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 22

More Realistic HW:More Realistic HW:

Alias ImpactAlias ImpactMore Realistic HW:More Realistic HW:

Alias ImpactAlias Impact2000 instr window,

64 instr issue, 8K 2 level Prediction, 256 renaming registers

Program

Instr

ucti

on

issu

es p

er

cycle

0

5

10

15

20

25

30

35

40

45

50

gcc espresso li fpppp doducd tomcatv

10

15

12

49

16

45

7 79

49

16

45 4 4

6 53

53 3 4 4

45

Perfect Global/stack Perfect + Inspection # None *

* All memory accesses are assumed to conflict+ Ongoing research# Most commercial compilers

Page 23: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 23

Realistic HW for 90s:Realistic HW for 90s: Window ImpactWindow ImpactRealistic HW for 90s:Realistic HW for 90s: Window ImpactWindow Impact

Realistic HW in 90s:

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

Program

Instr

ucti

on

issu

es p

er

cycle

0

10

20

30

40

50

60

gcc expresso li fpppp doducd tomcatv

10

15

12

52

17

56

10

15

12

47

16

10

1311

35

15

34

910 11

22

12

8 8 9

14

9

14

6 6 68

79

4 4 4 5 46

3 2 3 3 3 3

45

22

Infinite 256 128 64 32 16 8 4

Page 24: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 24

Fallacies and PitfallsFallacies and PitfallsFallacies and PitfallsFallacies and Pitfalls

Fallacy: Processors with lower CPIs will always be faster.– sophisticated pipelines typically have slower clock rates than

processors with simple pipelines

– example : • IBM Power-2(low CPI) : two FP and two load-store, clock rate 71.5

MHz(slower clock rate)

• Dec Alpha 21604(high CPI) : dual-issue with one load-store and one FP, 200 MHz(faster clock rate)

Page 25: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 25

Braniac vs. Speed DemonBraniac vs. Speed DemonBraniac vs. Speed DemonBraniac vs. Speed Demon

Benchmark

SP

EC

rati

o

0

100

200

300

400

500

600

700

800

900

esp

ress

o li

eqnto

tt

com

pre

ss sc gcc

spic

e

doduc

mdljd

p2

wave5

tom

catv

ora

alv

inn

ear

mdljs

p2

swm

25

6

su2

cor

hydro

2d

nasa

fpppp

6-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs.

2-scalar Alpha @ 200 MHz (7 stage pipe)

Page 26: Speculative ExecutionCS510 Computer ArchitecturesLecture 11 - 1 Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Speculative Execution CS510 Computer Architectures Lecture 11 - 26

Recent High Performance Recent High Performance ProcessorsProcessors

Recent High Performance Recent High Performance ProcessorsProcessors

Issue capability SPEC Year Initial (measure shipped in clock rate Issue Schedul- Maxi- Load- Integer or

Processor systems (MHz) structure ing mum store ALU FP Branch estimate)

IBM 1994 67 Dynamic Static 6 2 2 2 2 95 intPower-2 270 FP

Intel 1994 66 Dynamic Static 2 2 2 1 1 65 intPentium 65 FP

DEC Alpha 1995 300 Static Static 4 2 2 2 1 330 int21164 500 FP

Sun Ultra- 1995 167 Dynamic Static 4 1 1 1 1 275 int305 FP

Intel P6 1995 150 Dynamic Dynamic 3 1 2 1 1 >200 int

PowerPC 1995 133 Dynamic Dynamic 4 1 1 1 2 25 int620 300 FP

MIPS 1996 200 Dynamic Dynamic 4 1 2 2 1 300 intR10000 600 FP

HP 8000 1996 200 Dynamic Static 4 2 2 2 1 >360 int>550 FP