YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

HY425 Lecture 08: Limits of ILP

Dimitrios S. Nikolopoulos

University of Crete and FORTH-ICS

October 31, 2011

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 1 / 41

Page 2: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

ILP techniquesHardware

I Dynamic scheduling with scoreboardI Dynamic scheduling with renaming

I Tomasulo, renaming registersI Branch predictionI Multiple issueI Speculation

Software

I Instruction schedulingI Code transformations (topic of next lecture)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 3 / 41

Page 3: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

What limits ILP

Software and hardware issuesI Limits of parallelism in programs

I Data flow – true data dependenciesI Control flow – control dependenciesI Code generation, scheduling by compiler

I Hardware complexityI Large storage structures – branch prediction, ROB, windowI Complex logic – dependence control, associative searchesI Higher bandwidth – multiple issue, multiple outstanding

instructionsI Long latencies – memory system (caches, DRAM)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 4 / 41

Page 4: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

What is the upper bound of ILP in programs?

Roofline model of performance analysis

I Study of maximum ILP in programsI Difficult question dependent on hardware and compiler

technologyI Different conclusions with different assumptions

I Optimistic (unrealistic) assumptions of hardwareI Unlimited storage resources for tables, etc.I Perfect branch prediction and speculationI Others . . .

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 6 / 41

Page 5: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

David Wall (DEC, WRL Technical Report, 1993)

Assumptions

I Infinite virtual registers for renaming availableI Branch prediction is perfectI All branch targets are perfectly predicted

I No control dependenciesI All memory address known

I Can move loads before prior to unrelated storesI No dependencies other than true data dependencies

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 7 / 41

Page 6: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

David Wall (DEC, WRL Technical Report, 1993)

Assumptions

I Unlimited instruction window sizeI Unlimited number of functional unitsI All functional units compute in one cycleI Perfect caches

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 8 / 41

Page 7: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

David Wall (DEC, WRL Technical Report, 1993)

Comparison to a realistic processor – Alpha 21264

I Four-way instruction issueI 80 renaming registersI Branch predictor:

I 1024-branch history, 2 × 8K branch patterns

Methodology

I Collect trace of instructions and memory referencesI Schedule each instruction “by hand” as early as possible

I Wait until data dependence resolved

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 9 / 41

Page 8: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

David Wall (DEC, WRL Technical Report, 1993)Theoretical maximum ILP

I gcc, espresso, li integer programsI fpppp, doduc, tomcatv floating point programs

55

63

18

75

119

150

0 20 40 60 80 100 120 140 160

gcc

espresso

li

fpppp

doduc

tomcatv

Instruction issues per cycle

Maximum ILP

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 10 / 41

Page 9: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Limiting the instruction window size

Perfect processor

I Can look arbitrarily far ahead to fetch instructionsI Can rename output registers for all instructions that can

issueI Can determine data dependencies for all instructions

I O(n2) for n instructionsI Provide functional units for all instructions

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 11 / 41

Page 10: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Limiting the instruction window size

Instruction window

I Group of instructions examined for simultaneous executionI WS × IW × RPI comparators needed

I WS: window sizeI IW : issue widthI RPI: registers per instruction to check

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 12 / 41

Page 11: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Limiting the instruction window sizeWindow size 32 –∞

I 32–128 realistic values for modern processors

8

6

9

14

9

14

10

13

11

35

15

34

10

15

12

49

16

45

59

60

55

63

18

75

119

150

0 50 100 150 200

gcc

espresso

li

fpppp

doduc

tomcatv

Instruction issues per cycle

Impact of window size

infinite 2048 512 128 32

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 13 / 41

Page 12: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Realistic branch predictorTournament predictor

I 2-bit correlating, 2-bit non-correlating, 2-bit selectorsI 8192 branches, 2 predictors, 1 selector per branchI Correlating predictor indexed with PC XORed with historyI Non-correlating predictor indexed with PCI Average accuracy 97% in SPEC

Alternatives

I 2-bit predictor with 512 entries, 16-entry return addresstable

I Static predictor using profile of applicationI No prediction

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 14 / 41

Page 13: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Realistic branch predictorImpact of static vs. dynamic prediction

I 2048-instruction window, 64-way issue, 0-cycle mispredicted branchpenalty

2

2

2

29

4

19

6

6

7

45

14

45

6

7

6

46

13

45

9

12

10

48

15

46

35

41

16

61

58

60

0 10 20 30 40 50 60 70

gcc

espresso

li

fpppp

doduc

tomcatv

Instruction issues per cycle

Impact of branch predictor

Perfect Tournament predictor Standard 2-bit profile-based None

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 15 / 41

Page 14: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Realistic branch predictor

Impact on INT versus FP programs

1

5

14

12

14

12

1

16

18

23

18

30

0

3

2

2

4

6

0 5 10 15 20 25 30 35

tomcatv

doduc

fpppp

li

espresso

gcc

percentage of branches mispredicted

Branch midprediction accuracy

tournament 2-bit counter profile-based

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 16 / 41

Page 15: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Number of renaming registers32 –∞ renaming registers

I 2048-instruction window, 64-way issueI 8K-entry tournament predictor

5

5

4

5

4

4

7

5

5

6

5

5

28

11

20

11

10

9

44

15

35

12

13

10

45

16

49

12

15

10

54

29

59

12

15

11

0 10 20 30 40 50 60 70

tomcatv

doduc

fpppp

li

espresso

gcc

Instruction issues per cycle

Impact of renaming registers

Infinite 256 INT + 256 FP 128 INT + 128 FP 64 INT + 64 FP 32 INT + 32 FP None

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 17 / 41

Page 16: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Alias analysis

Alternatives for static and dynamic alias analysis

I Impossible to disambiguate all references at compile timeI Compiler can inspect static data in global segment and

stacks (known locations)I Hard to inspect data in heap (dynamic allocation, pointers)

I Unbounded number of comparisons needed at runtimeI Three options

I Perfect disambiguation of global and stack data (perfectcompiler)

I Inspection (disambiguate based on registers pointing tomemory)

I None (no disambiguation)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 18 / 41

Page 17: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Impact of alias analysis

Alias alternativesI 2048-instruction window, 64-way issue, 8K-entry tournament predictor.

256 INT, 256 FP renaming registers

4

4

3

3

5

3

5

6

4

4

5

4

45

16

49

9

7

7

45

16

49

12

15

10

0 10 20 30 40 50 60

tomcatv

doduc

fpppp

li

espresso

gcc

Instruction issues per cycle

Impact of memory disambiguation scheme

perfect global/stack perfect inspection none

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 19 / 41

Page 18: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Impact of alias analysis

Implementing memory disambiguation

I Need to know effective addresses of all earlier storesI Otherwise:

I In-order address calculationI Effective address speculation

Disambiguation with speculation

I Load assumes no dependence or uses dependencepredictor

I Stores check for dependence violations upon commitI Undo and restart mechanism used upon mis-speculation

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 20 / 41

Page 19: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Going beyond the limits

Advanced hardware techniques for ILP

I Memory WAW and WAR hazardsI May happen across procedure calls

I Unnecessary dependencies imposed by softwareI E.g. incrementing the loop induction variable

I Predictable data flowI Value prediction

I Prediction of addresses for memory disambiguationI Prediction of values

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 21 / 41

Page 20: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Other considerations for ILP

Clock rate vs. issue width

I 1994 HP PA 7100 @ 99 MHz 2-issue faster than TISuperSPARC 3-issue @ 60 MHz

I Focus on CPI may trade with long cycle time

Amdahl’s Law

I Single improvement may not improve performanceI Resources should scale proportionally

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 22 / 41

Page 21: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Other considerations for ILP (cont.)

Control flow

I Branches more predictable in FP than INT codesI FP programs have simpler control paths than INT

programs

Parallelism beyond basic blocks

I Multi-program and multi-threaded parallelismI Limited ILP in a single program motivates simpler using

many processors to run many programsI Multiple simpler processors an attractive alternative for

servers

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 23 / 41

Page 22: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Other considerations for ILP (cont.)

Clock speed

I Increased wire delays prevent increasing clock speedI Pipeline deepening may:

I Enable higher clock frequencyI Increase stall cycles and demand for ILPI Require multiple memory accesses, branch predictions,

register file accesses per cycleI Challenging at GHz clock rates

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 24 / 41

Page 23: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Other considerations for ILP (cont.)

Power issuesI Increasing complexity increases power consumption

I More transistors switchingI Increasing complexity and frequency will increase power

exponentiallyI More transistors switching at a higher clock rate

I Increasing complexity linearly does not increaseperformance linearly

I 3-issue Intel Pentium barely gets CPI < 1.0I More switching transistors per unit of performance

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 25 / 41

Page 24: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Instruction fetch and decode

CISC translated to RISC

I IA-32 CISC instruction set decoded to microops (uops)I uops scheduled in dynamically scheduled speculative

pipelineI Trace cache:

I Frequently executed instruction sequences, includingnon-adjacent sequences

I Sequences including multiple branchesI Up to 6 uops (three IA-32 instructions) decoded and

translated per cycleI Low miss rate (0.15% for SPEC CPUINT2000)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 27 / 41

Page 25: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Speculative pipeline

Dynamic scheduling

I Out-of-order execution pipelineI Register renaming for up to 3 uops per cycleI Commit up to 3 uops per cycleI Six dispatch ports to functional units

I Frequently executed instruction sequences, includingnon-adjacent sequences

I Sequences including multiple branches

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 28 / 41

Page 26: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Microarchitecture of Pentium 4

Pentium 4 640

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 29 / 41

Page 27: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Pentium 4 640 implementationMicroarchitecture quantitative characteristics

I Deep pipelines (21+ stages in various Pentium 4 generations)I Minimum 31 cycles from fetch to commit

Feature Size CommentsFront-end branch target 4K entries Predicts the next IA-32 instruction to fetch; used only when thebuffer execution trace cache misses.Execution trace cache 12K uops Trace cache used for uops.Trace cache branch- 2K entries Predicts the next uop.target bufferRegisters for renaming 128 total 128 uops can be in execution with up to 48 loads and 32 stores.Functional units 7 total; 2 simple ALU, The simple ALU units run at twice the clock rate, accepting up

complex ALU, load, store, to two simple ALU uops every clock cycle. This allowsFP move, FP arithmetic execution of two dependent ALU operation in a single clock

cycle.L1 data cache 16KB; 8-way associative; Integer load to use latency is 4 cycles; FP load to use latency is

64-byte blocks 12 cycles; up to 8 outstanding load misses.write-through

L2 cache 2 MB; 8-way associative; 256 bits to L1, providing 108 GB/sec; 18-cycle access time; 64128-byte blocks bits to memory, capable of 6.4 GB/sec. A miss in L2 does notwrite back cause an automatic update of L1.

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 30 / 41

Page 28: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Performance of Pentium 4

Memory latency

I Performance critically dependent on memory systemI Memory (DRAM) latency upward of 100 cycles

I Fastest memories as of 2006, 3.2 GHz clockI 2 or 3 levels of caches common in modern high-end

processors

Branch prediction

I Trace cache and branch predictionI Misprediction rateI Percentage of instructions misspeculated

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 31 / 41

Page 29: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Branch prediction on Pentium 4

Misprediction rate 8× higher in INT vs. FP programs

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 32 / 41

Page 30: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Branch prediction on Pentium 4

Misspeculated instruction rate follows misprediction rate

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 33 / 41

Page 31: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Data cache performance on Pentium 4

Multi-level cache hierarchy

I L2 cache miss penalty approx. 10× L1 cache miss penaltyI Hard to overlap stalls due to L2 cache misses

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 34 / 41

Page 32: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

CPI on Pentium 4Is there any ILP to begin with?

I Translation of IA-32 instructions to uops increases CPI by1.29× (1.29 uops per instruction)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 35 / 41

Page 33: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

CPI on Pentium 4

Is there any ILP to begin with?

I When does the processor get > 1 uop per cycle?

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 36 / 41

Page 34: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHz

Putting it all together

I Can a processor with a lower clock frequency outperform aprocessor with a higher clock frequency?

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 37 / 41

Page 35: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHzPutting it all together

I Pentium CPI = 1.27 × AMD CPI

I Pentium clock freq. = 1.23 × AMD clock freq.

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 38 / 41

Page 36: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Pentium 4 @ 3.2 GHz vs. AMD Opteron @ 2.6 GHz

Puttint it all together

I Can a processor with a lower clock frequency outperform aprocessor with a higher clock frequency?

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 39 / 41

Page 37: HY425 Lecture 08: Limits of ILP

RecapLimits of ILP

Case study: Intel NetburstConclusion

Conclusions

How do we compare processors?

I Lower CPI does not necessarily yield faster processorsI Processors with higher clock frequencies are not

necessarily fasterI Instruction-level parallelism faces various limitations

(walls):I Power wall – exponentially increasing complexityI Memory wall – non-overlapped memory latency

I What are we looking at next?I Software support for exploiting ILPI Designing more effective memory systems (caches, DRAM)I Looking in other sources of parallelism (threads, processes)

Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 41 / 41


Related Documents