Recap Limits of ILP Case study: Intel Netburst Conclusion HY425 Lecture 08: Limits of ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 31, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 1/41 Recap Limits of ILP Case study: Intel Netburst Conclusion ILP techniques Hardware Dynamic scheduling with scoreboard Dynamic scheduling with renaming Tomasulo, renaming registers Branch prediction Multiple issue Speculation Software Instruction scheduling Code transformations (topic of next lecture) Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 3/41
19
Embed
HY425 Lecture 08: Limits of ILP - University of Cretehy425/2012f/lectures/lecture08-handout2up… · I Multiple simpler processors an attractive alternative for servers Dimitrios
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RecapLimits of ILP
Case study: Intel NetburstConclusion
HY425 Lecture 08: Limits of ILP
Dimitrios S. Nikolopoulos
University of Crete and FORTH-ICS
October 31, 2011
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 1 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
ILP techniquesHardware
I Dynamic scheduling with scoreboardI Dynamic scheduling with renaming
I Tomasulo, renaming registersI Branch predictionI Multiple issueI Speculation
Software
I Instruction schedulingI Code transformations (topic of next lecture)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 3 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
What limits ILP
Software and hardware issuesI Limits of parallelism in programs
I Data flow – true data dependenciesI Control flow – control dependenciesI Code generation, scheduling by compiler
instructionsI Long latencies – memory system (caches, DRAM)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 4 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
What is the upper bound of ILP in programs?
Roofline model of performance analysis
I Study of maximum ILP in programsI Difficult question dependent on hardware and compiler
technologyI Different conclusions with different assumptions
I Optimistic (unrealistic) assumptions of hardwareI Unlimited storage resources for tables, etc.I Perfect branch prediction and speculationI Others . . .
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 6 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Assumptions
I Infinite virtual registers for renaming availableI Branch prediction is perfectI All branch targets are perfectly predicted
I No control dependenciesI All memory address known
I Can move loads before prior to unrelated storesI No dependencies other than true data dependencies
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 7 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Assumptions
I Unlimited instruction window sizeI Unlimited number of functional unitsI All functional units compute in one cycleI Perfect caches
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 8 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)
Comparison to a realistic processor – Alpha 21264
I Four-way instruction issueI 80 renaming registersI Branch predictor:
I 1024-branch history, 2 × 8K branch patterns
Methodology
I Collect trace of instructions and memory referencesI Schedule each instruction “by hand” as early as possible
I Wait until data dependence resolved
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 9 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
David Wall (DEC, WRL Technical Report, 1993)Theoretical maximum ILP
I gcc, espresso, li integer programsI fpppp, doduc, tomcatv floating point programs
55
63
18
75
119
150
0 20 40 60 80 100 120 140 160
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Maximum ILP
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 10 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window size
Perfect processor
I Can look arbitrarily far ahead to fetch instructionsI Can rename output registers for all instructions that can
issueI Can determine data dependencies for all instructions
I O(n2) for n instructionsI Provide functional units for all instructions
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 11 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window size
Instruction window
I Group of instructions examined for simultaneous executionI WS × IW × RPI comparators needed
I WS: window sizeI IW : issue widthI RPI: registers per instruction to check
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 12 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Limiting the instruction window sizeWindow size 32 –∞
I 32–128 realistic values for modern processors
8
6
9
14
9
14
10
13
11
35
15
34
10
15
12
49
16
45
59
60
55
63
18
75
119
150
0 50 100 150 200
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Impact of window size
infinite 2048 512 128 32
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 13 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictorTournament predictor
I 2-bit correlating, 2-bit non-correlating, 2-bit selectorsI 8192 branches, 2 predictors, 1 selector per branchI Correlating predictor indexed with PC XORed with historyI Non-correlating predictor indexed with PCI Average accuracy 97% in SPEC
Alternatives
I 2-bit predictor with 512 entries, 16-entry return addresstable
I Static predictor using profile of applicationI No prediction
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 14 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictorImpact of static vs. dynamic prediction
I 2048-instruction window, 64-way issue, 0-cycle mispredicted branchpenalty
2
2
2
29
4
19
6
6
7
45
14
45
6
7
6
46
13
45
9
12
10
48
15
46
35
41
16
61
58
60
0 10 20 30 40 50 60 70
gcc
espresso
li
fpppp
doduc
tomcatv
Instruction issues per cycle
Impact of branch predictor
Perfect Tournament predictor Standard 2-bit profile-based None
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 15 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Realistic branch predictor
Impact on INT versus FP programs
1
5
14
12
14
12
1
16
18
23
18
30
0
3
2
2
4
6
0 5 10 15 20 25 30 35
tomcatv
doduc
fpppp
li
espresso
gcc
percentage of branches mispredicted
Branch midprediction accuracy
tournament 2-bit counter profile-based
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 16 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Number of renaming registers32 –∞ renaming registers
I 2048-instruction window, 64-way issueI 8K-entry tournament predictor
5
5
4
5
4
4
7
5
5
6
5
5
28
11
20
11
10
9
44
15
35
12
13
10
45
16
49
12
15
10
54
29
59
12
15
11
0 10 20 30 40 50 60 70
tomcatv
doduc
fpppp
li
espresso
gcc
Instruction issues per cycle
Impact of renaming registers
Infinite 256 INT + 256 FP 128 INT + 128 FP 64 INT + 64 FP 32 INT + 32 FP None
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 17 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Alias analysis
Alternatives for static and dynamic alias analysis
I Impossible to disambiguate all references at compile timeI Compiler can inspect static data in global segment and
stacks (known locations)I Hard to inspect data in heap (dynamic allocation, pointers)
I Unbounded number of comparisons needed at runtimeI Three options
I Perfect disambiguation of global and stack data (perfectcompiler)
I Inspection (disambiguate based on registers pointing tomemory)
I None (no disambiguation)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 18 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Impact of alias analysis
Alias alternativesI 2048-instruction window, 64-way issue, 8K-entry tournament predictor.
256 INT, 256 FP renaming registers
4
4
3
3
5
3
5
6
4
4
5
4
45
16
49
9
7
7
45
16
49
12
15
10
0 10 20 30 40 50 60
tomcatv
doduc
fpppp
li
espresso
gcc
Instruction issues per cycle
Impact of memory disambiguation scheme
perfect global/stack perfect inspection none
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 19 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Impact of alias analysis
Implementing memory disambiguation
I Need to know effective addresses of all earlier storesI Otherwise:
I In-order address calculationI Effective address speculation
Disambiguation with speculation
I Load assumes no dependence or uses dependencepredictor
I Stores check for dependence violations upon commitI Undo and restart mechanism used upon mis-speculation
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 20 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Going beyond the limits
Advanced hardware techniques for ILP
I Memory WAW and WAR hazardsI May happen across procedure calls
I Unnecessary dependencies imposed by softwareI E.g. incrementing the loop induction variable
I Predictable data flowI Value prediction
I Prediction of addresses for memory disambiguationI Prediction of values
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 21 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP
Clock rate vs. issue width
I 1994 HP PA 7100 @ 99 MHz 2-issue faster than TISuperSPARC 3-issue @ 60 MHz
I Focus on CPI may trade with long cycle time
Amdahl’s Law
I Single improvement may not improve performanceI Resources should scale proportionally
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 22 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Other considerations for ILP (cont.)
Control flow
I Branches more predictable in FP than INT codesI FP programs have simpler control paths than INT
programs
Parallelism beyond basic blocks
I Multi-program and multi-threaded parallelismI Limited ILP in a single program motivates simpler using
many processors to run many programsI Multiple simpler processors an attractive alternative for
servers
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 23 / 41
I Deep pipelines (21+ stages in various Pentium 4 generations)I Minimum 31 cycles from fetch to commit
Feature Size CommentsFront-end branch target 4K entries Predicts the next IA-32 instruction to fetch; used only when thebuffer execution trace cache misses.Execution trace cache 12K uops Trace cache used for uops.Trace cache branch- 2K entries Predicts the next uop.target bufferRegisters for renaming 128 total 128 uops can be in execution with up to 48 loads and 32 stores.Functional units 7 total; 2 simple ALU, The simple ALU units run at twice the clock rate, accepting up
complex ALU, load, store, to two simple ALU uops every clock cycle. This allowsFP move, FP arithmetic execution of two dependent ALU operation in a single clock
cycle.L1 data cache 16KB; 8-way associative; Integer load to use latency is 4 cycles; FP load to use latency is
64-byte blocks 12 cycles; up to 8 outstanding load misses.write-through
L2 cache 2 MB; 8-way associative; 256 bits to L1, providing 108 GB/sec; 18-cycle access time; 64128-byte blocks bits to memory, capable of 6.4 GB/sec. A miss in L2 does notwrite back cause an automatic update of L1.
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 30 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Performance of Pentium 4
Memory latency
I Performance critically dependent on memory systemI Memory (DRAM) latency upward of 100 cycles
I Fastest memories as of 2006, 3.2 GHz clockI 2 or 3 levels of caches common in modern high-end
processors
Branch prediction
I Trace cache and branch predictionI Misprediction rateI Percentage of instructions misspeculated
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 31 / 41
RecapLimits of ILP
Case study: Intel NetburstConclusion
Branch prediction on Pentium 4
Misprediction rate 8× higher in INT vs. FP programs
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 32 / 41
I What are we looking at next?I Software support for exploiting ILPI Designing more effective memory systems (caches, DRAM)I Looking in other sources of parallelism (threads, processes)
Dimitrios S. Nikolopoulos HY425 Lecture 08: Limits of ILP 41 / 41