A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook
Jan 02, 2016
A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor Stalls
Olaf LubeckRam SrinivasanJeanine Cook
Percent of peak on single PE?About 5-8% (12-20x less than peak)
Deterministic Transport Kerbyson, Hoisie, Pautz (2003)
Single Processor Efficiency is Critical in Parallel Systems
Efficiency Loss in Scaling from 1 to 1000’s of PEs ?
Roughly 2-3x
Processor Model: A Monte Carlo Approach to Predict CPI
Token GeneratorToken:
Instruction classes
Service Centers:
Delays caused by• ALU Latencies
•Memory Latencies•Branch Misprediction
Max rate: 1 token every CPII
Feedback Loop - Stalls: Latencies interacting
with app characteristics
Retire: non-producing tokens
Stall producer Tokens
Inherent CPI:Best application cpi given no processor stalls – infinite zero-latency resources.
Processor Model: A Monte Carlo Approach to Predict CPI
Token GeneratorToken:
Instruction classes
Max rate: 1 token every CPII
Retire: non-producing tokens
Producer Tokens
MEM:V
TLB:31
L3:16
L2:6
L1:1
GSF:6
FPU:4
INT:1
Retire
BrM:6
Dependence Check and Stall Generator
Dependence Distance
Generation
Transition probabilities
associated with each path
Processor Stalls and Characterization of Application Dependence
Major source of stalls for in-order processors are RAW and WAW dependences
Token # 1 2 3 4 5 6
Load? N Y (hit in L3) N N N N
Consumer? N N N N N Y (token 2)
Prob Distribution:
load-to-use distance (instructions)
Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPI i
Application pdf’s:
1. Load-to-use
2. FP-to-use
3. INT-to-use
Summary of Model Parameters
• Inherent CPI– Binary instrumentation tool
• Instruction Classes– INT, FP, Reg, Branch mis-prediction, Loads, Non-producing– Note that Stores are retired immediately (treated as non-producers)
• Transition Probabilities– Probabilities of generating each instruction class computed from binary instrumentation– Cache hits computed from performance counters – can be predicted from models
• Distribution Functions of dependence distances (measured in instructions)– Load-to-use, FP-to-use, INT-to-use from binary instrumentation
• Processor and Memory Latencies – from architecture manuals
1. Parameters are computed in 1-2 hours
2. Model converges in a few secs
3. ~800 lines of C code
Bench – 1.3 GHz Itanium 2
3 MB L3 cache
260 cp mem latency
Hertz – 900 MHz Itanium 2
1.5 MB L3 cache
112 cp mem latency
Model Accuracy
Constant Memory Latencies
Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching
1.Linear relationship between prefetch distance and memory latency
2. Late prefetch can increase memory latency
This relationship suggest the Prefetch-to-load pdf
Token # 11 12 13 14 15 16
Issue Cycle 21 22 25 29 30 31
Load? N PF N N Y N
Model Extensions:Toward Multicore Chips (CMP’s)
Hertz: slope of 27 cpsBench: slope of 101 cps
Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc.
Memory latency as a function of outstanding loads
Application Characteristics: Dependence Distributions
Sixtrack L2 Hits: 5.7%Eon L2 Hits: 3.6%
But, Eon L2 stalls are 6x larger than Sixtrack
Dependence Distances are need to explain stalls
What kinds of questions can we explore with the model?
What if FPU was not pipelined?
What if L2 was removed (2 Level cache)?
What if processor freq was changed (power aware)?
Summary
• Monte Carlo techniques can be effectively applied to micro-architecture performance prediction– Main Advantage: whole program analysis, predictive and
extensible– Main Problems that we have seen: small loops are not well
predicted, binary instrumentation for prefetch can take >24 hrs.
• The model is surprisingly accurate given the architectural & application simplifications
• Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated
• We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara