Top Banner
A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook
17

A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Jan 02, 2016

Download

Documents

Roger Pearson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

A Monte Carlo Model of In-order Micro-architectural Performance:Decomposing Processor Stalls

Olaf LubeckRam SrinivasanJeanine Cook

Page 2: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Percent of peak on single PE?About 5-8% (12-20x less than peak)

Deterministic Transport Kerbyson, Hoisie, Pautz (2003)

Single Processor Efficiency is Critical in Parallel Systems

Efficiency Loss in Scaling from 1 to 1000’s of PEs ?

Roughly 2-3x

Page 3: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.
Page 4: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.
Page 5: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Processor Model: A Monte Carlo Approach to Predict CPI

Token GeneratorToken:

Instruction classes

Service Centers:

Delays caused by• ALU Latencies

•Memory Latencies•Branch Misprediction

Max rate: 1 token every CPII

Feedback Loop - Stalls: Latencies interacting

with app characteristics

Retire: non-producing tokens

Stall producer Tokens

Inherent CPI:Best application cpi given no processor stalls – infinite zero-latency resources.

Page 6: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Processor Model: A Monte Carlo Approach to Predict CPI

Token GeneratorToken:

Instruction classes

Max rate: 1 token every CPII

Retire: non-producing tokens

Producer Tokens

MEM:V

TLB:31

L3:16

L2:6

L1:1

GSF:6

FPU:4

INT:1

Retire

BrM:6

Dependence Check and Stall Generator

Dependence Distance

Generation

Transition probabilities

associated with each path

Page 7: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Processor Stalls and Characterization of Application Dependence

Major source of stalls for in-order processors are RAW and WAW dependences

Token # 1 2 3 4 5 6

Load? N Y (hit in L3) N N N N

Consumer? N N N N N Y (token 2)

Prob Distribution:

load-to-use distance (instructions)

Based on path that token 2 has taken, we compute the stall time (and cause) for token 6: 16 – 4*CPI i

Application pdf’s:

1. Load-to-use

2. FP-to-use

3. INT-to-use

Page 8: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Summary of Model Parameters

• Inherent CPI– Binary instrumentation tool

• Instruction Classes– INT, FP, Reg, Branch mis-prediction, Loads, Non-producing– Note that Stores are retired immediately (treated as non-producers)

• Transition Probabilities– Probabilities of generating each instruction class computed from binary instrumentation– Cache hits computed from performance counters – can be predicted from models

• Distribution Functions of dependence distances (measured in instructions)– Load-to-use, FP-to-use, INT-to-use from binary instrumentation

• Processor and Memory Latencies – from architecture manuals

1. Parameters are computed in 1-2 hours

2. Model converges in a few secs

3. ~800 lines of C code

Page 9: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Bench – 1.3 GHz Itanium 2

3 MB L3 cache

260 cp mem latency

Hertz – 900 MHz Itanium 2

1.5 MB L3 cache

112 cp mem latency

Model Accuracy

Constant Memory Latencies

Page 10: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

CPI Decomposition on Bench

Page 11: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching

Yes

Yes

No

No

Page 12: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Model Extensions for Variable Memory Latency: Compiler-controlled Prefetching

1.Linear relationship between prefetch distance and memory latency

2. Late prefetch can increase memory latency

This relationship suggest the Prefetch-to-load pdf

Token # 11 12 13 14 15 16

Issue Cycle 21 22 25 29 30 31

Load? N PF N N Y N

Page 13: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Model AccuracyVariable Memory Latencies from Prefetch

Page 14: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Model Extensions:Toward Multicore Chips (CMP’s)

Hertz: slope of 27 cpsBench: slope of 101 cps

Slope times are obtained empirically and a function of memory controllers, chip speeds, bus bandwidths, etc.

Memory latency as a function of outstanding loads

Page 15: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Application Characteristics: Dependence Distributions

Sixtrack L2 Hits: 5.7%Eon L2 Hits: 3.6%

But, Eon L2 stalls are 6x larger than Sixtrack

Dependence Distances are need to explain stalls

Page 16: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

What kinds of questions can we explore with the model?

What if FPU was not pipelined?

What if L2 was removed (2 Level cache)?

What if processor freq was changed (power aware)?

Page 17: A Monte Carlo Model of In-order Micro-architectural Performance: Decomposing Processor Stalls Olaf Lubeck Ram Srinivasan Jeanine Cook.

Summary

• Monte Carlo techniques can be effectively applied to micro-architecture performance prediction– Main Advantage: whole program analysis, predictive and

extensible– Main Problems that we have seen: small loops are not well

predicted, binary instrumentation for prefetch can take >24 hrs.

• The model is surprisingly accurate given the architectural & application simplifications

• Distributions that are used to develop predictive models are significant application characteristics that need to be evaluated

• We are ready to go into a “production” mode where we apply the model to a number of in-order architectures: Cell and Niagara