CPR: Composable Performance Regression for Scalable ...people.duke.edu/~bcl15/documents/lee2008-micro-slides.pdf6 homeworld Homeworld, three-dimensional movement Multimedia 7 mentalray

CPR: Composable Performance Regressionfor Scalable Multiprocessor Models

Benjamin C. LeeComputer Architecture Group

Microsoft Research

Jamison Collins, Hong WangMicroarchitecture Research Lab

Intel Corporation

David BrooksEngineering and Applied Sciences

Harvard University

International Symposium on Microarchitecture11 November 2008

Benjamin C. Lee, et al 1 :: MICRO :: 11 Nov 08

MotivationUniprocessor

Multiprocessor

Technology TrendsSimulation ChallengesSimulation Paradigm

Technology TrendsMoore’s Law and increasing transistor densities

Performance and power efficiency

Transition to multi-core and parallelism



Multiprocessor


Simulation Challenges

Cycle-Accurate SimulationAccurately identifies trends in design spaceTracks instructions’ progress through microprocessor

Simulation CostsCosts per simulation :: minutes, hours per designNumber of simulations :: scales exponentially (mp)

p,m :: parameter count, resolutionCosts per simulation :: scales superlinearity (nγ)

n cores, γ > 1



Multiprocessor


Statistical InferenceConstruct inferential models from samples [ASPLOS’06]

Use models as efficient surrogates for simulator



Multiprocessor


Multiprocessor Inference

Expensive CMP SimulationsPhysical resource contention increases host cyclesLogical resource contention increases simulated cyclesSynchronization increases cost per simulated cycle

Composable Performance RegressionLeverage core models to minimize CMP simulationsCore :: Uniprocessor performanceContention :: Shared resource contentionPenalty :: Performance penalty from contention



Multiprocessor

Regression TheoryModel EvaluationEvolutionary Design

Outline

MotivationTechnology TrendsSimulation ChallengesSimulation Paradigm

UniprocessorRegression TheoryModel EvaluationEvolutionary Design

MultiprocessorCPRModel EvaluationScalability



Multiprocessor


Outline






Multiprocessor


Regression Theory

Statistical InferenceModels relationships between dataRequires initial data to train, formulate modelLeverages correlation from initial data for prediction

Regression ModelsLow training costs (sample 300 from 4.3B designs)Accurate inference (1.5% median error)Efficient computation (100’s of predictions per second)



Multiprocessor


Formulationn simulated design samples, p design parameters

Response :: Y design metrics (e.g., performance)

Predictor :: X design parameters (e.g., ROB, cache)

Y =

y1...

yn

X =

x11 . . . x1p...

. . ....

xn1 . . . xnp

Coefficients :: β = [β0, . . . , βp]

T

Errors :: ε = [ε1, . . . , εn]T where εi ∼ N(0, σ2)

Y = Xβ + ε

F(Y) = G(X)βG + ε



Multiprocessor


Prediction

Requirementsβ known from least squares model trainingX known for a given set of queries

Expected ResponseResponse as weighted sum of predictor valuesComputed efficiently as matrix-vector product

E[Y] = E[Xβ + ε]= E[Xβ] + E[ε]= Xβ



Multiprocessor


Experimental Methodology

Intel Product SimulatorsModels consecutive generations of x86 µ-archSupports dual-, quad-core architectures.

Sampling Uniformly at Random (UAR)Parameter space includes predictors, ROB, caches15 parameters, 4.3B designsSample 300 designs for simulation

Statistical FrameworkR :: software environment for statistical computingHmisc and Design packages [Harrell]



Multiprocessor


BenchmarksDigital Home1 audio audio conversion2 video video compression3 photo photoshop albumGames4 unreal Unreal Tournament5 halflife Half-Life, modified Quake engine6 homeworld Homeworld, three-dimensional movementMultimedia7 mentalray rendering, ray tracing8 painter raster graphics package9 tachyon ray tracingOffice10 outlook personal information manager11 access relational database management system12 excel spreadsheet applicationProductivity13 md2 OpenSSL cryptographic hash function14 encrypt file encryption15 flash multimedia playerServer16 specweb web server17 tpcc on-line transaction processing18 specjapp J2EE 1.3 application servers



Multiprocessor


Uniprocessor Model AccuracyObtain 50 additional random samples for validation

Core :: 1.5% median error



Multiprocessor


Evolutionary Design

Evolutionary ApproachOptimize ProcXDesign ProcY, enhancing ProcX with µ-arch featuresRe-construct models, accounting for µ-arch featuresOptimize ProcY

Case StudyConsecutive generations of x86 µ-archImprove FE (e.g., branch prediction)Improve MEM (e.g., prefetching)Improve OOO (e.g., memory disambiguation)



Multiprocessor


Evolving CachesImprove MEM: similar performance with smaller caches



Multiprocessor


Evolving ROBImprove FE: more instructions inflight, suggests larger ROB

Improve MEM: fewer cache misses, suggests smaller ROB



Multiprocessor

CPRModel EvaluationScalability

Outline






Multiprocessor


Composable Performance Regression

CPR :: build separate core, contention, penalty models

Requires simulations Nuni > Ncon ≥ Npen

Suppose core sims require T1, multi-core sims require T1nγ



Multiprocessor


CPR: Core

Train with uniprocessor sims from full parameter space

Estimate per core delay from all design parameters



Multiprocessor


CPR: Contention

Train with CMP, cache-only sims from reduced subspace

Estimate cache hits/misses from shared cache parameters



Multiprocessor


CPR: Penalty

Train with composed predictions, few CMP sims from full space

Estimate CMP core delays from core, contention predictions



Multiprocessor


BenchmarksDual-Core BenchmarksSet .1 .21 painter homeworld2 access mentalray3 specjapp specweb4 homeworld tachyon5 dense flash

Quad-Core BenchmarksSet .1 .2 .3 .41 dense excel flash md22 video specjapp specweb tachyon3 excel homeworld audio unreal4 outlook encrypt halflife homeworld5 painter mentalray outlook encrypt



Multiprocessor


Multiprocessor Model AccuracyDual-core :: 6.6% median error

Quad-core :: 4.8% median error



Multiprocessor


Scaling TrendsLower bound CPR costs 0.33x of naïve costs

Approach lower bound as uniprocessor models built


Conclusion

Inference in IndustryEffective inference for x86 µ-arch1.5% median errors relative to simulationEvolutionary design for new features across generations

Composable Performance RegressionLeverage core models to minimize CMP simulationsConstruct separate core, contention, penalty models4.8 to 6.6% median errors for dual-, quad-core0.33x training costs of prior approaches


Future Directions

Efficient Multiprogramming AnalysisEvaluate combinations without modeling every combination

Multi-Threaded WorkloadsExtend for homogeneous, heterogeneous threads.Account for synchronization events

Many-Core ArchitecturesConstruct models without many-core simulatorsConsider other shared resources (e.g., network)


CPR: Composable Performance Regressionfor Scalable Multiprocessor Models

Benjamin C. LeeComputer Architecture Group

Microsoft Research

Jamison Collins, Hong WangMicroarchitecture Research Lab

Intel Corporation

David BrooksEngineering and Applied Sciences

Harvard University

International Symposium on Microarchitecture11 November 2008


CPR: Composable Performance Regression for Scalable ...people.duke.edu/~bcl15/documents/lee2008-micro-slides.pdf6 homeworld Homeworld, three-dimensional movement Multimedia 7 mentalray

Documents