Automated Parameter Setting Based on Runtime Prediction: Towards an Instance-Aware Problem Solver Frank Hutter, Univ. of British Columbia, Vancouver, Canada.

Automated Parameter Automated Parameter Setting Based on Runtime Setting Based on Runtime

Prediction:Prediction:

Towards an Instance-Aware Towards an Instance-Aware Problem SolverProblem Solver

Frank Hutter, Univ. of British Columbia, Vancouver, CanadaYoussef Hamadi, Microsoft Research, Cambridge, UK

September 7, 2005September 7, 2005 Automated Parameter SettingAutomated Parameter Setting 22

Motivation(1): Motivation(1): Why automated parameter setting Why automated parameter setting ??• We want to use the best available heuristic for a We want to use the best available heuristic for a

problemproblem– Strong domain-specific heuristics in tree searchStrong domain-specific heuristics in tree search

• Domain knowledge helps to pick good heuristicsDomain knowledge helps to pick good heuristics• But maybe you don‘t know the domain ahead of time ...But maybe you don‘t know the domain ahead of time ...

– Local search parameters must be tunedLocal search parameters must be tuned• Performance depends crucially on parameter settingPerformance depends crucially on parameter setting

• New application/algorithm: New application/algorithm: – Restart parameter tuning from scratchRestart parameter tuning from scratch– Waste of time both for researchers and practicionersWaste of time both for researchers and practicioners

• ComparabilityComparability– Is algorithm A faster than algorithm B because they Is algorithm A faster than algorithm B because they

spent more time tuning it ? spent more time tuning it ?


Motivation(2): operational Motivation(2): operational scenarioscenario• CP solver has to solve instances from a variety of CP solver has to solve instances from a variety of

domainsdomains• Domains not known a prioriDomains not known a priori• Solver should automatically use best strategy for Solver should automatically use best strategy for

each instance each instance • Want to learn from instances we solveWant to learn from instances we solve

Frank Hutter: Frank Hutter:


OverviewOverview

• Previous work on runtime prediction we Previous work on runtime prediction we base onbase on[Leyton-Brown, Nudelman et al. ’02 & ’04][Leyton-Brown, Nudelman et al. ’02 & ’04]

• Part I: Automated parameter setting based Part I: Automated parameter setting based on runtime predictionon runtime prediction

• Part II: Incremental learning for runtime Part II: Incremental learning for runtime prediction in a priori unknown domainsprediction in a priori unknown domains

• ExperimentsExperiments

• ConclusionsConclusions


Previous work on runtime Previous work on runtime prediction for algorithm prediction for algorithm selectionselection• General approachGeneral approach

– Portfolio of algorithmsPortfolio of algorithms– For each instance, choose the algorithm that promises to For each instance, choose the algorithm that promises to

be fastestbe fastest• ExamplesExamples

– [Lobjois and Lemaître, AAAI’98] [Lobjois and Lemaître, AAAI’98] CSPCSP• Mostly propagations of different complexityMostly propagations of different complexity

– [Leyton-Brown et al., CP’02][Leyton-Brown et al., CP’02] Combinatorial auctions Combinatorial auctions• CPLEX + 2 other algorithms (which were thought CPLEX + 2 other algorithms (which were thought

incompetitive)incompetitive)– [Nudelman et al., CP’04][Nudelman et al., CP’04] SAT SAT

• Many tree-search algorithms from last SAT competitionMany tree-search algorithms from last SAT competition

• On average considerably faster than each single On average considerably faster than each single algorithmalgorithm


Runtime prediction: Basics (1 Runtime prediction: Basics (1 algorithm)algorithm) [Leyton-Brown, Nudelman [Leyton-Brown, Nudelman et al. ’02 & ’04]et al. ’02 & ’04]

• Training: Given a set of t instances zTraining: Given a set of t instances z11,...,z,...,ztt

– For each instance zFor each instance zii

• Compute features Compute features xxii = (x = (xi1i1,...,x,...,ximim))

• Run algorithm to get its runtime yRun algorithm to get its runtime y ii

– Collect (Collect (xxi i ,y,yii) pairs) pairs

– Learn function f: Learn function f: XX !! R R (features (features !! runtime), y runtime), yi i f (f (xxii))

• Test: Given a new instance zTest: Given a new instance zt+1t+1

– Compute features Compute features xxt+1t+1

– Predict runtime yPredict runtime yt+1t+1 = f( = f(xxt+1t+1))

Expensive

Cheap


Runtime prediction: Runtime prediction: Linear Linear regression regression [Leyton-Brown, Nudelman et [Leyton-Brown, Nudelman et al. ’02 & ’04]al. ’02 & ’04]

• The learned function f has to be linear in The learned function f has to be linear in the features the features xxii = (x = (xi1i1,...,x,...,ximim))

– yyii ¼¼ f( f(xxii) = ) = j=1..mj=1..m (x (xijij * w * wjj) = ) = xxii * * ww

– The learning problem thus reduces to fitting The learning problem thus reduces to fitting the weights the weights w w == ww11,...,w,...,wmm

• To grasp the vast different in runtime To grasp the vast different in runtime better, estimate the logarithm of runtime: better, estimate the logarithm of runtime: e.g. ye.g. yi i = 5 = 5 runtime is 10 runtime is 1055 sec sec


Runtime prediction: Runtime prediction: Feature Feature engineering engineering [Leyton-Brown, [Leyton-Brown, Nudelman et al. ’02 & ’04]Nudelman et al. ’02 & ’04]

• Features can be computed quickly (in seconds)Features can be computed quickly (in seconds)– Basic properties like #vars, #clauses, ratioBasic properties like #vars, #clauses, ratio– Estimates of search space sizeEstimates of search space size– Linear programming boundsLinear programming bounds– Local search probesLocal search probes

• Linear functions are not very powerfulLinear functions are not very powerful• But you can use the same methodology to learn But you can use the same methodology to learn

more complex functionsmore complex functions– Let Let = ( = (11,...,,...,qq) be arbitrary combinations of the ) be arbitrary combinations of the

features xfeatures x11,...,x,...,xmm (so-called basis functions) (so-called basis functions)– Learn linear function of basis functions: f(Learn linear function of basis functions: f() = ) = * * ww

• Basis functions used in Basis functions used in [Nudelman et al. ’04][Nudelman et al. ’04]– Original features: xOriginal features: xii – Pairwise products of features: xPairwise products of features: xii * x * xjj – Only subset of these (drop useless basis functions)Only subset of these (drop useless basis functions)


Algorithm selection based on Algorithm selection based on runtime predictionruntime prediction[Leyton-Brown, Nudelman et al. ’02 & [Leyton-Brown, Nudelman et al. ’02 & ’04]’04]• Given n different algorithms AGiven n different algorithms A11,...,A,...,Ann

• Training:Training:– Learn n separate functions fLearn n separate functions fjj:: !! RR, ,

j=1...nj=1...n

• Test:Test:– Predict runtime yPredict runtime yjj

t+1t+1 = f = fjj((t+1t+1) for each of ) for each of the algorithmsthe algorithms

– Choose algorithm AChoose algorithm Ajj with minimal y with minimal yjjt+1t+1

Really Expensive

Cheap


OverviewOverview

• Previous work on runtime prediction we Previous work on runtime prediction we base on [Leyton-Brown, Nudelman et al. ’02 base on [Leyton-Brown, Nudelman et al. ’02 & ’04]& ’04]






Parameter setting based on Parameter setting based on runtime predictionruntime prediction

Finding the best default parameter setting for a problem classGenerate special purpose code Generate special purpose code [Minton [Minton ’93] ’93] Minimize estimated error Minimize estimated error [Kohavi & John [Kohavi & John ’95]’95]Racing algorithmRacing algorithm [Birattari et al. ’02] [Birattari et al. ’02]Local searchLocal search [Hutter ’04] [Hutter ’04]Experimental design Experimental design [Adenso-Daz & [Adenso-Daz & Laguna ’05]Laguna ’05]Decision trees Decision trees [Srivastava & Mediratta, [Srivastava & Mediratta, ’05]’05]

Runtime prediction for algorithm selection on a per-instance basePredict runtime for each Predict runtime for each algorithm and pick the algorithm and pick the best best [Leyton-Brown, [Leyton-Brown, Nudelman et al. ’02 & ’04]Nudelman et al. ’02 & ’04]

Runtime prediction for setting parameters on a per-instance base


Naive application of runtime Naive application of runtime prediction for parameter prediction for parameter settingsetting• Given one algorithm with n different parameter Given one algorithm with n different parameter

settings Psettings P11,...,P,...,Pnn

• Training:Training:– Learn n separate functions fLearn n separate functions fjj:: !! RR, j=1...n, j=1...n

• Test:Test:– Predict runtime yPredict runtime yjj

t+1t+1 = f = fjj((t+1t+1) for each of the parameter ) for each of the parameter settingssettings

– Run algorithm with setting PRun algorithm with setting Pjj with minimal y with minimal yjjt+1t+1

• If there are too many parameter configurations:If there are too many parameter configurations:– Cannot run each parameter setting on each instanceCannot run each parameter setting on each instance– Need to generalize (cf. human parameter tuning)Need to generalize (cf. human parameter tuning)– With With separateseparate functions there is no way to generalize functions there is no way to generalize

Too expensive

Fairly Cheap


Generalization by parameter Generalization by parameter sharingsharing

w1

y11:t

X1:tw2

y21:t

wn

yn1:t

w

y11:t

X1:t

y21:t yn

1:t

• Naive approach: n separate functions. Naive approach: n separate functions. • Information on theInformation on the

runtime of setting iruntime of setting icannotcannot inform inform predictions for predictions for setting jsetting j i i

• Our approach: 1 single function. Our approach: 1 single function. • Information on theInformation on the

runtime of setting iruntime of setting icancan inform inform predictions for predictions for setting isetting i j j


Application of runtime Application of runtime prediction for parameter prediction for parameter settingsetting• View the parameters as additional features, learn a View the parameters as additional features, learn a single single

functionfunction• Training: Given a set of instances zTraining: Given a set of instances z11,...,z,...,ztt

– For each instance zFor each instance zii

• Compute features Compute features xxii • Pick some parameter settings pPick some parameter settings p11,...,p,...,pnn

• Run algorithm with settings pRun algorithm with settings p11,...,p,...,pnn to get runtimes y to get runtimes y11i i ,...,y,...,ynn

ii

• Basic functions Basic functions 11ii, ..., , ..., nn

ii include the parameter settings include the parameter settings– Collect pairs (Collect pairs (jj

ii,y,yjjii) (n data points per instance)) (n data points per instance)

– Only learn a Only learn a single function single function g:g: !! RR • Test: Given a new instance zTest: Given a new instance zt+1t+1

– Compute features Compute features xxt+1t+1 – Search over parameter settings pSearch over parameter settings pjj. .

Evaluation: compute Evaluation: compute jjt+1t+1, check g(, check g(jj

t+1t+1) ) – Run with best predicted parameter setting pRun with best predicted parameter setting p**

Moderately Expensive

Cheap


Summary of automated Summary of automated parameter setting based on parameter setting based on runtime predictionruntime prediction• Learn a single function that maps Learn a single function that maps

features and parameter settings to features and parameter settings to runtimeruntime

• Given a new instanceGiven a new instance– Compute the features (they are fix)Compute the features (they are fix)– Search for the parameter setting that Search for the parameter setting that

minimizes predicted runtime for these minimizes predicted runtime for these featuresfeatures


OverviewOverview







Problem setting: Incremental Problem setting: Incremental learning for multiple domainslearning for multiple domains



Solution: Sequential Bayesian Solution: Sequential Bayesian Linear RegressionLinear RegressionUpdate “knowledge“ as new data arrives:Update “knowledge“ as new data arrives:

probability distribution over weights probability distribution over weights ww• Incremental (one (Incremental (one (xxii, y, yii) pair at a time)) pair at a time)

– Seemlessly integrate this new dataSeemlessly integrate this new data– ““Optimal“: yields same result as a batch approachOptimal“: yields same result as a batch approach

• EfficientEfficient– Computation: 1 matrix inversion per update Computation: 1 matrix inversion per update – Memory: can drop data we integratedMemory: can drop data we integrated

• RobustRobust– Simple to implement (3 lines of Matlab)Simple to implement (3 lines of Matlab)– Provides estimates of uncertainty in predictionProvides estimates of uncertainty in prediction


What are uncertainty estimates?What are uncertainty estimates?


Sequential Bayesian linear Sequential Bayesian linear regression – intuitionregression – intuition

• Instead of predicting Instead of predicting a single runtime y, a single runtime y, use a use a probability probability distribution P(Y)distribution P(Y)

• The mean of P(Y) is The mean of P(Y) is exactly the prediction exactly the prediction of the non-Bayesian of the non-Bayesian approach, but we get approach, but we get uncertainty estimatesuncertainty estimates

P(Y)

Log. runtime Y

Mean predicted runtime

Uncertainty of prediction


• Standard linear regression: Standard linear regression: – Training: given training data Training: given training data 1:n1:n, y, y1:n1:n, fit the weights , fit the weights w w

such that ysuch that y1:n1:n ¼¼ 1:n1:n * * ww– Prediction: yPrediction: yn+1n+1 = = n+1n+1 * * ww

• Bayesian linear regression:Bayesian linear regression:– Training: Given training data Training: Given training data 1:n1:n, y, y1:n1:n, infer probability , infer probability

distribution P(distribution P(ww||1:n1:n, y, y1:n1:n) ) // P( P(ww) * ) * ii P(y P(yii||ii, , ww))

– Prediction: P(yPrediction: P(yn+1n+1||n+1n+1, , 1:n1:n, y, y1:n1:n) = ) = ss P(yP(yn+1n+1||w, w, n+1n+1) * P() * P(ww||1:n1:n, y, y1:n1:n) d) dww

Gaussian

Sequential Bayesian linear Sequential Bayesian linear regression – technicalregression – technical

• ““Knowledge“ about the weights: Gaussian (Knowledge“ about the weights: Gaussian (ww, , ww))

Assumed Gaussian

Gaussian


• Start with a prior P(Start with a prior P(ww)) with with very high uncertaintyvery high uncertainty

• First data point (First data point (11,y,y11))

• P(P(ww||11, y, y11) ) // P(P(ww) * P(y) * P(y11||11,,ww))

P(wi)

Weight wi

P(yP(y11||11,,ww))

Weight wi

P(P(wwii||11, y, y11))

Prediction with prior Prediction with prior ww

P(yP(y22||,,ww))

Log. runtime y2

Prediction with posterior Prediction with posterior ww||11, y, y11

P(yP(y22||,,ww))

Log. runtime y2

Sequential Bayesian linear Sequential Bayesian linear regression – visualizedregression – visualized


Summary of incremental Summary of incremental learning for runtime predictionlearning for runtime prediction• Have a probability distribution over the weights:Have a probability distribution over the weights:

– Start with a Gaussian prior, incremetally update it with more dataStart with a Gaussian prior, incremetally update it with more data• Given the Gaussian weight distribution, the predictions are Given the Gaussian weight distribution, the predictions are

also Gaussiansalso Gaussians– We know how uncertain our predictions areWe know how uncertain our predictions are– For new domains, we will be very uncertain and only grow more For new domains, we will be very uncertain and only grow more

confident after having seen a couple of data pointsconfident after having seen a couple of data points



OverviewOverview







Domain for our experimentsDomain for our experiments

• SATSAT– Best studied NP-hard problemBest studied NP-hard problem– Good features already exist Good features already exist [Nudelman et al.’04][Nudelman et al.’04]– Lots of benchmarksLots of benchmarks

• Stochastic Local Search (SLS)Stochastic Local Search (SLS)– Runtime prediction has never been done for SLS beforeRuntime prediction has never been done for SLS before– Parameter tuning is very important for SLSParameter tuning is very important for SLS– Parameters are often continuousParameters are often continuous

• SAPS algorithm SAPS algorithm [Hutter, Tompkins, Hoos ‘02][Hutter, Tompkins, Hoos ‘02]– Still amongst the state-of-the-artStill amongst the state-of-the-art– Default setting not always bestDefault setting not always best– Well, I also know it well ;-)Well, I also know it well ;-)

• But the approach is applicable to about anything whenever we But the approach is applicable to about anything whenever we can compute features!!can compute features!!


Stochastic Local Search for Stochastic Local Search for SAT:SAT:Scaling and Probabilistic Smoothing Scaling and Probabilistic Smoothing (SAPS)(SAPS)[Hutter, Tompkins, Hoos ‘02][Hutter, Tompkins, Hoos ‘02]• Clause weighting algorithm for SAT, was state-Clause weighting algorithm for SAT, was state-

of-the-art in 2002of-the-art in 2002– Start with all clause weights set to 1Start with all clause weights set to 1– Hillclimbing until you hit a local minimumHillclimbing until you hit a local minimum– In local minima:In local minima:

• Scaling: scale weights of unsatisfied clauses: wScaling: scale weights of unsatisfied clauses: wcc ÃÃ * w * wcc

• Probabilistic smoothing: with probability PProbabilistic smoothing: with probability Psmoothsmooth, smooth all , smooth all clause weights: wclause weights: wcc ÃÃ * w * wcc + (1- + (1-) * average w) * average wcc

• Default parameter setting: (Default parameter setting: (, , , P, Psmoothsmooth) = ) = (1.3,0.8,0.05)(1.3,0.8,0.05)

• PPsmoothsmooth and and are very closely related are very closely related


Benchmark instancesBenchmark instances

• Only satisfiable instances!Only satisfiable instances!

•SAT04randSAT04rand: SAT ‘04 competition : SAT ‘04 competition instancesinstances

•mixmix: mix of lots of different domains : mix of lots of different domains from SATLIB: random, graph from SATLIB: random, graph colouring, blocksworld, inductive colouring, blocksworld, inductive inference, logistics, ...inference, logistics, ...


Adaptive parameter setting vs. Adaptive parameter setting vs. SAPS default on SAPS default on SAT04randSAT04rand

• Trained on Trained on mixmix and and used to choose used to choose parameters for parameters for SAT04randSAT04rand

• 22 {0.5,0.6,0.7,0.8}{0.5,0.6,0.7,0.8}

• 2 2 {1.1,1.2,1.3}{1.1,1.2,1.3}• For SAPS: For SAPS:

#steps #steps time time• Adaptive variant on Adaptive variant on

average 2.5 times average 2.5 times faster than defaultfaster than default– But default is not But default is not

strong herestrong here


Where uncertainty helps in Where uncertainty helps in practice: practice: qualitative differences in qualitative differences in training & test settraining & test set

• Trained on Trained on mixmix, tested on , tested on SAT04randSAT04rand

Estimates of uncertaintyof prediction

Optimal prediction


Where uncertainty helps in practice Where uncertainty helps in practice (2):(2):Zoomed to predictions with low Zoomed to predictions with low uncertaintyuncertainty

Optimal prediction


OverviewOverview







ConclusionsConclusions

• Automated parameter tuning is needed and Automated parameter tuning is needed and feasiblefeasible– Algorithm experts waste their time on itAlgorithm experts waste their time on it– Solver can automatically choose appropriate Solver can automatically choose appropriate

heuristics based on instance characteristicsheuristics based on instance characteristics

• Such a solver could be used in practiceSuch a solver could be used in practice– Learns incrementally from the instances it Learns incrementally from the instances it

solvessolves– Uncertainty estimates prevent catastrophic Uncertainty estimates prevent catastrophic

errors in estimates for new domainserrors in estimates for new domains


Future work along these Future work along these lineslines• Increase predictive performanceIncrease predictive performance

– Better featuresBetter features– More powerful ML algorithmsMore powerful ML algorithms

• Active learningActive learning– Run most informative probes for new domains (need the Run most informative probes for new domains (need the

uncertainty estimates)uncertainty estimates)• Use uncertaintyUse uncertainty

– Pick algorithm with maximal probability of success Pick algorithm with maximal probability of success ((notnot the one with minimal expected runtime!) the one with minimal expected runtime!)

• More domainsMore domains– Tree search algorithmsTree search algorithms– CPCP


Future work along related Future work along related lineslines• If there are no features: If there are no features:

– Local search in parameter space to find the Local search in parameter space to find the best default parameter setting best default parameter setting [Hutter ‘04][Hutter ‘04]

• If we can change strategies while If we can change strategies while running the algorithm:running the algorithm:– Reinforment learning for algorithm selectionReinforment learning for algorithm selection

[Lagoudakis & Littman ‘00][Lagoudakis & Littman ‘00]– Low knowledge algorithm controlLow knowledge algorithm control

[Carchrae and Beck ‘05][Carchrae and Beck ‘05]


The EndThe End

• Thanks to Thanks to – Youssef HamadiYoussef Hamadi– Kevin Leyton-BrownKevin Leyton-Brown– Eugene NudelmanEugene Nudelman– You for your attention You for your attention

Automated Parameter Setting Based on Runtime Prediction: Towards an Instance-Aware Problem Solver Frank Hutter, Univ. of British Columbia, Vancouver, Canada.

Documents

runtime y i

y i f x

x im compute features

runtime y t

compute features x t

logarithm of runtime

x im y i fx

y i pairs