Top Banner

of 15

Lion 5 Paper

Jun 03, 2018

Download

Documents

milp2
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/12/2019 Lion 5 Paper

    1/15

    Robust Gaussian process-based

    global optimization using a fully Bayesian

    expected improvement criterion

    Romain Benassi, Julien Bect, and Emmanuel Vazquez

    SUPELECGif-sur-Yvette, France

    Abstract. We consider the problem of optimizing a real-valued con-tinuous function f, which is supposed to be expensive to evaluate and,consequently, can only be evaluated a limited number of times. Thisarticle focuses on the Bayesian approach to this problem, which con-sists in combining evaluation results and prior information about f inorder to efficiently select new evaluation points, as long as the budgetfor evaluations is not exhausted.

    The algorithm called efficient global optimization (EGO), proposed byJones, Schonlau and Welch (J. Global Optim., 13(4):455492, 1998), isone of the most popular Bayesian optimization algorithms. It is basedon a sampling criterion called the expected improvement (EI), whichassumes a Gaussian process prior about f. In the EGO algorithm, theparameters of the covariance of the Gaussian process are estimated fromthe evaluation results by maximum likelihood, and these parameters arethen plugged in the EI sampling criterion. However, it is well-knownthat this plug-in strategy can lead to very disappointing results when theevaluation results do not carry enough information aboutf to estimate

    the parameters in a satisfactory manner.We advocate a fully Bayesian approach to this problem, and derive ananalytical expression for the EI criterion in the case of Student predic-tive distributions. Numerical experiments show that the fully Bayesianapproach makes EI-based optimization more robust while maintainingan average loss similar to that of the EGO algorithm.

    1 Introduction

    Let fbe a continuous real-valued function defined on some compact spaceX Rd. We consider the problem of finding the maximum of f, when f issupposed to be expensive to evaluate because one evaluation takes a long time

    or a large amount of resources. In this case, the optimization offmust be carriedout using a limited number of evaluations. More precisely, given a budget ofNevaluations of f, our objective is to choose sequentially N evaluation pointsX1, . . . , X N X so that (XN, f) = M MN is small, where XN stands for(X1, . . . , X N), M= maxxXf(x) and MN =f(X1) f(XN).

    In this article, we adopt a Bayesian approach to this sequential decisionproblem: the unknown functionfis considered as a sample path of a real-valued

    hal00607816,

    version1

    11Jul2011

    Author manuscript, published in "Learning and Intelligent Optimization (LION 5'11), Rome : Italie (2011)"DOI : 10.1007/978-3-642-25566-3_13

    http://dx.doi.org/10.1007/978-3-642-25566-3_13http://hal.archives-ouvertes.fr/http://hal-supelec.archives-ouvertes.fr/hal-00607816http://dx.doi.org/10.1007/978-3-642-25566-3_13
  • 8/12/2019 Lion 5 Paper

    2/15

    random processdefined on some probability space (, B,P0) with parameterx X, and a good strategy is a strategy that achieves, or gets close to, theBayes risk rB := infXN E0((XN, )), where E0 denotes the expectation withrespect to P0 and the infinimum is taken over the set of all sequential strategies.The reader is referred to the books [15] for a broader view on the field of globaloptimization.

    It is well-known [612] that an optimal Bayesian optimization strategy, i.e.a strategy XN such that E0((X

    N, )) = rB, can be formally obtained by

    dynamic programming. Let En, n = 1, 2, . . ., denote the conditional expec-tation with respect to the -algebraFn generated by the random variablesX1, (X1), . . . , X n, (Xn). Denote by RN = EN((XN, )) the terminal riskand define by backward induction

    Rn = minxX

    En

    Rn+1| Xn+1= x

    , n= N 1, . . . , 0. (1)

    Then, we haveR0= rB, and the strategy XN defined by

    Xn+1= argminxX

    En

    Rn+1| Xn+1= x

    , n= 1, . . . , N 1, (2)

    is optimal. Unfortunately, solving (1)(2) over an horizon Nof more than a fewsteps is not numerically tractable, for both the space of possible actions and thespace of possible outcomes at each step are continuous.

    A natural way of dealing with this problem is to consider a suboptimal one-step lookahead strategy; see, e.g., [13, chapter 6]. This leads to choosing eachnew evaluation point according to

    Xn+1= argmin

    xX

    En(M

    Mn+1

    |Xn+1= x)

    = argmaxxX

    En(Mn+1| Xn+1= x)

    = argmaxxX

    n(x) := En

    ((Xn+1) Mn)+ Xn+1= x , (3)

    where (z)+= 0 z. The sampling criterion n, introduced by J. Mockus [6] andpopularized through the EGO algorithm [14], is known as theexpected improve-ment (EI).

    When is a Gaussian process, or in other words, when a Gaussian processprior is chosen forf, it is well-known that the EI can be written in closed form,with the consequence that the maximization of n can be carried out with amoderate computational effort. However, a Gaussian process prior carries a

    high amount of information aboutfand it is often difficult to elicit such a priorbefore any evaluation is made. As a result, the covariance function ofis usuallyassumed to belong to some parametric class of positive definite functions, thevalue of the parameters assumed to be unknown. In the EGO algorithm, theparameters are estimated from the evaluation results by maximum likelihood,and then plugged in the EI sampling criterion (computed for a Gaussian processwith known covariance function). It has been reported [15] that this plug-in

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    3/15

    strategy can lead to very disappointing results when the evaluation results donot carry enough information aboutfto estimate the parameters satisfactorily.

    We advocate a fully Bayesian approach to this problem, following the steps ofLocatelli [9,16] and, more recently, Osborne and co-authors [1719].

    The paper is organized as follows. Section 2 recalls the expression of the EIcriterion in the case of a Gaussian process prior with known covariance func-tion, and describes the plug-in approach used in the EGO algorithm to handlethe parameters of the covariance function when it is only assumed to belong tosome parametric class. Section 3 explains how a fully Bayesian approach canbe adopted in this problem, in order to take into account the uncertainty onthe parameters of the covariance function. Section 4 presents a new closed-formexpression of the EI criterion for Student predictive densities, which arises nat-urally when a conjugate inverse-gamma prior is used for the variance parameterof the Gaussian process prior. Section 5 illustrates with numerical results the

    benefits of the fully Bayesian approach, focusing more particularly on the tail ofthe error distribution, i.e., on the occurrence of large errors.

    Nota bene. The analytical expression of the expected improvement for Studentpredictive distributions, presented in Section 4, has in fact already been obtainedby Williams, Santner and Notz [20] in the special case of an improper Jeffreyprior on the variance. We warmly thank Frank Hutter for pointing out this paperto us during the LION5 conference.

    2 Efficient global optimization

    2.1 The expected improvement sampling criterion for aGaussian process

    Recall that the distribution of a Gaussian process is uniquely determinedby its mean function m(x) := E0((x)), x X, and its covariance functionk(x, y) := E0(((x) m(x))((y) m(y))), x, yX. Hereafter, we assume thatthe mean function is constant on X and write GP (m, k) to denote thatis aGaussian process with mean function m(x) =m R and covariance functionk .

    Proposition 1. Letk be a stationary covariance function written ask(x, y) =2r(x y), x, y X, where 2 > 0 and r(0) = 1 (hence, r is a correlationfunction). Assume that | m GP (m, k) and m U(R), whereU(R)denotes the (improper) uniform distribution over

    R. Then, for allx X,

    (x) | Fn Nn(x), s2n(x) ,

    where n(x) =mn+rn(x)TR1n (n mn1n) , (4)

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    4/15

    with

    n

    = ((X1), . . . , (Xn))T ,

    1n = (1, . . . , 1)T Rn,

    Rn the correlation matrix ofn,

    rn(x) the correlation vector between(x) andn,

    mn = 1TnR1n n1TnR1n 1n

    , the weighted least squares estimate ofm,

    ands2n(x) =

    22n(x) , (5)

    with

    2

    n(x) = 1 rn(x)T

    R

    1

    n rn(x) +

    (1

    rn(x)TR1n 1n)

    2

    1TnR1n 1n . (6)

    Proposition 2. Under the assumptions of Proposition 1, the expected improve-ment can be written as

    n(x) =

    sn(x)

    n(x)Mnsn(x)

    + (n(x) Mn) n(x)Mnsn(x) ifsn(x)> 0,n(x) Mn

    +ifsn(x) = 0.

    (7)where denotes the Gaussian cumulative distribution function.

    Propositions 1 and 2 show that, given a set of evaluation points and a Gaus-sian prior, the EI sampling criterion can be computed with a moderate amount of

    resources (computing (4) atqdifferent points in X involvesO(qn2) operations).However, it is rare that a user has enough information about f in order

    to choose an adequate covariance function k before any evaluation is made.The approach generally taken consists in choosing k in a parametrized class ofcovariance functions and estimating the parameters of k from the evaluationresults.

    2.2 Classical parametrized covariance functions

    There are chiefly three classes of parametrized covariance functions in theliterature of Gaussian processes for modeling computer experiments. These arethe class of the so-called Gaussian covariances, the class of the exponential co-

    variances, and that of the Matern covariances. Using Matern covariances makesit possible to tune the mean square differentiability of, which is not the casewith the exponential and Gaussian covariances.

    Define : R+ R+ such that,h 0,

    (h) = 1

    21()

    21/2h

    K

    21/2h

    , (8)

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    5/15

    where is the Gamma function andK is the modified Bessel function of thesecond kind of order . The parameter > 0 controls regularity at the origin

    of.The anisotropic form of the Matern covariance on Rd may be written as

    k(x, y) =2r(x, y), with

    r(x, y) =

    d

    i=1

    (x[i] y[i])22i

    , x, yRd , (9)where the positive scalar 2 is a variance parameter (we have k(x, x) =

    2),x[i], y[i] denote the i

    th coordinate of x and y, the positive scalars i representscale or range parameters of the covariance, or in other words, characteristiccorrelation lengths, and finally = (, 1, . . . , d) Rd+1+ denotes the parametervector of the Matern covariance. Note that an isotropic form of the Materncovariance is obtained by setting 1 = . . . = d = . Then, the parametervector of the Matern covariance is = (, ) R2+.

    2.3 The EGO algorithm

    The approach taken in the EGO (efficient global optimization) algorithm [14,2123] consists in estimating the unknown parameters of the covariance functionby maximum likelihood, after each new evaluation. Then, the EI samplingcriterion is computed using the current value of the parameters of the covariance.EGO can therefore be viewed as a plug-in approach.

    Remark 1 (about maximum likelihood estimation of the parameters of a covari-ance function of a Gaussian process). Recall that, for GP(m, k) withk(x, y) = 2r(x, y), the likelihood of the evaluation results can be writtenas

    n(n; m, 2, ) =

    1

    (22)n/2|Rn()|1/2 e

    1

    2(

    nm1n)

    TRn()

    1(nm1n), (10)

    where Rn() stands for the correlation matrix ofn, parametrized by . Note

    that setting to zero the partial derivatives ofn with respect to m and 2 yieldsthe following maximum likelihood estimates for m and 2:

    m() = 1TnRn()1n1TnRn()

    11n, (11)

    2() = 1n

    n

    m1nT Rn()1 n m1n . (12)Thus the maximum likelihood estimate of can be obtained by maximizing theprofile likelihood n(n;m(),2(), ).

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    6/15

    2.4 The case of deceptive functions

    Deceptive functions is a term coined by D. Jones (see [15, 25]) to describefunctions that appear to be flat based on evaluation results. In fact, anyfunction can potentially appear to be flat depending on how it is sampled.

    When the available evaluation results do not bring enough information onthe objective function fto estimate the parameters of the covariance functionwith a reasonnable precision, the variance of the error of prediction can beseverely under-estimated as depicted in Figure 1. As will be shown in Section 5.1,this can lead to very unsatisfactory behaviors of the EGO algorithm, whichtends to waste lots of evalutions in local search around the current maxima(exploitation), very early in the optimization procedure, to the detriment ofglobal search (exploration).

    1 0.5 0 0.5 11

    0.5

    0

    0.5

    1

    Fig. 1. Example of a deceptive sampling of a function (dashdot line). Evaluationpoints (black dots) are chosen such that the value of the function is around zero atthese points. After having estimated the parameters of the covariance function bymaximum likelihood, the prediction is very flat (solid line) and confidence intervalsderived from the standard deviation of the error of prediction (gray area) are severelyunderestimated.

    3 Fully Bayesian one-step lookahead optimization

    It has been emphasized in Section 1 that the rationale behind the EI criterionis of a Bayesian decision-theoretic nature. Indeed, maximizing the EI criterionat iterationnis equivalent to minimizing the expected loss En(max() Mn+1),where the expectation is taken with respect to the value of the next evaluation,which is unknown and therefore modeled as a random variable.

    In a fully Bayesian setting, allthe unknown parameters of the model haveto be given prior distributions. This has already been done for the unknown

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    7/15

    mean m in Proposition 1. Let 0 denote the prior distribution of the vectorof covariance parameters = (2, ), and let n, n = 1, . . . , N , denote the

    corresponding posterior distributions. According to Bayes rule, the posteriordistribution of(x) is a mixture of Gaussian distributionsNn(x; ), s2n(x; )weighted by n(d

    ). The expected improvement criterion for this model canthus be written, using the tower property of conditional expectations, as

    En

    ((x) Mn)+

    = En

    En

    ((x) Mn)+

    =

    n(x;

    ) n(d). (13)

    Note that the plug-in EI criterion of Section 2.3 can be seen as an approximationof the fully Bayesian criterion (13):

    n(x; ) n(d) n(x;n) ,which is justified only if the posterior distribution is concentrated enough aroundthe MLE estimaten. In the general case, we claim that it is safer to use thefully Bayesian criterion (13), since the corresponding expected loss integratesthe uncertainty related to the fact that is not exactly known. This claim willbe supported by the numerical results of Section 5.

    When0 is a finitely supported discrete distribution, the posterior distribu-tionnand therefore the integral (13)can be computed exactly using Bayesrule. For more general prior distribution, the integral can be approximatedby stochastic techniques like MCMC sampling or SMC sampling (see [2628]and the references therein). An alternative approach using Bayesian quadrature

    rules [29] has been proposed in [1719]. In all cases, the EI criterion is approx-imated by an expression of the form

    iwin(x;

    i), which amounts to sayingthatn is approximated by the discrete distribution

    iwii .

    Remark 2. Although fully Bayesian approaches for Gaussian process modelshave been proposed in the literature for more than two decades (see [30, 31]and the references therein), surprisingly little has been written from this per-spective in the context of Bayesian global optimization. An early attempt in thisdirection can be found in [9, 16], where the variance parameter of a Brownianmotion is given an inverse-gamma prior and then integrated out as in (13). Morerecently, the fully Bayesian approach has been developed in a more general wayby [1719], but the important connection of (13) with the usual (Gaussian) EI

    criterion was not clearly established.Remark 3. Discrete mixtures of Gaussian distributions and the corresponding EIcriterion have also been introduced in [32] to allow for the use of several para-metric classes of covariance functions, in order to provide increased robustnesswith respect to the choice of a particular class. The approach is not Bayesian,however, since the weights in the mixture are not posterior probabilities.

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    8/15

    4 Student EI

    Let us consider the case of a Gaussian process with unknown mean mand covariance function of the form k(x, y) = 2r(x, y). We assume that mand2 are independent, withmuniformly distributed on R (as in Proposition 1)and 2 following an inverse-gamma distribution with shape parameter a0 andscale parameterb0, hereafter denoted by IG (a0, b0). We shall prove that, in thissetting, the EI criterion still has an explicit analytical expression, which is ageneralization of the usual EI criterion given in Proposition 2.

    First, recall that the prior chosen for 2 is conjugate [33]:

    Proposition 3. The conditional distribution of 2 givenFn isIG (an, bn), with

    an = a0+n 1

    2 ,

    bn = b0+1

    2 n mn1nT R1n n mn1n .Using this result and the fact that (x)| 2,

    n N0, 22n(x), it is easy to

    show that the predictive distribution of (x) is a Student distribution. Moreprecisely:

    Proposition 4. Let t denote the Student distribution with > 0 degrees offreedom. Then, for allx X,

    (x) n(x)n(x)

    | Fn tn ,

    withn = 2an, and2n(x) =bn/an2n(x).

    In other words, the predictive distribution at x is a location-scale Student dis-tribution with n degrees of freedom, location parametern(x) and scale pa-rameter n(x). The following result is the key to our EI criterion for Studentpredictive distributions:

    Lemma 1. LetT t with >0. Then

    E

    (T+u)+

    =

    + if 1,+u2

    1 F(u) + u F(u) otherwise,

    whereF is the cumulative distribution function of t.

    Combining Lemma 1 and Proposition 4 finally yields an explicit expression ofthe EI criterion:

    Theorem 1. Under the assumptions of this section, for allx X,

    En

    ((x) Mn)+

    = n(x)

    n+u2

    n 1 F

    n(u) + u Fn(u)

    , (14)

    withu=n(x) Mn/n(x).

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    9/15

    It has been assumed, up to this point, that the only unknown parameter in thecovariance function is the variance 2. More generally, assume that k(x, y) =

    2 r(x, y; ): in this case we proceed by conditioning as in Section 3. Indeed,assume that is independent from

    m, 2

    with a prior distribution 0. Let us

    denote by n(x; ) = En

    ((x) Mn)+|

    the value of the EI criterion at xprovided by Theorem 1 when the value of the unknown parameter is . Then

    En

    ((x) Mn)+

    = En

    n(x; )

    =

    n(x; ) n(d), (15)

    wherendenotes the posterior distribution ofafternevaluations. As explainedin Section 3, the integral (15) boils down to a finite sum that can be computedexactly (using Bayes rule) when the prior 0 has a finite support; in the generalcase, approximation techniques have to be used.

    5 Numerical experiments

    5.1 Optimization of a deceptive function

    Experiment. Consider the objective function f : X = [1, 1] Rdefined by

    f(x) = x (sin(10x+ 1) + 0.1 sin(15x)) , x X .

    We choose an initial set of four evaluation points with abscissas0.43,0.11,0.515 and 0.85, as shown in Figure 1. Our objective is to compare the evaluationpoints chosen by the plug-in approach (i.e., the EGO algorithm) and those chosenby the fully Bayesian algorithm (FBA) proposed in Section 4.

    In both approaches, we consider a Matern covariance function with a knownregularity parameter = 2 (see Section 2.2). In the approach of Section 4, wechoose an inverse gamma distributionI G(0.2, 12) for2. Since X has dimensionone, there is only one range parameter . To simplify the implementation ofthe approach proposed, we shall assume that has a finite support distribution.More precisely, define amin and amax, such thatmin< max, and set, for all

    i= 0, . . . , I , i =min

    maxmin

    i/I. We assume a uniform prior distribution over

    the is, with min= 2 103, max= 2 and I= 100.The optimization of the two sampling criteria is performed by a Monte Carlo

    approach. More precisely, we generate once and for all a set ofq= 600 candidatepoints uniformly distributed over X and the search for the maximum of eachsampling criterion is carried out at each iteration by determining the value of

    the sampling criterion over this finite set (the same set of points is used for bothcriteria).

    Results. Figures 2, 3 and 4 show that the standard deviation of the error ofprediction is severely underestimated when using the EGO algorithm, as a resultof the maximum likelihood estimation of the parameters of the covariance from

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    10/15

    a deceptive set of evaluation points. If the uncertainty about the covarianceparameters is taken into account, as explained above, the standard deviation

    of the error is more satisfactory. Figures 3 and 4 show that the maximum isapproximated satisfactorily after only four iterations with FBA, whereas EGOneeds nine more iterations before making an evaluation in the neighborhood ofthe maximizer. Indeed, we observe that EGO stays in the neighborhood of alocal optimum for a long time, while X remains unexplored. This behavior isnot desirable in a context of expensive-to-evaluate functions.

    5.2 Comparison on sample paths of a Gaussian process

    Experiment. In order to assess the performances of EGO and FBA from astatistical point of view, we study the convergence to the maximum using bothalgorithms on a set of sample paths of a Gaussian process.

    We have built several testbedsTk,k = 1, 2, . . ., of functionsfk,l,l = 1, . . . , L,

    corresponding to sample paths of a Gaussian process, with zero-mean and aMatern covariance function, simulated on a set of q = 600 points in [0, 1]d

    generated using a Latin hypercube sampling (LHS), with different values for dand for the parameters of the covariance. Here, due to the lack of room, wepresent only the results obtained for two testbeds in dimension 1 and 4 (theactual parameters are provided in Table 1).

    Parameter\ Testbed T1 T2Dimensiond 1 4Number of sample paths L 20000 20000Variance2 1.0 1.0Regularity 2.5 2.5

    Scale= (1, . . . , d) 0.1 (0.7, 0.7, 0.7, 0.7)

    Table 1. Parameters used for building the testbeds of Gaussian-process sample-paths.

    We shall compare the performance of EGO and FBA based on the approxi-mation error (Xn, fk,l), l = 1, . . . , L. For reference, we also provide the resultsobtained with two other strategies. The first strategy corresponds to using anEI criterion with the same values for the parameters of the covariance functionof than those used to generate the sample paths in the testbeds. In principlethis strategy ought to perform very well. The second strategy corresponds tospace-filling sampling, which is not necessarily a good optimization strategy.

    For FBA, we choose the same priors as those described in Section 5.1. Moreprecisely, whatever be the dimension d, we choose an isotropic covariance func-tion (with only one scale parameter) and we set min= 1/400 andmax= 2

    d.

    Results. Figures 5(a) and 6(a) show that EGO and FBA have very similaraverage performances. In fact, both of them perform almost as well, in this

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    11/15

    1 0.5 0 0.5 11

    0.5

    0

    0.5

    1

    1 0.5 0 0.5 1

    2.8

    2.6

    2.4

    2.2

    log10EI

    (a) parameters estimated by MLE

    1 0.5 0 0.5 110

    5

    0

    5

    10

    1 0.5 0 0.5 1

    0.6

    0.4

    0.2

    0

    log10EI

    (b) Bayesian approach for the parameters

    Fig. 2. A comparison of a) EGO and b) FBA at iteration 1. Top: objective function(dashdot line), prediction (solid line), 95% confidence intervals derived from the stan-

    dard deviation (gray area), sampling points (dots) and position of the next evaluation(vertical dashed line). Bottom: EI criterion.

    1 0.5 0 0.5 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    1 0.5 0 0.5 1

    20

    10

    0

    log10EI

    (a) parameters estimated by MLE

    1 0.5 0 0.5 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    1 0.5 0 0.5 1

    1.8

    1.6

    1.4

    1.2

    log10EI

    (b) Bayesian approach for the parameters

    Fig. 3. Iteration 3 (see Figure 2 for details)

    1 0.5 0 0.5 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    1 0.5 0 0.5 1

    40

    20

    0

    log10EI

    (a) parameters estimated by MLE

    1 0.5 0 0.5 11.5

    1

    0.5

    0

    0.5

    1

    1.5

    1 0.5 0 0.5 1

    2.5

    2

    log10EI

    (b) Bayesian approach for the parameters

    Fig. 4. Iteration 8 (see Figure 2 for details)

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    12/15

    experiment, as the reference strategy where the true parameters are assumedto be known. Comparing the tails of complementary cumulative distribution

    function of the error max f Mn makes it clear, however, that using a fullyBayesian approach brings a significant reduction of the occurrence of large errorswith respect to the EGO algorithm. In other words, the fully Bayesian approachappears to be statistically more robust than the plug-in approach, while retainingthe same average performance.

    References

    1. A. Torn and A. Zilinskas. Global Optimization. Springer, Berlin, 1989.

    2. J. D. Pinter. Global optimization. Continuous and Lipschitz optimization: algo-rithms, implementations and applications. Springer, 1996.

    3. A. Zhigljavsky and A. Zilinskas. Stochastic global optimization. Springer Verlag,

    2007.4. A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to derivative-free

    optimization. SIAM, 2009.

    5. Y. Tenne, and C. K. Goh.Computational intelligence in optimization: applicationsand implementations. Springer, 2010.

    6. J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methodsfor seeking the extremum. In L. Dixon and G. Szego, editors, Towards GlobalOptimization, volume 2, pages 117129. Elsevier, 1978.

    7. J. Mockus. Bayesian approach to Global Optimization: Theory and Applications.Kluwer Acad. Publ., Dordrecht-Boston-London, 1989.

    8. B. Betro. Bayesian methods in global optimization.Journal of Global Optimization,1:114, 1991.

    9. M. Locatelli and F. Schoen. An adaptive stochastic global optimization algorithmfor one-dimensional functions.Annals of Operations research, 58(4):261278, 1995.

    10. A. Auger and O. Teytaud. Continuous lunches are free plus the design of optimaloptimization algorithms. Algorithmica, 57(1):121146, 2008.

    11. D. Ginsbourger and R. Le Riche. Towards Gaussian process-based optimizationwith finite time horizon. In mODa 9 Advances in Model-Oriented Design andAnalysis, Contribution to Statistics, 8996, Springer, 2010.

    12. S. Grunewalder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret boundsfor Gaussian process bandit problems. In Proceedings of the 13th InternationalConference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9 ofJMLR W&CP, pages 273280, 2010.

    13. D. P. Bertsekas. Dynamic programming and optimal control. Athena Scientific

    Belmont, MA, 1995.14. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of

    expensive black-box functions. Journal of Global Optimization, 13(4):455492,1998.

    15. A. I. J. Forrester and D. R. Jones. Global optimization of deceptive functions withsparse sampling. In12th AIAA/ISSMO multidisciplinary analysis and optimizationconference, 10-12 September 2008.

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    13/15

    Number of iterations

    maxf

    Mn

    ref 1ref 2FBAEGO

    6 8 10 12 14 16 18 20 22 24

    106

    104

    102

    100

    (a) Average error to the maximum

    max f Mn

    1

    F(x)

    0 0.2 0.4 0.6 0.8 1103

    102

    101

    100

    (b) Distribution of errors at iteration 13

    maxf Mn

    1

    F(x)

    0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4103

    102

    101

    100

    (c) Distribution of errors at iteration 16

    Fig. 5. Average results and error distributions for testbed T1, for FBA (solid blackline), EGO (dashed black line), the EI with the parameters used to generate samplepaths (solid gray line), the space-filling strategy (dashed gray line). More precisely, (a)represents the average approximation error as a function of the number of evaluationpoints. In (b) and (c), F(x) stands for the cumulative distribution function of theapproximation error. We plot 1F(x) in logarithmic scale in order to analyze the be-havior of the tail of the distribution (big errors with small probabilities of occurrence).Small values for 1 F(x) mean better results.

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    14/15

    Number of iterations

    maxf

    Mn

    ref 1ref 2FBAEGO

    10 20 30 40 50 60104

    103

    102

    101

    100

    (a) Average error to the maximum

    max f Mn

    1

    F(x)

    0 0.2 0.4 0.6 0.8 1 1.2103

    102

    101

    100

    (b) Distribution of errors at iteration 20

    max f Mn

    1

    F(x)

    0 0.1 0.2 0.3 0.4 0.5 0.6

    103

    102

    101

    100

    (c) Distribution of errors at iteration 34

    Fig. 6. Average results and distribution of errors for testbed T2. See Figure 5 fordetails.

    hal00607816,

    version1

    11Jul2011

  • 8/12/2019 Lion 5 Paper

    15/15