Lion 5 Paper

8/12/2019 Lion 5 Paper

1/15

Robust Gaussian process-based

global optimization using a fully Bayesian

expected improvement criterion

Romain Benassi, Julien Bect, and Emmanuel Vazquez

SUPELECGif-sur-Yvette, France

Abstract. We consider the problem of optimizing a real-valued con-tinuous function f, which is supposed to be expensive to evaluate and,consequently, can only be evaluated a limited number of times. Thisarticle focuses on the Bayesian approach to this problem, which con-sists in combining evaluation results and prior information about f inorder to efficiently select new evaluation points, as long as the budgetfor evaluations is not exhausted.

The algorithm called efficient global optimization (EGO), proposed byJones, Schonlau and Welch (J. Global Optim., 13(4):455492, 1998), isone of the most popular Bayesian optimization algorithms. It is basedon a sampling criterion called the expected improvement (EI), whichassumes a Gaussian process prior about f. In the EGO algorithm, theparameters of the covariance of the Gaussian process are estimated fromthe evaluation results by maximum likelihood, and these parameters arethen plugged in the EI sampling criterion. However, it is well-knownthat this plug-in strategy can lead to very disappointing results when theevaluation results do not carry enough information aboutf to estimate

the parameters in a satisfactory manner.We advocate a fully Bayesian approach to this problem, and derive ananalytical expression for the EI criterion in the case of Student predic-tive distributions. Numerical experiments show that the fully Bayesianapproach makes EI-based optimization more robust while maintainingan average loss similar to that of the EGO algorithm.

1 Introduction

Let fbe a continuous real-valued function defined on some compact spaceX Rd. We consider the problem of finding the maximum of f, when f issupposed to be expensive to evaluate because one evaluation takes a long time

or a large amount of resources. In this case, the optimization offmust be carriedout using a limited number of evaluations. More precisely, given a budget ofNevaluations of f, our objective is to choose sequentially N evaluation pointsX1, . . . , X N X so that (XN, f) = M MN is small, where XN stands for(X1, . . . , X N), M= maxxXf(x) and MN =f(X1) f(XN).

In this article, we adopt a Bayesian approach to this sequential decisionproblem: the unknown functionfis considered as a sample path of a real-valued

hal00607816,

version1

11Jul2011

Author manuscript, published in "Learning and Intelligent Optimization (LION 5'11), Rome : Italie (2011)"DOI : 10.1007/978-3-642-25566-3_13
http://dx.doi.org/10.1007/978-3-642-25566-3_13http://hal.archives-ouvertes.fr/http://hal-supelec.archives-ouvertes.fr/hal-00607816http://dx.doi.org/10.1007/978-3-642-25566-3_13


2/15

random processdefined on some probability space (, B,P0) with parameterx X, and a good strategy is a strategy that achieves, or gets close to, theBayes risk rB := infXN E0((XN, )), where E0 denotes the expectation withrespect to P0 and the infinimum is taken over the set of all sequential strategies.The reader is referred to the books [15] for a broader view on the field of globaloptimization.

It is well-known [612] that an optimal Bayesian optimization strategy, i.e.a strategy XN such that E0((X

N, )) = rB, can be formally obtained by

dynamic programming. Let En, n = 1, 2, . . ., denote the conditional expec-tation with respect to the -algebraFn generated by the random variablesX1, (X1), . . . , X n, (Xn). Denote by RN = EN((XN, )) the terminal riskand define by backward induction

Rn = minxX

En

Rn+1| Xn+1= x

, n= N 1, . . . , 0. (1)

Then, we haveR0= rB, and the strategy XN defined by

Xn+1= argminxX

En

Rn+1| Xn+1= x

, n= 1, . . . , N 1, (2)

is optimal. Unfortunately, solving (1)(2) over an horizon Nof more than a fewsteps is not numerically tractable, for both the space of possible actions and thespace of possible outcomes at each step are continuous.

A natural way of dealing with this problem is to consider a suboptimal one-step lookahead strategy; see, e.g., [13, chapter 6]. This leads to choosing eachnew evaluation point according to

Xn+1= argmin

xX

En(M

Mn+1

|Xn+1= x)

= argmaxxX

En(Mn+1| Xn+1= x)

= argmaxxX

n(x) := En

((Xn+1) Mn)+ Xn+1= x , (3)

where (z)+= 0 z. The sampling criterion n, introduced by J. Mockus [6] andpopularized through the EGO algorithm [14], is known as theexpected improve-ment (EI).

When is a Gaussian process, or in other words, when a Gaussian processprior is chosen forf, it is well-known that the EI can be written in closed form,with the consequence that the maximization of n can be carried out with amoderate computational effort. However, a Gaussian process prior carries a

high amount of information aboutfand it is often difficult to elicit such a priorbefore any evaluation is made. As a result, the covariance function ofis usuallyassumed to belong to some parametric class of positive definite functions, thevalue of the parameters assumed to be unknown. In the EGO algorithm, theparameters are estimated from the evaluation results by maximum likelihood,and then plugged in the EI sampling criterion (computed for a Gaussian processwith known covariance function). It has been reported [15] that this plug-in

hal00607816,

version1

11Jul2011


3/15

strategy can lead to very disappointing results when the evaluation results donot carry enough information aboutfto estimate the parameters satisfactorily.

We advocate a fully Bayesian approach to this problem, following the steps ofLocatelli [9,16] and, more recently, Osborne and co-authors [1719].

The paper is organized as follows. Section 2 recalls the expression of the EIcriterion in the case of a Gaussian process prior with known covariance func-tion, and describes the plug-in approach used in the EGO algorithm to handlethe parameters of the covariance function when it is only assumed to belong tosome parametric class. Section 3 explains how a fully Bayesian approach canbe adopted in this problem, in order to take into account the uncertainty onthe parameters of the covariance function. Section 4 presents a new closed-formexpression of the EI criterion for Student predictive densities, which arises nat-urally when a conjugate inverse-gamma prior is used for the variance parameterof the Gaussian process prior. Section 5 illustrates with numerical results the

benefits of the fully Bayesian approach, focusing more particularly on the tail ofthe error distribution, i.e., on the occurrence of large errors.

Nota bene. The analytical expression of the expected improvement for Studentpredictive distributions, presented in Section 4, has in fact already been obtainedby Williams, Santner and Notz [20] in the special case of an improper Jeffreyprior on the variance. We warmly thank Frank Hutter for pointing out this paperto us during the LION5 conference.

2 Efficient global optimization

2.1 The expected improvement sampling criterion for aGaussian process

Recall that the distribution of a Gaussian process is uniquely determinedby its mean function m(x) := E0((x)), x X, and its covariance functionk(x, y) := E0(((x) m(x))((y) m(y))), x, yX. Hereafter, we assume thatthe mean function is constant on X and write GP (m, k) to denote thatis aGaussian process with mean function m(x) =m R and covariance functionk .

Proposition 1. Letk be a stationary covariance function written ask(x, y) =2r(x y), x, y X, where 2 > 0 and r(0) = 1 (hence, r is a correlationfunction). Assume that | m GP (m, k) and m U(R), whereU(R)denotes the (improper) uniform distribution over

R. Then, for allx X,

(x) | Fn Nn(x), s2n(x) ,

where n(x) =mn+rn(x)TR1n (n mn1n) , (4)

hal00607816,

version1

11Jul2011


4/15

with

n

= ((X1), . . . , (Xn))T ,

1n = (1, . . . , 1)T Rn,

Rn the correlation matrix ofn,

rn(x) the correlation vector between(x) andn,

mn = 1TnR1n n1TnR1n 1n

, the weighted least squares estimate ofm,

ands2n(x) =

22n(x) , (5)

with

2

n(x) = 1 rn(x)T

R

1

n rn(x) +

(1

rn(x)TR1n 1n)

2

1TnR1n 1n . (6)

Proposition 2. Under the assumptions of Proposition 1, the expected improve-ment can be written as

n(x) =

sn(x)

n(x)Mnsn(x)

+ (n(x) Mn) n(x)Mnsn(x) ifsn(x)> 0,n(x) Mn

+ifsn(x) = 0.

(7)where denotes the Gaussian cumulative distribution function.

Propositions 1 and 2 show that, given a set of evaluation points and a Gaus-sian prior, the EI sampling criterion can be computed with a moderate amount of

resources (computing (4) atqdifferent points in X involvesO(qn2) operations).However, it is rare that a user has enough information about f in order

to choose an adequate covariance function k before any evaluation is made.The approach generally taken consists in choosing k in a parametrized class ofcovariance functions and estimating the parameters of k from the evaluationresults.

2.2 Classical parametrized covariance functions

There are chiefly three classes of parametrized covariance functions in theliterature of Gaussian processes for modeling computer experiments. These arethe class of the so-called Gaussian covariances, the class of the exponential co-

variances, and that of the Matern covariances. Using Matern covariances makesit possible to tune the mean square differentiability of, which is not the casewith the exponential and Gaussian covariances.

Define : R+ R+ such that,h 0,

(h) = 1

21()

21/2h

K

21/2h

, (8)

hal00607816,

version1

11Jul2011


5/15

where is the Gamma function andK is the modified Bessel function of thesecond kind of order . The parameter > 0 controls regularity at the origin

of.The anisotropic form of the Matern covariance on Rd may be written as

k(x, y) =2r(x, y), with

r(x, y) =

d

i=1

(x[i] y[i])22i

, x, yRd , (9)where the positive scalar 2 is a variance parameter (we have k(x, x) =

2),x[i], y[i] denote the i

th coordinate of x and y, the positive scalars i representscale or range parameters of the covariance, or in other words, characteristiccorrelation lengths, and finally = (, 1, . . . , d) Rd+1+ denotes the parametervector of the Matern covariance. Note that an isotropic form of the Materncovariance is obtained by setting 1 = . . . = d = . Then, the parametervector of the Matern covariance is = (, ) R2+.

2.3 The EGO algorithm

The approach taken in the EGO (efficient global optimization) algorithm [14,2123] consists in estimating the unknown parameters of the covariance functionby maximum likelihood, after each new evaluation. Then, the EI samplingcriterion is computed using the current value of the parameters of the covariance.EGO can therefore be viewed as a plug-in approach.

Remark 1 (about maximum likelihood estimation of the parameters of a covari-ance function of a Gaussian process). Recall that, for GP(m, k) withk(x, y) = 2r(x, y), the likelihood of the evaluation results can be writtenas

n(n; m, 2, ) =

1

(22)n/2|Rn()|1/2 e

1

2(

nm1n)

TRn()

1(nm1n), (10)

where Rn() stands for the correlation matrix ofn, parametrized by . Note

that setting to zero the partial derivatives ofn with respect to m and 2 yieldsthe following maximum likelihood estimates for m and 2:

m() = 1TnRn()1n1TnRn()

11n, (11)

2() = 1n

n

m1nT Rn()1 n m1n . (12)Thus the maximum likelihood estimate of can be obtained by maximizing theprofile likelihood n(n;m(),2(), ).

hal00607816,

version1

11Jul2011


6/15

2.4 The case of deceptive functions

Deceptive functions is a term coined by D. Jones (see [15, 25]) to describefunctions that appear to be flat based on evaluation results. In fact, anyfunction can potentially appear to be flat depending on how it is sampled.

When the available evaluation results do not bring enough information onthe objective function fto estimate the parameters of the covariance functionwith a reasonnable precision, the variance of the error of prediction can beseverely under-estimated as depicted in Figure 1. As will be shown in Section 5.1,this can lead to very unsatisfactory behaviors of the EGO algorithm, whichtends to waste lots of evalutions in local search around the current maxima(exploitation), very early in the optimization procedure, to the detriment ofglobal search (exploration).

1 0.5 0 0.5 11

0.5

0

0.5

1

Fig. 1. Example of a deceptive sampling of a function (dashdot line). Evaluationpoints (black dots) are chosen such that the value of the function is around zero atthese points. After having estimated the parameters of the covariance function bymaximum likelihood, the prediction is very flat (solid line) and confidence intervalsderived from the standard deviation of the error of prediction (gray area) are severelyunderestimated.

3 Fully Bayesian one-step lookahead optimization

It has been emphasized in Section 1 that the rationale behind the EI criterionis of a Bayesian decision-theoretic nature. Indeed, maximizing the EI criterionat iterationnis equivalent to minimizing the expected loss En(max() Mn+1),where the expectation is taken with respect to the value of the next evaluation,which is unknown and therefore modeled as a random variable.

In a fully Bayesian setting, allthe unknown parameters of the model haveto be given prior distributions. This has already been done for the unknown

hal00607816,

version1

11Jul2011


7/15

mean m in Proposition 1. Let 0 denote the prior distribution of the vectorof covariance parameters = (2, ), and let n, n = 1, . . . , N , denote the

corresponding posterior distributions. According to Bayes rule, the posteriordistribution of(x) is a mixture of Gaussian distributionsNn(x; ), s2n(x; )weighted by n(d

). The expected improvement criterion for this model canthus be written, using the tower property of conditional expectations, as

En

((x) Mn)+

= En

En

((x) Mn)+

=

n(x;

) n(d). (13)

Note that the plug-in EI criterion of Section 2.3 can be seen as an approximationof the fully Bayesian criterion (13):

n(x; ) n(d) n(x;n) ,which is justified only if the posterior distribution is concentrated enough aroundthe MLE estimaten. In the general case, we claim that it is safer to use thefully Bayesian criterion (13), since the corresponding expected loss integratesthe uncertainty related to the fact that is not exactly known. This claim willbe supported by the numerical results of Section 5.

When0 is a finitely supported discrete distribution, the posterior distribu-tionnand therefore the integral (13)can be computed exactly using Bayesrule. For more general prior distribution, the integral can be approximatedby stochastic techniques like MCMC sampling or SMC sampling (see [2628]and the references therein). An alternative approach using Bayesian quadrature

rules [29] has been proposed in [1719]. In all cases, the EI criterion is approx-imated by an expression of the form

iwin(x;

i), which amounts to sayingthatn is approximated by the discrete distribution

iwii .

Remark 2. Although fully Bayesian approaches for Gaussian process modelshave been proposed in the literature for more than two decades (see [30, 31]and the references therein), surprisingly little has been written from this per-spective in the context of Bayesian global optimization. An early attempt in thisdirection can be found in [9, 16], where the variance parameter of a Brownianmotion is given an inverse-gamma prior and then integrated out as in (13). Morerecently, the fully Bayesian approach has been developed in a more general wayby [1719], but the important connection of (13) with the usual (Gaussian) EI

criterion was not clearly established.Remark 3. Discrete mixtures of Gaussian distributions and the corresponding EIcriterion have also been introduced in [32] to allow for the use of several para-metric classes of covariance functions, in order to provide increased robustnesswith respect to the choice of a particular class. The approach is not Bayesian,however, since the weights in the mixture are not posterior probabilities.

hal00607816,

version1

11Jul2011


8/15

4 Student EI

Let us consider the case of a Gaussian process with unknown mean mand covariance function of the form k(x, y) = 2r(x, y). We assume that mand2 are independent, withmuniformly distributed on R (as in Proposition 1)and 2 following an inverse-gamma distribution with shape parameter a0 andscale parameterb0, hereafter denoted by IG (a0, b0). We shall prove that, in thissetting, the EI criterion still has an explicit analytical expression, which is ageneralization of the usual EI criterion given in Proposition 2.

First, recall that the prior chosen for 2 is conjugate [33]:

Proposition 3. The conditional distribution of 2 givenFn isIG (an, bn), with

an = a0+n 1

2 ,

bn = b0+1

2 n mn1nT R1n n mn1n .Using this result and the fact that (x)| 2,

n N0, 22n(x), it is easy to

show that the predictive distribution of (x) is a Student distribution. Moreprecisely:

Proposition 4. Let t denote the Student distribution with > 0 degrees offreedom. Then, for allx X,

(x) n(x)n(x)

| Fn tn ,

withn = 2an, and2n(x) =bn/an2n(x).

In other words, the predictive distribution at x is a location-scale Student dis-tribution with n degrees of freedom, location parametern(x) and scale pa-rameter n(x). The following result is the key to our EI criterion for Studentpredictive distributions:

Lemma 1. LetT t with >0. Then

E

(T+u)+

=

+ if 1,+u2

1 F(u) + u F(u) otherwise,

whereF is the cumulative distribution function of t.

Combining Lemma 1 and Proposition 4 finally yields an explicit expression ofthe EI criterion:

Theorem 1. Under the assumptions of this section, for allx X,

En

((x) Mn)+

= n(x)

n+u2

n 1 F

n(u) + u Fn(u)

, (14)

withu=n(x) Mn/n(x).

hal00607816,

version1

11Jul2011


9/15

It has been assumed, up to this point, that the only unknown parameter in thecovariance function is the variance 2. More generally, assume that k(x, y) =

2 r(x, y; ): in this case we proceed by conditioning as in Section 3. Indeed,assume that is independent from

m, 2

with a prior distribution 0. Let us

denote by n(x; ) = En

((x) Mn)+|

the value of the EI criterion at xprovided by Theorem 1 when the value of the unknown parameter is . Then

En

((x) Mn)+

= En

n(x; )

=

n(x; ) n(d), (15)

wherendenotes the posterior distribution ofafternevaluations. As explainedin Section 3, the integral (15) boils down to a finite sum that can be computedexactly (using Bayes rule) when the prior 0 has a finite support; in the generalcase, approximation techniques have to be used.

5 Numerical experiments

5.1 Optimization of a deceptive function

Experiment. Consider the objective function f : X = [1, 1] Rdefined by

f(x) = x (sin(10x+ 1) + 0.1 sin(15x)) , x X .

We choose an initial set of four evaluation points with abscissas0.43,0.11,0.515 and 0.85, as shown in Figure 1. Our objective is to compare the evaluationpoints chosen by the plug-in approach (i.e., the EGO algorithm) and those chosenby the fully Bayesian algorithm (FBA) proposed in Section 4.

In both approaches, we consider a Matern covariance function with a knownregularity parameter = 2 (see Section 2.2). In the approach of Section 4, wechoose an inverse gamma distributionI G(0.2, 12) for2. Since X has dimensionone, there is only one range parameter . To simplify the implementation ofthe approach proposed, we shall assume that has a finite support distribution.More precisely, define amin and amax, such thatmin< max, and set, for all

i= 0, . . . , I , i =min

maxmin

i/I. We assume a uniform prior distribution over

the is, with min= 2 103, max= 2 and I= 100.The optimization of the two sampling criteria is performed by a Monte Carlo

approach. More precisely, we generate once and for all a set ofq= 600 candidatepoints uniformly distributed over X and the search for the maximum of eachsampling criterion is carried out at each iteration by determining the value of

the sampling criterion over this finite set (the same set of points is used for bothcriteria).

Results. Figures 2, 3 and 4 show that the standard deviation of the error ofprediction is severely underestimated when using the EGO algorithm, as a resultof the maximum likelihood estimation of the parameters of the covariance from

hal00607816,

version1

11Jul2011


10/15

a deceptive set of evaluation points. If the uncertainty about the covarianceparameters is taken into account, as explained above, the standard deviation

of the error is more satisfactory. Figures 3 and 4 show that the maximum isapproximated satisfactorily after only four iterations with FBA, whereas EGOneeds nine more iterations before making an evaluation in the neighborhood ofthe maximizer. Indeed, we observe that EGO stays in the neighborhood of alocal optimum for a long time, while X remains unexplored. This behavior isnot desirable in a context of expensive-to-evaluate functions.

5.2 Comparison on sample paths of a Gaussian process

Experiment. In order to assess the performances of EGO and FBA from astatistical point of view, we study the convergence to the maximum using bothalgorithms on a set of sample paths of a Gaussian process.

We have built several testbedsTk,k = 1, 2, . . ., of functionsfk,l,l = 1, . . . , L,

corresponding to sample paths of a Gaussian process, with zero-mean and aMatern covariance function, simulated on a set of q = 600 points in [0, 1]d

generated using a Latin hypercube sampling (LHS), with different values for dand for the parameters of the covariance. Here, due to the lack of room, wepresent only the results obtained for two testbeds in dimension 1 and 4 (theactual parameters are provided in Table 1).

Parameter\ Testbed T1 T2Dimensiond 1 4Number of sample paths L 20000 20000Variance2 1.0 1.0Regularity 2.5 2.5

Scale= (1, . . . , d) 0.1 (0.7, 0.7, 0.7, 0.7)

Table 1. Parameters used for building the testbeds of Gaussian-process sample-paths.

We shall compare the performance of EGO and FBA based on the approxi-mation error (Xn, fk,l), l = 1, . . . , L. For reference, we also provide the resultsobtained with two other strategies. The first strategy corresponds to using anEI criterion with the same values for the parameters of the covariance functionof than those used to generate the sample paths in the testbeds. In principlethis strategy ought to perform very well. The second strategy corresponds tospace-filling sampling, which is not necessarily a good optimization strategy.

For FBA, we choose the same priors as those described in Section 5.1. Moreprecisely, whatever be the dimension d, we choose an isotropic covariance func-tion (with only one scale parameter) and we set min= 1/400 andmax= 2

d.

Results. Figures 5(a) and 6(a) show that EGO and FBA have very similaraverage performances. In fact, both of them perform almost as well, in this

hal00607816,

version1

11Jul2011


11/15

1 0.5 0 0.5 11

0.5

0

0.5

1

1 0.5 0 0.5 1

2.8

2.6

2.4

2.2

log10EI

(a) parameters estimated by MLE

1 0.5 0 0.5 110

5

0

5

10

1 0.5 0 0.5 1

0.6

0.4

0.2

0

log10EI

(b) Bayesian approach for the parameters

Fig. 2. A comparison of a) EGO and b) FBA at iteration 1. Top: objective function(dashdot line), prediction (solid line), 95% confidence intervals derived from the stan-

dard deviation (gray area), sampling points (dots) and position of the next evaluation(vertical dashed line). Bottom: EI criterion.

1 0.5 0 0.5 11.5

1

0.5

0

0.5

1

1.5

1 0.5 0 0.5 1

20

10

0

log10EI


1 0.5 0 0.5 11.5

1

0.5

0

0.5

1

1.5

1 0.5 0 0.5 1

1.8

1.6

1.4

1.2

log10EI


Fig. 3. Iteration 3 (see Figure 2 for details)

1 0.5 0 0.5 11.5

1

0.5

0

0.5

1

1.5

1 0.5 0 0.5 1

40

20

0

log10EI


1 0.5 0 0.5 11.5

1

0.5

0

0.5

1

1.5

1 0.5 0 0.5 1

2.5

2

log10EI


Fig. 4. Iteration 8 (see Figure 2 for details)

hal00607816,

version1

11Jul2011


12/15

experiment, as the reference strategy where the true parameters are assumedto be known. Comparing the tails of complementary cumulative distribution

function of the error max f Mn makes it clear, however, that using a fullyBayesian approach brings a significant reduction of the occurrence of large errorswith respect to the EGO algorithm. In other words, the fully Bayesian approachappears to be statistically more robust than the plug-in approach, while retainingthe same average performance.

References

1. A. Torn and A. Zilinskas. Global Optimization. Springer, Berlin, 1989.

2. J. D. Pinter. Global optimization. Continuous and Lipschitz optimization: algo-rithms, implementations and applications. Springer, 1996.

3. A. Zhigljavsky and A. Zilinskas. Stochastic global optimization. Springer Verlag,

2007.4. A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to derivative-free

optimization. SIAM, 2009.

5. Y. Tenne, and C. K. Goh.Computational intelligence in optimization: applicationsand implementations. Springer, 2010.

6. J. Mockus, V. Tiesis, and A. Zilinskas. The application of Bayesian methodsfor seeking the extremum. In L. Dixon and G. Szego, editors, Towards GlobalOptimization, volume 2, pages 117129. Elsevier, 1978.

7. J. Mockus. Bayesian approach to Global Optimization: Theory and Applications.Kluwer Acad. Publ., Dordrecht-Boston-London, 1989.

8. B. Betro. Bayesian methods in global optimization.Journal of Global Optimization,1:114, 1991.

9. M. Locatelli and F. Schoen. An adaptive stochastic global optimization algorithmfor one-dimensional functions.Annals of Operations research, 58(4):261278, 1995.

10. A. Auger and O. Teytaud. Continuous lunches are free plus the design of optimaloptimization algorithms. Algorithmica, 57(1):121146, 2008.

11. D. Ginsbourger and R. Le Riche. Towards Gaussian process-based optimizationwith finite time horizon. In mODa 9 Advances in Model-Oriented Design andAnalysis, Contribution to Statistics, 8996, Springer, 2010.

12. S. Grunewalder, J.-Y. Audibert, M. Opper, and J. Shawe-Taylor. Regret boundsfor Gaussian process bandit problems. In Proceedings of the 13th InternationalConference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9 ofJMLR W&CP, pages 273280, 2010.

13. D. P. Bertsekas. Dynamic programming and optimal control. Athena Scientific

Belmont, MA, 1995.14. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of

expensive black-box functions. Journal of Global Optimization, 13(4):455492,1998.

15. A. I. J. Forrester and D. R. Jones. Global optimization of deceptive functions withsparse sampling. In12th AIAA/ISSMO multidisciplinary analysis and optimizationconference, 10-12 September 2008.

hal00607816,

version1

11Jul2011


13/15

Number of iterations

maxf

Mn

ref 1ref 2FBAEGO

6 8 10 12 14 16 18 20 22 24

106

104

102

100

(a) Average error to the maximum

max f Mn

1

F(x)

0 0.2 0.4 0.6 0.8 1103

102

101

100

(b) Distribution of errors at iteration 13

maxf Mn

1

F(x)

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4103

102

101

100

(c) Distribution of errors at iteration 16

Fig. 5. Average results and error distributions for testbed T1, for FBA (solid blackline), EGO (dashed black line), the EI with the parameters used to generate samplepaths (solid gray line), the space-filling strategy (dashed gray line). More precisely, (a)represents the average approximation error as a function of the number of evaluationpoints. In (b) and (c), F(x) stands for the cumulative distribution function of theapproximation error. We plot 1F(x) in logarithmic scale in order to analyze the be-havior of the tail of the distribution (big errors with small probabilities of occurrence).Small values for 1 F(x) mean better results.

hal00607816,

version1

11Jul2011


14/15

Number of iterations

maxf

Mn

ref 1ref 2FBAEGO

10 20 30 40 50 60104

103

102

101

100

(a) Average error to the maximum

max f Mn

1

F(x)

0 0.2 0.4 0.6 0.8 1 1.2103

102

101

100

(b) Distribution of errors at iteration 20

max f Mn

1

F(x)

0 0.1 0.2 0.3 0.4 0.5 0.6

103

102

101

100

(c) Distribution of errors at iteration 34

Fig. 6. Average results and distribution of errors for testbed T2. See Figure 5 fordetails.

hal00607816,

version1

11Jul2011


15/15

Lion 5 Paper

Documents