Reporte

Approximation models in optimization functions

Alan Daz Manrquez

Abstract

Nowadays, the most real world problems require the optimization of one or more goals, different techniqueshave been used for the solution of such problems. However, evolutionary algorithms have shown flexibility,adaptability and good performance in this class of problems. The main disadvantage of these algorithms isthat they require too many evaluations of the fitness function to achieve an acceptable results. Therefore, whenthe fitness function is computationally expensive (eg simulations of engineering problems) the optimizationprocess becomes prohibitive. To reduce the cost, surrogate models, also known as metamodels, are constructedand used instead of the real fitness function. There are a wide variety of work on this class of models, however,most existing work does not justify the choice of the metamodel. There are studies in the literature that makea comparison between different techniques for creating metamodels. However, most only make comparisonsbetween two techniques, or only taken as a point of comparison the accuracy of the metamodel. A mainproblem of the surrogate models is the scalability, due to that, the most have a good performance for arelatively low number of variables but has a poor performance for high dimensionalities. In this work, wecompare four metamodeling techniques (Polynomial approximation models, Kriging, Radial basis functions,Support vector regression), different aspects are taken into account for measuring the performance in sixscalables problems of optimization that representing different classes of problems. The objective of this studyis investigate the advantages and disadvantages of the metamodeling techniques in different test problems,measuring performance based on multiple aspects, including scalability.

1 Introduction

In recent years, Evolutionary Algorithms (EAs) have been applied with a great success to complex optimiza-tion problems. The main advantage of the EA lies in their ability to locate close to the global optimum.However, for many real-world problems, the number of calls to the objective function for locate a near opti-mal solution may be too. In many science and engineering problems researchers make heavy use of computersimulation codes in order to replace expensive physical experiments and improve the quality and performanceof engineered products and devices. For example, Computational Fluid Dynamics (CFD) solvers, Computa-tional Electro Magnetics (CEM) and Computational Structural Mechanics (CSM) have been shown to be veryaccurate. Such simulations are often times very expensive computationally. A simulation can take severalminutes, hours, days or even weeks or years. Hence, in many real world optimization problems, the numberof objective function evaluations needed to obtain a good solution dominates the optimization cost, that is,the optimization process is taken up by runs of the computationally expensive analysis codes.

In order to obtain efficient optimization algorithms, it is crucial to use prior information gained during theoptimization process. Conceptually, a natural approach to utilizing the known prior information is buildinga model of the fitness function to assist in the selection of candidate solutions for evaluation. A variety oftechniques for constructing of such a model, often also referred to as surrogates, metamodels or approximationmodels for computationally expensive optimization problems have been considered.

There are a variety of work on such problems, however, most existing work does not justify the choice ofthe metamodel or do not make a comparison between different surrogate models. In this work, we focus on astudy on surrogate models, evaluating different aspects, in order to choose correctly the use of a metamodel,depending on the interests of the algorithm.

The remainder of this work is organized as follows. We begin with a brief overview of surrogate modelingtechniques commonly used in the literature. Section 3 presents an overview of the state of the art of evolu-tionary algorithms with surrogate models. Section 4 presents the proposed methodology for the comparison

1

of the metamodeling techniques. Experimental results obtained on synthetic test problems are presented insection 5 and the discussion of results. Finally, section 6 summarizes our main conclusions.

2 Background

A surrogate model is a mathematical model that mimics the behavior of a computationally expensive simula-tion code over the complete parameter space as accurately as possible, using as little data points as possible.There are a variety of techniques for creating metamodels: rational functions, radial basis functions, arti-ficial neural networks, kriging models, support vector machines, splines, polynomial approximation models.The following are the most common approaches to constructing approximate models based on learning andinterpolation from known fitness values of a small population, also know as metamodeling techniques.

2.1 Polynomial approximation models

The response surface methodology (RSM) approximation is one of the most well established meta-modelingtechniques. This methodology employs the statistical techniques of regression analysis and analysis of variancein order to obtain minimum variances of the responses.

For most responses surfaces, the function use for approximation are polynomials because of their simplicity,although other types of function are, of course, possible.

In general, a polynomial in the coded inputs x1, x2, ..., xk is a function which is a linear aggregate (orcombination) of powers and products of the xs. A term in the polynomial is said to be of order j (or degreej) if it contains the product of j of the xs. A polynomial is said to be of order d, or degree d, if the term(s)of highest order in it is (are) of order or degree d.

The response surface for d = 2, k = 2 and if x1 and x2 denote two coded inputs is described as follows:

y(p) = (0) + (1x(p)1 + 2x

(p)2 ) + (11x

(p)1 x

(p)1 + 22x

(p)2 x

(p)2 + 12x

(p)1 x

(p)2 ) (1)

where yp is the response to x(p)1 and x

(p)2 . In the expression 1 the coefficients

s are coefficients of(empirical) parameters which, in practice, have to be estimated from the data.

The polynomial model is written in matrix notation as:

yp = T xp (2)

where is the vector of coefficients to be estimated, and xp is the vector corresponding to the form of thex(p)1 and x

(p)2 terms in the polynomial model.

As seen from Table 1, the number of parameters increases rapidly as the number, k, of the input variablesand the degree, d, of the polynomial are both increased.

Table 1: Number of coefficients in Polynomials of Degree d involving k inputs

Degree of Polynomial, dNumber of 1 2 3 4inputs k Planar Quadratic Cubic Quartic2 3 6 10 153 4 10 20 354 5 15 35 705 6 21 56 126

To estimate the unknown coefficients of the polynomial model, both the least squares method (LSM) andthe gradient method can be used, but either of them requires at least the same number of samples of the realobjective function than the number of coefficients in order to obtain good results.

The PRS can be construct by a Full Regression or by a Stepwise Regression, the basic procedure forStepwise Regression involve (1) identifying an initial model, (2) iteratively stepping, that is, repeatedly

2

altering the model at the previous step by adding or removing a predictor variable in accordance with thestepping criteria, and (3) terminating the search when stepping is no longer possible given the steppingcriteria, or when a specified maximum number of steps has been reached.

The principal advantages of PRS is that the fitness of the approximated response surface can be evaluatedusing powerful statistical tools and the minimum variances of the response surfaces can be obtained usingdesign of experiments with a small number of experiments.

In practice, we can often proceed by supposing that, over limited regions of the factor space, a polynomialof only first or second degree might adequately represent the true function. Higher-order polynomials can beused; however, instabilities may arise [1], or it may be too difficult to take sufficient sample data to estimateall of the coefficients in the polynomial equation, particularly in large dimensions. In this work, second degreePRS models are considered.

In this work the code for PRS used is the reported in [23].

2.2 Kriging

Kriging is a spatial prediction method based on minimizing the mean squared error, which belongs to thegroup of Geo-statistical methods and describing the spatial and temporal correlation between the values ofan attribute, named in honor of D. G. Krige, a South African engineer who developed an empirical methodto determine the distribution of gold deposits based on samples of the same.

The DACE model (Design and Analysis of Computer Experiments) is a parametric regression modeldeveloped by Sack et al. [20] using the Kriging approach, because Kriging has been often used in spacesusually of only two or three-dimensional in geostatistical situations, there is no obvious way to estimate thesemivariogram for high-dimensional inputs.

The DACE model can be expressed as a combination of a know function a(x) (e.g., polynomial function,trigonometric series, etc) and a Gaussian random process b(x).

y(x) = a(x) + b(x) (3)

The Gaussian random process b(x) is assumed to have mean zero and covariance:

E(b(x(i)), b(x(j))) = Cov(b(x(i)), b(x(j))) = 2R(,x(i),x(j)) (4)where 2 is the process variance of the response and R(,x(i),x(j)) is the correlation model with param-

eters . In table 2, show different types of correlation models.

Table 2: The correlation functions have the form R(,w,x) =ni=1

Rj(, wj xj) , dj = wj xj

NameRj(, dj)

Exponential 1(a) exp(j |dj |)

Gaussian 1(b) exp(jd2j )

Linear 1(c) max{0, 1 j |dj |}

Spherical 1(d) 1 1.5j + 0.53j , j = min{1, j |dj |}

Cubic 1(e) 1 32j + 23j , j = min{1, j |dj |}

Spline 1(f)

1 152j + 303j for 0 j 0

1.25(1 j)3 for 0.2 < j < 10 for j 1

j = j |dj |

3

Is common to choose the correlation function as a decreasing function of the distance between two points.Thus, two points close together will have a small distance and high correlation. In the figure 1 note that inall the cases the correlation decreases with |dj | and a larger j leads to faster decrease.

For the set S of design sites (training set) we have F :

F = [f(s1), f(s2), ..., f(sm)]T (5)

Further, define R as the matrix R of stochastic-process correlations between zs at design sites,

Rij = R(, si, sj) i, j = 1, ...,m (6)At an untried point x let:

r(x) = [R(, s1,x), ...,R(, sm,x)]T (7)be the vector of correlations between zs at designs sites and x.Now, for the sake of convenience, consider the linear predictor:

y(x) = cTY (8)

The error is:

y(x) y(x) = cTY y(x)= cT (F + Z) (f(x)T + z) (9)= cTZ z + (FT c f(x))T

where Z = [z1, ..., zm] are the errors at the design sites. To keep the predictor unbiased we demand thatFT c f(x) = 0, or

FT c = f(x) (10)

Under this condition the mean squared error (MSE) of the predictor (8) is:

MSE =

mi=1

[y(x) y(x)]2 (11)

Using Lagrange multipliers the MSE can be minimized. And the predictor model of DACE can be writedas:

y(x) = + rT (x)R1(Y 1) = (FTR1F )1FTR1y (12)

2 =1

m[(Y 1)TR1(Y F )]

where r(x) is the vector of correlations between zs at designs sites and x and R is the matrix of stochastic-process correlations between zs at design sites.

The values of and 2 depend on the value of j . The parameter j can be found with the maximumlikelihood method or maximizing the expression:

12

[(m ln2) + ln |R|] (13)The principal disadvantage of Kriging is that the model construction can be very time-consuming, more-

over, estimating the parameters is a n-dimensional optimization problem (n is the number of variables inthe design space), that can be computacionally expensive to solve.

In this work the code for KRG used is the reported in [15].

4

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(a) Exponential

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(b) Gaussian

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(c) Linear

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(d) Spherical

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(e) Cubic

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dj

R j(,

d j)

=0.2=1=5

(f) Spline

Figure 1: Correlation functions

5

2.3 Radial Basis Function Network

A radial basis function network (RBFN) is an artificial neural network that uses radial basis functions (RBF)as activation functions. It is a linear combination of radial basis functions. They are used in functionapproximation, time series prediction, and control.

The RBF were first introduced by R. Hardy in 1971 [10]. A RBF is a real-valued function whose valuedepends only on the distance from the origin, so that (x) = (||x||); or alternatively on the distance fromsome other point c, called a center, so that (x, c) = (||x c||). Any function that satisfies the property(x) = (||x||) is a radial function. The norm is usually Euclidean distance, although other distance functionsare also possible:

||x|| = d

i=0

x2i = distance of x to the origin (14)

Typical choices for the RBF include linear splines, cubic splines, multiquadratics splines, thin-plate splinesand Gaussian functions as show in Table 3.

Table 3: Radial Basis Functions r = ||x ci||Type of RBF Functionlinear splines |r|cubic splines |r|3multiquadratics splines

1 + (r)2

thin-plate splines |r|2 ln |r|Gaussian exp((r)2)

A RBFN typically have three layers: an input layer, a hidden layer with a non-linear RBF activationfunction and a linear output layer (Figure 2). The output, : Rn R, of the network is thus:

(x) =

Ni=1

wi(||x ci||) (15)

where N is the number of neurons in the hidden layer, ci is the center vector for neuron i, and wi are theweights of the linear output neuron. In the basic form all inputs are connected to each hidden neuron. Thenorm is typically taken to be the Euclidean distance and the basis function is taken to be Gaussian.

RBF networks are universal approximators on a compact subset of Rn. This means that a RBF networkwith enough hidden neurons can approximate any continuous function with arbitrary precision.

In a RBF network there are three types of parameters that need to be chosen to adapt the network fora particular task:The weights wi, the center vector ci, and the RBF width parameters i. In the sequentialtraining of the weights are updated at each time step as data streams in.

For some tasks it makes sense to define an objective function and select the parameter values that minimizeits value. The most common objective function is the least squares function:

K(w) =

t=1

Kt(w) (16)

where

Kt(w) = [y(t) (x(t),w)]2 (17)Radial basis function networks have been shown to produce good fits to arbitrary contours of both deter-

ministic and stochastic response functions.In this work the code for RBF is an own implementation obtained of [].

6

Input x

Weights

RBF

Linear Weights

Output y

Figure 2: Architecture of a radial basis function network. An input vector x is used as input to all radial basisfunctions, each with different parameters. The output of the network is a linear combination of the outputs fromradial basis functions.

2.4 Support Vector Regression

The Support Vector Machines (SVM) is mainly inspired froms statistical learning theory [22]. SVM are a setof related supervised learning methods which analyze data and recognize patterns. A support vector machineconstructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used forclassification, regression or other tasks. Intuitively, a good separation is achieved by the hyperplane that hasthe largest distance to the nearest training data points of any class (so-called functional margin), since ingeneral the larger the margin the lower the generalization error of the classifier.

SVM schemes use a mapping into a larger space so that cross products may be computed easily in termsof the variables in the original space making the computational load reasonable. The cross products in thelarger space are defined in terms of a kernel function K(x, y) which can be selected to suit the problem. Inthe Table 4, show different types of kernel functions.

Table 4: Kernel Functions for SVM

Type of Kernel Kernel Function

Polynomial K(x, x) = d

Gaussian Radial Basis Function K(x, x) = exp( ||xx||222

)Exponential Radial Basis Function K(x, x) = exp

( ||xx||22

)Multilayer Perceptron K(x, x) = tanh(+ %)

SVM can also be applied to regression problems1 by the introduction of an alternative loss function. Theloss function must be modified to include a distance measure.

Consider the problem of approximating the set of data,

1The SVM for regression problems are know as Support Vector Regression (SVR)

7

D = {(x1, y1), ..., (xl, yl)}, x Rn, y R (18)with a linear function,

f(x) = + b (19)

the optimal regression function is given by the minimum of the functional,

(w, ) =1

2||w||2 + C

ni=1

(i + +i ) (20)

where C is a pre-specified value, and +, are slack variables representing upper and lower constraintson the outputs of the system.

Exist different types of loss functions (Quadratic, Laplace, Huber, -insensitive), here only describe theproblem with the -insensitive.

Using an -insensitive loss function,

L =

{0 for |f(x) y| <

|f(x) y| otherwise (21)

the solution is given by,

max,

W (, ) = max,1

2

li=1

lj=1

(i i )(j j )+li=1

i(yi ) i (yi + ) (22)

with constraints,

0 i, i C, i = 1, ..., l (23)li=1

(i i ) = 0

Solving Equation 22 with constraints Equation 23 determines the Lagrange multipliers, , , and theregression function is given by Equation 19, where:

w =

li=1

(i, i )xi (24)

b = 12

Note that the above solution is for a regression with a linear kernel function.In this work the code for SVR used is the reported in [5].

3 State of the art

This section describes some approaches that have successfully used the metamodeling techniques.Ratle [17] proposed a hybrid algorithm that use a Genetic Algorithm (GA) with Kriging, the first gener-

ation is randomly initiated like in a basic GA. Fitness is evaluated using the true fitness function. Solutionsfound in the first generation are used for building up a metamodel with Kriging. The metamodel is then ex-ploited for several generations until a convergence criteria is reached (the convergence criteria is proposed bythe authors). The next generation is evaluated using the real fitness function and the metamodel is updatedusing the new data points. The proposed algorithm waa applied in a test function with two variables and ina test function with 20 variables. The authors say that the algorithm seems to be appropriate for obtainingrapidly and moderately good solution.

Ratle [18] combined a genetic algorithm with a approximated model created by the Kriging method.The metamodel is updated every k generations. Six different strategies proposed to update the metamodel

8

and they were evaluated in two test problems. Finally, the best strategy is used to the design of a simplemechanical structure with a noise or vibration level reduction criteria. The authors not justify the selectionof the parameters of the Kriging method.

Beltagy et al. [7] proposed an algorithm with a Gaussian regression model. The algorithm create ametamodel of the original function using a the Gaussian regression model, the metamodel is updated everytime a generation delay criterion is satisfied, the update is performed by taking into account the fitness andthe minimum distance of the population algorithm with respect to the vectors who built the metamodel.The individuals that be taken into account for the creation of the model must meet a minimum distancewith respect to the vectors with those was created the metamodel, helping to the diversity of points in themetamodel and to decrease computation time. The metamodeling approach seems to work best with smoothobjective functions. It stalls in situations where the global optima has strong local features.

Bull [3] proposed a GA with a neural network. The neural network is trained using example individualswith the explicit fitness and the resulting model is then used by the GA to find a solution. The model isupdated every R generations, the current fittest individual from the evolving population is evaluated in thereal function. This individual replace the one with lowest fitness in the training set and the neural networkis re-trained. The approach were applied at 20 fitness landscapes created with the NK model [13].

Pierret [16] designed an algorithm for the turbomachinery blade design. For the improvement of themachine performance require a detailed knowledge that can be provided by Navier-Stokes solvers (thesesolvers are time consuming). The algorithm have a database containing the input and output of previousNavier-Stokes solutions. One starts by scanning the database to select the sample that has a performancecloset to the required one. This is then adapted to the required performance for an optimization procedure.The optimization algorithm used is a simulated annealing and use a approximated model for the performanceevaluation. The approximated model is obtained from the database with a neural network. The new solutionobtained from the optimization procedure is evaluated by the Navier-Stokes solver and is added to thedatabase. Finally, if the target performance has not been reached, a new iteration is started. The methodrequire a few Navier-Stokes computations to define an optimized blade.

Beltagy et al. [6] used a Gaussian Process (GP) with a Evolutionary Algorithm (EA), they present theadvantages of using GP over other neural-net biologically inspired approaches. The metamodel is updatedwith an online model expansion, the model can expand to include new data points with minimal computationalcost. In the algorithm the metamodel isnt updated all the generations but is updated with respect to thepredicted standard deviation of the metamodel. The results are presented for a real world-engineering probleminvolving the structural optimization of a satellite boom.

Jin et al. [12] was mentioned for first time the evolution control. The evolution control help to avoidfalse optima. They propose two methods of evolution control: The first is the Individual-based control, inthis approach part of the individuals in the population are chosen and evaluated with the real function. Ifthe individual are chosen randomly the call it random strategy, if the individuals are the best individualsthey call it a best strategy. The second strategy is the Generation-based control, in this approach the wholepopulation of n generations will be evaluated with the real function in every k generations. They proposed aframework for managing approximate models in generation-based evolution control. Also, they proposed analgorithm that combine a evolutionary strategy (ES) with a neural network and used the proposed framework.They evaluated the new approach in two theoretical functions and in a real problem for the blade designoptimization.

Emmerich et al. [8] proposed the Metamodel Assisted Evolution Strategies (MAES), they used a EScombined with Kriging. The approach take into account the error associated with each prediction. Theestimated value and the predicted error are used to select the individuals to be evaluated with the realfunction. They evaluated the approach in artificial landscapes and in a problem of Airfoil Shape Optimization.The principal advantage of this method is the use of the error associated with the predictions.

Regis and Shoemaker [19] proposed two algorithms 1) an ES with local quadratic approximation and 2)ES with local cubic radial basis function. The main feature of the algorithms is that the objective functionvalue (or the fitness value) of an offspring solution will be estimated by fitting a model using its k-nearestneighbors among the previously evaluated points. The algorithms were applied to a twelve-dimensional (12-D)groundwater bioremediation problem involving a complex nonlinear finite-element simulation model.

Bueche et al. [2] proposed an algorithm denominated Gaussian Process Optimization Procedure (GPOP),they use a Gaussian Process as a inexpensive function that replace the original function, the GaussianProcess is created with individuals in the neighborhood of the current best solution and with the most recent

9

individuals evaluated. The GPOP was applied to a real-world problem: the optimization of stationary gasturbine compressor profiles. The authors mention that the GPOP converged much faster than a range ofalternative evolution strategies, and to significantly better results.

The most of the previous work do not justify the choice of metamodel, probably the author chose thetechnique by his knowledge. There are some works in which the choice of the metamodel technique is by thecharacteristics of it [8]. Some other works compare some metamodeling techniques, however the comparisonmethodology is inefficient [19].

There are studies in the literature that make a comparison between different techniques for creatingmetamodels. However, most only make comparisons between two techniques [4, 21, 9, 24], or only takenas a point of comparison the accuracy of the metamodel. Another work that takes into account severalmetamodeling techniques is the reported in [11], they compared four metamodeling techniques (PolynomialRegression, Kriging, Radial basis functions, Multivariate Adaptive Regression Splines), the comparison takeinto account multiple criteria to decide the best techniques in different problems. However, one disadvantageof this work is that the dimensionality isnt taken as an important factor. Other disadvantage is that nottaken account the fitness landscape of the metamodel, is natural to think that a metamodel can be veryaccurate but more difficult to optimize.

4 Metodology

The principal challenge in the approximation models is be as accurate as possible over the complete domain ofinterest while minimizing the simulation cost (efficiency). The most of the approaches that use metamodelingtechniques take account only the accuracy of the technique [4, 21, 9, 24]. However, other approaches suggestthe use of multiple criteria for assessing the quality of a metamodel [11], by example, the robustness, theefficiency, the simplicity. In this work take account this aspects and others have been added as scalabilityand easy to optimize.

To measure the performance of the metamodeling techniques were taken into account the following aspects:

Accuracy

The accuracy is the capability o have a prediction close to the real value of the system. For accuracyuse two data sets, the first one is the training data set and is used for train the metamodeling technique,the second set is the validation data set and is used for validate the accuracy of the technique. Formeasure the accuracy use the G-Metric:

G = 1Ni=1 (yi yi)2Ni=1 (yi yi)2

= 1 MSEVariance

(25)

where N is the size of the validation data set, yi is the predicted value for the input i, and yi is thereal value; y is the mean of the real values. The MSE (Mean Square Error) measure the differencebetween the estimator and the real value, the variance describes how far values lie from the mean, thatis the variance capture how irregular the problem is. The larger the value of G, the more accurate themetamodel.

Robustness

The robustness is the capability of the technique to achieving good accuracy for different test problems.Six problems were used, shown in the previous section. Of the six problems three are unimodals andthe other three are multimodals, the six problems have different features and are common problems foroptimization algorithms.

Scalability

The scalability is the capability of the technique to achieving good accuracy for different number ofdecision variables. The six test problems have the capacity to be scaled in the number of decisionvariables. The number of variables used are divided in 9 levels v = [2, 4, 6, 8, 10, 15, 20, 25, 50].

Efficiency

The efficiency refers to the computational effort required for the technique for construct the metamodeland for predict the response for a new input. The efficiency of each metamodeling technique is measuredby the time used for the metamodel construction and new predictions.

10

Easy to optimize

The easy to optimize refers to the ease of optimizing a metamodel created for a technique. It is obvious tothink that a more accurate metamodel is more complicated to optimize, because their fitness landscapemay be more rugged. For measure the ease to optimize a Differential Evolution (DE) is used to optimizea metamodel and the best value found for the DE is evaluated for the real function and measured thedistance from the optimum.

Simplicity

The simplicity is refers to ease of each technique. The number of parameters, the size of the parameters,the implementation and the he knowledge necessary for understanding the technique.

4.1 Test problems

To test the metamodeling techniques to different classes of problems, six test problems for unconstrainedglobal optimization are selected. The six problems are scalables in the design space. The test problems wereselected based in the space search shape and in the number of local minima.

A summary of the features of the six problems is given in Table 5, the test problem are described in moredetail in the appendix A.

Table 5: Features of the test problems

Problem name Space search shape # of local minima # of vari-ables

Global minimum

Step Unimodal no local minima exceptthe global one

n x = (0, . . . , 0), f(x) =0

Sphere Unimodal no local minima exceptthe global one

n x = (0, . . . , 0), f(x) =0

Rosenbrock Unimodal for n 3otherwise multimodal

several local minima forn > 3

n x = (1, . . . , 1), f(x) =0

Ackley Multimodal several local minima n x = (0, . . . , 0), f(x) =0

Rastrigrin Multimodal large number of localminima

n x = (0, . . . , 0), f(x) =0

Schwefel Multimodal several local minima n x =(420.9687, . . . , 420.9687),f(x) = 0

4.2 Scheme for metamodeling techniques comparison

The scheme proposed for the comparative study is the next:

1. Create a training data set with latin hypercubes [14] of size 100.

2. Train each technique used (PRS, KRG, RBF, SVR) with the training set.

3. Create the validation data set with latin hypercubes of size 200.

4. Predict the validation data set with the metamodel.

5. Measure the accuracy with the G-metric

6. Repeat the step 1 for 31 different data sets.

11

The procedure is applied to the six problems with the nine levels of variables (6 problems x 9 levels = 54different problems). The objective of this experiment is to measure the accuracy, robustness, scalability andefficiency of the techniques.

For each technique the parameters were discretized and a full factorial design is created, in order to avoidaffecting a technique by poor tuning of parameters. The parameters used for each technique are the next:

Polynomial RegressionDegree of the polynomial={2}Technique used for construct the polynomial = {Full, Stepwise}

KrigingCorrelation function = {Gaussian, Exponential, Cubic, Linear, Spherical, Spline}

Radial basis functionNumber of neurons in the hidden layer={3-100}

Support Vector RegressionC ={25 : 215} ={0.1 : 2} ={210 : 25}

5 Discussion of results

5.1 Accuracy, robustness and scalability

For each metamodeling technique was chosen the set of parameters that had the best accuracy in all theproblems (a setting for each metamodeling technique), this is named as Best overall settings, in the sameway was chosen the set of parameters that had the best accuracy for each problem (54 settings for eachmetamodeling technique), this is named as Best local settings.

The Best overall settings found are the next:

Polynomial RegressionDegree of the polynomial={2}Technique used for construct the polynomial = {Stepwise}

KrigingCorrelation function = {Exponential}

Radial basis functionNumber of neurons in the hidden layer={6}

Support Vector RegressionC ={210.5} ={0.2} ={22.5}

To illustrate the performance of the metamodeling techniques use a boxplot graphic; the median is shownwith a straight line inside the box and indicate that half of the problems are above it and the other half arebelow it. The mean is shown with a circle and indicate the average accuracy of a technique, the size of thebox indicate the variability of the technique. The smaller the box was more robust technique.

Figure 3 show the results of accuracy for all the problems and different problem sizes, the figure show thatthe accuracies of KRG, RBF and SVR the settings found for all the problems and different problem sizes

12

2.1933

1.9933

1.7933

1.5933

1.3933

1.1933

0.9933

0.7933

0.5933

0.3933

0.1933

0.0067

0.2067

0.4067

0.6067

0.8067

1.0067

Metamodeling technique

GM

etric

PRS KRG RBF SVR

Best local settingsBest overall settings

Figure 3: Accuracy for the metamodeling techniques in all the problems.

(Best overall settings) have a comparable performance with respect to the Best local settings. Moreover, inPRS the results for the Best overall settings worsen the performance of the technique.

Figure 3 show that the accuracies of RBF and KRG are among the best of the techniques; their values arevery close to each other (the median is close to 0.9); RBF is slightly better than KRG. However their resultsare not conclusive with respect to SVR. The worst performance is for PRS. In figure 4 show the average ofG-metric for all the problems, this figure is confirmed as the RBF and KRG have similar results and betterperformance with respect to SVR and PRS.

In terms of the robustness of the accuracy for all the problems, the RBF is the best for Best local settingsand Best overall settings, however their results are slightly better than KRG. Overall, RBF is the best forthe average accuracy and the robustness when handling different types of problems.

The problems were divided into two types: unimodal and multimodal. For each type the performance ofeach metamodeling technique was illustrated in a boxplot graphic. In figure 5 show that RBF are slightlybetter than KRG. PRS and SVR have a worst performance. However, the results show that there are certainproblems for which PRS and SVR have a good performance because the top of the box is close to 1. Moreover,RBF is more robust than KRG because the box size is smaller.

In figure 6 show that the best performance is for RBF, but their results are slightly better than KRG.In the same way as in the unimodal problems the worst performance is for PRS and SVR. In terms of therobustness the best is the RBF, but their results show that it is not very robust.

Due to the lack of robustness, we need to know if the problem is on the increase in the number of variablesor in the type of problem. Next we present an analysis of the techniques for each test problem:

Step

In figure 7 shows the behavior of the metamodeling techniques in the Step function, you can see thatwith very few variables (2 or 4) SVR has good behavior. But at increasing the number of variablesSVR worsens considerably. PRS is seen as having a bad behavior even with few variables. KRG has abetter performance than RBF up to maximum 10 variables, since for a greater number of variables RBFachieves better results. Moreover, as a special mention can be seen that the RBF maintain a steadyperformance despite the increase in the variables, we can say that is robust to the increase of variables.

13

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

PRS KRG RBF SVR


GM

etric


Figure 4: Accuracy for the metamodeling techniques: average in all the problems.

2

1.5

1

0.5

0

0.5

1


GM

etric

PRS KRG RBF SVR


Figure 5: Accuracy for the metamodeling techniques in unimodal problems and the nine levels of variables.

14

2

1.5

1

0.5

0

0.5

1


GM

etric

PRS KRG RBF SVR


Figure 6: Accuracy for the metamodeling techniques in multimodal problems and the nine levels of variables.

Finally, we note that both the best local settings and the best overall settings their results are similar,except for PRS. Thus, one can say that the general parameters found (Best Overall settings) can beused without prior adjustment (Best local settings).

Sphere

In figure 8 shows the behavior of the metamodeling techniques in the Sphere function, it appears thatthe technique is with the best performing is RBF. Furthermore, its behavior is very consistent with theincrease in the number of variables. It can be seen in best local settings PRS achieves good performanceto maximum 10 variables, and then its performance decreases significantly. In addition, KRG and SVRin both local best settings and global best settings, achieve comparable results. In the same way as inthe previous problem can be said that the best global settings and best local settings obtained similarresults at least for RBF, KRG and SVR.

Rosenbrock

In figure 9 shows the behavior of the metamodeling techniques in the Rosenbrock function, PRS is thetechnique with a worse performance, with just less than five variables have achieved moderately goodresults. But the increase in the number of variables leads at the performance decreased significantly.KRG and SVR have comparable results, both with a relatively low number of variables (v < 10) arebetter than RBF, however, to a greater number of variables (v 10) RBF achieves better results. Amain feature of RBF is his constant behavior even with the increase in the number of variables.

Ackley

In figure 10 shows the behavior of the metamodeling techniques in the Ackley function, KRG andRBF have similar results in best local settings and best overall settings. PRS is the worst performingtechnique, even with few variables can not approximate the function. KRG is the technique with betterperformance with a few variables (v < 20) while RBF is the technique with better performance for agreater number of variables (v 20). Again, RBF remained constant with the increase of the variables. Rastrigrin

15

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest local settings

Problem: Step

KRGPRSSVRRBF

(a) Best local settings

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest overall settings

Problem: Step

KRGPRSSVRRBF

(b) Best overall settings

Figure 7: The mean for accuracy metric by number of variables, problem: Step

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest local settingsProblem: Sphere

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


Problem: Sphere

KRGPRSSVRRBF


Figure 8: The mean for accuracy metric by number of variables, problem: Sphere

16

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


Problem: Rosenbrock

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


Problem: Rosenbrock

KRGPRSSVRRBF


Figure 9: The mean for accuracy metric by number of variables, problem: Rosenbrock

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest local settingsProblem: Ackley

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


Problem: Ackley

KRGPRSSVRRBF


Figure 10: The mean for accuracy metric by number of variables, problem: Ackley

17

In figure 11 shows the behavior of the metamodeling techniques in the Rastrigrin function, RBF is thetechnique with best performing. PRS has a good performance up to maximum 15 variables. Moreover,SVR and KRG have a similar behavior, and its performance is decreasing with the increase in thenumber of variables. Finally, in the same way as in previous problems best overalls settings and bestlocal settings are comparable in RBF, SVR and KRG.

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest local settingsProblem: Rastrigrin

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest overall settingsProblem: Rastrigrin

KRGPRSSVRRBF


Figure 11: The mean for accuracy metric by number of variables, problem: Rastrigrin

Schwefel

In figure 12 shows the behavior of the metamodeling techniques in the Schwefel function, in this problemthe four techniques are well-behaved, however the techniques that have better performance in best localsettings are RBF and KRG, and RBF for best overall settings. PRS behavior is decreasing with theincrease of the variables. For this problem KRG and RBF are remain constant with the increase in thenumber of variables.

In figure 13 show the performance of the techniques in the six test functions, while KRG is the techniquethat has better performance for a few variables (v < 15), RBF is the technique with better performance forhigh-dimensional functions. The technique with the worst overall performance is PRS. SVR achieves similarperformance to KRG although this is slightly better. RBF is very robust against the increase in the numberof variables in general for all problems, since their results are held constant.

6 Conclusions

The study presented in this paper has provided interesting results about the performance of differents meta-modeling techniques (PRS, KRG, RBF, SVR). The metamodeling techniques was evaluated making use ofmultiple aspects in several test problems with different features.

We define the size of the problem with respect to the dimensionality, the high dimensionality for problemswith v > 15 and low dimensionality for others. In table 6 show that KRG is the best for low dimensionalityproblems. Moreover RBF is the best for High dimensionality problems. Overall the best technique withrespect to the accuracy and the scalability is the RBF.

18

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest local settingsProblem: Schwefel

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric

Behavior by number of variablesBest overall settingsProblem: Schwefel

KRGPRSSVRRBF


Figure 12: The mean for accuracy metric by number of variables, problem: Schwefel

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


General

KRGPRSSVRRBF


0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of variables

GM

etric


General

KRGPRSSVRRBF


Figure 13: The mean for accuracy metric by number of variables in all the problems

19

Table 6: Best techniques with respect to the accuracy and dimensionality: All the problems

Unimodal Multimodal OverallLow dimensionality KRG KRG KRGHigh dimensionality RBF RBF RBF

Overall RBF RBF RBF

In table 7 show the best techniques with respect to accuracy and the dimensionality by problem. Forlow dimensional problems the best technique is KRG, although exist techniques to achieve a performancecomparable to it. For high dimensionality problems the best is RBF.

Table 7: Best techniques with respect to the accuracy and dimensionality: By problem

Step Sphere Rosenbrock Ackley Rastrigrin SchwefelLow dimensionality KRG ALL KRG,SVR KRG ALL KRG,RBF,SVRHigh dimensionality RBF RBF RBF RBF RBF KRG,RBF

Overall RBF RBF RBF RBF RBF RBF

Finally, to improve our study, we need evaluate the missing aspects (efficiency, easy to optimize, simplicity).

20

A Test problems

A.1 Step

Step is the representative of the problem of flat surfaces. The step function consists of many flat plateauswith uniform steep ridges. For algorithms that require gradient information to determine a search direction,this function poses considerable difficulty. Flat surfaces are obstacles for optimization algorithms, becausethey do not give any information as to which direction is favorable. Unless an algorithm has variable stepsizes, it can get stuck on one of the flat plateaus.

f1(x) =

Di=1

b|xi|+ 0.5c2 (26)

Its global minimum equal f2(x) = 0 is obtainable for xi = 0, i = 1, . . . , D. In Figure 14, show the Stepfunction with D = 2.

5

0

5

5

0

50

10

20

30

40

50

x1

Step function

x2

f 1(x 1,

x 2)

Figure 14: An overview of Step function with D = 2.

A.2 Sphere

So called first function of De Jongs is one of the simplest test benchmark. Function is continuous, convexand unimodal. It has the following general definition

f2(x) =

Di=1

x2i (27)

Its global minimum equal f2(x) = 0 is obtainable for xi = 0, i = 1, . . . , D. In Figure 15, show the Spherefunction with D = 2.

A.3 Rosenbrock

Rosenbrocks valley is a classic optimization problem, also known as banana function or the second functionof De Jong. The global optimum lays inside a long, narrow, parabolic shaped flat valley. To find the valleyis trivial, however convergence to the global optimum is difficult and hence this problem has been frequentlyused to test the performance of optimization algorithms. Function has the following definition:

f3(x) =

Di=1

(100

(xi+1 x2i

)2+ (xi 1)2

)(28)

21

5

0

5

5

0

50

10

20

30

40

50

x1

Sphere function

x2

f 2(x 1,

x 2)

Figure 15: An overview of Sphere function with D = 2.

Its global minimum equal f3(x) = 0 is obtainable for xi = 1, i = 1, . . . , D. In Figure 16, show theRosenbrock function with D = 2.

21

01

2

21

01

2

0

1000

2000

3000

4000

x1

Rosenbrock function

x2

f 3(x 1,

x 2)

Figure 16: An overview of Rosenbrock function with D = 2.

A.4 Ackley

The Ackley Problem is a minimization problem. Originally this problem was defined for two dimensions, butthe problem has been generalized to D dimensions. Ackleys is a widely used multimodal test function. Ithas the following definition

f4(x) = 20 exp(0.2 1DDi=1

x2i ) exp(1

DDi=1

cos(2pixi)) + 20 + exp(1) (29)

Its global minimum equal f4(x) = 0 is obtainable for xi = 0, i = 1, . . . , D. In Figure 17, show the Ackleyfunction with D = 2.

22

42

02

4

3210

1230

2

4

6

8

10

12

x1

Ackley function

x2

f 4(x 1,

x 2)

Figure 17: An overview of Ackley function with D = 2.

A.5 Rastrigrin

The Rastrigin Function is a typical example of non-linear multimodal function. Rastrigins function is basedon the function of De Jong with the addition of cosine modulation in order to produce frequent local minima.Thus, the test function is highly multimodal. This function is a fairly difficult problem due to its large searchspace and its large number of local minima. However, the location of the minima are regularly distributed.Function has the following definition

f5(x) =

Di=1

(x2i 10 cos (2pixi) + 10

)(30)

Its global minimum equal f5(x) = 0 is obtainable for xi = 0, i = 1, . . . , D. In Figure 18, show theRastrigrin function with D = 2.

42

02

4

42

02

40

10

20

30

40

50

60

70

x1

Rastrigrin function

x2

f 5(x 1,

x 2)

Figure 18: An overview of Rastrigrin function with D = 2.

23

A.6 Schwefel

Schwefels function is deceptive in that the global minimum is geometrically distant, over the parameter space,from the next best local minima. Therefore, the search algorithms are potentially prone to convergence inthe wrong direction. Function has the following definition

f6(x) = 418.9809D Di=1

xi sin(|xi|)

(31)

Its global minimum equal f6(x) = 0 is obtainable for xi = 420.9687, i = 1, . . . , D. In Figure 19, show theSchwefel function with D = 2.

Figure 19: An overview of Schwefel function with D = 2.

References

[1] Russell R. Barton. Metamodels for simulation input-output relations. In Proceedings of the 24th confer-ence on Winter simulation, WSC 92, pages 289299, New York, NY, USA, 1992. ACM.

[2] D. Bueche, N.N. Schraudolph, and P. Koumoutsakos. Accelerating evolutionary algorithms with gaussianprocess fitness function models. IEEE Trans. on Systems, Man, and Cybernetics: Part C, 2004. In press.

[3] L. Bull. On model-based evolutionary computation. Soft Computing, 3:7682, 1999.

[4] W. Carpenter and J.-F. Barthelemy. A comparison of polynomial approximation and artificial neuralnets as response surface. Technical Report 92-2247, AIAA, 1992.

[5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Softwareavailable at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[6] M.A. El-Beltagy and A.J. Keane. Evolutionary optimization for computationally expensive problemsusing Gaussian processes. In Proceedings of International Conference on Artificial Intelligence, pages708714. CSREA, 2001.

[7] M.A. El-Beltagy, P.B. Nair, and A.J. Keane. Metamodeling techniques for evolutionary optimizationof computationally expensive problems: promises and limitations. In Proceedings of Genetic and Evolu-tionary Conference, pages 196203, Orlando, 1999. Morgan Kaufmann.

[8] M. Emmerich, A. Giotis, M. Ozdenir, T. Back, and K. Giannakoglou. Metamodel-assisted evolutionstrategies. In Parallel Problem Solving from Nature, number 2439 in Lecture Notes in Computer Science,pages 371380. Springer, 2002.

24

[9] Anthony A. Giunta and Layne T. Watson. A comparison of approximation modeling techniques: Poly-nomial versus interpolating models, 1998.

[10] R. L. Hardy. Multiquadric equations of topography and other irregular surfaces. J. Geophys. Res.,76:19051915, 1971.

[11] R. Jin, W. Chen, and T.W. Simpson. Comparative studies of metamodeling techniques under miltiplemodeling criteria. Technical Report 2000-4801, AIAA, 2000.

[12] Y. Jin, M. Olhofer, and B. Sendhoff. Managing approximate models in evolutionary aerodynamic designoptimization. In Proceedings of IEEE Congress on Evolutionary Computation, volume 1, pages 592599,May 2001.

[13] Stuart A. Kauffman. The Origins of Order: Self-Organization and Selection in Evolution. OxfordUniversity Press, USA, 1 edition, June 1993.

[14] M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selecting valuesof input variables in the analysis of output from a computer code. Technometrics, 21(2):239245, 1979.

[15] H.B. Nielsen, S.N. Lophaven, Hans Bruun Nielsen, and J. Sndergaard. Dace - a matlab kriging toolbox,2002.

[16] S. Pierret. Turbomachinery blade design using a Navier-Stokes solver and artificial neural network.ASME Journal of Turbomachinery, 121(3):326332, 1999.

[17] A. Ratle. Accelerating the convergence of evolutionary algorithms by fitness landscape approximation. InA. Eiben, Th. Back, M. Schoenauer, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature,volume V, pages 8796, 1998.

[18] A. Ratle. Optimal sampling strategies for learning a fitness model. In Proceedings of 1999 Congress onEvolutionary Computation, volume 3, pages 20782085, Washington D.C., July 1999.

[19] R. G. Regis and C. A. Shoemaker. Local function approximation in evolutionary algorithms for theoptimization of costly functions. Evolutionary Computation, IEEE Transactions on, 8(5):490505, 2004.

[20] Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. Design and Analysis of ComputerExperiments. Statistical Science, 4(4):409423, November 1989.

[21] T. Simpson, T. Mauery, J. Korte, and F. Mistree. Comparison of response surface and Kriging modelsfor multidiscilinary design optimization. Technical Report 98-4755, AIAA, 1998.

[22] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.

[23] F.A.C. Viana. SURROGATES Toolbox Users Guide. Gainesville, FL, USA, version 2.1 edition, 2010.

[24] L. Willmes, T. Baeck, Y. Jin, and B. Sendhoff. Comparing neural networks and kriging for fitness approx-imation in evolutionary optimization. In Proceedings of IEEE Congress on Evolutionary Computation,pages 663670, 2003.

25

IntroductionBackgroundPolynomial approximation modelsKrigingRadial Basis Function NetworkSupport Vector Regression

State of the artMetodologyTest problemsScheme for metamodeling techniques comparison

Discussion of resultsAccuracy, robustness and scalability

ConclusionsTest problemsStepSphereRosenbrockAckleyRastrigrinSchwefel

Reporte

Documents

class of problems

realworld problems

different test problems

different classes of

real world problems

optimization cost

good performance

real fitness function