-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
Daniel Hernández-Lobato [email protected]
Autónoma de Madrid, Francisco Tomás y Valiente 11, 28049, Madrid,
Spain.
José Miguel Hernández-Lobato [email protected]
University, 33 Oxford street, Cambridge, MA 02138, USA.
Amar Shah [email protected] University, Trumpington
Street, Cambridge CB2 1PZ, United Kingdom.
Ryan P. Adams [email protected] University and
Twitter, 33 Oxford street Cambridge, MA 02138, USA.
AbstractWe present PESMO, a Bayesian method for iden-tifying the
Pareto set of multi-objective optimiza-tion problems, when the
functions are expen-sive to evaluate. PESMO chooses the
evaluationpoints to maximally reduce the entropy of theposterior
distribution over the Pareto set. ThePESMO acquisition function is
decomposed as asum of objective-specific acquisition
functions,which makes it possible to use the algorithm indecoupled
scenarios in which the objectives canbe evaluated separately and
perhaps with differ-ent costs. This decoupling capability is
usefulto identify difficult objectives that require
moreevaluations. PESMO also offers gains in effi-ciency, as its
cost scales linearly with the numberof objectives, in comparison to
the exponentialcost of other methods. We compare PESMO withother
methods on synthetic and real-world prob-lems. The results show
that PESMO producesbetter recommendations with a smaller numberof
evaluations, and that a decoupled evaluationcan lead to
improvements in performance, par-ticularly when the number of
objectives is large.
1. IntroductionWe address the problem of optimizing K
real-valuedfunctions f1(x), . . . , fK(x) over some bounded do-main
X ⊂ Rd, where d is the dimensionality of the input
Proceedings of the 33 rd International Conference on
MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48.
Copyright 2016 by the author(s).
space. This is a more general, challenging and realisticscenario
than the one considered in traditional optimizationproblems where
there is a single-objective function. For ex-ample, in a complex
robotic system, we may be interestedin minimizing the energy
consumption while maximizinglocomotion speed (Ariizumi et al.,
2014). When selectinga financial portfolio, it may be desirable to
maximize re-turns while minimizing various risks. In a mechanical
de-sign, one may wish to minimize manufacturing cost
whilemaximizing durability. In each of these multi-objective
ex-amples, it is unlikely to be possible to optimize all of
theobjectives simultaneously as they may be conflicting:
afast-moving robot probably consumes more energy, high-return
financial instruments typically carry greater risk,and cheaply
manufactured goods are often more likely tobreak. Nevertheless, it
is still possible to find a set of opti-mal points X ? known as the
Pareto set (Collette & Siarry,2003). Rather than a single best
point, this set representsa collection of solutions at which no
objective can be im-proved without damaging one of the others.
In the context of minimization, we say that x Paretodominates x′
if fk(x) ≤ fk(x′) ∀k, with at least oneof the inequalities being
strict. The Pareto set X ? isthen the subset of non-dominated
points in X , i.e., theset such that ∀x? ∈ X ?, ∀x ∈ X , ∃ k ∈ 1, .
. . ,K forwhich fk(x?) < fk(x). The Pareto set is considered to
beoptimal because for each point in that set one cannot im-prove in
one of the objectives without deteriorating someother objective.
Given X ?, the user may choose a pointfrom this set according to
their preferences, e.g., locomo-tion speed vs. energy consumption.
The Pareto set is oftennot finite, and most strategies aim at
finding a finite set withwhich to approximate X ? well.
It frequently happens that there is a high cost to evaluat-
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
ing one or more of the functions fk(·). For example, in
therobotic example, the evaluation process may involve a
timeconsuming experiment with the embodied robot. In thiscase, one
wishes to minimize the number of evaluationsrequired to obtain a
useful approximation to the Paretoset X ?. Furthermore, it is often
the case that there is nosimple closed form for the objectives
fk(·), i.e., they canbe regarded as black boxes. One promising
approach inthis setting has been to use a probabilistic model such
as aGaussian process to approximate each function (Knowles,2006;
Emmerich, 2008; Ponweiser et al., 2008; Picheny,2015). At each
iteration, these strategies use the uncer-tainty captured by the
probabilistic model to generate anacquisition (utility) function,
the maximum of which pro-vides an effective heuristic for
identifying a promising lo-cation on which to evaluate the
objectives. Unlike the ac-tual objectives, the acquisition function
is a function of themodel and therefore relatively cheap to
evaluate and max-imize. This approach contrasts with model-free
methodsbased on genetic algorithms or evolutionary strategies
thatare known to be effective for approximating the Pareto set,but
demand a large number of function evaluations (Debet al., 2002; Li,
2003; Zitzler & Thiele, 1999).
Despite these successes, there are notable limitations tocurrent
model-based approaches: 1) they often build theacquisition function
by transforming the multi-objectiveproblem into a single-objective
problem using scalarizationtechniques (an approach that is expected
to be suboptimal),2) the acquisition function generally requires
the evalua-tion of all of the objective functions at the same
location ineach iteration, and 3) the computational cost of
evaluatingthe acquisition function typically grows exponentially
withthe number of objectives, which limits their applicability
tooptimization problems with just 2 or 3 objectives.
We describe here a strategy for multi-objective optimiza-tion
that addresses these concerns. We extend previoussingle-objective
strategies based on stepwise uncertaintyreduction to the
multi-objective case (Villemonteix et al.,2009; Hernández-Lobato
et al., 2014; Henning & Schuler,2012). In the single-objective
case, these strategies choosethe next evaluation location based on
the reduction of theShannon entropy of the posterior estimate of
the mini-mizer x?. The idea is that a smaller entropy implies
thatthe minimizer x? is better identified; the heuristic
thenchooses candidate evaluations based on how much they
areexpected to improve the quality of this estimate. These
in-formation gain criteria have been shown to often providebetter
results than other alternatives based, e.g., on the pop-ular
expected improvement (Hernández-Lobato et al., 2014;Henning &
Schuler, 2012; Shah & Ghahramani, 2015).
The extension to the multi-objective case is obtained
byconsidering the entropy of the posterior distribution over
the Pareto set X ?. More precisely, we choose the next
eval-uation as the one that is expected to most reduce the
entropyof our estimate of X ?. The proposed approach is
calledpredictive entropy search for multi-objective
optimization(PESMO). Several experiments involving real-world
andsynthetic optimization problems, show that PESMO canlead to
better performance than related methods from theliterature.
Furthermore, in PESMO the acquisition functionis expressed as a sum
across the different objectives, al-lowing for decoupled scenarios
in which we can choose toonly evaluate a subset of objectives at
any given location.In the robotics example, one might be able to
decouple theproblems by estimating energy consumption from a
simu-lator even if the locomotion speed could only be evaluatedvia
physical experimentation. Another example, inspiredby Gelbart et
al. (2014), might be the design of a low-calorie cookie: one wishes
to maximize taste while min-imizing calories, but calories are a
simple function of theingredients, while taste could require human
trials. Theresults obtained show that PESMO can obtain better
re-sults with a smaller number of evaluations of the objec-tive
functions in such scenarios. Furthermore, we have ob-served that
the decoupled evaluation provides significantimprovements over a
coupled evaluation when the numberof objectives is large. Finally,
unlike other methods (Pon-weiser et al., 2008; Picheny, 2015), the
computational costof PESMO grows linearly with the number of
objectives.
2. Multi-objective Bayesian Optimization viaPredictive Entropy
Search
In this section we describe the proposed approach for
multi-objective optimization based on predictive entropy
search.Given some previous evaluations of each objective func-tion
fk(·), we seek to choose new evaluations that maxi-mize the
information gained about the Pareto set X ?. Thisapproach requires
a probabilistic model for the unknownobjectives, and we therefore
assume that each fk(·) followsa Gaussian process (GP) prior
(Rasmussen & Williams,2006), with observation noise that is
i.i.d. Gaussian withzero mean. GPs are often used in model-based
approachesto multi-objective optimization because of their
flexibil-ity and ability to model uncertainty (Knowles, 2006;
Em-merich, 2008; Ponweiser et al., 2008; Picheny, 2015).
Forsimplicity, we initially consider a coupled setting in whichwe
evaluate all objectives at the same location in any giveniteration.
Nevertheless, the approach described can be eas-ily extended to the
decoupled scenario.
Let D = {(xn,yn)}Nn=1 be the data (function
evaluations)collected up to stepN , where yn is aK-dimensional
vectorwith the values resulting from the evaluation of all
objec-tives at step n, and xn is a vector in input space
denotingthe evaluation location. The next query xN+1 is the one
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
that maximizes the expected reduction in the entropy H(·)of the
posterior distribution over the Pareto set X ?, i.e.,p(X ?|D). The
acquisition function of PESMO is hence:
α(x) = H(X ?|D)− Ey [H(X ?|D ∪ {(x,y)})] , (1)
where y is the output of all the GP models at xand the
expectation is taken with respect to theposterior distribution for
y given by these mod-els, p(y|D,x) =
∏Kk=1 p(yk|D,x). The GPs are assumed
to be independent a priori. This acquisition function isknown as
entropy search (Villemonteix et al., 2009; Hen-ning & Schuler,
2012). Thus, at each iteration we set the lo-cation of the next
evaluation to xN+1 = arg maxx∈X α(x).
A practical difficulty, however, is that the exact evalua-tion
of Eq. (1) is generally infeasible and the functionmust be
approximated; we follow the approach describedin (Hernández-Lobato
et al., 2014; Houlsby et al., 2012).In particular, Eq. (1) is the
mutual information between X ?and y given D. The mutual information
is symmetric andhence we can exchange the roles of the variables X
? and y,leading to an expression that is equivalent to Eq. (1):
α(x) = H(y|D,x)− EX? [H(y|D,x,X ?)] , (2)
where the expectation is now with respect to the pos-terior
distribution for the Pareto set X ? given the ob-served data, and
H(y|D,x,X ?) measures the entropyof p(y|D,x,X ?), i.e., the
predictive distribution for theobjectives at x given D and
conditioned to X ? beingthe Pareto set of the objective functions.
This alterna-tive formulation is known as predictive entropy
search(Hernández-Lobato et al., 2014) and it significantly
simpli-fies the evaluation of the acquisition function α(·). In
par-ticular, we no longer have to evaluate or approximate
theentropy of the Pareto set, X ?, which may be quite difficult.The
new acquisition function obtained in Eq. (2) favors theevaluation
in the regions of the input space for which X ?is more informative
about y. These are precisely also theregions in which y is more
informative about X ?.
The first term in the r.h.s. of Eq. (2) is straight-forward
toevaluate; it is simply the entropy of the predictive
distri-bution p(y|D,x), which is a factorizable
K-dimensionalGaussian distribution. Thus, we have that
H(y|D,x) = K2
log(2πe) +
K∑k=1
0.5 log(vPDk ) , (3)
where vPDk is the predictive variance of fk(·) at x.
Thedifficulty comes from the evaluation of the second termin the
r.h.s. of Eq. (2), which is intractable and must beapproximated; we
follow Hernández-Lobato et al. (2014)and approximate the
expectation using a Monte Carlo es-timate of the Pareto set, X ?
given D. This involves sam-pling several times the objective
functions from their pos-terior distribution p(f1, . . . , fK |D).
This step is done as in
Hernández-Lobato et al. (2014) using random kernel fea-tures
and linear models that accurately approximate thesamples from p(f1,
. . . , fK |D). In practice, we generate10 samples from the
posterior of each objective fk(·).
Given the samples of the objectives, we must optimizethem to
obtain a sample from the Pareto set X ?. Notethat unlike the true
objectives, the sampled functions can beevaluated without
significant cost. Thus, given these func-tions, we use a grid
search with d× 1, 000 points to solvethe corresponding
multi-objective problem to find X ?,where d is the number of
dimensions. Of course, in highdimensional problems such a grid
search is expected tobe sub-optimal; in that case, we use the
NSGA-II evo-lutionary algorithm (Deb et al., 2002). The Pareto
setis then approximated using a representative subset of 50points.
Given such a sample of X ?, the differential en-tropy of p(y|D,x,X
?) is estimated using the expectationpropagation algorithm (Minka,
2001), as described in theproceeding section.
2.1. Approximating the Conditional PredictiveDistribution Using
Expectation Propagation
To approximate the entropy of the conditional predic-tive
distribution p(y|D,x,X ?) we consider the distribu-tion p(X ?|f1, .
. . , fK). In particular, X ? is the Pareto setof f1, . . . , fK
iff ∀x? ∈ X ?,∀x′ ∈ X ,∃ k ∈ 1, . . . ,K suchthat fk(x?) ≤ fk(x′),
assuming minimization. That is,each point within the Pareto set has
to be better or equal toany other point in the domain of the
functions in at leastone of the objectives. Let f be the set {f1, .
. . , fK}. Infor-mally, the conditions just described can be
translated intothe following un-normalized distribution for X
?:
p(X ?|f) ∝∏
x?∈X?
∏x′∈X
[1−
K∏k=1
Θ (fk(x′)− fk(x?))
]=
∏x?∈X?
∏x′∈X
ψ(x′,x?) , (4)
where ψ(x′,x?) = 1−∏K
k=1 Θ (fk(x′)− fk(x?)), Θ(·)
is the Heaviside step function, and we have used the con-vention
that Θ(0) = 1. Thus, the r.h.s. of Eq. (4) is non-zero only for a
valid Pareto set. Next, we note that in thenoiseless case p(y|x, f)
=
∏Kk=1 δ(yk−fk(x)), where δ(·)
is the Dirac delta function; in the noisy case we simply
re-place the delta functions with Gaussians. We can hencewrite the
unnormalized version of p(y|D,x,X ?) as:
p(y|D,x,X ?) ∝∫p(y|x, f)p(X ?|f)p(f |D)df
∝∫ K∏
k=1
δ(yk − fk(x))∏
x?∈X?ψ(x,x?)
×∏
x′∈X\{x}
ψ(x′,x?) p(f |D) df , (5)
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
where we have separated out the factors ψ that do not de-pend on
x, the point in which the acquisition function α(·)is going to be
evaluated. The approximation to the r.h.s. ofEq. (5) is obtained in
two stages. First, we approximate Xwith the set X̃ = {xn}Nn=1 ∪ X ?
∪ {x}, i.e., the union ofthe input locations where the objective
functions have beenalready evaluated, the current Pareto set and
the candidatelocation x on which α(·) should be evaluated. Then, we
re-place each non-Gaussian factor ψ with a corresponding
ap-proximate Gaussian factor ψ̃ whose parameters are foundusing
expectation propagation (EP) (Minka, 2001). That is,
ψ(x′,x?) = 1−K∏
k=1
Θ (fk(x′)− fk(x?))
≈ ψ̃(x′,x?) =K∏
k=1
φ̃k(fk(x′), fk(x
?)) , (6)
where each approximate factor φ̃k is an
unnormalizedtwo-dimensional Gaussian distribution. In particular,
weset φ̃k(fk(x′), fk(x?)) = exp
{− 12υ
TkṼkυk + m̃
Tkυk
},
where we have defined υk = (fk(x′), fk(x?))T, and Ṽkand m̃k are
parameters to be adjusted by EP, which refineseach ψ̃ until
convergence to enforce that it looks similar tothe corresponding
exact factor ψ (Minka, 2001). The ap-proximate factors ψ̃ that do
not depend on the candidateinput x are reused multiple times to
evaluate the acquisi-tion function α(·), and they only have to be
computed once.The |X ?| factors that depend on x must be obtained
rela-tively quickly to guarantee that α(·) is not very expensiveto
evaluate. Thus, in practice we only update those factorsonce using
EP, i.e., they are not refined until convergence.
Once EP has been run, we approximate p(y|D,x,X ?)by the
normalized Gaussian that results from replac-ing each exact factor
ψ by the corresponding ap-proximate ψ̃. Note that the Gaussian
distribution isclosed under the product operation, and because
allnon-Gaussian factors in Eq. (5) have been replaced byGaussians,
the result is a Gaussian distribution. Thatis, p(y|D,x,X ?) ≈
∏Kk=1N (fk(x)|mCPDk , vCPDk ), where
the parameters mCPDk and vCPDk can be obtained from each
ψ̃ and p(f1, . . . , fK |D). If we combine this result withEq.
(3), we obtain an approximation to the acquisition func-tion in Eq.
(2) that is given by the difference in entropiesbefore and after
conditioning on the Pareto sets. That is,
α(x) ≈K∑
k=1
log vPDk (x)
2− 1S
S∑s=1
log vCPDk (x|X ?(s))2
, (7)
where S is the number of Monte Carlo samples, {X ?(s)}Ss=1
are the Pareto sets sampled to approximate the expectationin Eq.
(2), and vPDk (x) and v
CPDk (x|X ?(s)) are respectively
the variances of the predictive distribution at x, before
andafter conditioning to X ?(s). Last, in the case of noisy
obser-vations around each fk(·), we just increase the
predictive
variances by adding the noise variance. The next evalua-tion is
simply set to xN+1 = arg maxx∈X α(x).
Note that Eq. (7) is the sum of K functions
αk(x) =log vPDk (x)
2− 1S
S∑s=1
log vCPDk (x|X ?(s))2
, (8)
that intuitively measure the contribution of each objectiveto
the total acquisition. In a decoupled evaluation setting,each αk(·)
can be individually maximized to identify thelocation xopk = arg
maxx∈X αk(x), on which it is expectedto be most useful to evaluate
each of the K objectives. Theobjective k with the largest
individual acquisition αk(x
opk )
can then be chosen for evaluation in the next iteration.
Thisapproach is expected to reduce the entropy of the posteriorover
the Pareto set more quickly, i.e., with a smaller numberof
evaluations of the objectives, and to lead to better results.
The total computational cost of evaluating the acquisi-tion
function α(x) includes the cost of running EP, whichis O(Km3),
where m = N + |X ?(s)|, N is the number ofobservations made and K
is the number of objectives. Thisis done once per each sample X
?(s). After this, we can re-use the factors that are independent of
the candidate loca-tion x. The cost of computing the predictive
variance ateach x is hence O(K|X ?(s)|
3). In our experiments, the sizeof the Pareto set sampleX ?(s)
is 50, which means thatm is afew hundred at most. The supplementary
material containsadditional details about the EP approximation to
Eq. (5).
3. Related WorkParEGO is another method for multi-objective
Bayesianoptimization (Knowles, 2006). ParEGO transforms
themulti-objective problem into a single-objective problemusing a
scalarization technique: at each iteration, a vec-tor of K weights
θ = (θ1, . . . , θK)T, with θk ∈ [0, 1]and
∑Kk=1 θk = 1, is sampled at random from a uniform
distribution. Given θ, a single-objective function is built:
fθ(x) = maxKk=1(θkfk(x)) + ρK∑
k=1
θkfk(x) (9)
where ρ is set equal to 0.05. See (Nakayama et al., 2009,Sec.
1.3.3) for further details. After step N of the opti-mization
process, and given θ, a new set of N observationsof fθ(·) are
obtained by evaluating this function in the al-ready observed
points {xn}Nn=1. Then, a GP model is fitto the new data and
expected improvement (Mockus et al.,1978; Jones et al., 1998) is
used find the location of thenext evaluation xN+1. The cost of
evaluating the acquisi-tion function in ParEGO isO(N3), where N is
the numberof observations made. This is the cost of fitting the GP
tothe new data (only done once). Thus, ParEGO is a simpleand fast
technique. Nevertheless, it is often outperformedby more advanced
approaches (Ponweiser et al., 2008).
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
SMSego is another technique for multi-objective
Bayesianoptimization (Ponweiser et al., 2008). The first step
inSMSego is to find a set of Pareto points X̃ ?, e.g., byoptimizing
the posterior means of the GPs, or by find-ing the non-dominated
observations. Consider now anoptimistic estimate of the objectives
at input location xgiven by mPDk (x)− c · vPDk (x)1/2, where c is
some con-stant, and mPDk (x) and v
PDk (x) are the posterior mean and
variance of the kth objective at location x, respectively.The
acquisition value computed at a candidate location x ∈X by SMSego
is given by the gain in hyper-volume ob-tained by the corresponding
optimistic estimate, after an �-correction has been made. The
hyper-volume is simply thevolume of points in functional space
above the Pareto front(this is simply the function space values
associated to thePareto set), with respect to a given reference
point (Zitzler& Thiele, 1999). Because the hyper-volume is
maximizedby the actual Pareto set, it is a natural measure of
perfor-mance. Thus, SMSego does not reduce the problem to
asingle-objective. However, at each iteration it has to finda set
of Pareto points and to fit a different GP to each oneof the
objectives. This gives a computational cost that isO(KN3). Finally,
evaluating the gain in hyper-volume ateach candidate location x is
also more expensive than thecomputation of expected improvement in
ParEGO.
A similar method to SMSego is the Pareto active learning(PAL)
algorithm (Zuluaga et al., 2013). At iteration N ,PAL uses the GP
prediction for each point x ∈ Xto maintain an uncertainty region RN
(x) about theobjective values associated with x. This region
isdefined as the intersection of RN−1(x), i.e., the un-certainty
region in the previous iteration, and Qc(x),defined as as the
hyper-rectangle with lower-cornergiven by mPDk (x)− c · vPDk
(x)0.5, for k = 1, . . . ,K,and upper-corner given by mPDk (x) + c
· vPDk (x)1/2,for k = 1, . . . ,K, for some constant c. Given these
re-gions, PAL classifies each point x ∈ X as
Pareto-optimal,non-Pareto-optimal or uncertain. A point is
classified asPareto-optimal if the worst value in RN (x) is not
domi-nated by the best value in RN (x′), for any other x′ ∈ X ,with
an � tolerance. A point is classified as non-Pareto-optimal if the
best value in RN (x) is dominated by theworst value in RN (x′) for
any other x′ ∈ X , with an �tolerance. All other points remain
uncertain. After theclassification, PAL chooses the uncertain point
x with thelargest uncertainty regionRN (x). The total
computationalcost of PAL is hence similar to that of SMSego.
The expected hyper-volume improvement (EHI) (Em-merich, 2008) is
a natural extension of expected improve-ment to the multi-objective
setting (Mockus et al., 1978;Jones et al., 1998). Given the
predictive distribution ofthe GPs at a candidate input location x,
the acquisition isthe expected increment of the hyper-volume of a
candidate
Pareto set X̃ ?. Thus, EHI also needs to find a Pareto set X̃
?.This set can be obtained as in SMSego. A difficulty is, how-ever,
that computing the expected increment of the hyper-volume is very
expensive. For this, the output space is di-vided in a series of
cells, and the probability of improve-ment is simply obtained as
the probability that the observa-tion made at x lies in a
non-dominated cell. This involves asum across all non-dominated
cells, whose number growsexponentially with the number of
objectives K. In particu-lar, the total number of cells is (|X̃
?|+1)K . Thus, althoughsome methods have been suggested to speed-up
its calcula-tion, e.g., (Hupkens et al., 2014; Feliot et al.,
2015), EHI isonly feasible for 2 or 3 objectives at most.
Sequential uncertainty reduction (SUR) is another methodproposed
for multi-objective Bayesian optimization(Picheny, 2015). The
working principle of SUR is similarto that of EHI. However, SUR
considers the probabilityof improving the hyper-volume in the whole
domain ofthe objectives X . Thus, SUR also needs to find a set
ofPareto points X̃ ?. These can be obtained as in SMSego.The
acquisition computed by SUR is simply the expecteddecrease in the
area under the probability of improvingthe hyper-volume, after
evaluating the objectives at a newcandidate location x. The SUR
acquisition is computedalso by dividing the output space in a total
of (|X̃ ?|+ 1)Kcells, and the area under the probability of
improvement isobtained using a Sobol sequence as the integration
points.Although some grouping of the cells has been
suggested(Picheny, 2015), SUR is an extremely expensive
criterionthat is only feasible for 2 or 3 objectives at most.
The proposed approach, PESMO, differs from the methodsdescribed
in this section in that 1) it does not transformthe multi-objective
problem into a single-objective, 2) theacquisition function of
PESMO can be decomposed as thesum of K individual acquisition
functions, and this allowsfor decoupled evaluations, and 3) the
computational cost ofPESMO is linear in the total number of
objectives K.
4. ExperimentsWe compare PESMO with the other strategies
describedin Section 3: ParEGO, SMSego, EHI and SUR. We donot
compare results with PAL because it is expected togive similar
results to those of SMSego, as both meth-ods are based on a lower
confidence bound. We havecoded all these methods in the software
for Bayesianoptimization Spearmint
(https://github.com/HIPS/Spearmint). We use a Matérn covariance
function forthe GPs and all hyper-parameters (noise, length-scales
andamplitude) are approximately sampled from their poste-rior
distribution (we generate 10 samples from this distri-bution). The
acquisition function of each method is aver-aged over these
samples. In ParEGO we consider a differ-
https://github.com/HIPS/Spearminthttps://github.com/HIPS/Spearmint
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
ent scalarization (i.e., a different value of θ) for each
sam-ple of the hyper-parameters. In SMSego, EHI and SUR,for each
hyper-parameter sample we consider a differentPareto set X̃ ?,
obtained by optimizing the posterior meansof the GPs. The resulting
Pareto set is extended by includ-ing all non-dominated
observations. At iteration N , eachmethod gives a recommendation in
the form of a Paretoset obtained by optimizing the posterior means
of the GPs.The acquisition function of each method is maximized
us-ing L-BFGS (a grid of size 1, 000 is used to find a goodstarting
point). The gradients of the acquisition functionare approximated
by differences (except in ParEGO).
4.1. Accuracy of the PESMO Approximation
One question is whether the proposed approximations
aresufficiently accurate for the effective identification of
thePareto set. We compare in a one-dimensional problemwith 2
objectives the acquisition function computed byPESMO with a more
accurate estimate obtained via expen-sive Monte Carlo sampling and
a non-parametric estimatorof the entropy (Singh et al., 2003).
Figure 1 (top) shows at agiven step the observed data and the
posterior mean and thestandard deviation of each objective. The
figure on the bot-tom shows the acquisition function computed by
PESMOand by the Monte Carlo method (Exact). Both functionslook very
similar, including the location of the global max-imizer. This
indicates that (7), obtained by expectationpropagation, is
potentially a good approximation of (2), theexact acquisition. The
supplementary material has extra re-sults showing that the
individual acquisition functions com-puted by PESMO, i.e., αk(·),
for k = 1, 2, are also accurate.
4.2. Experiments with Synthetic Objectives
We compare PESMO with other approaches in a 3-dimensional
problem with 2 objectives obtained by sam-pling the functions from
the GP prior. We generate 100 ofthese problems and report the
average performance whenconsidering noiseless observations and when
the observa-tions are contaminated with Gaussian noise with
standarddeviation equal to 0.1. The performance metric is
thehyper-volume indicator, which is maximized by the actualPareto
set (Zitzler & Thiele, 1999). At each iteration wereport the
logarithm of the relative difference between thehyper-volume of the
actual Pareto set (obtained by optimiz-ing the actual objectives)
and the hyper-volume of the rec-ommendation (obtained by optimizing
the posterior meansof the GPs). Figure 2 (left-column) shows, as a
functionof the evaluations made, the average performance of
eachmethod with error bars. PESMO obtains the best results,and when
executed in a decoupled scenario, slight improve-ments are observed
(only with noisy observations).
Table 1 shows the average time in seconds to determinethe next
evaluation in each method. The fastest method
Figure 1. (top) Observations of each objective and posterior
meanand standard deviations of each GP model. (bottom) Estimatesof
the acquisition function (2) by PESMO, and by a Monte Carlomethod
combined with a non-parametric estimator of the entropy(Exact),
which is expected to be more accurate. Best seen in color.
is ParEGO followed by SMSego and PESMO. The decou-pled version
of PESMO, PESMOdec, takes more time be-cause it optimizes α1(·) and
α2(·). The slowest meth-ods are EHI and SUR. Most of their cost is
in the lastiterations, in which the Pareto set size, |X̃ ?|, is
large.The cost of evaluating the acquisition function in EHI andSUR
is O((|X̃ ?| + 1)K), leading to expensive optimiza-tion via L-BFGS.
In PESMO the cost of evaluating α(·) isO(K|X ?(s)|
3) because K linear systems are solved. Thesecomputations are
faster because they are performed by theopen-BLAS library, which is
optimized for each processor.The acquisition function of EHI and
SUR does not involvesolving linear systems and these methods cannot
use open-BLAS. Note that we also keep fixed |X ?(s)| = 50 in
PESMO.
Table 1. Avg. time in seconds doing calculations per
iteration.PESMO PESMOdec ParEGO SMSego EHI SUR
33±1.0 52±2.5 11±0.2 16±1.3 405±115 623±59
We have carried out additional synthetic experiments with4
objectives on a 6-dimensional input space. In this case,EHI and SUR
become infeasible, so we do not compareresults with them. Again, we
sample the objectives fromthe GP prior. Figure 2 (right-column)
shows, as a functionof the evaluations made, the average
performance of eachmethod. The best method is PESMO, and in this
case, thedecoupled evaluation performs significantly better.
Thisimprovement is because in the decoupled setting,
PESMOidentifies the most difficult objectives and evaluates
themmore times. In particular, because there are 4 objectives itis
likely that some objectives are more difficult than oth-
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
Figure 2. (left-column) Average log relative difference between
the hyper-volume of the recommendation and the maximum hyper-volume
for each number of evaluations made. We consider noiseless (top)
and noisy observations (bottom). The problem consideredhas 2
objectives and 3 dimensions. (right-column) Similar results for a
problem with 4 objectives and 6 dimensions. We do not
compareresults with EHI and SUR because they are infeasible due to
their exponential cost with the number of objectives. Best seen in
color.
ers just by chance. Figure 3 illustrates this behavior fora
representative case in which the first two objectives arenon-linear
(difficult) and the last two objectives are linear(easy). We note
that the decoupled version of PESMO eval-uates the first two
objectives almost three times more.
Figure 3. (top) Contour curves of 4 illustrative objectives on 6
di-mensions obtained by changing the first two dimensions in
inputspace while keeping the other 4 fixed to zero. The first 2
objec-tives are non-linear while the 2 last objectives are linear.
(bottom)Number of evaluations of each objective done by
PESMOdecoupledas a function of the iterations performed N . Best
seen in color.
4.3. Finding a Fast and Accurate Neural Network
We consider the MNIST dataset (LeCun et al., 1998) andevaluate
each method on the task of finding a neural net-work with low
prediction error and small prediction time.
These are conflicting objectives because reducing the
pre-diction error will involve larger networks which will
takelonger at test time. We consider feed-forward networkswith
ReLus at the hidden layers and a soft-max outputlayer. The networks
are coded in the Keras library and theyare trained using Adam (D.
Kingma, 2014) with a mini-batch size of 4, 000 instances during 150
epochs. The ad-justable parameters are: The number of hidden units
perlayer (between 50 and 300), the number of layers (between1 and
3), the learning rate, the amount of dropout, and thelevel of `1
and `2 regularization. The prediction error ismeasured on a set of
10, 000 instances extracted from thetraining set. The rest of the
training data, i.e., 50, 000 in-stances, is used for training. We
consider a logit transfor-mation of the prediction error because
the error rates arevery small. The prediction time is measured as
the averagetime required for doing 10, 000 predictions. We
computethe logarithm of the ratio between the prediction time ofthe
network and the prediction time of the fastest network,(i.e., a
single hidden layer and 50 units). When measur-ing the prediction
time we do not train the network andconsider random weights (the
time objective is also set toignore irrelevant parameters). Thus,
the problem is suitedfor a decoupled evaluation because both
objectives can beevaluated separately. We run each method for a
total of200 evaluations of the objectives and report results
after100 and 200 evaluations. Because there is no ground truthand
the objectives are noisy, we re-evaluate 3 times thevalues
associated with the recommendations made by eachmethod (in the form
of a Pareto set) and average the results.
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
Figure 4. Avg. Pareto fronts obtained by each method after 100
(left) and 200 (right) evaluations of the objectives. Best seen in
color.
Then, we compute the Pareto front (i.e., the function
spacevalues of the Pareto set) and its hyper-volume. We repeatthese
experiments 50 times and report the average results.
Table 2 shows the hyper-volumes obtained in the experi-ments
(the higher, the better). The best results, after 100evaluations of
the objectives, correspond to the decoupledversion of PESMO,
followed by SUR and by the coupledversion. When 200 evaluations are
done, the best methodis PESMO in either setting, i.e., coupled or
decoupled. AfterPESMO, SUR gives the best results, followed by
SMSegoand EHI. ParEGO is the worst performing method in
eithersetting. In summary, PESMO gives the best overall results,and
its decoupled version performs much better than theother methods
when the number of evaluations is small.
Table 2. Avg. hyper-volume after 100 and 200 evaluations.# Eval.
PESMO PESMOdec ParEGO SMSego EHI SUR
100 66.2±.2 67.6±.1 62.9±1.2 65.0±.3 64.0±.9 66.6±.2200 67.8±.1
67.8±.1 66.1±.2 67.1±.2 66.6±.2 67.2±.1
Figure 4 shows the average Pareto front obtained by eachmethod
after 100 and 200 evaluations of the objectives. Theresults
displayed are consistent with the ones in Table 2. Inparticular,
PESMO is able to find networks that are fasterthan the ones found
by the other methods, for a similar pre-diction error on the
validation set. This is especially thecase of PESMO when executed
in a decoupled setting, afterdoing only 100 evaluations of the
objectives. We also notethat PESMO finds the most accurate
networks, with almost1.5% of prediction error in the validation
set.
Figure 5. Number of evaluations of each objective done
byPESMOdecoupled, as a function of the iteration number N , in
theproblem of finding good neural networks. Best seen in color.
The good results obtained by PESMOdecoupled are explainedby
Figure 5, which shows the average number of evalua-tions of each
objective. More precisely, the objective thatmeasures the
prediction time is evaluated just a few times.This makes sense
because it depends on only two param-eters, i.e., the number of
layers and the number of hiddenunits per layer. It is hence simpler
than the prediction er-ror. PESMOdecoupled is able to detect this
and focuses onthe evaluation of the prediction error. Of course,
evaluatingthe prediction error more times is more expensive, since
itinvolves training the neural network more times. Neverthe-less,
this shows that PESMOdecoupled is able to successfullydiscriminate
between easy and difficult objective functions.
The supplementary material has extra experiments compar-ing each
method on the task of finding an ensemble of deci-sion trees of
small size and good prediction accuracy. Theresults obtained are
similar to the ones reported here.
5. ConclusionsWe have described PESMO, a method for
multi-objectiveBayesian optimization. At each iteration, PESMO
evaluatesthe objective functions at the input location that is
mostexpected to reduce the entropy of posterior estimate of
thePareto set. Several synthetic experiments show that PESMOhas
better performance than other methods from the litera-ture. That
is, PESMO obtains better recommendations witha smaller number of
evaluations, both in the case of noise-less and noisy observations.
Furthermore, the acquisitionfunction of PESMO can be understood as
a sum of K indi-vidual acquisition functions, one per each of the K
objec-tives. This allows for a decoupled evaluation scenario,
inwhich the most promising objective is identified by maxi-mizing
the individual acquisition functions. When run in adecoupled
evaluation setting, PESMO is able to identify themost difficult
objectives and, by focusing on their evalua-tion, it provides
better results. This behavior of PESMO hasbeen illustrated on a
multi-objective optimization problemthat consists of finding an
accurate and fast neural network.Finally, the computational cost of
PESMO is small. In par-ticular, it scales linearly with the number
of objectives K.Other methods have an exponential cost with respect
to Kwhich makes them infeasible for more than 3 objectives.
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
AcknowledgmentsDaniel Hernández-Lobato gratefully acknowledges
the useof the facilities of Centro de Computación Cientı́fica(CCC)
at Universidad Autónoma de Madrid. This au-thor also acknowledges
financial support from the Span-ish Plan Nacional I+D+i, Grants
TIN2013-42351-P andTIN2015-70308-REDT, and from Comunidad de
Madrid,Grant S2013/ICE-2845 CASI-CAM-CM. José
MiguelHernández-Lobato acknowledges financial support fromthe
Rafael del Pino Fundation. Amar Shah acknowledgessupport from the
Qualcomm Innovation Fellowship pro-gram. Ryan P. Adams acknowledges
support from the Al-fred P. Sloan Foundation.
ReferencesAriizumi, R., Tesch, M., Choset, H., and Matsuno,
F.
Expensive multiobjective optimization for robotics
withconsideration of heteroscedastic noise. In 2014
IEEEInternational Conference on Intelligent Robots and Sys-tems,
pp. 2230–2235, 2014.
Collette, Y. and Siarry, P. Multiobjective
Optimization:Principles and Case Studies. Springer, 2003.
D. Kingma, J. Ba. Adam: A method for stochastic opti-mization.
2014. arXiv:1412.6980.
Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. Afast and
elitist multiobjective genetic algorithm: NSGA-II. IEEE
Transactions on Evolutionary Computation, 6:182–197, 2002.
Emmerich, A. The computation of the expected improve-ment in
dominated hypervolume of Pareto front approx-imations. Technical
Report LIACS TR-4-2008, LeidenUniversity, The Netherlands,
2008.
Feliot, P., Bect, J., and Vazquez, E. A Bayesian approachto
constrained single- and multi-objective optimization.2015.
arXiv:1510.00503 [stat.CO].
Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian
opti-mization with unknown constraints. In Thirtieth Confer-ence on
Uncertainty in Artificial Intelligence, 2014.
Henning, P. and Schuler, C. J. Entropy search
forinformation-efficient global optimization. Journal ofMachine
Learning Research, 13:1809–1837, 2012.
Hernández-Lobato, J. M., Hoffman, M. W., and Ghahra-mani, Z.
Predictive entropy search for efficient globaloptimization of
black-box functions. In Advances inNeural Information Processing
Systems 27, pp. 918–926. 2014.
Houlsby, N., Hernández-lobato, J. M., Huszár, F.,
andGhahramani, Z. Collaborative Gaussian processes forpreference
learning. In Advances in Neural InformationProcessing Systems 25,
pp. 2096–2104. 2012.
Hupkens, I., Emmerich, M., and Deutz, A. Faster com-putation of
expected hypervolume improvement. 2014.arXiv:1408.7114 [cs.DS].
Jones, D. R., Schonlau, M., and Welch, W. J. Efficientglobal
optimization of expensive black-box functions.Journal of Global
Optimization, 13(4):455–492, 1998.
Knowles, J. ParEGO: a hybrid algorithm with on-line land-scape
approximation for expensive multiobjective opti-mization problems.
IEEE Transactions on EvolutionaryComputation, 10:50–66, 2006.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.
Gradient-based learning applied to document recognition.
Pro-ceedings of the IEEE, 86(11):2278–2324, 1998.
Li, X. A non-dominated sorting particle swarm optimizerfor
multiobjective optimization. In Genetic and Evolu-tionary
Computation GECCO 2003, pp. 37–48. 2003.
Minka, T. A Family of Algorithms for ApproximateBayesian
Inference. PhD thesis, MIT, 2001.
Mockus, J., Tiesis, V., and Zilinskas, A. The application
ofBayesian methods for seeking the extremum. TowardsGlobal
Optimization, 2(117-129):2, 1978.
Nakayama, H., Yun, Y., and M.Yoon. Sequential Approxi-mate
Multiobjective Optimization Using ComputationalIntelligence.
Springer, 2009.
Picheny, V. Multiobjective optimization using Gaussianprocess
emulators via stepwise uncertainty reduction.Statistics and
Computing, 25:1265–1280, 2015.
Ponweiser, W., Wagner, T., Biermann, D., and Vincze,
M.Multiobjective optimization on a limited budget of eval-uations
using model-assisted S-metric selection. In Par-allel Problem
Solving from Nature PPSN X, pp. 784–794. 2008.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-cesses for
Machine Learning (Adaptive Computationand Machine Learning). The
MIT Press, 2006.
Shah, A. and Ghahramani, Z. Parallel predictive entropysearch
for batch global optimization of expensive objec-tive functions. In
Advances in Neural Information Pro-cessing Systems 28, pp.
3312–3320. 2015.
Singh, H., Misra, N., Hnizdo, V., Fedorowicz, A., andDemchuk, E.
Nearest neighbor estimates of entropy.American journal of
mathematical and management sci-ences, 23:301–321, 2003.
-
Predictive Entropy Search for Multi-objective Bayesian
Optimization
Villemonteix, J., Vazquez, E., and Walter, E. An informa-tional
approach to the global optimization of expensive-to-evaluate
functions. Journal of Global Optimization,44:509–534, 2009.
Zitzler, E. and Thiele, L. Multiobjective evolutionary
algo-rithms: a comparative case study and the strength
Paretoapproach. IEEE Transactions on Evolutionary Compu-tation,
3:257–271, 1999.
Zuluaga, M., Krause, A., Sergent, G., and Püschel, M. Ac-tive
learning for multi-objective optimization. In Inter-national
Conference on Machine Learning, pp. 462–470,2013.