BAYESIAN LEARNING TECHNIQUES FOR NONPARAMETRIC IDENTIFICATION › dottIEIE › tesi › 2005 › m_neve.pdf · Bayesian Learning This chapter introduces the main theoretical framework

BAYESIAN LEARNING TECHNIQUES FOR

NONPARAMETRIC IDENTIFICATION

Marta Neve

Tutor: Prof. Giuseppe De Nicolao

Dept. of Computer Engineering

and Systems Science

University of Pavia

2002-2005

ii

Acknowledgements

This thesis collects most of the work I have carried out during the last three years atthe Department of Computer Engineering and Systems Science of the University of Pavia.

I am heartily grateful to my supervisor, Prof. Giuseppe De Nicolao, for hisnever-ending support and encouragement throughout this “adventure”. His outstandingexperience in the field of identification together with his bright intelligence urged me toprogress day by day. I also owe him a great debt of gratitude for his friendship: it willalways be precious to me.

I would like to thank Prof. Tomaso Poggio for giving me the opportunity tospend some months working at the CBCL laboratory at MIT. My experience in Bostonhas been very stimulating and helped me to develop an open-minded attitude. There arepeople I have met there that I will never forget.

I wish to express my gratitude to all the people working in the laboratory ofidentification and control of dynamic systems: they created a friendly atmosphere while,at the same time, offering me their constant help whenever I needed it.

I am indebted to Claudio for sharing these years with me, from the very be-ginning onwards. He always believed in me and encouraged me to face all the difficultiesI have met along the way. I wish him all the best in the future.

My special thanks go to my mother for her constant fondness and trust in mycapabilities. The strength she proved to have is an incessant source of inspiration for me.This thesis is dedicated to her and is written in memory of my father.

iii

iv

Contents

Acknowledgements iii

Acronyms ix

1 Introduction 1

1.1 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Bayesian Learning 5

2.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The Bayes Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Nonparametric Bayesian Analysis 11

3.1 Stochastic Population Model . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 Typical and Individual Curves . . . . . . . . . . . . . . . . . . . . . 13

3.1.2 Regularization Network . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Population Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Estimating the Typical Curve . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Estimating the Individual Curves . . . . . . . . . . . . . . . . . . . 17

3.2.3 Estimating the Hyper-Parameters . . . . . . . . . . . . . . . . . . . 18

3.3 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 AUC Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.2 Simulated Example: Sparsely Sampled Data . . . . . . . . . . . . . 24

3.4.3 Analysis of Pharmacokinetic Data . . . . . . . . . . . . . . . . . . . 28

3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Appendix: Technical Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 An MCMC Approach 33

4.1 The Population Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 MCMC Estimation of the Sampled Model . . . . . . . . . . . . . . . . . . 34

4.2.1 MCMC Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

v

vi CONTENTS

4.2.2 Hyper-Parameter Priors . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Full Conditional Distributions . . . . . . . . . . . . . . . . . . . . . 35

4.2.4 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Estimation of the Continuous-Time Signals . . . . . . . . . . . . . . . . . . 36

4.3.1 The Posterior Expectation . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Population Regularization Network . . . . . . . . . . . . . . . . . . 38

4.3.3 The Posterior Variance . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.1 Simulated Example: Sparsely Sampled Data . . . . . . . . . . . . . 40

4.4.2 Analysis of Pharmacokinetic Data . . . . . . . . . . . . . . . . . . . 44


5 Identification of Engine Maps 49

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Engine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Engine Map Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Model Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4.1 Map Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4.2 Dynamic Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


6 Active Learning 65

6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.4 Air Flow Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


Bibliography 84

List of Figures

3.1 Regularization Network structure of the EB estimator . . . . . . . . . . . . 15

3.2 AUC example: estimated vs. true typical curve . . . . . . . . . . . . . . . 22

3.3 AUC example: performance of the discrete time and the continuous timeapproaches on 500 data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Simulated example: noisy measurements and real curves . . . . . . . . . . 25

3.5 Simulated example: true vs. estimated typical curve . . . . . . . . . . . . . 26

3.6 Simulated example: true vs. estimated individual curve #4 . . . . . . . . . 26


3.8 Simulated example: RMSE of the typical curve and the individual curveson 500 data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.9 Pharmacokinetic data: average and real curves . . . . . . . . . . . . . . . . 28

3.10 Pharmacokinetic data: estimated typical curve . . . . . . . . . . . . . . . . 29

3.11 Pharmacokinetic data: estimated individual curve #5 . . . . . . . . . . . . 29

3.12 Pharmacokinetic data: estimated individual curve #19 . . . . . . . . . . . 30

3.13 Lemma 1: graphical interpretation in terms of projections in the Hilbertspace of jointly normal random variables . . . . . . . . . . . . . . . . . . . 31

4.1 Regularization Network structure of the MCMC estimator . . . . . . . . . 38

4.2 Simulated example: histogram of λ2 . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Simulated example: histogram of λ2 . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Simulated example: scatter plot of λ2 and λ2 . . . . . . . . . . . . . . . . . 41

4.5 Simulated example: true vs. estimated typical curve . . . . . . . . . . . . . 42



4.8 Simulated example: RMSE of the typical curve and the individual curveson 500 data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.9 Pharmacokinetic data: histogram of λ2 . . . . . . . . . . . . . . . . . . . . 44

4.10 Pharmacokinetic data: histogram of λ2 . . . . . . . . . . . . . . . . . . . . 45

4.11 Pharmacokinetic data: scatter plot of λ2 and λ2 . . . . . . . . . . . . . . . 45

4.12 Pharmacokinetic data: estimated typical curve . . . . . . . . . . . . . . . . 46

4.13 Pharmacokinetic data: estimated individual curve #5 . . . . . . . . . . . . 46

vii

viii LIST OF FIGURES

4.14 Pharmacokinetic data: estimated individual curve #19 . . . . . . . . . . . 47

5.1 Air flow map: standard vs. regularized RBFNN approaches (3D) . . . . . . 55

5.2 Air flow map: standard vs. regularized RBFNN approaches (2D) . . . . . . 56

5.3 Mean indicated torque map: standard vs. regularized RBFNN approaches(3D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.4 Mean friction torque map: standard vs. regularized RBFNN approaches(3D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.5 Simulink model: architecture of the overall model . . . . . . . . . . . . . . 59

5.6 Simulink model: architecture of the Intake Manifold subsystem . . . . . . . 60

5.7 Simulink model: architecture of the Torque Generation subsystem . . . . . 60

5.8 Dynamic validation: simulated vs. experimental manifold pressure . . . . . 61

5.9 Dynamic validation: simulated vs. experimental crankshaft speed . . . . . 62

6.1 Coincident regions: estimated λ2 as a function of the number of data . . . 71

6.2 Coincident regions: data selected by the two-phase procedure . . . . . . . . 72

6.3 Coincident regions: RMSE between estimated and true air flow map . . . . 73

6.4 Coincident regions: air flow map estimated using 10, 20 and 50 data andtrue map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.5 Distinct regions: sampling region vs. region of interest . . . . . . . . . . . 74

6.6 Distinct regions: estimated λ2 as a function of the number of data . . . . . 75

6.7 Distinct regions: data selected by the two-phase procedure . . . . . . . . . 76

6.8 Distinct regions: RMSE between estimated and true air flow map . . . . . 77

6.9 Distinct regions: air flow map estimated using 10, 20 and 50 data and truemap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

Acronyms

AIC Akaike Information CriterionAUC Area Under the CurveCV Coefficient of VariationDOE Design Of ExperimentEB Empirical BayesECU Electronic Control UnitFPE Final Prediction ErrorGCV Generalized Cross ValidationGP Gaussian ProcessLT Load TorqueMCMC Markov Chain Monte CarloMET Mean Effective TorqueMFT Mean Friction TorqueMIT Mean Indicated TorqueML Maximum LikelihoodMLP Multi Layer PerceptronMSE Mean Square ErrorMVM Mean Value ModelOCV Ordinary Cross ValidationRBF Radial Basis FunctionRBFNN Radial Basis Function Neural NetworkRKHS Reproducing Kernel Hilbert SpaceRMSE Root Mean Square ErrorRN Regularization NetworkSALP1 Sequential Active Learning Problem 1SALP2 Sequential Active Learning Problem 2SSR Sum of Squared Residuals

ix

x ACRONYMS

Chapter 1

Introduction

An ubiquitous problem in engineering and science is that of reconstructing functions ofone or more variables from discrete and sparse noisy data. This task goes under differentnames such as regression, function approximation, identification and learning. Amongthe possible approaches there are: parameter estimation, regularization theory, splineapproximation, neural networks and Bayesian learning which is the framework adoptedin this thesis. The strength and the elegance of the Bayesian approach stem from itsrigorous processing of the assumptions that leads directly to the final estimate. In fact,the Bayes formula provides the posterior expectation, i.e. the result of the estimation,from the model (the likelihood) and the assumptions (the prior).

In many real-world problems, the key point is finding “the right prior” that incorp-orates all the available a-priori knowledge without unnecessarily biasing the estimate. Awidespread and successful strategy relies on hierarchical priors including unknown tunablehyper-parameters to be learnt from the data.

Within the framework of Bayesian learning this thesis addresses two major challenges.The first one concerns the joint estimation of a population of univariate curves from infre-quently and non uniformly sampled data. Such a population identification problem is ofvital importance in the preclinical phase of drug development. In this context, few dataare available for each subject so that individual models cannot be identified separately.The main contributions are the formulation of a proper prior, the derivation of the struc-ture and properties of the Bayes estimate, and the development of effective computationalstrategies.

The second challenge regards the estimation of a function of two variables satisfyingspecific boundary conditions. The motivation for this research issue is the problem ofreconstructing engine maps from sets of workbench data. The main features that mustbe taken into account are: (i) the difficulty of collecting many data; (ii) the need for goodextrapolation properties in order to simulate the dynamic behaviour of the engine evenoutside the region where static data can be measured; (iii) the existence of specific bound-ary conditions that are to be met with by the map. The main results are the formulationof a prior compatible with all prior knowledge, the design of an optimal sampling strategymaximizing the informative content of the data, and the successful validation carried outon experimental dynamic tests.

1

2 CHAPTER 1. INTRODUCTION

1.1 Thesis Overview

Bayesian Learning

This chapter introduces the main theoretical framework at the basis of Bayesian learning.The problem of selecting suitable prior distributions is addressed focusing the attentionon Gaussian Processes (GPs) and Bayesian parametric models. In particular, hints onthe basics of GPs are given so that the problem of selecting proper priors turns intothe problem of choosing and/or tuning proper covariance functions. Finally, the closerelationship existing between Bayesian estimation and Tychonov regularization theory ishighlighted.

Nonparametric Bayesian Analysis for Population Data

Population models are used to describe the dynamics of different subjects belonging to apopulation and play an important role in drug pharmacokinetics. In the present chaptera nonparametric identification scheme is proposed in which both the average impulseresponse of the population and the individual ones are modelled as Gaussian stochasticprocesses. Assuming that the average curve is an integrated Wiener process, it is shownthat its estimate is a cubic spline. An Empirical Bayes algorithm for estimating both thetypical and the individual curves is worked out. The model is tested on simulated datasets as well as on xenobiotics pharmacokinetic data.

The material of this chapter partially appears in [46], [47], [48].

M. Neve, G. De Nicolao, L. Marchesi. Fixed interval smoothing of popula-tion pharmacokinetic data. Proc. of the 16th IFAC World Congress, Fr-A19-TO/6,Prague, Czech Republic, July 4-8, 2005.

M. Neve, G. De Nicolao, L. Marchesi. Identification of pharmacokineticmodels via population smoothing splines. ANIPLA-BIOSYS 2005, Milan, Italy, June9-10, 2005.

M. Neve, G. De Nicolao, L. Marchesi. Nonparametric identification of popu-lation models via Gaussian processes. Automatica. Provisionally accepted as a RegularPaper.

Nonparametric Analysis for Population Data:

an MCMC Approach

The chapter deals with the nonparametric identification of population models, that ismodels that explain jointly the behaviour of different subjects drawn from a population,e.g. responses of different patients to a drug. The average response of the populationand the individual responses are modelled as continuous-time Gaussian processes withunknown hyper-parameters. The posterior expectation and variance of both the averageand individual curves are computed by means of a Markov Chain Monte Carlo scheme.

1.1. THESIS OVERVIEW 3

The model and the estimation procedure are tested on xenobiotics pharmacokinetic data.

The material of this chapter partially appears in [49].

M. Neve, G. De Nicolao, L. Marchesi. Nonparametric identification of popu-lation pharmacokinetic models: an MCMC approach. Proc. of the 24th AmericanControl Conference, pp. 991-996, Portland, USA, June 8-10, 2005.

Nonparametric Identification of Engine Maps

In this chapter a new methodology for the identification of engine maps from staticdata is presented. In order to enhance the flexibility of the model and exploit priorknowledge on the boundary conditions of the maps, a basis function neural networkwith a large number of neurons is used. To ensure smoothness of the estimated mapas well as guarantee reliable extrapolation properties, the weights are estimated via aregularization strategy. Dynamic data are used to validate the new methodology. Forthis purpose, the estimated maps are included in a Mean Value Model whose simulatedmanifold pressure and crankshaft speed are compared with the experimental ones. Theresults show a clear improvement with respect to the performances obtained resorting tostandard Radial Basis Function Networks.

The material of this chapter partially appears in [50], [51], [52], [53].

M. Neve, G. De Nicolao, G. Prodi, C. Siviero. Estimation of engine maps:a regularized basis-function networks approach. Submitted for publication.

M. Neve, G. De Nicolao, G. Prodi, C. Siviero. Stima di mappe motoremediante reti neurali di regolarizzazione. Atti della Fondazione Ronchi, Anno LIX, 2004,1, pp. 113-116. Atti della II Giornata di Studio su “Applicazione delle Reti Neuralinell’Ingegneria Elettrica e dell’Informazione”, Pavia, Italy, May 26-27, 2003.

M. Neve, G. De Nicolao, G. Prodi, C. Siviero. Nonparametric estimationof engine maps using regularized basis-function networks. Proc. of the First IFAC Sym-posium on Advances in Automotive Control, pp. 339-345, Salerno, Italy, April 19-23, 2004.

M. Neve, G. De Nicolao, G. Prodi, C. Siviero. Nonparametric neural iden-tification of the air flow map. Proc. of the 6th International Conference on Engine forAutomobile, SAE Paper, SAE-NA 2003-01-08, Capri, Italy, September 14-19, 2003.

Active Learning Strategies for the Neural Estimation of Engine

Maps

The chapter deals with the sequential optimal design of experiments for the reconstruc-tion of engine maps from static data by means of artificial neural networks. Since the

4 CHAPTER 1. INTRODUCTION

collection of static data is expensive and time consuming, the problem of selecting themost informative set of experiments arises. In the neural network literature, this problemgoes under the name of active learning and is solved by either optimizing the entropyor the variance of the estimation error. If Multi Layer Perceptron (MLP) networks areused, the selection of the next experimental point is based on the linearization of themodel around the current parameter values. Conversely, if linear-in-parameter modelsare employed, such as RBF networks with fixed basis functions, the entire sequence ofoptimal experiments can be computed in advance without knowing the results of theprevious experiments. In the chapter, it is shown that for Bayesian RBF networks theoptimal sequence of experiments can be designed in advance even when the regularizationparameter has to be tuned. In fact, within an optimal sequential design scheme, it isseen that the so-called regularization parameter tends towards a constant so that it canbe fixed from a certain step onward. The proposed active learning strategies are appliedto the estimation of the map that gives air-flow as a function of manifold pressure andcrankshaft speed.

The material of this chapter partially appears in [45].

M. Neve, G. De Nicolao. Active Learning strategies for the neural estimationof engine maps. In S. Kalogirou, editor, Artificial Intelligence in Energy and RenewableEnergy Systems, Nova Publishers Inc., 2006. In press.

Chapter 2

Bayesian Learning

In the present chapter, the fundamentals of Bayesian learning that permeate the wholethesis are introduced. The aims of this thesis are the simultaneous estimation of severalfunctions of one variable and the reconstruction of a single function of two variablesstarting in both cases from discrete and sparse noisy data.In this preliminary chapter, let us consider a single unknown function which is indicatedas

f(·) : Rd 7→ R

while the available noisy observations are

yi = f(xi) + vi, i = 1, . . . , n.

In the following, the additive measurement errors vi are assumed to be Gaussian randomvariables with zero mean and variance V ar[vi] = σ2, while the input-output training setis denoted by Dn = {(xi, yi), i = 1, . . . , n}. According to the Bayesian approach, theunknown function is provided with a probabilistic model which incorporates all the a-priori knowledge, disguised as a probability measure, p(f), commonly known as prior.Note that p(f) is usually defined over a suitable functional space to which the unknownfunction is supposed to belong. Inferences on f(·) can be performed once the posteriordistribution of f(·) given the training set Dn has been calculated using the Bayes formula

p(f |Dn) =p(Dn|f)p(f)

p(Dn)

where the likelihood, p(Dn|f), measures the capability of the function f to predict theavailable observations.

The estimation problems arising in the present thesis imply the necessity of specifyingproper priors for the specific problem at hand. In fact, in the following, the unknownfunction will be alternatively treated as an integrated Wiener process or an Ornstein-Uhlenbeck process or a linear combination of specific basis functions. Nevertheless, itis easy to see that, throughout this thesis, the function f(·) is always modelled as a d-dimensional Gaussian Process (GP) that is assumed to be independent of the measurementerrors vi. As already pointed out by a number of statisticians, GPs are attractive because

5

6 CHAPTER 2. BAYESIAN LEARNING

of their flexible nonparametric nature and computational simplicity. For these reasons,they have been applied in a large number of fields to a diverse range of ends, and manydeep theoretical analyses are available. The following section recalls the basic conceptsunderlying the theory of Gaussian Process. For more details the interested reader mayrefer to the literature, see e.g. [21, 40, 41, 54, 57, 60, 61, 80, 81].

2.1 Gaussian Processes

A Gaussian Process (GP) is a process whose distribution functions of all orders are Gaus-sian. Since a Gaussian is determined by its first and second order cumulants, its distri-butions are completely determined by its mean, m(ξ) = E[f(ξ)], and covariance function

K(ξ1, ξ2) = E[(f(ξ1) − m(ξ1))(f(ξ2) − m(ξ2))].

It can be easily proved that the random vector f = [f(x1) f(x2) . . . f(xn)]T , sampled incorrespondence with the input points x1, x2, . . . , xn, is normally distributed with meanand covariance matrix given by

E[f] = [m(x1) m(x2) . . . m(xn)]T

Cov[f] = K =

K(x1, x1) K(x1, x2) · · · K(x1, xn)K(x2, x1) K(x2, x2) · · · K(x2, xn)

......

. . ....

K(xn, x1) K(xn, x2) · · · K(xn, xn)

For sake of simplicity, in the following only Gaussian Processes with null expected valueswill be considered so that the GP will be fully determined once assigned a suitable covari-ance function. This result can be obtained by preprocessing the data in order to removeany bias or trend originally present.

An indirect way to define a GP is to assume that f(x) is given by a linear combinationof m basis functions ϕi(x)

f(x) =m∑

i=1

θiϕi(x)

ϕi(x) : Rd 7→ R

and that the array of weights

θ = [θ1 θ2 . . . θm]T

is normally distributed with zero mean and variance Σθ. Then, it turns out [80] that f(x)is a Gaussian Process with covariance function

K(ξ1, ξ2) = ϕ(ξ1)T Σθϕ(ξ2)

2.2. THE BAYES ESTIMATE 7

ϕ(ξ) = [ϕ1(ξ) ϕ2(ξ) . . . ϕm(ξ)]T .

Compared to the case of a general covariance function K(ξ1, ξ2), this process is completelyspecified by the knowledge of the m × m matrix Σθ. This kind of “parametric GP” willbe used in Chapters 5 and 6, whereas Chapters 3 and 4 make use of the GP’s that do notadmit a finite-dimensional parametrization. Even though the covariance function of theparametric GP seems somehow determined, it is not fully specified since it depends onboth the choice of the basis functions, ϕj(x), and the variance of the weights, Σθ, which isusually unknown and must be estimated from the data. In fact, this situation is commonto all types of GP’s and is due to the lack of detailed a-priori information on the problemunder study. In many cases the covariance function can be rewritten as

K(ξ1, ξ2) = λ2K ′(ξ1, ξ2)

where K ′(ξ1, ξ2) is fixed and λ2 represents the only uncertainty (i.e. Σθ = λ2I), so thatthe complete determination of the covariance function depends on a proper tuning of theunknown parameter λ2. This result is achieved by resorting to some tuning methods suchas Generalized Cross Validation (GCV), Ordinary Cross Validation (OCV), Cp statisticsor Maximum Likelihood (ML), [27, 28, 39]. An example of ML estimation can be foundin Chapter 3 while Chapters 5 and 6 resort to the so-called Cp statistics. Finally, also theGCV criterion is used in Chapter 5.

In other cases the covariance function is parametrized as follows

K(ξ1, ξ2) = K(ξ1, ξ2; α1, α2, . . . αr)

where the coefficients α1, α2, . . ., αr are to be treated as hyper-parameters. Althoughmore complex, the tuning of the hyper-parameters can still be performed according tothe aforementioned criteria. A different (and “truly Bayesian”) approach is obtainedby specifying a prior probability distribution over each coefficient and then applying theBayes formula to compute their posterior distribution. In this case the unknown function,f(x), and the hyper-parameters, αi, are estimated simultaneously. If explicit formulas forthe posterior distribution are lacking, Markov Chain Monte Carlo (MCMC) methodologiescan be used to numerically evaluate the posterior [44, 81].

2.2 The Bayes Estimate

Under the given assumptions, the posterior distribution p(f |Dn) is Gaussian [79] withmean

f(x) = E[f(x)|Dn] = K(x)T Σ−1y y (2.1)

and covariance

Cov[f(ξ1), f(ξ2)|Dn] = K(ξ1, ξ2) − K(ξ1)T Σ−1

y K(ξ2)

where


K(x) = Cov[f(x), f] = [K(x, x1) K(x, x2) . . . K(x, xn)]T

Σy = V ar[y] = K + σ2I

y = [y1 y2 . . . yn]T .

Given that the posterior is Gaussian, f(x) can be used as a point estimate of the un-known function and it is, therefore, named Bayes estimate. The posterior covarianceCov[f(ξ1), f(ξ2)|Dn] can be used to compute its confidence intervals. Note that the Bayesestimate in (2.1) may be rewritten as a linear combination of basis functions whose numberis equal to the number of available data, i.e.

f(x) =n∑

i=1

θiK(x, xi)

where

θ = [θ1 θ2 . . . θn]T

θ = Σ−1y y = E[yyT ]−1y.

For a nonparametric GP the number of basis functions (regressors) is not fixed a pri-ori but scales with the size of the data, so that f(·) can be regarded as a nonparametricestimation of the unknown function f(·). Whenever the auto-covariance is radial, i.e.K(ξ1, ξ2) = K(‖ξ1 − ξ2‖), it turns out that the Bayes estimate is just a Radial BasisFunction (RBF) neural network. The weights of such a network are the entries of thevector θ and are therefore computed as the solution of a system of linear equations.

Before concluding the present chapter it is worth mentioning the connection exist-ing between Bayesian estimation and regularization theory, [73]. Being not necessary forthe purpose of this thesis, the theoretical framework at the basis of the regularizationapproach is not discussed into detail. It is just mentioned that the theory of Reproduc-ing Kernel Hilbert Spaces (RKHS) is needed in order to formulate a unified framework.The interested reader may refer to [3, 76] for a detailed description of RKHS and theirconnection to Bayesian estimation theory. Consider the following regularization problem

fγ(x) = arg minf∈HK

n∑

i=1

(yi − f(xi))2 + γ‖f‖K (2.2)

where HK is the RKHS associated with the kernel K(ξ1, ξ2) and γ ≥ 0 is a constant. It isapparent that the first term appearing in the cost functional is used to weight the sum ofsquared residuals (SSR) while the second term takes into account the magnitude of f(·)according to the RKHS norm. Most of the times ‖f‖K introduces a smoothness constraintused to penalize wiggly solutions. This is the reason why the positive constant γ is alsoknown as regularization parameter. If γ = 0 the regularization problem reduces to theusual minimization of the sum of squared residuals and fγ(x) tries to fit the availabledata. On the other hand, if γ → +∞ the need for smoothness prevails so that fγ(x) → 0.In [23, 31] the regularization problem (2.2) is solved showing that fγ(·) is a regularizationnetwork.

2.2. THE BAYES ESTIMATE 9

Theorem 1 For γ = σ2, the Bayes estimate f(x) and the solution fγ(x) of (2.2) coincide.

The proof of this theorem follows directly from the definition of regularization network.Such a result is significant since it suggests that the Bayesian viewpoint may provide acriterion for the choice of the regularization parameter.


Chapter 3

Nonparametric Bayesian Analysis for

Population Data

An important problem in biomedicine is that of characterizing the average behaviouras well as the inter-individual variability of a population of subjects. As an example,the analysis of population data is of primary importance in pharmacology, where drugresponses measured in multiple subjects are used to obtain average and individual phar-macokinetic and pharmacodynamic models.When it is possible to collect a sufficient number of observations for each subject, modelidentification can be performed separately for each individual. However, in many casesthere are technical, ethical and cost reasons that limit the number of samples that can becollected in each subject. Some examples are given by toxicokinetic studies as well as phar-macological experiments involving critical patients such as neonatal, pediatric or intensivecare unit ones. If the individual models cannot be identified separately, it is necessary toresort to so-called “population methods” that provide the average and individual modelsfrom the joint analysis of all the available data [1, 5, 11, 25, 62, 63, 64, 75, 82].In the drug development process, the use of population approaches has been recommendedby the Food and Drug Administration, in order to obtain a reliable assessment of intra-and inter-individual variabilities [8]. However, the use of such models is not restricted topharmacology but is being extended to data analysis problems arising in several contextsranging from medical imaging [7] and diagnosis of metabolic disorders [74] to genomics[19].

Population methods can be divided into three main branches: parametric, semipara-metric and nonparametric. In the parametric approach, a structural model is assumed,e.g. a compartmental one, and the model parameters are regarded as random variablesextracted from a distribution representative of the given population [6, 35, 36, 77, 78](note that the term “nonparametric” in the last two papers refers to the estimation ofthe probability distributions of the parameters of a grey-box model).In other cases, for instance in the preliminary phases of a study, a structural model is notavailable and semiparametric or nonparametric techniques must be used. In the semi-parametric approach, the response curves are modelled as regression splines [16, 17, 55],so that the non-trivial problem of deciding the number and the location of the splineknots arises.

Recently, in order to develop a completely nonparametric approach, the individual

11

12 CHAPTER 3. NONPARAMETRIC BAYESIAN ANALYSIS

curves have been modelled as discrete-time stochastic processes (e.g. random walks), re-formulating the problem within the framework of Bayesian estimation [42]. This kind ofmodel has also been used for the analysis of gene expression time series measured usingDNA micro-arrays [19]. Since the sampling schedules are usually not uniformly spacedin time, it would be more convenient to model the individual curves as continuous-timestochastic processes. In this paper we develop such a continuous-time population model.More precisely, assuming that the average impulse response of the population is an in-tegrated Wiener process, it is shown that its Bayes estimate is a cubic spline. Explicitformulas are worked out also for the estimates of the individual responses. This esti-mation approach extends the Gaussian processes methodology for the reconstruction ofcontinuous functions given discrete and noisy samples [41, 67, 80] to the case of populationmodels. Remarkably, the overall estimator can be interpreted as a kind of RegularizationNetwork [56] whose weights are the solution of a system of linear equations.

In the last few years there has been a growing interest in smoothing splines within thecontrol community, especially for what concerns their interpretation in an optimal controltheoretic context [15, 71]. In the present work, conversely, smoothing splines arise as thesolution of an optimal mean-square estimation problem. The method is tested on simu-lated data sets as well as on pharmacokinetic data related to xenobiotics administrationin human subjects.

3.1 Stochastic Population Model

Consider the problem of estimating a family of scalar real-valued continuous-time func-tions zj(t), j = 1, ..., N , t ≥ 0, on the basis of noisy samples taken at discrete instants.More precisely, assume that the following measurements are available

yjk = zj(tjk) + vj

k, k = 1, . . . , nj, (3.1)

where tjk > 0 denotes the k-th sampling instant (“knot”) for the j-th curve, and themeasurement errors vj

k are mutually independent and normally distributed with E[vjk] = 0,

V ar[vjk] = (σj

k)2. In an experimental setting, the j-th curve zj(t) will be representative

of the j-th subject (e.g. an impulse response obtained as a drug concentration profile inplasma after administration of a unit bolus). Note that the number and location of thesampling instants tjk may vary from subject to subject. Hereafter, each individual curvewill be decomposed as

zj(t) = z(t) + zj(t)

where z(t) is the “typical (average) curve” of the population and z(t) is the “individualshift” with respect to the average behaviour. For ease of notation, the observations willbe grouped as follows

y := [y11 . . . y1

n1y2

1 . . . y2n2

. . . yN1 . . . yN

nN]T

Letting n = n1 + n2 + ... + nN be the total number of observations, y is an n-dimensionalcolumn vector. In a similar way, it is possible to define

z := [z(t11) . . . z(t1n1) . . . z(tN1 ) . . . z(tNnN

)]T

3.1. STOCHASTIC POPULATION MODEL 13

z := [z1(t11) . . . z1(t1n1) . . . zN(tN1 ) . . . zN(tNnN

)]T

v := [v11 . . . v1

n1v2

1 . . . v2n2

. . . vN1 . . . vN

nN]T

Therefore, in vector notation, (3.1) can be rewritten as

y = z + z + v (3.2)

where v ∼ N(0,Σv),Σv := diag{(σ11)

2 . . . (σNnN

)2},Σv > 0.

3.1.1 Typical and Individual Curves

In the present work, a stochastic approach is adopted: the unknown functions are modelledas stochastic processes and the aim is to compute their posterior distributions given theobserved data (note that the data are processed off-line, so that there is no need for theestimator to satisfy causality constraints).

Assumption 1 The Gaussian stochastic processes z(t) and zj(t), j = 1, ..., N , are inde-pendent of each other and of the noise vector v.

2

In the following, R(t, τ) := Cov[z(t), z(τ)] and R(t, τ) := Cov[zj(t), zj(τ)], ∀j, will denotethe auto-covariance functions of the typical curve and the individual shifts, respectively.Hereafter, it will be assumed that both R(t, τ) and R(t, τ) are positive definite operators.Recalling that zj(t) is a shift with respect to the typical response, it is reasonable toassume that E[zj(t)] = 0, ∀t, ∀j. As for z(t), by properly scaling the data, it can beassumed without loss of generality that E[z(t)] = 0. Since all the involved processes arejointly Gaussian, the posterior distributions are Gaussian as well. The following resultsprovide the point estimates and the confidence intervals for the typical curve and theindividual ones. In the next proposition and thereafter, V ar[y] will denote the covariancematrix of the random vector y.

Proposition 1

ˆz(t) := E[z(t)|y] =N∑

j=1

nj∑

k=1

cjkR(t, tjk) (3.3)

zj(t) := E[zj(t)|y] = ˆz(t) +

nj∑

k=1

cjkR(t, tjk) (3.4)

c = Σ−1y y (3.5)

c = [c11 c1

2 . . . c1n1

. . . cN1 . . . cN

nN]T

Σy := V ar[y] = V ar[z] + V ar[z] + Σv

V ar [z] = R :=

R(t11, t

11) · · · R(t11, t

NnN

)· · · · · · · · ·

R(tNnN, t11) · · · R(tNnN

, tNnN)


V ar[z] = R := blockdiag{R1, . . . , RN}

Rj :=

R(tj1, t

j1) · · · R(tj1, t

jnj

)

· · · · · · · · ·R(tjnj, t

j1) · · · R(tjnj

, tjnj)

Proof: According to a well-known formula for jointly Gaussian random variables

E[z(t)|y] = E[z(t)] + Cov[z(t),y]V ar[y]−1(y − E[y])

Under the given assumptions, E[z(t)] = 0, E[y] = 0 and

Cov[z(t),y] = Cov[z(t), z + z + v] = Cov[z(t), z] = [R(t, t11)...R(t, tNnN)]

Concerning z(t), a completely analogous derivation yields

E[zj(t)|y] = Cov[zj(t),y]V ar[y]−1y.

Observing that E[zj(t)|y] = E[z(t)|y] + E[zj(t)|y] and that E[zj(t)|yik] = 0, ∀i 6= j,

equation (3.4) is obtained. Finally, the expressions for Σy, V ar[z] and V ar[z] followdirectly from the assumptions.

Proposition 2

V ar[z(t)|y] = R(t, t) − rΣ−1y rT

r := [R(t, t11) . . . R(t, tNnN)]

V ar[zj(t)|y] = R(t, t) + Rj(t, t) − (r + rj)Σ−1y (r + rj)T

rj := Cov[zj(t), z]

Proof: By a well-known formula

V ar[z(t)|y] = V ar[z(t)] − Cov[z(t),y]V ar[y]−1Cov[z(t),y]T

Recalling that y = z+ z+v and in view of the independency assumptions, the expressionfor V ar[z(t)|y] immediately follows. Analogous considerations hold for V ar[zj(t)|y].

3.1.2 Regularization Network Interpretation

It is interesting to note from (3.3) and (3.4) that the estimates ˆz(t) and zj(t) are obtainedas linear combinations of the functions R(t, tjk), R(t, tjk). This is the typical structure thatcomes out in the Bayesian estimation of Gaussian processes [23, 56, 76, 81]. Remarkably,the same estimator can also be obtained via Tychonov regularization theory [23, 56]. Thisexplains why Poggio and Girosi (1990) have introduced the term Regularization Network(RN) to denote such estimators, pointing out their neural network-like structure. Alsothe estimator of Proposition 1 can be regarded as a RN, although of a special type.

Having to do with the identification of a population model, the number of neurons istwice the number n of the data instead of n as in the standard RN, see Fig. 3.1. A first setof n neurons receive t as input and have R(t, tjk) as activation function. The estimate ˆz(t)

3.2. POPULATION SPLINES 15

t

c1

cN

1

nN

c1

1

cN

nN

)(~1tz

)(~tz

N

)(ˆ tzN

)(ˆ1tz

)(ˆ tz

),(~ N

nN

ttR

),( N

nN

ttR

),( 1

1ttR

),(~ 1

1ttR

Figure 3.1: Regularization Network structure of the estimator.

of the typical curve is obtained by linearly combining these outputs through the weightscjk. A second set of n neurons, having R(t, tjk) as activation functions, produce outputs

that, combined again through the weights cjk, yield the estimates of the individual shifts

ˆzj(t). The weight vector c is obtained as the solution of a system of n linear equations,see (3.5). This is an advantage with respect to other kinds of networks, such as MultiLayer Perceptrons, in which the weights have to be computed using iterative nonlinearoptimization [40].

3.2 Population Splines and Hyper-Parameters Esti-

mation

For the results of the previous section to be of practical use it is necessary to specifythe statistics of the stochastic processes z(t), zj(t). If frequently sampled observationswere available, their statistics could be identified by black-box parametric identificationmethods. On the other hand, population studies are often characterized by the scarcityof samples per subject. Therefore, it is necessary to introduce signal models that reflectthe available a-priori knowledge.


3.2.1 Estimating the Typical Curve

If it is only known that a signal is “smooth”, it is a common practice to model it as anintegrated Wiener process as done below.

Assumption 2·

x (t) = Ax(t) + Bw(t)

z(t) = Cx(t)

A =

[0 10 0

], B =

[01

], C =

[1 0

]

where x(0) ∼ N(0, X0

), and w(t) is a scalar continuous-time white Gaussian noise,

independent of x(0) and the measurement error vector v, with E[w(t)w(τ)] = λ2δ(t− τ).

2

The model above can describe signals whose initial conditions are deterministically knownby setting X0 = 0. The case of completely unknown initial conditions, corresponding toX−1

0 = 0, will be discussed in Section 3.3. The parameter λ2 affects the regularity of therealizations (smaller values correspond to smoother signals). The a-priori knowledge isseldom sufficient to specify λ2 so that it must be regarded as a hyper-parameter that willhave to be estimated from the data, see Subsection 3.2.3.

Theorem 2 Under Assumption 2, ˆz(t) defined in Proposition 1 is a cubic spline withknots located in the sampling instants {t11, t12,. . . t

NnN

}.

Proof: It is well known that X(t) := V ar[x(t)] is the solution of the differential Lyapunovequation

·

X (t) = AX(t) + X(t)AT + λ2BBT

X(0) = X0

Moreover,

R(t, τ) =

{CX(t)eA

T (τ−t)CT , t ≤ τ

CeA(t−τ)X(τ)CT , t > τ

In view of the definition of A, B, C, it follows that R(t, τ), seen as a function of t, is apiecewise cubic polynomial:

R(t, τ) = λ2 γ(t, τ) (3.6)

γ(t, τ) =

{t2

2(τ − t

3), t ≤ τ

τ2

2(t − τ

3), t > τ

Note that R(t, τ) is continuous with all its derivatives everywhere but in t = τ where it iscontinuous up to the second derivative. Recalling that ˆz(t) in (3.3) is a linear combinationof the functions R(t, tjk) (Proposition 1), the thesis immediately follows.

2

3.2. POPULATION SPLINES 17

In the literature, it is known that the conditional expectation of an integrated Wienerprocess given discrete observations is a cubic smoothing spline [76]. In some sense, The-orem 2 generalizes such a result to the analysis of a population of signals so that it isnatural to define ˆz(t) as a population smoothing spline.

3.2.2 Estimating the Individual Curves

Concerning the model for the individual shifts zj(t), the following assumption is in order.

Assumption 3 For j = 1, . . . , N ,

·

x (t) = Ax(t) + Bwj(t)

zj(t) = Cx(t)

A =

[a1 10 a2

], B =

[01

], C =

[1 0

]

where a1 < 0, a2 < 0, and x(0) ∼ N(0, X0

), and wj(t) is a scalar continuous-time white

Gaussian noise (independent of v, w(t) and wi(t), i 6= j) with E[w(t)w(τ)] = λ2δ(t− τ).

2

The statistics of z(t) will depend on the three parameters a1, a2, λ2. For λ2 the sameconsiderations as for λ2 hold. The two poles a1 and a2 provide two more degrees of freedomfor shaping the auto-covariance of zj(t). A possible drawback may be the difficulty inestimating two more hyper-parameters from the data. In this respect, a simpler modeldescribes also the individual shifts as integrated Wiener processes (a1 = 0, a2 = 0).However, observe that the measurements can be rewritten as

yjk = z(tk) + vj

k

where vjk := zj(tk) + vj

k. In other words, as far as the estimation of z(t) is concerned, vjk

acts as measurement noise. If zj(t) were an integrated Wiener process, its variance wouldtend to infinity with t, and the confidence intervals for z(t) would diverge as t grows. Anotable exception occurs when some a-priori knowledge on the asymptotic value of thetypical curve z(t) is available, in which case an integrated Wiener model for the individualshifts can do as well (this is further discussed in Section 3.4).In view of Assumption 3, the calculation of R(t, τ) is completely analogous to that ofR(t, τ) described in the proof of Thm. 2 and yields

R(t, τ) = λ2 γ(t, τ) (3.7)

γ(t, τ) =

ea(τ−t)( t2e2at

2a− te2at

2a2 − e2at−14a3 )

+(τ − t)ea(τ−t) 2ate2at−e2at+14a2 , t ≤ τ

ea(t−τ)( τ2e2aτ

2a− τe2aτ

2a2 − e2aτ−14a3 )

+(t − τ)ea(t−τ) 2aτe2aτ−e2aτ+14a2 , t > τ


3.2.3 Estimating the Hyper-Parameters

When one is faced with a Bayesian estimation problem involving unknown hyper-parameters, a simple, yet effective, approach is to resort to the so-called EmpiricalBayes (EB) method [39]. In the first step, a Maximum Likelihood (ML) estimate ofthe hyper-parameters is computed. Then, the Bayes estimate is calculated as if thehyper-parameters were deterministically known and equal to their Maximum Likelihoodestimates. In the problem at hand, this leads to the following estimation algorithm,where θ = [λ2, λ2, a1, a2] denotes the hyper-parameters vector.

Algorithm:

1. Let θML := arg minθ

{ln(det(Σy)) + yTΣ−1

y y}

2. Let [λ2, λ2, a1, a2]T = θML

and compute ˆz(t) and zj(t), j = 1, ..., N according to Proposition 1.

2

If also the individual shifts are modelled as integrated Wiener processes, the only hyper-parameters will be λ2 and λ2.

3.3 Completely Unknown Initial Conditions

It is important to be able to estimate typical curves whose initial conditions in t = 0are completely unknown. As already mentioned, this would correspond to X−1

0 = 0.A practical approach is to let X−1

0 = εI where ε is a small enough scalar, but thisis far from being numerically robust. The rigorous approach calls for the derivation ofspecific formulas as done in the following. Taking into account typical curves whose initialconditions have infinite variance is equivalent to considering a population of the type

zj(t) = z∗(t) + zj(t) (3.8)

z∗(t) := φT (t)ζ + z(t) (3.9)

where z(t) and zj(t) have finite auto-covariances, φ(t) : R1 7−→ R1×M is a deterministicvector function, and ζ ∼ N(0, ρ2I), ρ2 = ∞. For instance, with reference to the integratedWiener process of Assumption 2, letting X0 = ρ2I, ρ2 = ∞ would yield φT (t) = [1 t].In other words, handling completely unknown initial conditions amounts to estimatingadditional parameters with infinite prior variance. In the following, it will be assumedthat the measurements are as in (3.1) and that the n×M matrix

Φ := [φ(t11) . . . φ(t1n1) . . . φ(tN1 ) . . . φ(tNnN

)]T

is full column rank.

Proposition 3 For the model (3.8)-(3.9),

ˆz∗(t) := E[z∗(t)|y] =N∑

j=1

nj∑

k=1

cjkR(t, tjk) + φT (t)d

3.3. INITIAL CONDITIONS 19

zj(t) := E[zj(t)|y] = ˆz∗(t) +

nj∑

k=1

cjkR(t, tjk)

d = (ΦTM−1Φ)−1ΦTM−1y

c = M−1(y − Φd)

M := R + R + Σv

Proof: Mutatis mutandis, the proof is completely analogous to that of Thm. 1.5.3 in [76]and is therefore omitted.

2

In the set of sampling instants tjk, j = 1, . . . , N ; k = 1, . . . , nj, there may be repeatedelements as more than one individual curve can be measured at the same time. For thesubsequent derivations it is useful to introduce the “minimal set” (i.e. without repetitions)of sampling instants {τi}, i = 1, . . . , n, where τi1 6= τi2 , ∀i1 6= i2, and τi is such that thereexist j and k such that τi = tjk. Moreover, let Λ be a matrix whose entries are either 0 or1 such that

[t11 . . . t1n1. . . tN1 . . . tNnN

]T := Λ[τ1 . . . τn]T

The next two results provide the posterior variance, and hence the confidence intervals,of the typical and individual curves, respectively.


V ar[z∗(t)|y] = Ru(t) + Ro(t)

Ru(t) = R(t, t) − ¯r ¯R−1

¯rT

Ro(t) = LΣηLT

Ση := (FT (R + Σv)−1F + J)−1

J :=

[0 0

0 ¯R−1

]

F := [Φ Λ]

L := [φT (t) ¯r ¯R−1

]

¯r := [R(t, τ1) . . . R(t, τn)]

¯R :=

R(τ1, τ1) · · · R(τ1, τn)

· · · · · · · · ·R(τn, τ1) · · · R(τn, τn)


Proof: First of all, observe that the positive definiteness of the operator R(t, τ) implies the

invertibility of ¯R. As for the existence of Ση, assume by contradiction that there exists

x 6= 0 such that xT (FT (R + Σv)−1F + J)x = 0. Let x be partitioned as x = [ζTzT ]T ,

ζ ∈ RM×1. This implies ¯R−1

z = 0 and Φζ + Λz = 0, that is z = 0 and Φζ = 0, ζ 6= 0,which contradicts the full-rank assumption made on Φ. In order to apply Lemma 1 (inAppendix 3.6), let z∗ := z∗(t) and observe that

z∗ = φT (t)ζ + z(t)

y = Fη + ε

ε ∼ N(0,Σε), Σε = R + Σv

η := [ζT ¯zT ]T

¯z := [z(τ1) . . . z(τn)]T

Moreover,Γ := Cov[z∗, η] = [φT (t)V ar[ζ] ¯r]

V := V ar[η] =

[V ar[ζ] 0

0 ¯R

]

ΓV−1 = [φT (t) ¯r ¯R−1

] = L

Recalling that ρ2 = ∞, it is easy to see that

Ση = (FTΣε−1F + V−1)−1

Then,

V ar[z∗|η] = V ar[z∗] − Cov[z∗, η]V ar[η]−1Cov[z∗, η]T

= φT (t)V ar[ζ]φ(t) + R(t, t) − ΓV−1ΓT

= R(t, t) − ¯r ¯R−1

¯rT

Finally, the thesis follows straightforwardly from the application of Lemma 1 (Appendix3.6).

2


V ar[zj(t)|y] = Rju(t) + Rj

o(t)

Rju(t) = Ru(t) + R(t, t) − rjR−1rjT

Rjo(t) = LjΣηL

jT

Ση := (FTΣ−1v F + J)−1

J :=

[0 0

0 (R + R)−1

]

F := [Φ I]

Lj := [φT (t) (r + rj)(R + R)−1]

3.4. EXAMPLES 21

Proof: The invertibility of R follows from the positive definiteness of the operator R(t, τ).As for the invertibility of Ση, it can be proved in the same way as the invertibility of Ση,demonstrated in the proof of Proposition 4. In order to apply Lemma 1, let z∗ := zj(t)and observe that

z∗ = φT (t)ζ + z(t) + zj(t)

y = Fη + v

η = [ζT (z + z)T ]T

Moreover,Γ := Cov[z∗, η] = [φT (t)V ar[ζ] r + rj]

V := V ar[η] =

[V ar[ζ] 0

0 R + R

]

ΓV−1 = Lj

The rest of the proof is very similar to the proof of Proposition 4.

3.4 Examples

In this section the proposed identification scheme is applied to three different case studies,both simulated and experimental.

3.4.1 AUC Estimation

In pharmacology, an important application of population models is the estimation of theArea Under the plasma concentration-time Curve (AUC) of an administered drug. In fact,such a parameter is one of the most important metrics of systemic exposure which in turnis related to therapeutic and toxic effects of the drug under study. As already mentionedin the introduction, the advantage of nonparametric population identification methods,compared to parametric ones that rely on structural model (e.g. compartmental), isthat arbitrary assumption can be avoided. Park et al. (1997) showed that a populationsemiparametric method based on regression splines performed comparably with respectto a population parametric approach, being much robust in the face of possible modelmismatches. On the other hand, Magni et al. (2002) demonstrated that a discrete-timenonparametric method performed equally well if not better than the semiparametric one.Below, a simulated data set is used to compare the continuous-time method proposed inthe present work with the discrete-time one of [42].The measurements yj

k in the j-th subject were obtained as

zj (t) =Dkaj

Vj (kaj − kj)(exp (−kjt) − exp (−kajt))

yjk = zj (tk) + vj

k

tjk = {0.5, 2.3, 3.6, 5.9, 10}V ar[vj

k] = 0.09 + (0.15zj(tk))2 (3.10)


where D = 50 stands for the drug dose, (kaj, kj, Vj) are the parameters characterizing thej-th subject and the variance of the measurement errors vj

k is the sum of a constant and aconstant-CV term (with Coefficient of Variation, CV=0.15). The population distributionof the individual parameters is as follows:

kj = 0.14 exp(ν1j)

kaj = kj + 0.55 exp(ν2j)

Vj = 5 exp(ν3j)

νj = [ν1j ν2j ν3j]T

ν ∼ N(0, diag{0.16 0.16 0.16}).

For this model, 500 replicate data sets of N=20 individuals each were generated. Fig. 3.2shows the average curve z(t) (obtained as the average of 400 individual curves) as well asthe data {yj

k} of data set #51. For each data set, the AUC for ˆz(t) and the AUC of the20 individual curves zj(t) were computed using both methods.

0 1 2 3 4 5 6 7 8 9 100

5

10

15

Figure 3.2: Data set #51: estimated typical curve (solid) vs. true typical curve (dashed)and all the available data.

It is worth noting that the individual curves are not stationary but tend to besmoother towards the end of the experiment. In order to compensate this behaviour thetimes were transformed logarithmically by defining a new time axis tnew := ln(t + 1). Infact, in this new time axis the rate of variation of the curves is more uniform.

Since in the model the variance of the measurements errors vjk depends on the

unknown concentrations zj(tk), a two-step procedure was adopted. First, the variancematrix Σv was calculated using (3.10) with zj(tk) replaced by yj

k. After estimating zj(t),matrix Σv was updated using (3.10) with zj(tk) replaced by zj(tk) and the identificationalgorithm was re-run to yield the final estimates. In Fig. 3.2, the estimated typical curveobtained from data set #51 is plotted together with the individual data and the true

3.4. EXAMPLES 23

1 2−40

−20

0

20

40Typical AUC

Err

(%

)

Approach

1 2−40

−20

0

20

40Individual AUC’s

Err

(%

)

Approach

Figure 3.3: Err(%) for the typical as well as the individual AUC’s obtained using thediscrete-time (1) and the continuous-time (2) approaches.


typical curve.

Each estimate was evaluated on the basis of the percent relative error defined as

Err(%) :=AUC − AUC

AUC· 100

where AUC is the integral up to the last sampling instant, i.e. 10

AUC =

∫ 10

0

z(t)dt

and AUC denotes the integral of ˆz(t). Analogous percent errors were computed for theindividual AUC’s. The results are summarized in Fig. 3.3 where the boxplots for theerrors relative to the typical and individual AUC’s are reported. The distribution of theerrors has about the same dispersion for both methods. However, the new method is lessbiased. This improvement is not surprising because the discrete-time model computes theAUC on the basis of a piecewise linear estimate of the curve which is less accurate thanthe continuous-time estimate obtained by the new method.

3.4.2 Simulated Example: Sparsely Sampled Data

In this example the proposed nonparametric identification scheme is applied to a problemin which sampling is not uniform between subjects. In particular the number of samplesper subject ranges from 1 to 9. In such conditions it is clearly impossible to estimatethe typical curve by averaging the individual curves estimated by standard identificationmethods. Conversely, the nonparametric population approach not only reconstructs thetypical curve but provides also estimates of the individual ones.The measurements yj

k in the j-th subject were obtained as

zj(t) = αj exp(−βjt)

yjk = zj(tk) + vj

k

V ar[vjk] = (0.1 zj(tjk))

2

where (αj, βj) are the parameters characterizing the j-th subject. The population distri-bution of the individual parameters is as follows

αj = exp(ν1j)

βj = exp(ν2j)

νj = [ν1j ν2j]T

ν ∼ N([0 ln(0.2)]T , diag{0.01 0.0259}).

For this model, 500 replicate data sets of N=7 individuals each were generated. Theset of possible sampling instants was {t1, . . . , t9} = {0, 0.5, 1, 1.5, 2, 4, 8, 12, 24}.In each data set, subject #1 was fully sampled, #2 was sampled at time points

3.4. EXAMPLES 25

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2

1.4Noisy dataConcentration curves

Figure 3.4: Data set #78: noisy measurements and real individual curves.

{t1, t3, t5, t6, t7, t8}, #3 at {t1, t5, t6, t8, t9}, #4 at {t1, t3, t6, t8}, #5 at {t2, t4, t6},#6 at {t3, t7} and #7 at {t5} (30 samples in total). For illustrative purposes, in Fig. 3.4the noisy measurements and the individual curves of data set #78 are plotted. In thisproblem, in order to take into account the prior information that all curves tend to zero, atransformation of the time coordinates was performed. More precisely, tnew = 1/(1+ t/µ)so that t = 0 and t = ∞ correspond to tnew = 1 and tnew = 0, respectively. Then, in thenew time coordinates it was assumed that both the typical and individual curves had zeroinitial conditions (corresponding to zero terminal conditions at t = ∞), i.e. x(0) = 0 andx(0) = 0. As the new time range tnew ∈ [0, 1] is finite, the individual shifts were modelledas integrated Wiener processes (since tnew does not go to infinity, the posterior varianceof the typical curve cannot diverge). Another advantage of the time transformation hasto do with its ability to formalize the prior knowledge that the curves become smootheras time increases. In fact, processes whose second derivative is stationary in the new timecoordinate correspond to processes whose second derivative has decreasing variance in theoriginal time coordinate. The parameter µ of the time transformation was chosen so asto maximize the minimum distance between each pair of transformed sampling instants,yielding µ = 3, 00. In the estimation algorithm V ar[vj

k] was approximated by 0.01(yjk)

2,i.e. by replacing the (unknown) zj(tjk) with the observed yj

k. The hyper-parameters wereestimated via Maximum Likelihood.

The results of the identification for data set #78 are given in Figs. 3.5, 3.6 and 3.7,where the estimated typical curve and two of the seven individual curves are reportedtogether with their confidence intervals. Both the typical and individual curves are estim-ated with reasonable accuracy. The accuracy of the individual curves decreases togetherwith the number of available data. This phenomenon can be appreciated by looking atthe boxplots of the RMSEs reported in Fig. 3.8. The RMSE was computed as

RMSE =

(1

t9

∫ t9

0

(zj (τ) − zj (τ)

)2dτ

) 1

2

(3.11)

for the individual curves, and analogously for the typical curve z(t).


0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2

1.4DataReal curveTypical curveConfidence intervals

Figure 3.5: True (dashed) vs. estimated (solid) typical curve with its 95% confidenceintervals and available data (open circles).

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2

1.4True individual curveObserved data of the individual curveEstimated curveConfidence intervals

Figure 3.6: Individual curve #4: true (dashed) vs. estimated (solid) curve with its 95%confidence intervals and available data.

3.4. EXAMPLES 27

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2



#1 #2 #3 #4 #5 #6 #7 typical0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Figure 3.8: RMSE of each individual curve and of the typical curve computed on 500data sets.


3.4.3 Analysis of Pharmacokinetic Data

Finally, the proposed population model was tested on a data set related to xenobioticsadministration in 27 human subjects [58].

0 5 10 15 20 250

20

40

60

80

100

120

time (hours)

xeno

biotic

s con

cent

ratio

n

Figure 3.9: Xenobiotics concentration data after a bolus in 27 human subjects: averagecurve (bold) and individual curves.

In the experiment, 8 samples were collected in each subject at{t1, t2, t3, t4, t5, t6, t7, t8} = {0.5, 1, 1.5, 2, 4, 8, 12, 24} hours after a bolus administra-tion. The data have a 10% Coefficient of Variation, i.e. V ar[vj

k] = (0.1zj(tjk))2. To

illustrate the population variability, the 27 experimental concentration curves are re-ported in Fig. 3.9, together with the average curve which, given the number of subjects,is a reasonable estimate of the typical curve. Starting from these experimental data,different sampling schemes can be simulated by choosing proper subsets of the data. Inparticular, as an example, the following sparse sampling protocol was adopted: subject#2 is sampled at time points {t6, t7, t8}, #5 at {t2, t4, t8}, #7 is fully sampled, #8 at{t3, t5}, #13 at {t1, t2}, #17 at {t7}, #19 at {t6}, #20 at {t4, t8}, #21 at {t5} and #23at {t1, t3} (25 samples in total).

Also in this case study the times were transformed by defining a new time axistnew = 1/(1 + t/µ) with µ = 3, 00 (the value of µ coincides with that used for the simu-lated example because the sampling schedule is the same). In this case, in the new timecoordinates all the curves (the typical and the individuals) are equal to zero in tnew = 1(corresponds to t = 0). This was accommodated by inserting zero-variance null meas-urements in tnew = 1. The hyper-parameters were estimated via Maximum Likelihood(λ2

ML = 102130, λ2ML = 23443).

In Fig. 3.10, the estimated typical curve with its 95% confidence intervals is reportedtogether with the data. The estimated typical curve (Fig. 3.10) appears to be a satisfact-ory reconstruction, especially if it is taken into account that it was obtained using only 25observations. In Figs. 3.11 and 3.12 the estimate of the individual curve of subjects #5and #19, respectively, are shown together with their confidence intervals. For the otherindividuals, reasonable estimates are obtained (data not shown).

3.4. EXAMPLES 29

0 5 10 15 20 250

10

20

30

40

50

60

70

80

xeno

biotic

s con

cent

ratio

nDataTypical curveConfidence intervals

Figure 3.10: Estimated typical curve (bold) with its 95% confidence intervals.

0 5 10 15 20 250

10

20

30

40

50

60

70

80

xeno

biotic

s con

cent

ratio

n

Observed dataIndividual curveUnobserved dataConfidence intervals

Figure 3.11: Estimated individual curve of subject #5 (bold) with its 95% confidenceintervals. For this individual curve only three data (full circles) were available. In orderto assess the quality of the reconstruction, the other five unobserved data (open circles)are also plotted.


0 5 10 15 20 250

10

20

30

40

50

60

70

80

xeno

biotic

s con

cent

ratio

n

Observed dataIndividual curveUnobserved dataConfidence intervals

Figure 3.12: Estimated individual curve of subject #19 (bold) with its 95% confidenceintervals. For this individual curve only one datum (full circle) was available. In order toassess the quality of the reconstruction, the other seven unobserved data (open circles)are also plotted.

3.5 Concluding Remarks

A new nonparametric continuous-time model for the population analysis of multiple ex-periments has been proposed. The typical curve as well as the individual ones are modelledas continuous-time Gaussian processes. If the statistics of the processes are known, theposterior expectation given the data (the Bayes estimate) is obtained as the output ofa Regularization Network, i.e. as the linear combination of auto-covariance functionscentred at the sampling knots. The network weights are computed by solving a systemof linear equations. Moreover, if the typical curve is modelled as an integrated Wienerprocess, its estimate is a cubic spline. In general, the statistics of the processes are notcompletely known and depend on some unknown hyper-parameters. Therefore, an Em-pirical Bayes scheme has been proposed: first the hyper-parameters are estimated viaMaximum Likelihood and subsequently their ML estimates are plugged into the Regular-ization Network.

In order to develop a truly Bayesian estimation procedure in which the hyper-parameters and the curves can be jointly estimated, in the following chapter the ap-plication of Markov Chain Monte Carlo (MCMC) algorithms to the present problem isinvestigated. A further direction of future research would focus on the implementation ofcomputationally efficient algorithms. In fact, the proposed scheme requires the solutionof a system of linear equations and its computational complexity scales with the cube ofthe number of observations. By exploiting the state-space model it may be possible towork out algorithms based on Kalman filtering whose complexity scales linearly with thenumber of data, see e.g. [12] where the efficient computation of regularization networksis addressed.

3.6. APPENDIX: TECHNICAL LEMMA 31

3.6 Appendix: Technical Lemma

Consider the problem of estimating a scalar random variable z∗ given noisy observationsy = Fη + ε where the vector η is correlated with z∗ and ε is an independent noise term.In the next lemma it is shown that conditional variance V ar[z∗|y] can be decomposed asthe sum of two terms. The first one is the conditional variance when η is perfectly known,whereas the second term keeps into account the presence of the measurements noise ε.A graphical representation of the lemma in terms of projections in the Hilbert space ofjointly normal random variables is provided in Fig. 3.13

a

b

h

z*

c

y

z E z* *= |[ ]y

>

E[ ]z*|h

a Var z2 *= |[ ]y

b Var E z2 *= |[ [ ]h] | y

c Var z2 *= |[ ]h

Figure 3.13: Graphical interpretation of Lemma 1 in terms of projections in the Hilbertspace of jointly normal random variables.

Lemma 1 Assume thaty = Fη + ε, y ∈ Rn

ε ∼ N(0,Σε), Σε > 0[

z∗

η

]∼ N(0,Σ)

Σ =

[σ2∗ Γ

ΓT V

], Σ > 0

where z∗ is a scalar and ε is independent of [z∗ηT ]T . Then,

V ar[z∗|y] = V ar[z∗|η] + V ar[E[z∗|η]|y]

V ar[z∗|η] = σ2∗ − ΓV−1ΓT

V ar[E[z∗|η]|y] = ΓV−1V ar[η|y]V−1ΓT

V ar[η|y] = (FTΣ−1ε F + V−1)−1

Proof: The expression for V ar[z∗|η] is a straightforward consequence of well-known prop-erties of jointly Gaussian random variables. As for the computation of V ar[E[z∗|η]|y],observe that E[z∗|η] = ΓV−1η. Therefore,

V ar[E[z∗|η]|y] = ΓV−1V ar[η|y]V−1ΓT


On the other hand, in view of standard Bayesian estimation formulas

V ar[η|y] = (FTΣ−1ε F + V−1)−1

By applying the matrix inversion lemma, one has that

V ar[η|y] = V − VFT (FVFT + Σε)−1FV

Finally,

V ar[z∗|η] + V ar[E[z∗|η]|y] = σ2∗ − ΓV−1ΓT +

+ ΓV−1(V − VFT (FVFT + Σε)−1FV)V−1ΓT

= σ2∗ − ΓFT (FVFT + Σε)

−1FΓT

= V ar[z∗] − Cov[z∗,y]V ar[y]−1Cov[z∗,y]T

= V ar[z∗|y]

so proving the thesis.

Chapter 4

Nonparametric Analysis for

Population Data: an MCMC

Approach

As already mentioned in Chapter 3, in science and technology, it is rather common toanalyze data coming from multiple experiments performed on different subjects belongingto some given population. If only few observations can be collected in each subject, onemust resort to so-called population methods which may rely on parametric, semiparamet-ric or nonparametric models. In the previous chapter, model identification is performedaccording to a continuous-time nonparametric model based on an Empirical Bayes ap-proach. Such a methodology, though relatively simple, replaces the hyper-parameterswith their point estimates and, as such, underestimates the confidence intervals. For thisreason, the present chapter develops a fully Bayesian identification scheme, which resortsto a Markov Chain Monte Carlo (MCMC) procedure for estimating the posterior distri-butions. The method is tested on the same simulated example and pharmacokinetic dataalready introduced in Chapter 3.

4.1 The Population Model

Since the stochastic population model used in this chapter coincides with the one intro-duced in Chapter 3, one can refer to Sections 3.1 and 3.2 for a detailed description of themodel at issue. In this section, a sampled reformulation of the aforementioned continuous-time model is introduced that will prove useful in the following. More precisely, equation(3.2) is rewritten as

y = Φw + Dw + v (4.1)

where Φ and D are such that λ2ΦΦT = V ar[z], λ2DDT = V ar[z], w ∼ N(0, λ2In) andw ∼ N(0, λ2In). Recalling equations (3.6) and (3.7), it is easy to see that V ar[z] andV ar[z], as introduced in the previous chapter, can be rewritten as follows:

33

34 CHAPTER 4. AN MCMC APPROACH

V ar [z] = R = λ2

γ(t11, t

11) ... γ(t11, t

NnN

)... ... ...

γ(tNnN, t11) ... γ(tNnN

, tNnN)

V ar[z] = R = blockdiag{R1, ..., RN}

Rj = λ2

γ(tj1, t

j1) ... γ(tj1, t

jnj

)

... ... ...

γ(tjnj, tj1) ... γ(tjnj

, tjnj)

In the following, X0 and X0 are supposed to be known, so that only two hyper-parametershave to be estimated from the data: λ2 and λ2.

4.2 MCMC Estimation of the Sampled Model

Consider a generic identification problem in which y denotes the observed data and θ theunknown parameters of the model. Bayesian inference requires the computation of theposterior distribution p(θ|y). Given p(θ|y), a point estimate of θ is

E [θ|y] =

∫θp (θ|y) dθ.

In the present problem, this expression does not admit an analytical solution and itsnumerical evaluation is tackled by means of an MCMC approach, [22].

4.2.1 MCMC Estimation

Monte Carlo integration evaluates E[θ|y] by drawing samples {θi, i = 1, ..., h} from theposterior distribution so that

E [θ|y] ≈ 1

h

h∑

i=1

θi.

It is not necessary that the samples are drawn independently from p(θ|y), but it sufficesthat they explore the whole support of p(θ|y) in the correct proportions. One way ofdoing this is through a Markov chain having p(θ|y) as its stationary distribution. Thisresult can be obtained resorting to a well known family of algorithms named Metropolis-Hastings algorithms [30]. In particular, letting θ = [λ2 λ2]T , the Gibbs sampler will beapplied to the estimation of the population model previously described.

4.2.2 Hyper-Parameter Priors

The very first step consists of assigning a prior distribution to all the random variables in-volved in the model. A slightly informative prior is to be preferred for parameters that areaffected by a large uncertainty. For the hyper-parameters λ2 and λ2, a computationally

4.2. MCMC ESTIMATION OF THE SAMPLED MODEL 35

advantageous choice, which simplifies the subsequent calculation of the so-called full con-ditional distributions, is to model the prior of 1/λ2 and 1/λ2 as a Gamma distributionwith large (possibly infinite) variance:

p

(1

λ2

)= Γ (g1, g2) ∝

(1

λ2

)g1−1

eg2

λ2

p

(1

λ2

)= Γ (g3, g4) ∝

(1

λ2

)g3−1

eg4

λ2

where gi are the parameters that characterize the Gamma distribution.

4.2.3 Full Conditional Distributions

The full conditional is the probability distribution of a variable conditioned on all theother variables in the model. It turns out that the full conditional for 1/λ2 is Γ(g1, g2)with arguments g1 = g1 + n/2 and g2 = g2 + wT w/2. As for the full conditional of 1/λ2,it is Γ(g3, g4), g3 = g3 + n/2, g4 = g4 + wT w/2. The full conditional distributions of w

and w are proved to be multivariate Gaussian functions characterized by the followingstatistics:

V ar [w|·] =(λ−2In + ΦT (V ar [z] + Σv)

−1Φ

)−1

E [w|·] = V ar [w|·]ΦT (V ar [z] + Σv)−1

y

V ar [w|·] =(λ−2In + DT (V ar [z] + Σv)

−1D

)−1

E [w|·] = V ar [w|·]DT (V ar [z] + Σv)−1

y

4.2.4 The Algorithm

In order to initialize the Gibbs sampler, the following steps are performed.

• Initialization of the parameters of the prior distributions for λ2 and λ2. Letting g1,g2, g3 and g4 be all equal to zero (implying that the prior variance is infinite) yieldsa non-informative prior.

• Initialization of the unobserved variables (i.e. w, w, 1/λ2, 1/λ2) to values obtainedby sampling the corresponding prior distribution or fixed either according to somea-priori knowledge or arbitrarily. In the present case, to speed up convergence, λ2

and λ2 are initialized with the Empirical Bayes estimate, see Chapter 3, althoughan arbitrary initialization would work as well.

Subsequently, the following procedure has to be repeated iteratively so as to collect asufficient amount of samples of both hyper-parameters. During each iteration, the fullconditional distributions are updated according to the values of the samples drawn duringthe preceding step. The structure of the general i + 1-th iteration is described below.

• Compute V ar[w|·] and E[w|·]


• Sample wi+1 from the updated full conditional distribution of w

• Update g1 and g2

• Sample λ2i+1 from the updated full conditional distribution of λ2

• Compute V ar[w|·] and E[w|·]

• Sample wi+1 from the updated full conditional distribution of w

• Update g3 and g4

• Sample λ2i+1 from the updated full conditional distribution of λ2

After having extracted all samples, the initial “burn-in” part of the chain is discarded.The remaining h samples λ2

i , λ2i , i = 1, ..., h, provide a characterization of the posterior

distribution of λ2 and λ2 given the data.

4.3 Estimation of the Continuous-Time Signals

The MCMC algorithm described in the previous section hinges on the sampled model(4.1) and as such can estimate the typical and individual curves only in correspondenceof the sampling knots. In this section, the posterior distribution of λ2 and λ2 is exploitedto obtain the posterior expectation and the posterior variance as well as the posteriordistribution of the typical and individual curves at any time point.

4.3.1 The Posterior Expectation

The following results provide the point estimates for the typical curve and the individualones.

Proposition 6 Let

cjk =

1

h

h∑

i=1

λ2i c

jk(λ

2i , λ

2i ), (4.2)

c(λ2i , λ

2i ) := [c1

1...c1n1

c21...c

2n2

...cN1 ...cN

nN]T

:= V ar[y|λ2i , λ

2i ]

−1y, i = 1, ..., h, (4.3)

Then,

ˆz(t) := E[z(t)|y] 'N∑

j=1

nj∑

k=1

γ(t, tjk)cjk (4.4)

4.3. ESTIMATION OF THE CONTINUOUS-TIME SIGNALS 37

Proof: By the total probability theorem, the posterior expectation is

E[z(t)|y] =

∫∫E[z(t)|y, λ2, λ2]p(λ2, λ2|y)dλ2 dλ2

and, using the samples drawn by the MCMC algorithm, it can be approximated as

E[z(t)|y] 'h∑

i=1

E[z(t)|y, λ2

i , λ2i

]

h

According to a well known formula for jointly Gaussian random variables,

E[z(t)|y, λ2i , λ

2i ] = E[z(t)|λ2

i , λ2i ]+

+ Cov[z(t),y|λ2i , λ

2i ]V ar[y|λ2

i , λ2i ]

−1(y − E[y|λ2i , λ

2i ])

Under the given assumptions, E[z(t)|λ2i , λ

2i ] = 0, E[y|λ2

i , λ2i ] = 0, ∀λ2

i , ∀λ2i , while

Cov[z(t),y|λ2

i , λ2i

]= Cov

[z(t), z + z + v|λ2

i , λ2i

]=

= Cov[z(t), z|λ2

i

]= λ2

i

[γ

(t, t11

)...γ

(t, tNnN

)]

V ar[y|λ2i , λ

2i ] = V ar[z|λ2

i ] + V ar[z|λ2i ] + Σv

Then, recalling the definition of c(λ2i , λ

2i ), one has that

E[z(t)|y] 'h∑

i=1

1

h

N∑

j=1

nj∑

k=1

γ(t, tjk)cjk(λ

2i , λ

2i ) (4.5)

from which the thesis follows.

2

The computationally intensive step of the algorithm is (4.3), where the number of opera-tions required to calculate V ar[y|λ2

i , λ2i ]

−1y scales as the cube of the number n of data.However, as shown in [12], V ar[y] can be inverted in O(n) operations, via Kalman filteringtechniques, if y is a vector obtained by sampling a process whose spectrum is rational.Analogous ideas can be used to efficiently evaluate the inverses needed in the MCMCsimulation (Section 4.2).

Proposition 7

zj(t) := E[zj(t)|y

]' ˆz(t) +

nj∑

k=1

γ(t, tjk)cjk

where

cjk =

1

h

h∑

i=1

λ2i c

jk(λ

2i , λ

2i ) (4.6)

and cjk(λ

2i , λ

2i ) have been defined in Proposition 6.


Proof: All the considerations made about the estimation of the typical curve apply alsoto the estimation of the individual curves, with the only difference that γ(t, τ) is replacedby γ(t, τ).

4.3.2 Population Regularization Network

It is noteworthy that, in analogy with the results of Chapter 3, ˆz(t) and zj(t) are obtainedas linear combinations of auto-covariance functions γ(t, τ) and γ(t, τ), centred in thesampling knots τ = tjk. As already mentioned in the previous chapter, this structure,which characterizes estimators obtained via Gaussian processes estimation or Tychonovregularization, is known as Regularization Network. This means that the estimators ofPropositions 6 and 7 can be regarded as a kind of RN (see Fig. 4.1) in which there are twotypes of neurons. A first set of n neurons receive t as input and have γ(t, tjk) as activationfunctions. Their outputs are linearly combined through the weights cj

k to obtain ˆz(t) asoutput. The individual shifts zj(t) are estimated by a second set of n neurons havingγ(t, tjk) as activation functions and cj

k as weights.

t

)(~1 tz

)(~ tz N

)(ˆ 1 tz

)(ˆ tz

)(ˆ tz N

1

1~c

1

1c

N

n Nc~

N

nNc

),( 1

1ttg

),(~ 1

1ttg

),( N

nNttg

),(~ N

nNttg

Figure 4.1: Regularization Network (RN) with 2n neurons.

The weights cjk and cj

k are computed as ergodic averages, see (4.2) and (4.6), of the coeffi-cients cj

k(λ2i , λ

2i ) which are obtained through the solution of a system of linear equations,

(4.3), whose order is equal to the number of sampling knots. Note that this representsthe main difference between the aforementioned Regularization Network and the one re-ported in Fig. 3.1. In fact, in the RN described in Chapter 3, the typical curve and the

4.4. EXAMPLES 39

individual shifts are weighted by the same coefficients while, in this case, the weights usedfor reconstructing ˆz(t) and zj(t) are different. Recalling that the auto-covariances γ(t, tjk)are piecewise cubic polynomials, another important consequence of the structure of theestimator is that the estimate ˆz(t) is a cubic spline.

4.3.3 The Posterior Variance

In order to compute the posterior variance of z(t) recall that

V ar [z(t)|y] = E[z(t)2|y

]− E [z(t)|y]2 (4.7)

Note that E[z(t)|y] is known from (4.4). Moreover,

E[z(t)2|y] =

∫z(t)2

∫∫p(z(t)|y,λ2, λ2)p(λ2, λ2|y)dλ2dλ2dz

=

∫∫E[z(t)2|y, λ2, λ2]p(λ2, λ2|y)dλ2dλ2

'h∑

i=1

E[z(t)2|y, λ2i , λ

2i ]

h

where

E[z(t)2|y, λ2i , λ

2i ] = V ar[z(t)|y, λ2

i , λ2i ] + E[z(t)|y, λ2

i , λ2i ]

2

The conditional expectation E[z(t)|y, λ2i , λ

2i ]

2 can be computed according to Proposition6, while

V ar[z(t)|y, λ2i , λ

2i ] = V ar[z(t)|λ2

i ] − Cov[z(t),y|λ2i ]V ar[y|λ2

i , λ2i ]

−1Cov[z(t),y|λ2i ]

T

Analogous considerations hold for V ar[zj(t)|y].

2

Although this is more demanding from a computational point of view, the 5% and 95%percentiles of the posterior distributions of z(t) and z(t) could be calculated in a similarway. Indeed, by the total probability theorem, the posterior distribution of z(t) can beapproximated as the average of h gaussian distributions whose mean and variance areE[z(t)|y, λ2

i , λ2i ] and V ar[z(t)|y, λ2

i , λ2i ], respectively.

4.4 Examples

In this section, the MCMC identification scheme is applied to a simulated and an experi-mental case study. In all the considered cases the estimates were obtained on the basisof 2000 MCMC runs. The nodes of the Markov chain corresponding to λ2 and λ2 wereinitialized with their Empirical Bayes (EB) estimates. In order to compare the MCMCapproach with the Empirical Bayes one, the same problems as in Chapter 3 are considered.


4.4.1 Simulated Example: Sparsely Sampled Data

The results presented here were obtained applying the MCMC methodology to the simu-lated data sets introduced in Chapter 3, Section 3.4.2. Again the time coordinates weretransformed according to tnew = 1/(1 + t/µ) with µ maximizing the minimum distancebetween each pair of sampling instants (µ = 3, 00). Moreover both the typical curve andthe individual shifts were modelled as integrated Wiener processes. The MCMC identi-fication scheme was run on all the 500 data sets. The results of the identification for dataset #78 are given in Figs. 4.2-4.7.

0 5 10 15 20 25 30 35 400

10

20

30

40

50

60

Figure 4.2: Histogram of the 2000 samples extracted from the posterior of λ2. TheEmpirical Bayes estimate is also reported (white diamond).

In Figs. 4.2-4.3, the histograms of the samples extracted from the posterior of λ2 andλ2 are plotted together with their Empirical Bayes estimates computed in the previouschapter. The scatter plot of the samples of λ2 and λ2 is reported in Fig. 4.4. The mostapparent feature is the large variance of the posterior of λ2 compared to that of λ2. Thisis not unexpected, since λ2 has to do with the variance of the second derivative of thetypical curve which is more time-varying than the individual shifts that correspond to λ2.

In Figs. 4.5-4.7 the estimated typical curve and two of the seven individual curves areplotted with their 95% confidence intervals. It is seen that the estimates are accurate forboth the typical and the individual curves. A comparison with Figs. 3.5-3.7 in Chapter 3shows that there is an almost perfect agreement between the MCMC and Empirical Bayesestimates but that the Empirical Bayes method underestimates the size of the confidenceintervals. This is a consequence of the fact that the Empirical Bayes intervals neglectthe uncertainty in the hyper-parameter estimates. In order to assess how the accuracy ofthe estimates of the individual curves depends on the number of data collected for eachcurve, one can look at the boxplots of the RMSEs reported in Fig. 4.8. The RMSE was

4.4. EXAMPLES 41

0 0.5 1 1.50

5

10

15

20

25

30

35


0 5 10 15 20 25 30 35 400

0.5

1

1.5

λ2bar

λ2 tilde

Figure 4.4: Scatter plot of the samples of λ2 and λ2. The Empirical Bayes estimate isalso reported (white diamond).


0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2

DataReal curveTypical curveConfidence intervals

Figure 4.5: True (dashed) vs. estimated (solid) typical curve with its 95% confidenceintervals and available data (open circles).

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2



4.4. EXAMPLES 43

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

1.2



#1 #2 #3 #4 #5 #6 #7 typical0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Figure 4.8: RMSE of each individual curve and of the typical curve computed on 500data sets.


computed using equation (3.11) for both the individual curves and the typical curve z(t).The comparison between Fig. 4.8 and Fig. 3.8 shows that, except for a few subjects whichpresent a greater RMSE, the overall behaviour of the estimates obtained with the MCMCapproach is equivalent to the one obtained using the EB methodology.

4.4.2 Analysis of Pharmacokinetic Data

The second data set is related to the xenobiotics administration in 27 human subjectsintroduced in Chapter 3, Section 3.4.3. The same sampling protocol and time transforma-tion were adopted. The results are shown in Figs. 4.9-4.14. The histograms of the valuesof λ2 and λ2 sampled from the posterior are plotted in Figs. 4.9-4.10, together with theEmpirical Bayes estimates.

0 2 4 6 8 10 12 14 16

x 105

0

10

20

30

40

50

60


The scatter plot is reported in Fig. 4.11. Also in this case, λ2 exhibits a largervariability than λ2, a fact which is due to the typical curve accounting for most of thevariability in the data. In Figs. 4.12-4.14 the estimated typical curve and the estimatesof the individual curves corresponding to subjects #5 and #19 are plotted together withthe 95% confidence intervals. A comparison with Figs. 3.10-3.12 in Chapter 3, where thesame curves were estimated using the EB approach, shows that the MCMC and the EBestimates are practically coincident but that the EB intervals are underestimated.

4.4. EXAMPLES 45

0 2 4 6 8 10 12

x 104

0

5

10

15

20

25


0 2 4 6 8 10 12 14 16

x 105

0

2

4

6

8

10

12x 10

4

λ2bar

λ2 tilde

Figure 4.11: Scatter plot of the samples of λ2 and λ2. The Empirical Bayes estimate isalso reported (white diamond).


0 5 10 15 20 250

10

20

30

40

50

60

70

80

xeno

biotic

s con

cent

ratio

nDataTypical curveConfidence intervals

Figure 4.12: Estimated typical curve (bold) with its 95% confidence intervals.

0 5 10 15 20 250

10

20

30

40

50

60

70

80

90

xeno

biotic

s con

cent

ratio

n

Observed dataIndividual curveConfidence intervalsUnobserved data

Figure 4.13: Estimated individual curve of subject #5 (bold) with its 95% confidenceintervals. For this individual curve only three data (full circles) were available. In orderto assess the quality of the reconstruction, the other five unobserved data (open circles)are also plotted.

4.5. CONCLUDING REMARKS 47

0 5 10 15 20 250

10

20

30

40

50

60

70

80

xeno

biotic

s con

cent

ratio

n

Observed dataIndividual curveConfidence intervalsUnobserved data

Figure 4.14: Estimated individual curve of subject #19 (bold) with its 95% confidenceintervals. For this individual curve only one datum (full circle) was available. In order toassess the quality of the reconstruction, the other seven unobserved data (open circles)are also plotted.


An MCMC scheme that performs the estimation of a population of curves has been de-veloped and tested on simulated and real data. To improve the computational efficiency,Monte Carlo sampling of the curves is performed only at the sampling instants and theinter-sample posterior expectation and variance of the curves are subsequently recon-structed on the basis of the posterior distribution of the hyper-parameters. It has beenproven that, if the curves are modelled as integrated Wiener processes, their estimatesare cubic splines whose knots coincide with the instants at which the individual curvesare sampled. The comparison with the EB estimates, performed in real and simulateddata, shows that there is an almost perfect agreement with the MCMC estimate. Onthe other hand, the EB underestimates the size of the confidence intervals, so that theMCMC approach, though more burdensome, is to be preferred.


Chapter 5

Nonparametric Identification of

Engine Maps

Mean Value Models (MVMs) are a widespread tool for engine dynamics description andsimulation in automotive industries and research centres. A major advantage of sucha kind of models is their simple structure: no more than three non-linear differentialequations, related to the air flow through the intake manifold, the film fluid on manifoldwall and the engine crankshaft dynamics respectively. Hence, they are well suited to easyand fast implementation in blocks-build simulators like Simulink or SystemBuild. Anotherappeal of MVMs is the good predictive performance in terms of Mean Square Error (MSE)between measured and simulated intake manifold pressure and engine speed, which makesthem a valuable tool for engine control and diagnosis [26, 33, 65]. Remarkably, MVMsmake it possible to create a complete dynamic model using only static bench-test data,i.e. the throttle valve flow, the volumetric efficiency, the net indicated torque and thefriction torque in static conditions [32]. Thus, a good engine model can be built withoutresorting to dynamic tests, and the available dynamic data can be used for fine tuning andvalidation purposes. For these reasons MVMs are gaining wider acceptance also in view ofthe need for fast design and prototyping in the face of increasing engine complexity. MVMsare used for control purposes (for an efficient fluid film compensation in idle control, orfor an adequate fuel dosage mostly in rapid transient conditions), for diagnosis purposes(e.g. detecting a hole in the manifold wall or a drift in the pressure sensor) or for safetyapplications (engine shut off can be forced if a sudden fault in the drive-by-wire throttlevalve is detected).

The construction of an MVM requires the identification of the engine maps that givethe air flow entering the cylinders, the net indicated torque and the friction torque asfunctions of the manifold pressure (p) and the crankshaft speed (n). An engine bench isused that keeps the engine operating point at fixed values so that steady-state measure-ments of the dependent variable (the indicated torque, for instance) are collected. Then,from a set of data taken at different points on the (n, p) plane the entire surface describingthe map has to be estimated.

A first reason that renders engine map estimation a nontrivial problem has to do withdata acquisition. In fact, every single static measure is expensive in terms of test benchoccupation and operator working time [43]. Moreover, there are regions in the (n, p) planewhere static measurements cannot be taken because they are not stability points for the

49

50 CHAPTER 5. IDENTIFICATION OF ENGINE MAPS

engine, although they may be reached during engine transients. Therefore, the surfacehas to be estimated from a finite set of sparse data that cannot even cover all the regionof interest.

The engine map estimation problem has been addressed using various approaches[4, 13, 34] none of which, however, is completely satisfactory:

• Polynomial models tend to produce oscillating functions if the order is too high;this restricts the choice to low-order models which may result inadequate to catchall the features;

• Additive models [29] impose the strong assumption that the map is given by thesum of univariate functions;

• Radial Basis Function Neural Networks (RBFNN) [31, 56] to some extent sufferfrom the same bias/overfitting problems as polynomial models (too many neuronsproduce overfitting while a smaller network may yield biased estimates); moreoverit may be difficult to handle boundary conditions for the map;

• Multi Layer Perceptron Neural Networks (MLP) [37] do not overcome thebias/overfitting dilemma; moreover it may be difficult to control their extrapola-tion behaviour that is affected by the initialization of the training procedure [13].

The previous observations motivate the search of new methodologies for the identificationof engine maps. A good technique should guarantee:

• flexibility of the surface in order to reproduce the profile of the true map;

• smoothness of the estimated map that should not suffer from overfitting;

• reliable extrapolation in regions where static data are not available;

• proper handling of boundary conditions in order to exploit the prior knowledge onthe shape of the map.

In the present chapter, bias reduction is accomplished by searching the model within arich functional class, i.e. a basis function neural network with a large number of neurons,whereas overfitting is avoided by using regularization rather than least-squares regressionfor parameter estimation [39, 56, 66]. The proposed model achieves the objective ofreliable extrapolation by combining its smoothness properties, guaranteed by the useof regularization, with an explicit handling of boundary conditions. In fact, the modelstructure is such that zero boundary conditions can be imposed easily on one or two axes.This helps obtaining good predictive performances because extrapolation typically occursnear to the boundaries.

5.1. EXPERIMENTAL SETUP 51

5.1 Experimental Setup

The experimental data were collected on an M138 engine, planned for Maserati Coupeand Spider, and built by Ferrari Auto in the establishment of Maranello. Two kinds oftests have been performed: a first set in static regime for the maps estimation (steady-state tests or engine bench tests), the second ones under dynamic conditions for the modelvalidation (dynamic tests or on-board tests). Engine bench tests have been conductedwith completely warmed-up engine to acquire the mean effective torque, the air flow pastthe throttle plate, the manifold air pressure, the relative air/fuel ratio λA/F , the throttleopening angle, and the crankshaft speed at N = 92 different operating points. Duringon-board tests, sequences of fast accelerations/decelerations have been performed withthe engine in neutral gear, reaching high speeds and then letting the engine in its freemovement reach the idle speed range (large and fast throttle transients). These testshave been conducted acquiring the spark advance, the throttle opening angle, the relativeair/fuel ratio λA/F , the crankshaft speed and the manifold air pressure (sampling periodequal to 10 ms). The experiments conducted in neutral gear represent a sort of “worstcase” scenario in that this corresponds to a minimal momentum of inertia and a less stabledynamics.

5.2 Engine Model

The engine model has two major dynamic components: the manifold filling and emptyingdynamics and the rotational crankshaft dynamics. The state equation for the manifoldpressure is obtained by applying the mass conservation law and the ideal gas law:

p =RTman

Vman

(mat − mac) (5.1)

where the air mass flow rate into cylinders, mac, is a function of engine speed n and intakemanifold pressure p. The other variables in the equation are the air gas constant R, theintake manifold air temperature Tman, the intake manifold volume Vman and the air massflow rate past the throttle plate mat. The air flow rate mat depends on the throttle angleα, the intake pressure p, the throttle upstream pressure P0, and the throttle upstream airtemperature T0 according to the standard equation for compressible fluid flow through anorifice:

mat = A (α) · P0√RT0

· β(

p

P0

)(5.2)

where A(α) is calculated from air flow data collected in the steady-state tests, and

β

(p

P0

)=

√2

κ−1

[(pP0

) 2

κ −(

pP0

)κ+1

κ

]

√(2

κ+1

)κ+1

κ−1

where κ = 1.4 is the ratio of the specific heat capacities of the air. The first formula inthe curly bracket is used in case of subsonic flow, i.e.


(p

P0

) ≥ (2

κ + 1)

κκ−1

whereas the second one is used for sonic flow.

The crankshaft speed is obtained by differentiating the energy conservation law:

n (t) =30

π· 1

J· (MET − LT ) (5.3)

where J is the engine momentum of inertia, LT is the load torque and the mean effectivetorque (MET ) is modelled as the difference between the mean indicated torque (MIT )and the mean friction torque (MFT ). The mean indicated torque (MIT ) is in general afunction of the controllable variables spark advance (Θ) and air/fuel relative ratio (λA/F )and of the state variables crankshaft speed (n) and manifold pressure (p). Different ex-perimental campaigns related to various engines (and also to the Ferrari engine describedin this chapter) have undoubtedly shown that the 4-variables function can be split asthe product of 1-variable functions of Θ and λA/F and a 2-variable function of p and n,without any appreciable prediction capability loss:

MIT (Θ, λA/F , n, p) = MIT0(n, p) · ηΘ · ηλA/F

Therefore, it follows that:

MET (Θ, λA/F , n, p) = MIT − MFT = MIT0(n, p) · ηΘ · ηλA/F− MFT (n, p)

where the efficiencies ηΘ (function of the difference between the optimum spark advanceand the actuated one) and ηλA/F

(function of the air/fuel relative ratio λA/F ) are calculatedby the ECU. The mean friction torque, MFT , and the potential mean indicated torque,MIT0, are functions of engine speed n and intake manifold pressure p. Note that themodel here described is related to a warmed-up engine and that the film fluid dynamicshas been feed-forward compensated, thus restricting the complete model to the two non-linear differential equations (5.1) and (5.3).

5.3 Engine Map Estimation

5.3.1 Model Definition

Hereafter, the true engine map will be indicated by f 0(n, p), where n and p denote thecrankshaft speed (rpm) and the manifold pressure (mbar). The basis function neuralnetwork model is

f(n, p, θ) = θ0 + fn(n, θn) + fp(p, θp) + fn,p(n, p, θnp) (5.4)

where

fx (x, θx) = θx0x +

∫ x

0RBF (ξ, θx) dξ, x = n, p

5.3. ENGINE MAP ESTIMATION 53

fn,p (n, p, θnp) =

∫ n

0

∫ p

0

RBF (ξ, η, θnp) dξdη (5.5)

RBF (x, θx) =∑Nx

i=1 θxi h (‖x − xi‖) , x = n, p

RBF (n, p, θnp) =∑Nnp

i=1 θnpi h(‖[n p]T − [ni pi]T‖), h(r) = e

−r2

2s2

In the above expressions, RBF denotes a Radial Basis Function neural network withGaussian kernel h(r). The centres (denoted by xi and [ni pi]) of the one-dimensionalnetworks RBF (x, θx), x = n, p, and of the two-dimensional network RBF (n, p, θnp) arelocated on regular grids spanning the operating region (21x1 and 21x21 centres, respect-ively, so that Np = Nn = 21, Nnp = 441). For a fixed value of s, which controls thewidth of the Gaussian kernel, the overall model is linear in the 486x1 parameter vec-tor θ = [θ0 θn

0 θp0 θn θp θnp]T , θ0 ∈ R1, θn

0 ∈ R1, θp0 ∈ R1, θn ∈ RNn , θp ∈ RNp ,

θnp ∈ RNnp , θ ∈ Rq, q = 3 + Nn + Np + Nnp.

The use of integrated RBFs as basis functions is a means to reduce the bias error. Infact, standard RBF networks tend to zero outside the grid and this hampers the extrapola-tion properties of the model especially when the surface is non-stationary as a functionof the n and p variables, i.e. it exhibits trends. Herein, the network RBF (n, p, θnp) isused to model the second derivative ∂2f/∂n∂p which, in the map estimation problem, isusually more stationary. Since f (n, p, θ) = 0 when either n = 0 or p = 0, the univariatenetworks RBF (x, θx) , x = n, p are essential in order to keep into account the non-zeroboundary conditions of the map.

5.3.2 Parameter Estimation

The training data are given by the triplets (zk, nk, pk), k = 1, . . . , N where the pair (nk, pk)specifies a point in the (n, p) plane whereas zk denotes a static measurement which,depending on the considered map, may be a measurement of air flow, mean indicatedtorque or mean friction torque. Given that the number of the parameters (q = 486) islargely greater than the number of static measurements (N = 92), least squares regressionis clearly inappropriate. Rather, the estimated parameter vector θ is computed as

θ = arg minθ

Jγ (θ) , Jγ (θ) = SSR (θ) + γ∥∥∥θ

∥∥∥2

where θ = [θn θp θnp]T , SSR is the sum of squared residuals and γ is the so-calledregularization parameter which must be regarded as a hyper-parameter for the problemunder study. Note that the calculation of θ reduces to the solution of a system of linearequations [56, 76]. If γ is “small” the surface f (n, p, θ) will tend to be wiggly and, inthe limit as γ → 0, data interpolation will be obtained. Conversely, larger values of γwill yield smoother surfaces and, in the limit, a plane, because the coefficients θ0, θ

n0 and

θp0 of the linear part of the model are not penalized in the cost function Jγ. In order to

find the best trade-off between data fitting and smoothness the value of γ is selected byminimizing the Generalized Cross Validation (GCV) criterion


GCV (γ) = NSSR

(N − q (γ))2

where q (γ) are the so-called equivalent degrees of freedom, a number that in this caseranges from 3 (the degrees of freedom of a plane) to the number N of data as γ variesfrom ∞ to zero. The degrees of freedom q(γ) are defined as q(γ) = trace(H) where H isthe so-called “hat matrix” [39, 76]. Letting zk = f(nk, pk, θ) be the output of the networkfed by the input (nk, pk), the matrix H is such that [z1 . . . zN ]T = H[z1 . . . zN ]T . In theactual implementation of the network GCV minimization is used to jointly optimize γand s (the width of the Gaussian kernel).

5.4 Results

5.4.1 Map Estimation

The new surface reconstruction method has been applied to the estimation from staticdata of the following three engine maps: air flow past the throttle plate, net indicatedtorque and friction torque. All these maps are functions of the manifold pressure p andthe crankshaft speed n.

The air-flow map has zero boundary conditions in the sense that f 0 (n, p) = 0 wheneither n = 0 or p = 0. Therefore, the regularized network reduces to f (n, p, θ) =fn,p (n, p, θnp). Conversely, it was not necessary to impose zero boundary conditions on thenet indicated torque and the friction torque maps. For the sake of comparison the threemaps were also estimated using a standard radial basis function neural network. For thispurpose the function “newrb” of the MatLab Neural Network Toolbox [14] was employed.The two design parameters “goal” and “spread” were tuned so as to minimize the Cp

statistics [28] Cp = SSR + 2qσ2, where the degrees of freedom q are q = 1 + NNe, withNNe representing the number of neurons, and σ2 is the variance of the output measure-ment error. Such variance was estimated as σ2 = SSR

N−qwhere SSR is the sum of squared

residuals obtained using a q-th order over-parameterized polynomial model (air flow map:σ2 = 169.68 [Kg/h]2, mean indicated torque map: σ2 = 1.026 [Kgm]2, mean frictiontorque map: σ2 = 0.0204 [Kgm]2). For the three maps the number of degrees of freedomq were 33, 32, 19 respectively. For the air-flow map the obtained results are illustrated inFigs. 5.1-5.2. From the three-dimensional plots in Fig. 5.1 it is apparent that the regu-larized network approach produces a smoother surface. This fact is better appreciated inFig. 5.2, where the constant-speed curves as a function of p are plotted against the data.It is also seen that for high values of the pressure, the standard RBF map falls to zerooutside the range of the data. Moreover, the regularized network provides more regularextrapolations for low pressure values. Analogous considerations hold for the other twomaps, Figs. 5.3-5.4. In particular the mean friction torque map estimated via standardRBFNN (first panel of Fig. 5.4) does not provide sensible extrapolations for low pressurevalues as it collapses to zero due to the local nature of the Gaussian basis functions. Sum-marizing, in spite of the large number of parameters, the use of regularization not onlyhas proven very effective in avoiding any kind of overfitting, but also provides sensibleextrapolations.

5.4. RESULTS 55

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

1200

pressure [mbar]

Standard RBF

speed [rpm]

air

flow

[Kg/

h]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

1200

pressure [mbar]

Regularized RBF

speed [rpm]

air f

low

[K

g/h

]

Figure 5.1: Comparison between the air flow map estimated according to a standardRBFNN methodology (first panel) and the map estimated according to the new regular-ized basis network approach (second panel).


0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

pressure [mbar]

air f

low

[Kg/

h]Standard RBF

0 200 400 600 800 1000 12000

200

400

600

800

1000

1200

pressure [mbar]

air f

low

[Kg/

h]

Regularized RBF

Figure 5.2: Constant-speed lines of the air flow map against experimental data for thestandard RBFNN methodology (first panel) and the new regularized basis network ap-proach (second panel).

5.4. RESULTS 57

0200

400600

8001000

1200

0

2000

4000

6000

8000−10

0

10

20

30

40

50

pressure [mbar]

Standard RBF

speed [rpm]

me

an

ind

ica

ted

to

rqu

e [K

gm

]

0200

400600

8001000

1200

0

2000

4000

6000

8000−10

0

10

20

30

40

50

pressure [mbar]

Regularized RBF

speed [rpm]

mea

n in

dica

ted

torq

ue [K

gm]

Figure 5.3: Comparison between the mean indicated torque map estimated according toa standard RBFNN methodology (first panel) and the map estimated according to thenew regularized basis network approach (second panel).


0200

400600

8001000

1200

0

2000

4000

6000

80001

2

3

4

5

6

7

8

pressure [mbar]

Standard RBF

speed [rpm]

mea

n fr

ictio

n to

rque

[Kgm

]

0200

400600

8001000

1200

0

2000

4000

6000

80002

3

4

5

6

7

8

pressure [mbar]

Regularized RBF

speed [rpm]

mea

n fr

ictio

n to

rque

[Kgm

]

Figure 5.4: Comparison between the mean friction torque map estimated according to astandard RBFNN methodology (first panel) and the map estimated according to the newregularized basis network approach (second panel).

5.4. RESULTS 59

5.4.2 Dynamic Validation

Fine Tuning of the MVM

Two out of the five dynamic tests were used for tuning some parameters of the MVM.The remaining three dynamic tests were kept aside for testing the overall MVM. Firstof all the intake manifold volume Vman in (5.1) and the function A(α) (described by asecond-order polynomial) in (5.2) were adjusted so as to minimize the RMSE (Root MeanSquare Error) between measured and simulated manifold pressure. This procedure wascarried out feeding the MVM with sequences of crankshaft speed and throttle openingcollected during the two considered dynamic tests. Then, also the engine momentum ofinertia J in (5.3) was adjusted so that the RMSE between the measured and simulatedcrankshaft speed was minimized, using the overall MVM as a simulator. The fine tuningwas carried out twice plugging either the standard RBFNN maps or the regularized mapsinto the MVM. Figs 5.5, 5.6 and 5.7 show the Simulink diagram implemented in order tosimulate the dynamic behaviour of the engine according to the specified MVM. Note thatthe same simulation scheme has been used for the fine tuning step.

lambda

advance

pres

speed1

speed

Torque Generation

pressure

speed

canister capacity

throttle valve

speed

pfilter

pman

Tamb

pres

Intake Manifold

[Time_D,mste_w_P]

[Time_D,giri_mot]

[Time_D,p_filter]

[Time_D,pman]

[Time_D,zwist_P]

[Time_D,lamsoni_w_P]

[Time_D,tamb_w_P]

[Time_D,wdks_P]

Figure 5.5: Architecture of the overall Simulink model used to simulate the engine dy-namics.

Dynamic Validation on Pressure Data

A first dynamic validation test was performed by verifying whether the air subsystemMVM based on the estimated air flow map was capable of accurately simulating thedynamic behaviour of the manifold pressure. For this purpose, the sequences of crankshaftspeed and throttle opening, recorded during the experimental dynamic tests, were appliedas inputs to the MVM. Then, the manifold air pressure computed by the model was


1

pres

MATLAB

Function

throttle

1

s

pressure

Cinv

manifold capacity

100*3600

1e-6

p2/p1 function

Saturation

f(u)

Fcn

Air Flow Map

6

Tamb

5

pman

4

pfilter

3

speed

2

throttle valve

1

canister

capacity

Figure 5.6: Architecture of the Intake Manifold subsystem of the Simulink model used tosimulate the engine dynamics.

1

speed

-1

1

s

engine

speed

Jinv

engineinertia

cutoff

advanceadvance_efflambda lambda_eff

Mean Indicated Torque

Mean Friction Torque

[Time_D,B_ll]

4

speed1

3

pres

2

advance1

lambda

Figure 5.7: Architecture of the Torque Generation subsystem of the Simulink model usedto simulate the engine dynamics.

5.4. RESULTS 61

compared with the experimental pressure p. The RMSE for the manifold pressure relativeto the three dynamic tests was 12.01 mbar and 8.46 mbar for the standard RBFNN mapand regularized map, respectively.

Dynamic Validation on Pressure and Speed Data

A further dynamic validation was performed by plugging all the three estimated maps intothe MVM and simulating both pressure and speed dynamics. For this purpose, signalscollected during on-board tests such as the spark advance, throttle opening and excess-air factor λA/F , were used to feed the MVM. Then, the crankshaft speed and manifoldair pressure computed by the model were compared with the experimental speed n andexperimental pressure p.

0 5 10 15 20 25 30 35 40 45200

250

300

350

400

450

500

550Manifold pressure

Pres

sure

[mba

r]

Time [s]

dynamic measurementssimulated curve

0 5 10 15 20 25 30 35 40 45200

250

300

350

400

450

500

550Manifold pressure

Pres

sure

[mba

r]

Time [s]


Figure 5.8: Dynamic on-board test. Simulated vs. experimental manifold pressure. Firstpanel: MVM based on a standard RBFNN. Second panel: MVM based on the newregularized basis network approach.

In Fig. 5.8 the time course of the simulated pressure during a dynamic test is plottedagainst the experimental pressure. In the first panel, the same comparison is presented for


the pressure simulated by the MVM based on standard RBF networks. Analogous plotscomparing simulated and experimental crankshaft speed are given in Fig. 5.9. Consideringall the three dynamic tests, the RMSE for the manifold pressure was 32.24 [mbar] and29.56 [mbar] for the standard RBFNN maps and regularized maps, respectively. As forthe crankshaft speed the RMSE was 217.52 [rpm] and 189.00 [rpm]. In view of the rangeof the transients and the type of engine, these errors are more than satisfactory.

0 5 10 15 20 25 30 35 40 45500

1000

1500

2000

2500

3000

3500

4000

4500Crankshaft speed

Spee

d [rp

m]

Time [s]


0 5 10 15 20 25 30 35 40 45500

1000

1500

2000

2500

3000

3500

4000

4500Crankshaft speed

Spee

d [rp

m]

Time [s]


Figure 5.9: Dynamic on-board test. Simulated vs. experimental crankshaft speed. Firstpanel: MVM based on a standard RBFNN. Second panel: MVM based on the newregularized basis network approach.


A good engine map estimation strategy should ensure good interpolation and extrapola-tion properties, a feature that cannot be guaranteed by traditional regression schemessuch as polynomial ones. The need of accurate interpolations is obvious in view of thefact that static measurements are collected on a discrete grid. As for the extrapolation


requirements note that the dynamic trajectories can explore regions where static datacannot be collected for engine stability reasons. Hence, the map estimation algorithmsmust be able to produce reliable extrapolations for low values of p and n. The good per-formances obtained in the dynamic tests demonstrate that such nontrivial extrapolationproblem has been effectively solved by the new regularized network approach. Among thefuture developments one may cite the joint use of dynamic and static data to refine theestimation of engine maps in critical regions.


Chapter 6

Active Learning Strategies for the

Neural Estimation of Engine Maps

One of the problems arising in the identification of engine models is the estimationof nonlinear engine maps. As introduced in Chapter 5, such maps take the form ofscalar multivariable functions and are typically reconstructed on the basis of staticmeasurements collected during bench tests. The estimation procedure described inthat chapter guarantees that the reconstructed maps enjoy both good interpolation andextrapolation properties. One problem left concerns the difficulty of obtaining a largenumber of static measures covering all the region of interest due to the cost of the datacollection campaign. Therefore, it is important to optimize the choice of the staticexperiments in order to maximize the information they convey.

The aforementioned considerations motivate the interest for methodologies borrowedfrom the theory of optimal DOE (Design of Experiment) [18], that in the neural networkliterature goes also under the name of Active Learning, [9, 20, 39, 69, 70, 72]. Examplesof applications to engine modelling are also available, [4, 43, 59]. In [4], an activelearning methodology is used to derive a procedure that selects the new experimentalpoint maximizing the state of knowledge about the unknown map given the previousexperimental data. The gain of knowledge is measured either by means of the entropy,[39], or in terms of variance of the generalization error, [9]. Arsie and coworkers[4] concluded that the techniques based on active learning guarantee good accuracyand generalization properties with a significant improvement with respect to heuristicselection of training data. An important consequence is that with a proper design ofthe experiments, a reduction of the training set is possible without deteriorating theapproximation quality. Conversely, with a given number of experiments the optimaldesign will yield better models.

The active learning scheme is an iterative one. For a given training set, the algorithmlooks for the next experimental point (e.g. the pair speed-pressure) which is mostinformative. In order to compute the gain of knowledge due to the new datum (e.g. theair flow measurement), it is necessary to perform a linearization of the model based onthe current training set. After having chosen the experiment that maximizes the gainof knowledge, a new training set is formed and the model is updated. It is clear thatthe procedure is intrinsically a step-by-step one: it is not possible to select in advancea sequence of experiments because the result of each experiment is needed in order

65

66 CHAPTER 6. ACTIVE LEARNING

to choose the subsequent one. The main reason for such a limitation is the need oflinearizing the current model. If the model is linear with respect to its parameters, itslinearization does not depend on the actual values of the parameters. Hence, an entiresequence of experiments can be optimally selected in advance without waiting for theoutcomes of the experiments. For this reason, in the present chapter the map estimationproblem is approached using RBF neural networks, because they can be handled aslinear-in-parameter models. In particular, the estimation procedure introduced inChapter 5 and described also in [50, 51, 52, 53] will be used. Of course, the resultspresented in this chapter extend to any other type of linear-in-parameter model, e.g.polynomials and splines.

However, the development of an active learning scheme cannot ignore the issue of thechoice of complexity of the model [10]. In its simplest form this corresponds to decidingthe order of the model (e.g. the order of the polynomial approximation) and is carriedout resorting to cross-validation or statistical criteria such as GCV, OCV, FPE, AIC,Cp, [2, 24, 28, 38, 68].

Alternatively, one can tune the flexibility of the model within a Bayesian paradigm,as done in the present chapter. In the Bayesian approach the model parameters areregarded as random variables and are assigned a prior distribution. In the presentframework, such a distribution depends on a positive scalar, the so-called regularizationparameter, which has to be tuned in order to optimize the model flexibility [73]. Varioustuning methods are possible, e.g. cross-validation, GCV, OCV, Cp, Maximum Likelihood,[27, 28]. A common feature of these techniques is that they call for an iterative algorithm:the cost functions depend on the residuals so that the model has to be estimated forseveral values of the regularization parameter until the criterion is optimized. The needof recomputing the estimates several times entails a significant computational burden.Another important consequence is that the advance design of a whole sequence ofexperiments (that would be viable if the regularization parameter were fixed) is no morepossible because each datum is needed for tuning the regularization parameter before thenext experimental point is decided.A possible solution to these drawbacks relies on the behaviour of the estimates of theregularization parameter. In fact, it is expected that these estimates converge to anasymptotic value as the number of training data tends to infinity. In view of this, atwo-phase procedure is proposed. First, a certain number of data are collected one ata time, readjusting the regularization parameter at each iteration until the value of theregularization parameter stabilizes. In the second phase, the regularization parameteris kept equal to the average of the last values calculated in the first phase and thesubsequent experimental points can be decided all in one go before performing thecorresponding experiments.

The two-phase procedure has been tested on a simulated benchmark problem. Thedata are obtained adding pseudo-random noise to the mathematical model of the air flowmap identified in the previous chapter. In this context, the selection of the experimentsamounts to choosing the pairs (speed, pressure) that are most informative. The advantageof using simulated data is the possibility of assessing the quality of approximation bycomputing the difference between the true engine map and the estimated one. Theresults demonstrate that the two-phase procedure provides an effective approach toactive learning of engine maps. In particular, after a certain number of data have beencollected, the readjustment of the regularization parameter does not bring any significant

6.1. PRELIMINARIES 67

advantage in terms of model accuracy.

6.1 Preliminaries

In this section, a concise review of the Bayesian approach for the reconstruction of amulti-dimensional function f(x) is provided. The unknown function f(x) : Rn 7−→ R1,has to be estimated on the basis of k discrete and noisy observations

yi = f(xi) + vi, i = 1, . . . , k

where the measurements errors vi are independent and identically distributed with zeromean and variance V ar[vi] = σ2. In the following, the set Dk = {(xi, yi), i = 1, . . . , k}denotes the input-output training set. Concerning the unknown function, the followingassumption is made

f(x) = ϕT (x)θ, θ ∈ Rm (6.1)

ϕ(x) = [ϕ1(x) ϕ2(x) . . . ϕm(x)]T

where ϕj(x) are suitable basis functions, e.g. polynomials or trigonometric functions. Ifϕj(x) = ϕj(‖x − xj‖), that is ϕj is radially symmetric with respect to the centre xj,equation (6.1) is a Radial Basis Function (RBF) Neural Network, and θ is the weightvector. In the Bayesian approach, the parameter vector θ is modelled as a random vector,e.g. normally distributed with zero mean and variance matrix V ar[θ] = λ2I. Moreover,θ is assumed to be independent of the measurement errors vi.

Bayesian estimation relies on the computation of the posterior distribution of θ giventhe training set Dk. It is well known [28] that such a posterior is still Gaussian with

θk := E[θ|Dk] = A−1k

k∑

i=1

ϕiyi (6.2)

V ar[θ|Dk] = A−1k

Ak :=k∑

i=1

ϕiϕTi +

σ2

λ2I

where the shorthand notation ϕi := ϕ(xi) is used. The ratio σ2/λ2 is also known asregularization parameter. The posterior expectation θk is the so-called Bayes estimate.Consequently, the estimate of the function f(x) will be

fk(x) = ϕT (x)θk.

Remarkably, once the basis functions ϕj have been fixed, the Bayes estimate θk is obtainedas the solution of a system of linear equations, see (6.2). This is a major advantage withrespect to other type of networks, such as Multi Layer Perceptron ones, whose training


calls for nonlinear optimization which may suffer from local minima and convergenceproblems.

6.2 Active Learning for Bayesian Models

In this section, the Active Learning problem for Bayesian estimation is discussed. Inorder to assess the quality of the estimated function fk, one may refer to its generalizationcapability. If the problem of predicting the value of the function at a specific point xu isconsidered, the generalization performance can be measured by the Mean Square Error

MSEk(xu) := E[(fk(xu) − f(xu))2].

It can be proven, [72], that

MSEk(xu) = ϕTu A−1

k ϕu (6.3)

where ϕu := ϕ(xu). Given a training set Dk, the problem of choosing the next samplingpoint xk+1 so as to optimize the generalization performance of θk+1 goes under the nameof sequential active learning problem.

Sequential active learning problem 1 (SALP1): Let X be the region in the inputspace where additional measurements can be taken. Select the sampling point xk+1 ∈ Xso as to minimize MSEk+1(xu).

2

Let ∆(xk+1) := MSEk(xu)−MSEk+1(xu) be the variation of generalization perform-ance due to the introduction of the additional point xk+1. Recalling (6.3), it would seemthat the evaluation of ∆(xk+1) requires the inversion of Ak, which may be burdensomewhen k is large. The next lemma, [9], shows that such an inversion can be avoided by aproper update formula.

Lemma 2

∆(xk+1) =ϕT

k+1A−1k ϕuϕ

Tu A−1

k ϕk+1

σ2 + ϕTk+1A

−1k ϕk+1

A−1k = A−1

k−1 −A−1

k−1ϕkϕTk A−1

k−1

σ2 + ϕTk A−1

k−1ϕk

2

Then, the solution of the SALP1 is obtained as

xoptk+1 = arg max

xk+1

∆(xk+1).

6.2. ACTIVE LEARNING 69

In real world problems, one is interested with the generalization capability over a regionof interest rather than on a single point xu. In the following, the region of interest will bedenoted as Xu. Notably, Xu does not necessarily coincide with the admissible samplingregion X. In fact, one may be concerned with the generalization performance in regionswhere samples cannot be taken because of physical or economic constraints. The followingindex measures the expected generalization performance on a point xu randomly chosenin Xu according to a probability density function pu, known as environmental probability:

MSEk =

∫

Xu

MSEk(xu)pu(xu)dxu.

It can be seen that

MSEk = tr(A−1k C) (6.4)

C =

∫

Xu

ϕuϕTu pu(xu)dxu (6.5)

Sequential active learning problem 2 (SALP2): Let X be the region in the input spacewhere additional measurements can be taken. Select the sampling point xk+1 ∈ X so asto minimize MSEk+1(xu).

2

In analogy with Lemma 2, the following lemma, [9], can be exploited to efficiently solvethe SALP2.

Lemma 3

∆(xk+1) := MSEk − MSEk+1 =ϕT

k+1A−1k CA−1

k ϕk+1

σ2 + ϕTk+1A

−1k ϕk+1

(6.6)

2

It is worth noting that, in view of (6.3)-(6.5), the solutions of both the SALP1 and SALP2do not make use of the output measurements yj, j = 1, . . . , k. Consider the problem ofselecting l points xk+i, i = 1, . . . , l, so as to optimize the generalization over Xu. Byan iterative application of the solution of the SALP2, all the l points can be decided inadvance before any output measurement yk+i is taken. This would not be possible fornonlinear estimators because the selection of each new point is based on a linearizationwhich depend on past output measurements.

As a matter of fact, such an “advance design” is possible only if all the parametersof the Bayesian model are known. Although the measurement error variance σ2 may beknown in many cases, the variance λ2 is hardly known and has to be tuned on the basis ofDk. The possible tuning criteria include OCV, GCV, ML and the Cp statistics, [27, 28].By the way, whenever λ2 is updated when passing from Dk to Dk+1, the numericallyefficient formulas for computing A−1

k without matrix inversion are no more applicable. Inorder to recover the possibility of an advance design, a two-step procedure is proposed.At first, the sequential active learning strategy is applied updating λ2 at each iterationuntil the estimated λ2 converges to a steady state. After that, λ2 is fixed and the advancedesign can be carried out to obtain the desired number of further sampling points.


6.3 Bayesian Networks for Engine Map Estimation

In Chapter 5, the problem of estimating the nonlinear surfaces involved in the constructionof a Mean Value Model has been widely addressed. In this section, a simplified version ofthe same framework is used. In the following, engine maps with null boundary conditions(such as the air flow map) will be considered so that equation (5.4) reduces to (5.5).Assuming that the unknown map function is indicated by f(x), with x = [n p]T , equation(5.5) coincides with (6.1) if the basis functions are defined as follows

ϕj(x) =

∫ x1

0

∫ x2

0

h(‖ξ − xj‖)dξ1dξ2 j = 1, . . . ,m

h(r) = e−r2

2s2

where s is a design parameter that controls the width of the gaussian function h(r). Notethat, according to the introduced simplification, the weight vector θ is equivalent to θnp

in equation (5.5). Once the value of s and the centres xj have been properly selected, theoverall model is linear in the unknown parameter vector θ. The m centres of the networkare located on a regular grid spanning the operating region in the (n,p) plane. As for theparameter s it can be either tuned following statistical criteria or assigned according tosome rule-of-thumb (e.g. equal to half the maximum distance between the training inputsxi). As already mentioned in Section 5.3.2, according to the regularization approach theestimated parameter vector is computed as

θ = arg minθ

Jγ(θ)

Jγ(θ) = SSR(θ) + γ‖θ‖2

where SSR is the usual sum of squared residuals and the regularization parameter γ isequivalent to σ2/λ2 (see Section 6.1). Since, in most cases, either σ2 or λ2 (or both)are unknown (in which case they must be regarded as hyper-parameters) the problem ofestimating a proper value for γ is in order. A possible solution is given by the minimizationof the Cp statistics [28]

Cp = SSR + 2q(γ)σ2

where q(γ) represents the degrees of freedom, a number that, in the present context,ranges from 0 (the degrees of freedom of the null constant function) to min(m,N) as γvaries from ∞ to zero [76].

6.4 Active Learning of the Air Flow Map

The effectiveness of the proposed active learning strategy has been tested on a simulatedbenchmark. The data are obtained by adding pseudo-random noise to the mathematicalmodel of the air flow map identified from experimental data as discussed in Chapter5. The advantage of resorting to simulated data is twofold. First of all, this allows

6.4. AIR FLOW MAP 71

one to compare the reconstructed map with the “true” one to assess the quality of theestimation. Moreover, it is possible to emulate arbitrary sampling schedules with respectto number and location of training data. The variance of the pseudo-random noise isσ2 = 104 (Kg2/h2). It has been assumed that static measurements can be collected in therectangular region 400 ≤ p ≤ 1000 (mbar), 1000 ≤ n ≤ 7500 (rpm). For computationalreasons, the sampling region has been discretized using a 10 × 10 grid (the potentialsampling points are the knots of the grid). The centres of the neural network are locatedon a regular 10 × 10 grid spanning the region 0 ≤ p ≤ 1000, 0 ≤ n ≤ 7500. The traininghas been performed on normalized inputs −1 ≤ x ≤ 1, −1 ≤ y ≤ 1, where x and ycorrespond to p and n, respectively. In such a normalized setting the width parameters was set equal to 0.3, which guarantees a reasonable smoothness of the approximatingsurface.

0 10 20 30 40 50 601

1.5

2

2.5

3

3.5x 10

9

# of data

λ2

Figure 6.1: Problem 1. Estimated λ2 as a function of the number of data (asterisks).

Two problems have been considered that differ for the choice of the region of interest.In both cases the environmental probability pu is assumed constat over a region ofinterest. In the first problem, such a region coincides with the sampling one. The activelearning procedure has been applied using the four corners of the sampling region asstarting points. At first, active learning has been used for selecting 50 sampling pointsperforming the tuning of λ2 via the Cp statistics at each iteration. The values of theestimated λ2 are plotted against the iteration number in Fig. 6.1. The asymptoticconvergence towards a steady state value is apparent. This observation justifies theuse of the following two-phase procedure. In the first phase, the sampling points areevaluated iteratively performing the tuning of λ2 at each step, until the estimated valuesbecome reasonably stable. In practice convergence is assessed by checking whether|lnλ2

k+i − lnλ2k| < 0.35, i = 1, . . . , 10, where λ2

k denotes the estimate of λ2 computedat the k-th step. If the condition is satisfied, in the second phase λ2 is kept fixed andequal to the average of the last ten values. It is worth mentioning that it is well-knownthat in regularization problems, the estimate is relatively robust with respect to theregularization parameter provided that changes within the same order of magnitude areconsidered.


0 100 200 300 400 500 600 700 800 900 10000

1000

2000

3000

4000

5000

6000

7000

8000

pressure [mbar]

spee

d [rp

m]

DATA

new data (λ2 fixed)new data (λ2 tuned)initial data

0 100 200 300 400 500 600 700 800 900 10000

1000

2000

3000

4000

5000

6000

7000

8000

pressure [mbar]

spee

d [rp

m]

DATA

new data (λ2 tuned)initial data 1

2

3

4

5

6

7

8

9

10

Figure 6.2: Problem 1. First panel: data selected in phase I by the active learningprocedure while tuning λ2 (circles) and data selected in phase II with λ2 fixed to itssteady state value (crosses). The initial data are also reported (full circles). Secondpanel: the first ten points numbered according to the order of choice.

6.4. AIR FLOW MAP 73

0 10 20 30 40 50 6020

30

40

50

60

70

80

90

# of data

RMSE

Figure 6.3: Problem 1. Root Mean Square Error (RMSE) between the estimated mapand the true one. Comparison between the RMSE obtained tuning λ2 for each of the 50data collected (asterisks) and the RMSE obtained in phase II fixing λ2 to its steady statevalue after the first 20 data (circles).

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

1200

pressure [mbar]

True map

speed [rpm]

air f

low

[K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

pressure [mbar]

Map estimated adding 10 data

speed [rpm]

air f

low

[K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

pressure [mbar]


speed [rpm]

air flo

w [K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

1200

pressure [mbar]


speed [rpm]

air flo

w [K

g/h

]

Figure 6.4: Problem 1. The true map and the air flow maps estimated on the basis of 10,20 and 20+30 data collected by means of the proposed active learning procedure.


As already mentioned at the end of Section 6.2, using a constant λ2 is computationallyadvantageous and allows one to perform an advance design of the sampling points beforecollecting the data. In the considered example, the two-phase procedure has been appliedcollecting 20 data in phase I and other 30 data in phase II. The sampling points selected inthe two phases are plotted in the first panel of Fig. 6.2. It is worth noting that there arelocations that are chosen more than once. In the second panel of the same figure the firstten points are plotted together with their order of choice. In Fig. 6.3, the performanceof the two-phase procedure is compared with that of the original approach that tunes λ2

at each step. It can be seen that applying the two-phase procedure entails only a slightdegradation in terms of RMSE between the estimated and true engine map evaluated onthe region of interest. The precision of the obtained estimates is reasonably good if oneconsiders that the RMSE of the map estimated using all the 100 points in the grid ofthe sampling region is 22.70 [(Kg/h)

1

2 ] (λ2 = 1.8 × 109). Finally, in Fig. 6.4 the true airflow map and three maps estimated using the active learning procedure (with 10, 20 and20+30 points) are plotted. The result obtained using 50 data selected by the two-phasescheme is completely satisfactory and comparable to the results obtained in Chapter 5using 92 experimental data.

0 100 200 300 400 500 600 700 800 900 1000 11000

1000

2000

3000

4000

5000

6000

7000

8000

pressure [mbar]

spee

d [r

pm]

Figure 6.5: Problem 2. Sampling region (dash-dot) and region of interest (solid) withrespect to the air flow map estimation problem. The initial data are also reported (fullcircles).

In the second problem the region of interest is 200 ≤ p ≤ 700 (mbar), 1000 ≤ n ≤ 6000(rpm) and does not coincide with the sampling one, see Fig. 6.5. The choice of this regioncould be motivated by the need of developing a model for the design of the idle-speedcontroller. The region of interest is not included in the sampling region because somepoints are outside the engine stability region where static measurements can be collected.On the other hand, the map model must cover also these points because they can be


reached during dynamic transients.

0 10 20 30 40 50 601.5

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5x 10

9

# of data

λ2

Figure 6.6: Problem 2. Estimated λ2 as a function of the number of data (asterisks).

Again, the starting points are the corners of the sampling region. When 50 samplingpoints are selected tuning λ2 at each iteration, one obtains the values of λ2 reported inFig. 6.6. The same criterion of convergence has been applied as that used in Problem1. In the problem at hand, 20 data have been collected in phase I and other 30 in phaseII. The 50 sampling points and the detail of the first ten are reported in Fig. 6.7. Acomparison between Fig. 6.2 and 6.7 highlights the different spatial distributions of theselected points that in the latter case are more concentrated towards the lower left cornerwhich corresponds to the intersection of the sampling region with the region of interest.

In Fig. 6.8, the performance of the two-phase procedure is compared with that of theoriginal approach that tunes λ2 at each step. Again, the use of a fixed λ2 appears morethan satisfactory. For the sake of comparison, the RMSE of the map estimated using allthe 100 points in the grid of the sampling region is 34.89 [(Kg/h)

1

2 ] (λ2 = 2.99 × 109).In Fig. 6.9, the true air flow map is compared with the maps estimated using 10, 20 and20+30 points. The global quality of approximation is worse than that of Problem 1, seeFig. 6.4, but this is not surprising because the region of interest is smaller and shifted.


In this chapter, the problem of selecting the most informative static experiments for en-gine map reconstruction has been considered. Typically, such a problem is solved bymeans of iterative active learning schemes which choose the next experimental point onthe basis of the available training set. This means that, in general, an advance designis not possible. However, if the engine map is described by a linear-in-parameter modelthe choice of the next point does not depend on the previous experimental outcomes.Hence the interest for linear models such as Bayesian RBF neural networks. Neverthe-less, recalling that Bayesian methods use the available training set to tune the so-calledregularization parameter, it would seem that an advance design is still not feasible. To


0 100 200 300 400 500 600 700 800 900 10000

1000

2000

3000

4000

5000

6000

7000

8000

pressure [mbar]

spee

d [rp

m]

DATA

new data (λ2 fixed)new data (λ2 tuned)initial data

0 100 200 300 400 500 600 700 800 900 10000

1000

2000

3000

4000

5000

6000

7000

8000

pressure [mbar]

spee

d [rp

m]

DATA

new data (λ2 tuned)initial data

1

2

3

4

5

6 7

8

9

10

Figure 6.7: Problem 2. First panel: data selected in phase I by the active learningprocedure while tuning λ2 (circles) and data selected in phase II with λ2 fixed to itssteady state value (crosses). The initial data are also reported (full circles). Secondpanel: the first ten points numbered according to the order of choice.


0 10 20 30 40 50 6030

40

50

60

70

80

90

100

110

# of data

RMSE

Figure 6.8: Problem 2. Root Mean Square Error (RMSE) between the estimated mapand the true one. Comparison between the RMSE obtained tuning λ2 for each of the 50data collected (asterisks) and the RMSE obtained in phase II fixing λ2 to its steady statevalue after the first 20 data (circles).

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

1200

pressure [mbar]

True map

speed [rpm]

air f

low

[K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

pressure [mbar]


speed [rpm]

air f

low

[K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

pressure [mbar]


speed [rpm]

air f

low

[K

g/h

]

0200

400600

8001000

1200

0

2000

4000

6000

80000

200

400

600

800

1000

pressure [mbar]


speed [rpm]

air f

low

[K

g/h

]

Figure 6.9: Problem 2. The true map and the air flow maps estimated on the basis of 10,20 and 20+30 data collected by means of the proposed active learning procedure.


obviate this drawback, in the present chapter a two-phase procedure has been proposed.In the first phase, the regularization parameter is tuned at each step until its estimatehas reached, at least approximately, a steady-state value. Then, in the second phase, theregularization parameter is kept fixed and the advance design of the whole sequence offuture experiments is made possible.

The proposed procedure has been tested on a simulated benchmark. The results aresatisfactory and demonstrate the usefulness of active learning methodologies in order toimprove accuracy or reduce the number of experiments. From the practical point of view,one could consider the following strategy. First an initial set of experiments is decided onheuristic grounds. Assuming that the regularization parameter is reliably estimated onthe basis of this training set, an optimal advance design is performed in order to chooseall the experiments that will be conducted in a second and final experimental session.

For what concerns the future developments three main directions can be envisaged.First of all, other types of Bayesian estimators may be considered. These may includeclassical models such as polynomial ones, but with Bayesian priors, or other methods suchas Gaussian Processes [41, 67, 80] which are the basis of Regularization Networks models,[56].

A second development is the extension of the two-phase approach to parametric modelsestimated via least squares methods. In that case, the advance design of the experimentsis hindered by the need of choosing the model complexity (e.g. the polynomial order).Again, a possible solution would be estimating the model order on a reduced set of ex-periments (possibly heuristically determined) and then design the future experiments allin one go.

Finally, it may be worth exploring the use of data collected during dynamic onboardtests for engine map estimation. For instance, dynamic data could be processed to obtaina preliminary model on the basis of which few but informative static experiments arechosen.

Bibliography

[1] L. Aarons. Software for population pharmacokinetics and pharmacodynamics. Clin.Pharmacokinet., 36(4):255–264, 1999.

[2] H. Akaike. A new look at the statistical model identification. IEEE Trans. onAutomatic Control, AC-19(6):716–723, 1974.

[3] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337–404,1950.

[4] I. Arsie, F. Marotta, C. Pianese, and G. Rizzo. Information Based Selection ofNeural Networks Training Data for S.I. Engine Mapping. SAE Technical Paper,(2001-01-0561):173–184, 2001.

[5] S. L. Beal and L. B. Sheiner. Estimating population kinetics. Crit. Rev. Biomed.Eng., 8(3):195–222, 1982.

[6] S. L. Beal and L. B. Sheiner. NONMEM Users Guide. University of California, SanFrancisco, CA, USA, 1998.

[7] A. Bertoldo, G. Sparacino, and C. Cobelli. “Population” approach improves para-meter estimation of kinetic models from dynamic PET data. IEEE Trans. on MedicalImaging, 23(3):297–306, 2004.

[8] Center for Drug Evaluation and Research. Guidance for Industry: Population Phar-macokinetics. United States Department of Health and Human Services, Food andDrug Administration, 1999.

[9] D. A. Cohn. Neural network exploration using optimal experiment design.A. I. Memo, Artificial Intelligence Laboratory, MIT, 1491, 1994.

[10] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statisticalmodels. J. Artificial Intelligence Research, 4:129–145, 1996.

[11] M. Davidian and D. M. Giltinan. Nonlinear Models for Repeated Measurement Data.Chapman and Hall, New York, NY, USA, 1995.

[12] G. De Nicolao and G. Ferrari-Trecate. Regularization networks: Fast weight calcu-lation via Kalman filtering. IEEE Trans. on Neural Networks, 12(2):228–235, 2001.

[13] G. De Nicolao, R. Scattolini, and C. Siviero. Modelling the Volumetric Efficiency ofIC engines: parametric, non-parametric and neural techniques. Control EngineeringPractice, 4(10):1405–1415, 1996.

79

80 BIBLIOGRAPHY

[14] H. Demuth and M. Beale. Neural network toolbox user’s guide. The MatWorks Inc.,Natik, MA, USA, 1993.

[15] M. Egerstedt and C. F. Martin. Optimal trajectory planning and smoothing splines.Automatica, 37:1057–1064, 2001.

[16] K. E. Fattinger and D. Verotta. A nonparametric subject-specific population methodfor deconvolution: I. Description, internal validation, and real data examples. J.Pharmacokin. Biopharm., 23:581–610, 1995.

[17] K. E. Fattinger and D. Verotta. A nonparametric subject-specific population methodfor deconvolution: II. External validation. J. Pharmacokin. Biopharm., 23:611–634,1995.

[18] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, NY,USA, 1972.

[19] F. Ferrazzi, P. Magni, and R. Bellazzi. Bayesian clustering of gene expression timeseries. In Proc. of 3rd Int. Workshop on Bioinformatics for the Management, Analysisand Interpretation of Microarray Data (NETTAB 2003), pages 53–55, 2003.

[20] K. Fukumizu. Statistical active learning in multilayer perceptron. IEEE Trans. onNeural Networks, 11:17–26, 2000.

[21] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhDthesis, University of Cambridge, 1997.

[22] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo inPractice. Chapman and Hall, London, UK, 1996.

[23] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networksarchitectures. Neural Computation, 7:219–269, 1995.

[24] G. H. Golub, M. Health, and G. Wahba. Generalized cross-validation as a methodfor choosing a good ridge parameter. Technometrics, 21:215–224, 1979.

[25] V. Guardabasso, P. J. Munson, and D. Rodbard. A versatile method for simultaneousanalysis of families of curves. FASEB J., 2:209–215, 1988.

[26] L. Guzzella and C. H. Onder. Introduction to modeling and control of internal com-bustion engine systems. Springer-Verlag, Berlin, Germany, 2004.

[27] P. Hall and D. M. Titterington. Common structure of techniques for choosing smooth-ing parameters in regression problems. J. R. Statist. Soc., 49:184–198, 1987.

[28] T. Hastie, R. Tibshirani, and J. Freedman. The Elements of Statistical Learning.Springer-Verlag, New York, NY, USA, 2001.

[29] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall,London, UK, 1990.

[30] W. K. Hastings. Monte Carlo sampling methods using Markov Chain and theirapplications. Biometrika, 57:97–109, 1970.

BIBLIOGRAPHY 81

[31] S. Haykin. Neural Networks a Comprehensive Foundation. MacMillan College Pub-lishing Company, New York, NY, USA, 1963.

[32] E. Hendricks and S. C. Sorenson. Mean Value Modelling of Spark Ignition Engines.SAE Technical Paper, (900616), 1990.

[33] J. B. Heywood. Internal Combustion Engine Fundamentals. McGraw-Hill, New York,NY, USA, 1992.

[34] T. Holliday and T. P. Davis. Engine Mapping Experiments: A two-stage regressionapproach. Technometrics, 40:120–126, 1998.

[35] R. Jelliffe, A. Schumitzky, M. Van Guilder, X. Wang, and R. Leary. Populationpharmacokinetic and dynamic models: parametric (P) and nonparametric (NP) ap-proaches. In 14th IEEE Symposium on Computer-Based Medical Systems, pages407–412, Bethesda, MD, USA, 2001.

[36] R. Leary, R. Jelliffe, A. Schumitzky, and M. Van Guilder. An adaptive grid non-parametric approach to pharmacokinetic and dynamic (PK/PD) population models.In 14th IEEE Symposium on Computer-Based Medical Systems, pages 389–394,Bethesda, MD, USA, 2001.

[37] R. P. Lippmann. An introduction to computing with neural nets. IEEE Acoust.,Speech, Signal Processing Magazine, 4:4–22, 1987.

[38] L. Ljung. System Identification: Theory for the User. Prentice-Hall, EnglewoodCliffs, NJ, USA, 1987.

[39] D. J. MacKay. Bayesian interpolation. Neural Computation, 4:415–447, 1992.

[40] D. J. MacKay. Gaussian Processes: A Replacement for Supervised Neural Networks?In Lecture Notes on Neural Information Processing Systems (NIPS’97), 1997.

[41] D. J. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, NeuralNetworks and Machine Learning, volume 168 of NATO Asi Series, Series F, Computerand Systems Sciences., pages 133–166. Kluwer Academic Press, 1998.

[42] P. Magni, R. Bellazzi, G. De Nicolao, I. Poggesi, and M. Rocchetti. Nonparamet-ric AUC estimation in population studies with incomplete sampling: a Bayesianapproach. J. Pharmacokin. Pharmacodyn., 29(5/6):445–471, 2002.

[43] N. Muller, M. Hafner, and R. Isermann. A Neuro-Fuzzy Based Method for theDesign of Combustion Engine Dynamometer Experiments. SAE Technical Paper,(2000-00P-198), 2000.

[44] R. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes inStatistics. Springer-Verlag, New York, NY, USA, 1996.

[45] M. Neve and G. De Nicolao. Active learning strategies for the neural estimation ofengine maps. In S. Kalogirou, editor, Artificial Intelligence in Energy and RenewableEnergy Systems. Nova Publishers Inc., 2006.

82 BIBLIOGRAPHY

[46] M. Neve, G. De Nicolao, and L. Marchesi. Fixed interval smoothing of populationpharmacokinetic data. In Proc. 16th IFAC World Congress, number Fr-A19-TO/6,Prague, Czech Republic, 2005.

[47] M. Neve, G. De Nicolao, and L. Marchesi. Identification of pharmacokinetic modelsvia population smoothing splines. ANIPLA-BIOSYS 2005, Milan, June 2005.

[48] M. Neve, G. De Nicolao, and L. Marchesi. Nonparametric identification of populationmodels via Gaussian processes. Automatica, provisionally accepted for publication,2005.

[49] M. Neve, G. De Nicolao, and L. Marchesi. Nonparametric identification of populationpharmacokinetic models: an MCMC approach. In Proc. 24th American ControlConference, pages 991–996, Portland, OR, USA, June 8-10, 2005.

[50] M. Neve, G. De Nicolao, G. Prodi, and C. Siviero. Estimation of engine maps: aregularized basis-function networks approach. Submitted for publication.

[51] M. Neve, G. De Nicolao, G. Prodi, and C. Siviero. Stima di mappe motore mediantereti neurali di regolarizzazione. In Atti della Fondazione Ronchi, Anno LIX, 2004,number 1, pages 113–116. Atti della II Giornata di Studio su “Applicazione delleReti Neurali nell’Ingegneria Elettrica e dell’Informazione”, Pavia, May 2003.

[52] M. Neve, G. De Nicolao, G. Prodi, and C. Siviero. Nonparametric estimation ofengine maps using regularized basis-function networks. In Proc. of the First IFACSymposium on Advances in Automotive Control, pages 339–345, Salerno, Italy, 2004.

[53] M. Neve, G. De Nicolao, G. Prodi, and C. Siviero. Nonparametric neural identifica-tion of the air flow map. In Proc. of the 6th International Conference on Enginefor Automobile, ICE2003 (SAE Paper), number SAE-NA 2003-01-08, Capri, Italy,September 14-19, 2003.

[54] C. Paciorek. Nonstationary Gaussian Processes for Regression and Spatial Modeling.PhD thesis, Carnegie Mellon University, Pittsburg, 2003.

[55] K. Park, D. Verotta, T. F. Blaschke, and L. B. Sheiner. A semiparametric methodfor describing noisy population pharmacokinetic data. J. Pharmacokin. Biopharm.,25:615–642, 1997.

[56] T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE,78:1481–1497, 1990.

[57] C. Rasmussen. Evaluation of Gaussian Processes and other methods for nonlinearregression. PhD thesis, Department of Computer Science, University of Toronto,1996. ftp://ftp.cs.toronto.edu/pub/carl/thesis.ps.gz.

[58] M. Rocchetti and I. Poggesi. Comparison of the Bailer and Yeh methods using realdata. In L. Aarons et al., editor, The population approach: measuring and manag-ing variability in response, concentration and dose, pages 385–390, Brussels, Belgium.European cooperation in the field of scientific and technical research, European Com-mission, 1997.

BIBLIOGRAPHY 83

[59] K. Ropke. Design of Experiments (DOE) in der Motorenentwicklung. Expert Verlag,2003.

[60] M. Seeger. Bayesian methods for support vector machines and Gaussian Processes.Master’s thesis, University of Edinburgh, Division of Informatics, 1999.

[61] M. Seeger. Bayesian model selection for support vector machines, Gaussian Processesand other kernel classifiers. In S. Solla, T. Leen, and K. R. Muller, editors, Advancesin Neural Information Processing Systems 12, pages 603–609. MIT Press, 2000.

[62] L. B. Sheiner. The population approach to pharmacokinetic data analysis: rationaleand standard data analysis methods. Drug Metabolism Reviews, 15:153–171, 1994.

[63] L. B. Sheiner, B. Rosenberg, and V. V. Marathe. Estimation of population charac-teristics of pharmacokinetic parameters from routine clinical data. J. Pharmacokin.Biopharm., 5(5):445–479, 1977.

[64] L. B. Sheiner and J. L. Steimer. Pharmacokinetic/pharmacodynamic modeling indrug development. Annu. Rev. Pharmacol. Toxicol., 40:67–95, 2000.

[65] C. Siviero, R. Scattolini, A. Gelmetti, L. Poggio, and G. Serra. Analysis & Validationof Mean Value Models for SI-Engines. In Proc. of the First IFAC-Workshop onAdvances in Automotive Control, pages 1–6, 1995.

[66] J. Sjoberg and L. Ljung. Overtraining, regularization and searching for minimum inneural networks. In 4th IFAC Symposium on Adaptive Systems in Control and SignalProcessing, pages 669–674, 1992.

[67] A. J. Smola and B. Schlkopf. Bayesian kernel methods. In S. Mendelson and A. J.Smola, editors, Machine Learning, Proceedings of the Summer School, AustralianNational University, pages 65–117, Berlin, Germany, 2003. Springer-Verlag.

[68] T. Soderstrom and P. Stoica. System Identification. Prentice Hall International,London, UK, 1989.

[69] P. Sollich. Query construction, entropy, and generalization in neural-network models.Physical Review E, 49(5):4637–4651, 1994.

[70] M. Sugiyama and H. Ogawa. Incremental active learning for optimal generalization.Neural Computation, 12(12):2909–2940, 2000.

[71] S. Sun, M. B. Egerstedt, and C. F. Martin. Control theoretic smoothing splines.IEEE Trans. on Automatic Control, 45(12):2271–2279, 2000.

[72] K. K. Sung and P. Niyogi. Active Learning the weights of a RBF network. In Proc. ofIEEE Workshop on Neural Networks for Signal Processing, volume 49, pages 40–47,1995.

[73] A. N. Tykhonov. Solutions of incorrectly formulated problems and the regularizationmethod. Soviet. Math. Dokl., 4(4):1624, 1963.

84 BIBLIOGRAPHY

[74] P. Vicini and C. Cobelli. The iterative two-stage population approach to IVGTT

minimal modeling: improved precision with reduced sampling. Am. J. Physiol. En-docrinol. Metab., 280(1):179–186, 2001.

[75] S. Vozeh, J. L. Steimer, M. Rowland, P. Morselli, F. Mentre, L. P. Balant, andL. Aarons. The use of population pharmacokinetics in drug development. Clin.Pharmacokinet., 30(2):81–93, 1996.

[76] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, USA, 1990.

[77] J. Wakefield and J. Bennett. The Bayesian modelling of covariates for populationpharmacokinetic models. JASA, 91:917–927, 1996.

[78] J. Wakefield, A. F. M. Smith, A. Racine-Poon, and A. Gelfand. Bayesian analysisof linear and nonlinear population models using the Gibbs Sampler. Appl. Statist.,41:201–221, 1994.

[79] P. Whittle. Prediction and regulation by linear least-square methods. English Uni-versities Press, 1963.

[80] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to lin-ear prediction and beyond. In M. Jordan, editor, Learning and Inference in GraphicalModels, pages 599–621. MIT Press, 1999.

[81] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. InD. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in NeuralInformation Processing Systems, volume 8, Cambridge, MA, USA, 1996. MIT Press.

[82] L. Yuh, S. Beal, M. Davidian, F. Harrison, A. Hester, K. Kowalski, E. Vonesh,and R. Wolfinger. Population pharmacokinetic/pharmacodynamic methodology andapplications: a bibliography. Biometrics, 50:566–575, 1994.

BAYESIAN LEARNING TECHNIQUES FOR NONPARAMETRIC IDENTIFICATION › dottIEIE › tesi › 2005 › m_neve.pdf · Bayesian Learning This chapter introduces the main theoretical framework

Documents