Introduction 1 VRIJE UNIVERSITEIT BRUSSEL Faculteit Geneeskunde en Farmacie Laboratorium voor Farmaceutische en Biomedische Analyse N EW TRENDS IN MULTIVARIATE ANALYSIS AND C ALIBRATION Frédéric ESTIENNE Thesis presented to fulfil the requirements for the degree of doctor in Pharmaceutical Sciences Academic year : 2002/2003 Promotor : Prof. Dr. D.L. MASSART
276
Embed
A comparison of multivariate calibration techniques applied to experimental NIR data sets
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction
1
VRIJE UNIVERSITEIT BRUSSEL
Faculteit Geneeskunde en Farmacie
Laboratorium voor Farmaceutische en Biomedische Analyse
NEW TRENDS IN MULTIVARIATE ANALYSIS AND CALIBRATION
Frédéric ESTIENNE
Thesis presented to fulfil the requirements for
the degree of doctor in Pharmaceutical Sciences
Academic year : 2002/2003
Promotor : Prof. Dr. D.L. MASSART
New Trends in Multivariate Analysis and Calibration
2
Introduction
3
ACKNOWLEDGMENTS
First of all, I would like to thank Professor Massart for allowing me to spend these almost four years in
his team. The knowledge I acquired, the experience I gained, and most probably the reputation of this
formation gave a new and by far better start to my professional life.
For the rest, the list of people I have to thank would be too long to be printed here. Not even
mentioning I might accidentally omit someone. So I will probably play it the safe way and simply
thank everyone I enjoyed studying, working, having fun, gossiping (etc …) with during all these years.
Thank you all !
New Trends in Multivariate Analysis and Calibration
4
TABLE OF CONTENTS
ACKNOWLEDGMENTS 2
TABLE OF CONTENTS 4
LIST OF ABBREVIATIONS 6
___________ INTRODUCTION ___________
8
___________ I . MULTIVARIATE ANALYSIS AND CALIBRATION ___________
12
“Chemometrics and modelling.” 12
___________ II . COMPARISON OF MULTIVARIATE CALIBRATION METHODS ___________
38
“A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data
Sets. Part II : Predictive Ability under Extrapolation Conditions” 40
“A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data
Sets. Part III : Robustness Against Instrumental Perturbation Conditions” 70
“The Development of Calibration Models for Spectroscopic Data using Multiple Linear
Regression” 99
Introduction
5
___________ III . NEW TYPES OF DATA : NATURE OF THE DATA SET ___________
168
“Multivariate calibration with Raman spectroscopic data : a case study” 170
“Inverse Multivariate calibration Applied to Eluxyl Raman data“ 200
___________ IV . NEW TYPES OF DATA : STRUCTURE AND SIZE ___________
212
“Multivariate calibration with Raman data using fast PCR and PLS methods” 214
“Multi-Way Modelling of High-Dimensionality Electro-Encephalographic Data” 225
“Robust Version of Tucker 3 Model” 250
___________ CONCLUSION ___________
270
PUBLICATION LIST 274
New Trends in Multivariate Analysis and Calibration
6
LIST OF ABBREVIATIONS
ADPF Adaptative-degree polynomial filter
AES Atomic emission spectroscopy
ALS Alternating least squares
ANOVA Analysis of variance
ASTM American society for testing material
CANDECOMP Canonical Decomposition
CCD Coupled charge device
CV Cross-validation
DTR De-trending
EEG Electro-encephalogram
FFT Fast Fourier transform
FT Fourier Transform
GA Genetic algorithm
GC Gas chromatography
ICP Induced coupled plasma
IR Infrared
k-NN k-nearest neighbours
LMS Least median of squares
LOO Leave-one-out
LV Latent variable
LWR Locally weighted regression
MCD Minimumm covariance determinant
MD Mahalanobis distance
MLR Multiple Linear Regression
MSC Multiple scatter correction
MSEP Mean squared error of prediction
MVE Minimum volume ellipsoid
Introduction
7
MVT Multivariate trimming
NIPALS Nonlinear iterative partial least squares
NIR Near-infrared
NL-PCR Non-linear principal component regression
NN Neural networks
NPLS N-way partial least squares
OLS Ordinary least squares
PARAFAC Parallel factor analysis
PC Principal component
PCA Principal component analysis
PCC Partial correlation coefficient
PCR Principal component regression
PCRS Principal component regression with selection of PCs
PLS Partial least squares
PP Projection pursuit
PRESS Prediction error sum of squares
QSAR Quantitative structure-activity relationship
RBF Radial basis function
RCE Relevant components extraction
RMSECV Root mean squared error of cross validation
RMSEP Root mean squared error of prediction
RVE Relevant variable extraction
SNV Standard normal variate
SPC Statistical process control
SVD Singular value decomposition
TLS Total least squares
UVE Uninformative variables elimination
New Trends in Multivariate Analysis and Calibration
8
NEW TRENDS IN MULTIVARIATE ANALYSIS AND CALIBRATION
INTRODUCTION
Many definitions have been given for Chemometrics. One of the most frequently quoted of these
definitions [1] states the following :
Chemometrics is a chemical discipline that uses mathematics, statistics and formal logic (a) to design
or select optimal experimental procedures; (b) to provide the maximum relevant chemical information
by analysing chemical data; and (c) to obtain knowledge about chemical systems.
This thesis focuses specifically on points (b) and (c) of this definition, and a particular emphasis is
placed on multivariate methods and how they are used to model data. It should be noted that, while
modelling is probably the most important area of chemometrics, there are many other applications such
as method validation, optimisation, statistical process control, signal processing, etc.
Modelling methods can be divided into two groups of methods, even if these two groups are often
widely overlapping. In multivariate data analys is, models are used directly for data interpretation. In
multivariate calibration, models relate the data to a given property in order to predict this property.
Modelling methods in general are introduced in Chapter 1. The most common multivariate data
analysis and calibration methods are presented as well as some more advanced ones, in particular
methods applying to data with complex structure.
A particularity of chemometrics is that many methods used in the field were developed in other areas of
science before they were imported to chemistry. This is for instance the case for Partial Least Squares,
which was initially developed to build economical models. Chemometrics also covers a very wide
domain of application, and specialists in each field develop or modify methods best suited for their
particular applications. These factors lead to the fact that many methods are often available for a given
Introduction
9
problem. The first step of the chemometrical methodology is therefore to select the most appropriate
method to use. The importance of this step is illustrated in Chapter 2. Multivariate calibration methods
are compared on data with different structures. This comparison is performed in situations challenging
for the methods (data extrapolation, instrumental perturbation). A detailed description of the steps
necessary to develop a multivariate calibration model is also provided using Multiple Linear
Regression as a reference method.
Multivariate calibration and Near Infrared (NIR) spectroscopy have a parallel history. NIR could only
be routinely implemented through the use of sophisticated chemometrical tools and the arising of
modern computing. Chemometrical methods were then widely promoted by the remarkable
achievements of multivariate calibration applied to NIR data. For many years, multivariate calibration
and NIR spectroscopy were therefore almost synonym for the non-specialist. In the last few years,
chemometrical methods proved very efficient on other types of analytical data. This was sometimes the
case even for analytical methods that were not considered as necessitating sophisticated data treatment.
It is shown in chapter 3 how Raman spectroscopic data can benefit from chemometrics in general and
multivariate calibration in particular, allowing the use of Raman in a growing number of industrial
applications. This chapter also illustrates the importance of method selection in chemometrics, and
shows that the choice of the most appropriate method to use can depend on many factors, for instance
quality of the data set.
During the last years, data treated by chemometricians tend to become more and more complex. This
complexity can be understood in terms of volume of data, or in terms of data structure. The increasing
size of chemometrical data sets has several causes. For instance, combinatorial chemistry and high
throughput screening are designed to generate important volumes of data. Collections of samples
recorded during time also tend to get larger and larger. The improvement of analytical instruments
leads to better spectral resolutions and therefore larger data sets (sometimes several tens of thousands
of items). This last point is illustrated in chapter 4. It is shown how calibration methods specifically
designed to be fast can considerably reduce computation time required for calibration and prediction of
new samples. Complexity of a data set can also be understood in terms of data structure. Methods
developed in the area of psychometrics allowing to treat data that are not only multivariate, but also
multidimodal were recently introduced in the chemometrical field. Chapter 4 shows how this kind of
New Trends in Multivariate Analysis and Calibration
10
methods can be used to extract information from a very complex data set with up to 6 modes. This
chapter gives another illustration of the fact that chemometrical methods can be applied to new types of
data, even out of the strict domain of chemistry, since the multidimodal methods are applied to
pharmaceutical Electro-Encephalographic Data. Another example is given showing how these methods
can be adapted in order to be made more robust toward difficult data sets.
REFERENCES
[1] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi and J. Smeyers-
Verbeke, Handbook of Chemometrics, Elsevier, Amsterdam, 1997.
Introduction
11
New Trends in Multivariate Analysis and Calibration
There are two types of modelling. Modelling can in the first place be applied to extract useful
information from a large volume of data, or to achieve a better understanding of complex phenomena.
This kind of modelling is sometimes done through the use of simple visual representations. Depending
on the type of data studied and the field of application, modelling is then referred to as exploratory
multivariate analysis or data mining. Modelling can in the second place be applied when two or more
characteristics of the same objects are measured or calculated and then related to each other. It is for
instance possible to relate the concentration of a chemical compound to an instrumental signal, the
chemical structure of a drug to its activity or instrumental responses to sensory characteristics. In these
situations, the purpose of modelling usually is, after a calibration process, to make predictions (e.g.
predict the concentration of a certain analyte in a sample from a measured signal), but it can sometimes
simply be to verify the nature of the relationship. The two types of modelling strongly overlap. The
methods introduced in this chapter will therefore not be presented as being exploratio n or calibration
oriented, but rather will be introduced by rank of increasing complexity of the type of data or modelling
problem they are applied to.
2. Univariate regression
2.1. Classical univariate least squares : straight line models
Before introducing some of the more sophisticated methods, we should look shortly at the classical
univariate least squares methodology (often called ordinary least squares – OLS), which is what
analytical chemists generally use to construct a (linear) calibration line. In most analytical techniques
the concentration of a sample cannot be measured directly but is derived from a measured signal that is
in direct relation with the concentration. Suppose the vector x represents the concentrations of samples
and y the corresponding measured instrumental signal. To be able to define a model y = f(x) a
relationship between x and y has to exist. The simplest and most convenient situation is when the
relation is linear which leads to a model of the type :
New Trends in Multivariate Analysis and Calibration
14
y = b0 + b1x (1)
which is the equation of a straight line. The coefficients b0 and b1 represent the intercept and the slope
of the line. Relationships between y and x that follow a curved line can for instance be represented by a
regression model of the type :
y = b0 + b1x + b11x2 (2)
The least squares regression analysis is a methodology that allows to estimate the coefficients of a
given model. For calibration purposes one usually focuses on straight- line models which we also will
do in the rest of this section. Conventionally the x-values represent the so-called controlled or
independent variable, i.e. the variable that is considered not to have a measurement error (or a
negligible one), which is the concentration in our case. The y values represent the dependent variable,
i.e. the measured response, which is considered to have a measurement error. The least squares
approach allows to obtain b0 and b1 values such that the model fits the measured points (xi, yi) best.
Fig. 1. Straight line fitting through a series of measured points.
The true relationship between x and y is considered to be y = β0 + β1x while the relationship between
each xi and its measured yi can be represented as yi = b0 + b1xi + ei. The signal yi is composed of a
Chapter 1 – Multivariate Analysis and Calibration
15
component predicted by the model, b0 + b1x, and a random component, ei, the residual (Fig. 1). The
least squares regression finds the estimates b0 and b1 for β0 and β1 by calculating the values b0 and b1
for which ∑ei2 = ∑ (yi – b0 – b1xi)2, the sum of the squared residuals, is minimal. This explains the
name “least squares”. Standard books about regression, including least squares approaches are [1,2].
Analytical chemists can find information in [3,4].
2.2. Some variants of the univariate least square straight line models
A fundamental assumption of OLS is that there are only errors in the direction of y. In some instances,
two measured quantities are related to each other and the assumption then does not hold, because there
are also measurement errors in x. This is for instance the case when two analytical methods are
compared to each other. Often one of these methods is a reference method and the other a new method,
which is faster or cheaper and it is wanted to demonstrate that the results of both methods are
sufficiently similar. A certain number of samples are analysed with both methods and a straight line
model relating both series of measurements is obtained. If β0 as estimated from b0 is not more different
from 0 than an a priori accepted bias and β1 as estimated by b1 is not more different from 1 than a given
amount, then one can accept that for practical purposes y = x. In its simplest statistical expression, this
means that it is tested that β0 = 0 and β1 = 1 or to put it in another way that b0 is statistically different
from 0 and/or b1 is statistically different from 1. If this is the case then it is concluded that the two
methods do not yield the same result but that there is a constant (intercept) or proportional (slope)
systematic error or bias. This means that one should calculate b0 and b1 and at first sight this could be
done by OLS. However both regression variables (not only yi but now also xi) are subject to error, as
already mentioned. This violates one of the key assumptions of the OLS calculations.
It has been shown [4-7] that the computation of b0 and b1 according to the OLS-methods leads to wrong
estimates of β0 and β1. Significant errors in the least squares estimate of b1 can be expected if the ratio
between the measurement error on the x values and the range of the x values is large. In that case OLS
should not be used. To obtain correct values for b0 and b1 the sum of least squares must now be
obtained in the direction given in figure 2. Such methods are sometimes called errors in variables
models or orthogonal least squares. Detailed studies of the application of models of these types can be
found in [8,9].
New Trends in Multivariate Analysis and Calibration
16
Fig. 2. The errors-in-variables model.
Another possibility is to apply inverse regression. The term inverse is applied in opposition to t he usual
calibration procedure. Calibration consists of measuring samples with a known characteristic and
deriving a calibration line (or more generally a model). A measurement is then carried out for an
unknown sample and its concentration is derived from the measurement result and the calibration line.
In view of the assumptions of OLS, the measurement is the y-value and the concentration the x-value,
i.e.
measurement = f (concentration) (3)
This relationship can be inverted to become
Concentration = f (measurement) (4)
OLS is then applied in the usual way, meaning that the sum of the squared residuals is minimised in the
direction of y, which is now the concentration. This may appear strange, since, when the calibration
line is computed, there are no errors in the concentrations. However, if it is taken into account that
there will be an error in the predicted concentration of the unknown sample, then minimising in this
way means that one minimises the prediction errors, which is what is important to the analytical
chemist. It has been shown indeed that better results are obtained in this way [10-12]. The analytical
Chapter 1 – Multivariate Analysis and Calibration
17
chemist should therefore really apply eq. (4), instead of the usual eq. (3). In most cases the difference in
prediction qua lity between both approaches is very small in practice, so that there is generally no harm
in applying eq. (3). We will see however that when multivariate calibration is applied, inverse
regression is the rule. It should be noted that, when the aim is not to predict y-values, but to obtain the
best possible estimates of β0 and β1, inverse regression performs worse than the usual procedure.
Fig. 3. The leverage effect.
2.3. Robust regression
One of the most often occurring difficulties for an experimentalist is that of the presence of outliers.
The outliers may be due to experimental error or to the fact that the proposed model does not represent
the data well enough. For example, if the postulated model is a straight line, and measurements are
made in a concentration range where this is no longer true, the measurements obtained in that region
will be model outliers. In figure 3 it is clear that the last point is not representative for the straight line
fitted by the rest of the data. The outlier attracts the regression line computed by OLS. It is said to exert
leverage on the regression line. One might think that outliers can be discovered by examining the
residuals towards the line. As can be observed this is not necessarily true : the outlier’s residua l is not
much larger than that of some other data points.
New Trends in Multivariate Analysis and Calibration
18
To avoid the leverage effect, the outlier(s) should be eliminated. One way to achieve this is to use more
efficient outlier diagnostics than simply looking at residuals. Cook’s squared distance or the
Mahalanobis distance can for instance be used.
A still more elegant way is to apply so-called robust regression methods. The easiest to explain is
called the single median method [13]. The slope between each pair of points is computed. For instance
the slope between points 1 and 2 is 1.10, between 1 and 3 1.00, between 5 and 6 6.20. The complete list
is 1.10, 1.00, 1.03, 0.95, 2.00, 0.90, 1.00, 0.90, 2.23, 1.10, 0.90, 2.67, 0.70, 3.45, 6.20. These are now
ranked and the median slope (here the 8-th value 1.03) is chosen. All pairs of points of which the
outlier is one point have high values and end up at the end of the ranking, so that they do not have an
influence on the chosen median slope : even if the outlier was still more distant, the selected media n
would still be the same. A similar procedure for the intercept, which we will not explain in detail, leads
to the straight line equation y = 0.00 + 1.03 x, which is close to the line obtained with OLS after
eliminating the outlier. The single median method is not the best robust regression method. Better
results are obtained with the least median of squares method (LMS) [14], the iteratively re-weighted
[15] or bi-weight regression [16]. Comparing results of calibration lines obtained with OLS and with a
robust method is one way of finding outliers towards a regression model [17].
3. Multiple Linear Regression
3.1. Multivariate (multiple) regression
Multivariate regression, also often called multiple regression or multiple linear regression (MLR) in the
linear case, is used to obtain values for the b-coefficients in an equation of the type :
y = b0 + b1x1 + b2x2 + … bmxm (5)
where x1, x2, …, xm are different variables. In analytical spectroscopic applications, these variables
could be the absorbances obtained at different wavelengths, y being a concentration or other
characteristic of the samples to be predicted, in QSAR (the study of quantitative structure-activity
relationships) they could be variables such as hydrophobicity (log P), the Hammett electronic
Chapter 1 – Multivariate Analysis and Calibration
19
parameter σ, with y being some measure of biological activity. In experimental design, equations of the
type
y = b0 + b1x1 + b2x2 + b12x1x2 + b11x12 + b22x2
2 (6)
are used to describe a response y as a function of the experimental variables x1 and x2. Both equations
(5) and (6) are called linear, which may surprise the non-initiated, since the shape of the relationship
between y and (x1,x2) is certainly not linear. The term linear should be understood as linear in the
regression parameters. An equation such as y = b0 + log (x – b1) is non- linear [2].
It can be observed from the applications cited higher that multiple regression models occur quite often.
We will first consider the classical solution to estimate the coefficients. Later we will describe some
more sophisticated methodologies introduced by chemometricians, such as those based on latent
vectors.
As for the univariate case, the b-values are estimates of the true b-parameters and the estimation is done
by minimising a (sum of) squares. It can be shown that
b = (XTX)-1XTy (7)
where b is the vector containing the b-values from eq. (5), X is an nxm matrix containing the x-values
for n samples (or objects as they are often called) and m variables and y is the vector containing the
measurements for the n samples.
A difficulty is that the inversion of the XTX matrix leads to unstable results when the x-variables are
very correlated. There are two ways to avoid this problem. One is to select variables (variable selection
or feature selection) such that correlation is reduced, the other is to combine the variables in such a way
that the resulting summarising variables are not correlated (feature reduction). Both feature selection
and feature reduction lead to a smaller number of variables than the initial number of variables, which
by itself has important advantages.
New Trends in Multivariate Analysis and Calibration
20
3.2. Wide data matrices
Chemists often produce wide data matrices, characterised by a relatively small number of objects (a
few ten to a few hundred) and a very large number of variables (many hundreds, at least). For instance,
analytical chemists now often apply very fast spectroscopic methods, such as near infrared
spectroscopy (NIR). Because of the rapid character of the analysis, there is no time for dissolving the
sample or separation of certain constituents. The chemist tries to extract the information required from
the spectrum as such and to do so he has to relate a y-value such as an octane number of gasoline
samples or a protein content of wheat samples to the absorbance at 500 to, in some cases, 10 000
wavelengths. The e.g. 1000 variables for 100 objects constitute the X matrix. Such matrices contain
many more columns than rows and are therefore often called wide. Feature selection/reduction the n
takes on a completely different complexity compared to the situations described in the preceding
sections. It should be remarked that variables in such matrices are often very correlated. This can for
instance be expected for two neighbouring wavelengths in a spectrum. In the following sections, we
will explain which methods chemometricians use to model very large, wide and highly correlated data
matrices.
3.3. Feature selection methods
3.3.1. Stepwise Selection
The classical approach, which is found in many statistical packages, is the so-called stepwise
regression, a feature selection method. The so-called forward selection procedure consists of first
selecting the variable that is best correlated with y. Suppose this is found to be xi. The model at this
stage is restricted to y = f (xi). Then, one tests all other variables by adding them to the model, which
then becomes a model in two variables y = f (xi,xj). The variable xj which is retained together with xi is
the one which, when added to the mode l, leads to the largest improvement compared to the original
model y = f (xi). It is then tested whether the observed improvement is significant. If not, the procedure
stops and the model is restricted to y = f(xi). If the improvement is significant, xj is incorporated
definitively in the model. It is then investigated which variable should be added as the third one and
whether this yields a significant improvement. The procedure is repeated until finally no further
Chapter 1 – Multivariate Analysis and Calibration
21
improvement is obtained. The procedure is based on analysis of variance and several variants such as
backwards elimination (starting with all variables and eliminating successively the least important
ones) or a combination of forward and backward methods have been proposed. It should be noted that
the criteria applied in the analysis of variance are such that selected variables are less correlated. In
certain contexts such as in experimental design or QSAR, the reason for applying feature selection is
not only to avoid the numerical difficulties described higher, but also to explain relationships. The
variables that are included in the regression equation have a chemical and physical meaning and when a
certain variable is retained it is considered that the variable influences the y-value, e.g. the biological
activity, which then leads to proposals for causal relationships. Correct feature selection then becomes
very important in those situations to avoid making wrong conclusions. One of the problems is that the
procedures involve regressing many variables on y and chance correlation may then occur [18].
There are other difficulties, for instance, the choice of experimental conditions, the samples or the
objects. These should cover the experimental domain as well as possible and, where possible, follow an
experimental design. This is demonstrated, for instance, in [19]. Outliers can also cause problems.
Detection of multivariate outliers is not evident. As for the univariate regression, robust regression is
possible [14, 20]. An interesting example in which multivariate robust regression is applied concerns an
experimental design [21] carried out to optimise the yield of an organic synthesis.
3.3.2. Genetic algorithms for feature selection
Genetic algorithms are general optimisation tools aiming at selecting the fittest solution to a problem.
Suppose that, to keep it simple, 9 variables are measured. Possible solutions are represented in figure 4.
Selected variables are indicated by a 1, non-selected variables by a 0. Such solutions are sometimes, in
analogy with genetics, called chromosomes in the jargon of the specialists.
By random selection a set of such solutions is obtained (in real applications often several hundreds).
For each solution an MLR model is built using an equation such as (5) and the sum of squares of the
residuals of the objects towards that model is determined. In the jargon of the field, one says that the
fitness of each solution is determined : the smaller the sum of squares the better the model describes the
data and the fitter the corresponding solutions are.
New Trends in Multivariate Analysis and Calibration
22
Fig. 4. A set of solutions for feature selection from nine variables for MLR
Then follows what is described as the selection of the fittest (leading to names such as genetic
algorithms or evolutionary computation). For instance out of the, say 100 original solutions, the 50
fittest are retained. They are called the parent generation. From these is obtained a child generation by
reproduction and mutation.
Reproduction is explained in figure 5. Two randomly chosen parent solutions produce two child
solutions by cross over. The cross over point is also chosen randomly. The first part of solution 1 and
the second part of solution 2 together yield child solution 1’. Solution 2’ results from the first part of
solution 2 and the second of solution 1.
Chapter 1 – Multivariate Analysis and Calibration
23
Fig. 5. Genetic algorithms: the reproduction step
The child solutions are added to the selected parent solutions to form a new generation. This is repeated
for many generations and the best solution from the final generation is retained. Each generation is
additionally submitted to mutation steps. Here and there, randomly chosen bits of the solution string are
changed (0 to 1 or 1 to 0). This is applied in figure 6.
Fig. 6. Genetic algorithms: the mutation step.
The need for the mutation step can be understood from figure 5. Suppose that the best solution is close
to one of the child solutions in that figure, but should not include variable 9. However, because the
New Trends in Multivariate Analysis and Calibration
24
value for variable 9 is 1 in both parents, it is also unavoidably 1 in the children. Mutation can change
this and move the solutions in a better direction.
Genetic algorithms were first proposed by Holland [22]. They were introduced in chemometrics by
Lucasius et al [23] and Leardi et al [24]. They were applied for instance in QSAR and molecular
modeling [25], conformational analysis [26], multivariate calibration for the determination of certain
characteristics of polymers [27] or octane numbers [28]. Reviews about applications in chemistry can
be found in [29,30]. There are several competing algorithms such as simulated annealing [31] or the
immune algorithm [32].
4. Feature reduction : Latent Variables
The alternative to feature selection is to combine the variables in what we called earlier summarising
variab les. Chemometricians call this latent variables and the obtaining of such variables is called
feature reduction. It should be understood that in this case no variables are discarded.
4.1. Principal Component Analysis
The type of latent variable most commonly used is the principal component (PC). To explain the
principle of PCs, we will first consider the simplest possible situation. Two variables (x1 and x2) were
measured for a certain number of objects and the number of variables should be reduced to one. In
principal component analysis (PCA) this is achieved by defining a new axis or variable on which the
objects are projected. The projections are called the scores, s1, along principal component 1, PC1 (Fig.
7).
Chapter 1 – Multivariate Analysis and Calibration
25
Fig. 7. Feature reduction of two variables, x1 and x2, by a principal component.
The projections along PC1 preserve the information present in the x1-x2 plot, namely that there are two
groups of data. By definition, PC1 is drawn in the direction of the largest variation through the data. A
second PC, PC2, can also be obtained. By definition it is orthogonal to the first one (Fig. 8-a). The
scores along PC1 and along PC2 can be plotted against each other yielding what is called a score plot
(Fig. 8-b).
b)
a)
Fig. 8. a) second PC and b) score plot of the data in Fig. 7.
New Trends in Multivariate Analysis and Calibration
26
The reader observes that PCA decorrelates : while the data points in the x1-x2 plot are correlated they
are not longer so in the s1-s2 plot. This also means that there was correlated and therefore redundant
information present in x1 and x2. PCA picks up all the important information in PC1 and the rest, along
PC2, is noise and can be eliminated. By keeping only PC1, feature reduction is applied : the number of
variables, originally two, has been reduced to one. This is achieved by computing the score along PC1
as :
s = w1x1 + w2x2 (8)
In other words the score is a weighted sum of the original variables. The weights are known as loadings
and plots of the loadings are called loading plots.
This can now be generalised to m dimensions. In the m-dimensional space, PC1 is obtained as the axis
of largest variation in the data, PC2 is orthogonal to PC1 and is drawn into the direction of largest
remaining variation around PC1. It therefore contains less variation (and information) than PC1. PC3 is
orthogonal to the plane of PC1 and PC2. It is drawn in the direction of largest variation around that
plane, but contains less variation than PC3. In the same way PC4 is orthogonal to the hyperplane
PC1,PC2,PC3 and contains still less variation, etc. For a matrix with dimensions n x m, N = min (n, m)
PCs can be extracted. However, since each of them contains less and less information, at a certain time
they contain only noise and the process can be stopped before reaching N. If only d << N PCs are
obtained, then feature reduction is achieved.
A very important application of principal components is to visually display the information present in
the data set and most multivariate data applications therefore start with score and/or loading plots. The
score plots give information about the objects and the loading plots about the variables. Both can be
combined into a biplot, which are all the more effective after certain types of data transformation, e.g.
spectral mapping [33]. In figure 9, a score plot is shown for an investigation into the Maillard reaction,
a reaction between sugars and amino acids [34]. The samples consist of reaction mixtures of different
combinations of sugars and aminoacids. The variables are the areas under the peaks of the reaction
mixtures. The reactions are very complex: 159 different peaks were observed. Each of the samples is
therefore characterized by its value for 159 variables. The PC1-PC2 score plot of figure 9 can be seen as
a projection of the samples from 159-dimensional space to the two-dimensional space that preserves
best the variance in the data. In the score plot different symbols are given to the samples according to
Chapter 1 – Multivariate Analysis and Calibration
27
the sugar that was present and it is observed for instance that samples with rhamnose occupy a specific
location in the score plot. This is only possible if they also occupy a different place in the original 159-
dimensional space, i.e. their GC chromatogram is different. By studying different parts of the data and
by including the information from the loading plots, it is then possible to understand the effect of the
starting materials on the obtained reaction mixture.
Fig. 9. PCA score plot of samples from the Maillard reaction. The samples with rhamnose have symbol ¡.
Principal components have been used in many different fields of application. Whenever a table of
samples x variables is obtained and some correlation between the variables is expected a principal
components approach is useful. Let us consider an environmental example [35]. In figure 10 the score
plot is shown. The data consist of air samples taken at different times in the same sampling location.
For each of the samples a capillary GC chromatogram was obtained. The different symbols given to the
samples indicate different wind directions prevailing at the time of sampling. Clearly the wind direction
has an effect on the sample compositions. To understand this better, figure 11 gives a plot of the
loadings of a few of the variables involved. It is observed that the loadings on PC1 are all positive and
not very different. Referring to eq. (5), and remembering that the loadings are the weights (the w-
values) this means that the score on PC1 is simply a weighted sum of the variables and therefore a
global indicator of pollution. The samples with highest score on PC1 are those with the highest degree
of pollution. Along PC2 some variables have positive loadings and others negative loadings. Those of
New Trends in Multivariate Analysis and Calibration
28
the aliphatic variables are positive and those of the aromatic variables are negative. It follows that
samples with positive scores contain more aliphatic than aromatic variables
Fig. 10. PCA score plot of air samples. Fig. 11. PCA loading plot of a few variables measured on the air samples.
Combining PC1 and PC2, one can then conclude that samples with symbol x have an aliphatic character
and that the total content increases with higher values on PC1. The same reasoning can be held for the
samples with symbol • : they have an aromatic character. In fact, one could define new aliphaticity and
aromaticity factors as in figure 12. This can be done in a more formal way using what is called factor
analysis.
Chapter 1 – Multivariate Analysis and Calibration
29
Fig. 12. New fundamental factors discovered on a score plot
4.2. Other latent variables
There are other types of latent variables. In projection pursuit [34,36] a latent variable is chosen such
that, instead of largest variation in the data set, it describes the largest inhomogeneity. In that way
clusters or outliers can be observed more easily. Figure 13 shows the result applied to the Maillard data
of figure 9 and it appears that the cluster of rhamnose samples can now be observed more clearly.
Fig. 13. Projection pursuit plot of samples from the Maillard reaction. The samples with rhamnose have symbol ¡.
If the y-values are not characteristics observed for a set of samples, but the class belongingness of the
samples (e.g. samples 1-10 belong to class A, samples 11-25 to class B), then a latent variable can be
defined that describes the largest discrimination between the classes. Such latent variables are called
New Trends in Multivariate Analysis and Calibration
30
canonical variates or sometimes linear discriminant functions and are the basis for supervised pattern
recognition methods such as linear discriminant analysis. In the partial least squares (PLS) sectio n, still
another type of latent factor will be introduced.
4.3. N-way methods
Some data have a more complex structure than the classical 2-way matrix or table. Typical examples
are for instance met in environmental chemistry [37]. A set of n variables can be measured in m
different locations at p different times. This leads to a 3-way data set with dimensions n x m x p. The
three ways (or modes) are the variable mode, the location mode and the time mode. This can of course
be generalised to a higher number of modes, but for the sake of simplicity we will here restrict figures
and formulas to 3-way. The classical approach to study such data is to perform what is called
unfolding. Unfolding consists in rearranging a 3-way matrix into a 2-way matrix. The 3-way array can
be considered as several 2-way tables (slices of the original matrix), and these tables can be put next to
each other, leading to a new 2-way array (Fig. 14). This rearranged matrix can be treated with PCA.
Considering the example of figure 14, the scores will carry information about the locations, and the
loadings mixed information about the two other modes.
Fig. 14. Unfolding of a 3-way matrix, performed preserving the 'Location' dimension.
Unfolding can be performed in different directions so that each of the three modes is successively
preserved in the unfolded matrix. In this way, three different PCA models can be built, the scores of
each of these models giving information about one of the modes. This approach is called the Tucker1
model. It is the first of a series of Tucker models [38]. The most important of these is the Tucker3
Chapter 1 – Multivariate Analysis and Calibration
31
model. Tucker3 is a true n-way method as it takes into account the multi-way structure of the data. It
consists in building, through an iterative process, a score matrix for each of the modes, and a core
matrix defining the interactions between the modes. As in PCA, the components in each mode are
constrained to be orthogonal. The number of components can be different in each mode. A graphical
representation of the Tucker3 model for 3-way data is given in figure 15. It appears as a sum, weighted
by the core matrix G, of outer products between the factors stored as columns in the A, B and C score
matrices.
Fig. 15. Graphical representation of the Tucker 3 model. n, m and p are the dimensions of the original matrix X. w1, w2 and w3 are the number of components extracted on mode 1, 2 and 3 respectively, corresponding to the number of columns of the loading matrices A, B and C respectively.
Another common n-way model is the Parafac-Candecomp model that was proposed simultaneously by
Chan and Harchman [39, 40]. Information about n-way methods (and software) can be found in ref.
[41-43]. Applications in process control [44,45], environmental chemistry [37,46], food chemistry [47],
curve resolution [48] and several other fields have been published.
5. Calibration on latent variables
5.1. Principal component regression (PCR)
Until now we have applied latent variables only for display purposes. Principal components can
however also be used as the basis of a regression method. It is applied among others when the x-values
constitute a wide X matrix, for example for NIR calibration (see earlier). Instead of the original x-
values one applies the reduced ones, the scores. Suppose m variables (e.g. 1000) were measured for n
samples (e.g. 100). As explained earlier this requires either feature selection or feature reduction. The
New Trends in Multivariate Analysis and Calibration
32
latter can be achieved by replacing the m x-values by the scores on the k significant PC scores (e.g. 5).
The X matrix now no longer consists of 100 x 1000 absorbance values but of 100 x 5 scores since each
of the 100 samples is now characterized by 5 scores instead of 1000 variables. The regression model is
:
y=a1s1 + a2s2+…+a5s5 (9)
Since:
s=w1x1 + w2x2+ … w1000x1000 (10)
eq (9) becomes:
y=b1x1 + b2x2 + … b1000x1000 (11)
By using the principal components as intermediates it is therefore possible to solve the wide X matrix
regression problem. It should also be noted that the principal components are by definition not
correlated, so that the correlation problem mentioned earlier is therefore also solved.
5.2. Partial least squares (PLS)
The aim of partial least squares is the same as that of PCR, namely to model a set of y-values with the
data contained in an (often) wide matrix of correlated variables. However the approach is different. In
PCR, one works in two steps. In the first the scores are obtained and only the X matrix is involved, in
the second y-values are related to the scores. In PLS this is done in only one step. The latent variables
are obtained, not with the variation in X as criterion as is the case for principal components, but such
that the new latent variable shows maximal covariance between X and y. This means that the latent
variable is now built immediately in function of the relationship between y and X. In principle one
therefore expects that PLS would perform better than PCR, but in practice they often perform equally
well. A tutorial can be found in [49]. Several algorithms are available. A very effective one requiring
the least computer time according to our experience is SIMPLS [50].
Chapter 1 – Multivariate Analysis and Calibration
33
5.3. Applications of PCR and PLS
PCR and PLS have been applied in many different fields. The following references constitute a
somewhat haphazard selection from a very large literature. There are many analytical applications in
the pharmaceutical industry [51], the petroleum industry [52], food science [53], environmental
chemistry [54]. The methods are used with near or mid infrared [55], chromatographic [56], Raman
[57], UV [58], potentiometric [59] data. A good overview of applications in QSAR is found in [60].
5.4. PLS2 and other methods describing relationship between two tables
Instead of relating one y-value to many x-values, it is possible to model a set of y-values with a set of
x-values. This means that one relates two matrices Y and X, or in other words two tables. For instance,
one could measure for a certain set of samples a number of sensory characteristics on the one hand and
obtain analytical measures on the other. This would yield two tables as depicted in figure 16. One could
then wonder if it is possible to predict the sensory characteristics from the (easier to measure) chemical
measurements or at least to understand which (combinations) of analytical measurements are related to
which sensory characteristics. At the same time one wants to obtain information about the structure of
each of the two tables (e.g. which analytical variables give similar information). PLS2 can be used for
this purpose. Other methods that can be applied are for instance canonical correlation and reduced rank
regression. An example relating 20 measurements of mechanical strength of meat patties to the sensory
evaluation of textural attributes can be found in [61] and a comparison of methods in [62].
Fig. 16. Relating two 2-way tables.
New Trends in Multivariate Analysis and Calibration
34
5.5. Generalisation
It is also possible to relate multi-way models to a vector of y-values or to 2-way tables. The same way
as with 2-way data, the latent variables obtained in multi-way models are then used to build the
regression models [63]. The multi-way analog to PCR would consist in modelling the original data with
Tucker3 or Parafac, and then regress the dependent y-variable on the obtained scores. A more
sophisticated N-way version of PLS (N-PLS) was also developed [64]. The principle of N-PLS is to fit
a model similar to Parafac, but aiming at maximizing the covariance between the dependent and
independent variables instead of fitting a model in a least squares sense. The usefulness of such
approaches will be apparent from figure 17. In process analysis, one is concerned with the quality of
finished batches and this can be described by a number of quality parameters. At the same time for
each batch, a number of variables can be measured on the process in function of time [65]. This yields
a two-way table on the one hand and a three-way one on the other. Relating these tables allows
predicting the quality of a batch from the measurements made during the process.
Fig. 17. Relating a two-way and a three-way table.
Chapter 1 – Multivariate Analysis and Calibration
35
6. Conclusion
The most common chemometrical modelling methods were introduced in this chapter, together with
some more advanced ones, in particular methods applying to data with complex structure. These
concepts will be developed in further chapters
REFERENCES
[1] N.R. Draper and H. Smith, Applied Regression Analysis, Wiley, New York, 1981.
[2] J. Mandel, The Statistical Analysis of Experimental Data, Dover reprint, 1984, Wiley &Sons,
1 Shell International Chemicals B. V., Shell Research and Technology Centre
Amsterdam P. O. Box 38000
1030 BN Amsterdam The Netherlands
ABSTRACT
The present study compares the performance of different multivariate calibration techniques when new samples to be predicted can fall outside the calibration domain. Results of the calibration methods are investigated for extrapolation of different types and various levels. The calibration methods are applied to five near-IR data sets including difficulties often met in practical cases (non- linearity, non-homogeneity and presence of irrelevant variables in the set of predictors). The comparison leads to general recommendations about what method to use when samples requiring extrapolation can be expected in a calibration application.
1 Shell International Chemicals B. V., Shell Research and Technology Centre
Amsterdam P. O. Box 38000
1030 BN Amsterdam The Netherlands
ABSTRACT
This work is part of a more general research aiming at comparing the performance of multivariate calibration methods. In the first and second parts of the study, the performances of multivariate calibration methods were compared in situation of interpolation and extrapolation respectively. This third part of the study deals with robustness of calibration methods in the case whe re spectra corresponding to new samples of which the y value has to be predicted can be affected by instrumental perturbations not accounted for in the calibration set. This type of perturbations can happen due to instrument ageing, replacement of one or several parts of the spectrometer (e.g. the detector), use of a new instrument, or modifications in the measurement conditions, like the displacement of the instrument to a different location. Even though no general rules could be drawn, the variety of data sets and calibration methods used enabled to establish some guidelines for multivariate calibration in this unfavourable case when instrumental perturbation arises.
Chapter 2 – Comparison of Multivariate Calibration Methods
71
1 – Introduction
This study is part of a more general research aiming at comparing the performance of multivariate
calibration methods. These methods enable to relate instrumental responses consisting of a set of
predictors X to a chemical or physical property of interest y (the response factor). The choice of the
most appropriate method is a crucial step in order to obtain a good prediction of the property y of new
samples. Methods were compared using sets of industrial Near-Infrared (NIR) data, chosen such that
they include difficulties often met in practice, namely data clustering, non- linearity, and presence of
irrelevant variables in the set of predictors. The comparative study was performed in three separate
steps :
• In the first part of the study [1], the performances of multivariate calibration methods were
compared in the ideal situation where test samples are within the calibration domain (interpolation).
• In the second part of the study [2], the performance of multivariate calibration methods were
compared in a situation which can sometimes not be avoided in practice : the case where some test
samples fall outside the calibration domain (extrapolation). Extrapolation occurring in the X-space and
in the Y-space was considered.
• This third part of the study deals with the case where spectra corresponding to new samples of
which the y value has to be predicted can be affected by instrumental perturbations not accounted for in
the calibration set. The robustness of a calibration model is challenged in this situation in which exactly
superimposing replicate spectra of a stable standard is impossible. The instrumental perturbations can
be due to instrument ageing, replacement of one or several parts of the spectrometer (e.g. the detector),
use of a new instrument, or modifications in the measurement conditions, like the displacement of the
instrument to a different location. In all cases a degradation of the prediction results must be expected.
This third part of the method comparison study aims at evaluating the robustness of the different
calibration methods in the presence of such perturbations.
New Trends in Multivariate Analysis and Calibration
72
2 - Experimental
2.1 - Multivariate calibration methods tested
Only the methods that performed best in the first and second part of the comparative study [1,2] were
retained for this part. The calibration methods used in each part of the comparative study are
summarised in Table 1.
Table 1. Methods used in the different parts of the comparative study. Part 3 is the current study.
Method PART 1 PART 2 PART 3 PCR X X PCR-sel X X X TLS-PCR X TLS-PCR-sel X PLS-cv X PLS-rand X X X PLS-pert X Brown X MLR-step X X X GA X X FT-GA X X X UVE-PCR X X UVE-PCR-sel X UVE-PLS X X X RCE-PLS X X NL-PCR X NL-PCR-sel X NL-UVE-PCR X NL-UVE-PCR-sel X poly-PCR X SPL-PLS X X kNN X LWR X X X RBF-PLS X X X FT-NN X PC-NN X X X OBS-NN X
Chapter 2 – Comparison of Multivariate Calibration Methods
73
2.1.1 - Principal component regression (PCR)
In classical PCR (sometimes referred to as top-down PCR) [3], the number A of Principal Components
(PC) is optimised by Leave One Out (LOO) Cross Validation (CV). PCs from PC1 to PCA are retained
in order of the variance in the original data matrix X they explain. A limitation of this approach is that,
in some cases, information related to the property to be predicted y is found in high-order PCs, which
account for only a small amount of spectral variance. An alternative version called PCR with best
subset selection (PCR-sel) was therefore used. In this method, PCs are selected according to their
correlation with the target property y [1]. Model complexity was estimated by LOOCV followed by a
randomisation test [4]. This test allows to determine whether models with lower complexity have
significantly worse predictive ability and should therefore not be used.
2.1.2 - Partial least squares and its variants
Contrarily to PCR, the Latent Variables (LV) in PLS [5,6] are calculated to maximise covariance
between X and y. Latent variable selection as performed in PCR is therefore not necessary. The model
complexity A in PLS can be determined in several ways. The most classical way is to perform LOOCV
and retain the complexity associated with the minimum LOOCV error (PLS-cv). However this
approach is rather conservative since the removal of one sample at a time corresponds to a small
statistical perturbation of the calibration set. The complexity of the model chosen is often too high. Use
of a randomisation test often allows to reduce the complexity of the selected models (PLS-rand), but in
some cases it carries a risk of underfitting, i.e. too few LVs can be retained [7]. This is why an
alternative validation method for selecting optimal model complexity based on the simulation of
instrumental perturbatio ns on a subset of calibration sample spectra (PLS-pert) [7] was developed. This
method aims at determining the number of LVs beyond which models are unnecessarily sensitive to
instrumental perturbations affecting the spectra.
2.1.3 - Methods based on variable selection/elimination
In stepwise Multiple Linear Regression (MLR-step), original variables are selected iteratively
according to their correlation with the target property y [8]. For a selected variable xi, a regression
New Trends in Multivariate Analysis and Calibration
74
coefficient bi is determined and tested for significance using a t-test at a critical level α ( α = 5% was
used in this study). If the coefficient is found to be significant, the variable is retained and another
variable xj is selected according to its partial correlation with the residuals obtained from the model
built with xi. This procedure is called forward selection. The significance of the two regression
coefficients bi and bj associated with the two retained variables is then again tested, and the non-
significant terms are eliminated from the equation (backward elimination). Forward selection and
backward elimination are alternatively repeated until no significant improvement of the model fit can
be achieved by including more variables and all regression terms already selected are significant. In
order to reduce the risk of overfitting due to retaining too many variables, a procedure based on
LOOCV followed by a randomisation test is applied to test different sets of variables for significant
differences in prediction.
Genetic algorithms (GA) are probabilistic optimisation tools inspired by the “survival of the fittest”
principle of Darwinian theory of natural evolution and the mechanisms of natural genetics [9]. They
can be used in calibration to select a small subset of original variables to model y using MLR [10,11].
Instead of performing the selection on the set of numerous correlated original variables, one can apply
GA to transformed variables such as power spectrum coefficients obtained by Fourier transform (FT-
GA) [12]. In this case the variable selection is carried out in the frequency domain, from the first fifty
power spectrum coefficients only.
2.1.4 - Methods based on uninformative variable elimination
The idea behind the uninformative variable elimination PLS (UVE-PLS) method is to reduce
significantly the number of original variable before calculating LVs in the final PLS model [13]. This is
done by removing original variables that are considered unimportant. One first generates vectors of
random variables that are attached to each spectrum in the data set. Then a PLS model is built on the
set of artificially augmented spectra, and all variables with regression coefficients not significantly
more reliable than the regression coefficients of the dummy variables are truncated. (The reliability of a
coefficient is calculated as the ratio of its magnitude to its standard deviation estimated by leave-one-
out jackknifing). After reduction of the number of original variables, a new PLS model is built. Model
complexities for variable elimination and final modelling are determined by LOOCV. The advantage of
Chapter 2 – Comparison of Multivariate Calibration Methods
75
the UVE-PLS approach is that, since noisy or redundant variables have been eliminated, the models
built after variable elimination will be more parsimonious and robust than classical PLS models.
2.1.5 - Methods based on local modelling
In locally weighted regression (LWR), a dedicated local model is developed for each new prediction
sample [14]. This can be advantageous for data sets that exhibit some clustering or some non-linearity
that can be approximated by local linear fits. For each point to be predicted, a local PLS model is built
using the closest (in terms of Euclidian norm in the X space) calibration points. In this study, the points
were given uniform weights in the local model [15].
The radial basis function PLS method (RBF-PLS) bears some similarities with LWR [16]. The PLS
algorithm is applied to the M and y matrices instead of the X and y matrices. M(n× n) is called the
activation matrix (with n the number of samples). Its elements are Gaussian functions placed at the
positions of the calibration objects. Thus a form of local modelling is performed as in LWR. The PLS
algorithm relates the non- linearly transformed distance measures in M to the target property in y. The
width of Gaussian functions and number of LVs are optimised by prediction testing using a training
and a monitoring set similarly to Neural Networks.
2.1.6 - Methods using Neural Networks (NN)
Back-propagation NN using PCs as inputs was used in this study (PC-NN). A method was applied to
find the best number of nodes to be used in the input and hidden layers based on the contribution of
each node [17]. NN models using Fourier transform power spectrum coefficients (FT-NN) were also
used. Optimisation of the set of input coefficients was performed on the first 20 coefficients by trial-
and-error (the variance propagation approach for sensitivity estimation can not be applied in this case
since Fourier coefficients are not orthogonal). All NN models had one hidden layer and were trained
with the Levenberg-Marquardt algorithm [18]. Hyperbolic tangent and/or linear functions were used in
the nodes of the hidden and output layers.
New Trends in Multivariate Analysis and Calibration
76
2.2 - Data sets used
All data sets were described in detail in the two first parts of this comparative study [1,2]. The main
characteristics of the data sets used in the comparative study are summarised in Table 2.
Table 2. Main characteristics of the experimental NIR data sets.
Data set Linearity/nonlinearity Clustering WHEAT Linear Strong (PC3) POLYOL Linear Strong (PC1) GASOLINE 1 Slightly nonlinear Strong (PC2) POLYMER Strongly nonlinear Strong (PC1) GAS OIL 1 Nonlinear Minor (PC1-PC2)
2.2.1 - WHEAT data
This data set was submitted to the Chemometrics and Intelligent Laboratory Systems database of
proposed standard reference data set by Kalivas [19]. It consists of NIR spectra of wheat samples with
specified moisture content. Samples were measured in diffuse reflectance from 1100 to 2500 nm (2 nm
step) on a Bran & Luebbe instrument. Offset correction was performed on the spectra to eliminate
baseline shift. After offset correction, a PCA revealed a separation in two clusters on PC3. This
separation can be linked to the clustering present on the y values. An isolated sample on this PC was
detected as an outlier and removed from the data.
2.2.2 - POLYOL data
This data set consists of NIR spectra used for the determination of hydroxyl number in polyether
polyols. Spectra were recorded on a NIR Systems 6250 instrument from 1100 to 2158 nm (2 nm step).
An offset correction was applied to eliminate a baseline shift between spectra. The data set contains
two clusters due to the presence of a peak at 1690 nm only in some of the spectra [10]. The data set is
clustered, the clustering can be seen on a PC1-PC2 score plot.
Chapter 2 – Comparison of Multivariate Calibration Methods
77
2.2.3 - GASOLINE data
This data set was studied for the determination of gasoline MON. The NIR spectra were recorded on a
PIONIR 1024 spectrometer instrument from 800 to 1080 nm (0.5 nm step). Spectra were pre-processed
with first derivatives to eliminate a baseline shift and to separate overlapping peaks. This data set
contains three clusters due to gasolines with different grades and it is non- linear.
2.2.4 – POLYMER data
This data set was used for the determination of the amount of a minor mineral component in a polymer.
NIR spectra were recorded from 1100 to 2498 nm (2 nm step). SNV transformation was applied to
remove a curved baseline shift between spectra. This data set is clus tered and strongly non- linear in the
X-y relationship and in the X-space.
2.2.5 – GAS OIL data
This data set was studied for modelling the viscosity of hydro-treated gas oil samples. The NIR spectra
were recorded on a NIR interferometer between 4770 and 6300 cm-1 (1.9 cm-1 step). Spectra were
converted from wavenumbers to wavelengths and linear baseline correction was performed to correct
for a baseline drift. Clusters and zones of unequal density are present in the data set due to the fact that
the samples come from three different batches. This data set is non-linear, but this non- linearity can
only be seen due to the presence of two extreme samples. These extreme samples could have been
misinterpreted as outliers, but people in charge of data acquisition established through expert
knowledge that this was not the case.
New Trends in Multivariate Analysis and Calibration
78
2.3 - Design of the method comparison study
Models were developed using calibration samples, and their predictive ability was evaluated on
perturbation- free test samples as was done in the first part of the comparative study [1]. Perturbations
were then simulated on the spectra of the test samples. The following types of perturbation were
studied :
• detector noise
• change in optical pathlength
• wavelength shift
• slope in baseline
• baseline offset
• stray light
For each calibration method, the prediction error on the perturbed test samples was evaluated and
compared to the prediction error on perturbation-free samples. Therefore, this study provided not only
information on the performance of calibration methods in the presence of perturbation, but also on the
relative degradation of performance compared to perturbation-free test samples.
The perturbations were simulated as follows:
Chapter 2 – Comparison of Multivariate Calibration Methods
79
2.3.1 - Detector noise
Gaussian white noise can affect detectors in spectroscopy. Since the measured transmitted or reflected
light is log-transformed to absorbance, the Gaussian white noise becomes heteroscedastic (Fig. 1). To
simulate detector noise in each data set, the maximum peak height of the mean spectrum was first
determined. White noise was then simulated with a standard deviation equal to a fraction of the
maximum peak height and added to the transmission or reflection spectra before they were log-
transformed into absorbance. For the GASOLINE data, the raw spectra before application of the fist
derivative were used.
Fig. 1. POLYOL data. Standard deviation of simulated detector noise. Absorbance scale.
New Trends in Multivariate Analysis and Calibration
80
2.3.2 - Change in optical pathlength
In spectroscopy, scattering due to different particle sizes, presence of water in a sample or change of
the sample cell can modify the effective pathlength of the radiation. This multiplicative effect causes a
modification in absorbance (Fig. 2).
Fig. 2. GAS OIL 1 data. Influence of a 2.5% optical pathlength change.
Let x be the absorbance value at a given wavelength. After a change ∆ l of the optical pathlength l, the
absorbance for the same sample at the same wavelength becomes :
)ll
1(xxpath∆
+= (1)
Chapter 2 – Comparison of Multivariate Calibration Methods
81
2.3.3 - Wavelength shift
Imperfections in the optics or mechanical parts of spectrometers can cause wavelength shifts. To
simulate wavelength shifts, a second-order polynomial was fitted to each spectrum using 3-point
spectral windows. Once the polynomial coefficients were obtained for each window, the shifted
absorbance values were interpolated at the position defined by the shift value ∆λ (Fig. 3).
Fig. 3. POLYMER data. Influence of a 2 nm wavelength shift.
New Trends in Multivariate Analysis and Calibration
82
2.3.4 - Baseline slope
Baseline slope is often related to multiplicative perturbations such as stray light or optical pathlength
change. A slope is determined as a fraction of the maximum signal of the mean spectrum and added to
all spectra of the data set (Fig. 4).
Fig. 4. WHEAT data. Influence of a 3% baseline slope.
Chapter 2 – Comparison of Multivariate Calibration Methods
83
2.3.5 - Baseline offset
Baseline offsets can be due to imperfection in optics, fouling of the sample cell or even changes in the
cell positioning of the fiber optic. The baseline offset was determined as a fraction of the maximum
signal in the mean spectrum and added to all spectra (Fig. 5).
Fig. 5. GAS OIL 1 data. Influence of a 2% baseline offset.
New Trends in Multivariate Analysis and Calibration
84
2.3.6 - Stray light
Stray light is the fraction of detected light that was not transmitted through the sample. It is usually
caused by imperfections in the optical parts of the instruments. At a given wavelength, the effect of
stray light is simulated before log-transformation by adding a fraction s of the maximum signal in the
mean spectrum (F ig. 6).
Fig. 6. GAS OIL 1 data. Influence of 1% stray light.
Therefore the absorbance for a sample at a given wavelength in the presence of stray light is calculated
as :
( )s10logx xstray +−= − (2)
Chapter 2 – Comparison of Multivariate Calibration Methods
85
Some instrumental perturbations were not applied to experimental data sets that had been pre-processed
in order to remove instrumental effects of the same type. For each experimental data set, perturbations
were adjusted by visual evaluation of the perturbation effect on the spectra. Details on the simulated
perturbations can be found in Table 3.
Table 3. Perturbations applied to the experimental data sets.
[18] R. Fletcher, Practical Methods of optimization, Wiley, N.Y., 1987.
[19] J. Kalivas, Chemom. Intell. Lab. Syst. 37 (1997) 255-259.
[20] R.D. Snee, Technometrics 19 (1977) 415-428.
Chapter 2 – Comparison of Multivariate Calibration Methods
99
THE DEVELOPMENT OF CALIBRATION
MODELS FOR SPECTROSCOPIC DATA USING
MULTIPLE LINEAR REGRESSION
Based on :
THE DEVELOPMENT OF CALIBRATION MODELS FOR
SPECTROSCOPIC DATA USING PRINCIPAL COMPONENT
REGRESSION
Internet Journal of Chemistry 2 (1999) 19, URL: http://www.ijc.com/articles/1999v2/19/
R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. Jouan-Rimbaud, B. Walczak +, D.L. Massart*, S. de Jong 1, O.E. de Noord2, C. Puel3, B.M.G. Vandeginste1
Unilever Research Laboratorium Vlaardingen P.O. Box 114
3130 AC Vlaardingen The Netherlands
Centre de Recherches Elf-Antar
Centre Automatisme et Informatique
BP 22 69360 Solaize
France
ABSTRACT
This article aims at explaining how to develop a calibration model for spectroscopic data analysis by Multiple Linear Regression (MLR). Building an MLR model on spectroscopic data implies selecting variables. Variable selections methods are therefore studied in this article. Before applying the method, the data has to be investigated in oder to detect for instance outliers, clustering tendency or nonlinearities. How to handle replicates and perform different data preprocessings and/or pretreatments is also explained in this tutorial.
... original data, o measured data, * smoothed data…
New Trends in Multivariate Analysis and Calibration
108
c)
Fig. 1. c) ... 1st. derivative of the cubic
polynomial in the different windows in a), * estimated 1st. derivative data.
d) 1st. derivative of the data set in a) : ... real 1st. derivative, * estimated values (window size = 13, m=6; cubic polynomial, n=3).
d)
It should be noted that in many cases the instrument software will perform, if desired, smoothing by
averaging of scans so that the user does not have to worry about how exactly to proceed. Often this is
then followed by applying Savitzky-Golay, which is also usually present in the software of the
instrument. If the analyst decides to carry out the smoothing with other software, then care must be
taken not to distort the signal.
Differentiation can be used to enhance spectral differences. Second derivatives remove constant and
linear background at the same time. An example is shown in figure 2-b,c. Both first and second
derivatives are used, but second derivatives seem to be applied more frequently. A possible reason for
Chapter 2 – Comparison of Multivariate Calibration Methods
109
their popularity is that they have troughs (inverse peaks) at the location of the original peaks. This is
not the case for first derivatives.
In principle, differentiation of data is obtained by using the appropriate derivative of the polynomial
used to fit the data in each window (Fig. 1-c,d). In practice, tables [18,21] or computer algorithms
[19,20] are used to obtain the coefficients ck which are used in the same way as for eqn (7).
Alternatively the differentials can be calculated from the differences in absorbance between two
wavelengths separated by a small fixed distance known as the gap.
One drawback of the use of derivatives is that they decrease the SNR by enhancing the noise. For that
reason smoothing is needed before differentiation. The higher the degree of differentiation used, the
higher the degradation of the SNR. In addition, and this is also true for smoothing data by using the
Savitzky-Golay method, it is assumed that points are obtained at uniform intervals which is not always
necessarily true. Another drawback [23] is that calibration models obtained with spectra pre-treated by
differentiation are sometimes less robust to instrumental changes such as wavelength shifts which may
occur over time and are less easily corrected for the changes.
Constant background differences can be eliminated by using offset correction. Each spectrum is
corrected by subtracting either its absorbance at the first wavelength (or other arbitrary wavelength) or
the mean value in a selected range (Fig. 2-d).
Fig. 2. NIR spectra for different wheat samples and several preprocessing methods applied to them : a) original data b) 1st. derivative c) 2nd. Derivative d) offset corrected
New Trends in Multivariate Analysis and Calibration
110
Fig. 2. NIR spectra for different wheat samples and several preprocessing methods applied to them : e) SNV corrected f) detrended corrected g) detrended+SNV corrected h) MSC corrected
An interesting method is the one based on contrasts as proposed by Spiegelman [24,25]. A contrast is
the difference between the absorbance at two wavelengths. The differences between the absorbances at
all pairs of wavelengths are computed and used as variables. In this way offset corrected wavelengths,
derivatives (differences between wavelengths close to each other) are included and also differences
between two peak wavelengths, etc. A difficulty is that the number of contrasts equals p(p-1)/2 which
soon becomes very large, e.g. 1000 wavelengths gives 500.000 contrasts. At the moment there is
insufficient experience to evaluate this method.
Other methods that can be used are based on transforms such as the Fourier transform or the wavelet
transform. Multivariate calibration using MLR on Fourier coefficients was compared with PCR (MLR
applied on scores on principal components) [26]. Methods based on the use of wavelet coefficients
have also been described [27]. One can first smooth the signal by applying Fourier or wavelet
transforms to the signal [28] and then apply MLR to the smoothed signal. MLR can also be applied
directly on the Fourier or the wavelet coefficients, which is probably a preferable approach. For NIR
this does not seem useful because the signal contains little random (white) noise, so that the simpler
techniques described above are usually considered sufficient.
Chapter 2 – Comparison of Multivariate Calibration Methods
111
3.3. Methods specific for NIR
The following methods are applied specifically to NIR data of solid samples. Variation between
individual NIR diffuse reflectance spectra is the result of three main sources :
• non-specific scatter of radiation at the surface of particles.
• variable spectral path length through the sample.
• chemical composition of the sample.
In calibration we are interested only in the last source of variance. One of the major reasons for
carrying out pre-processing of such data is to eliminate or minimise the effects of the other two sources.
For this purpose, several approaches are possible.
Multiplicative Scatter (or Signal) Correction (MSC) has been proposed [29-31]. The light scattering or
change in path length for each sample is estimated relative to that of an ideal sample. In principle this
estimation should be done on a part of the spectrum which does not contain chemical information, i.e.
influenced only by the light scattering. However the areas in the spectrum that hold no chemical
information often contain the spectral background where the SNR may be poor. In practice the whole
spectrum is sometimes used. This can be done provided that chemical differences between the samples
are small. Each spectrum is then corrected so that all samples appear to have the same scatter level as
the ideal. As an estimate of the ideal sample, we can use for instance the average of the calibration set.
MSC performs best if an offset correction is carried out first. For each sample :
eba ji ++= xx (8)
where xi is the NIR spectrum of the sample, and jx symbolises the spectrum of the ideal sample (the
mean spectrum of the calibration set). For each sample, a and b are estimated by ordinary least-squares
regression of spectrum xi vs. spectrum jx over the available wavelengths. Each value xij of the
corrected spectrum xi(MSC) is calculated as :
p1,2,...,j;b
ax(MSC)x ij
ij =−
= (9)
New Trends in Multivariate Analysis and Calibration
112
The mean spectra must be stored in order to transform in the same way future spectra (Fig. 2-h).
Standard Normal Variate (SNV) transformation has also been proposed for removing the multiplicative
interference of scatter and particle size [32,33]. An example is given in figure 2-a, where several
samples of wheat are measured. SNV is designed to operate on individual sample spectra. The SNV
transformatio n centres each spectrum and then scales it by its own standard deviation :
p1,2,...,j;SD
x(SNV)x iij
ij =−
=x
(10)
where xij is the absorbance value of spectrum i measured at wavelength j, ix is the absorbance mean
value of the uncorrected ith spectrum and SD is the standard deviation of the p absorbance values,
( )1
1
2
−
−∑=
p
xp
jiij x
.
Spectra treated in this manner (Fig. 2-e) have always zero mean and variance equal to one, and are thus
independent of original absorbance values.
De-trending of spectra accounts for the variation in baseline shift and curvilinearity of powdered or
densely packed samples by using a second degree polynomial to correct the data [32]. De-trending
operates on individual spectra. The global absorbance of NIR spectra is generally increasing linearly
with respect to the wavelength , but it increases curvilinearly for the spectra of densely packed samples.
A second-degree polynomial can be used to standardise the variation in curvilinearity :
ii ex +++= cba *2* λλ (11)
where xi symbolises the individual NIR spectrum and λ∗ the wavelength. For each sample, a, b and c
are estimated by ordinary least-squares regression of spectrum xi vs. wavelength over the range of
wavelengths. The corrected spectrum xi(DTR) is calculated by :
ic*b2*ai)DTR(i exx =−λ−λ−= (12)
Chapter 2 – Comparison of Multivariate Calibration Methods
113
Normally de-trending is used after SNV transformation (Fig. 2-f,g). Second derivatives can also be
employed to decrease baseline shifts and curvilinearity, but in this case noise and complexity of the
spectra increases.
It has been demonstrated that MSC and SNV transformed spectra are closely related and that the
difference in prediction ability between these methods seems to be fairly small [34,35].
3.4. Selection of pre-processing methods in NIR
The best pre-processing method will be the one that finally produces a robust model with the best
predictive ability. Unfortunately there seem to be no hard rules to decide which pre-processing to use
and often the only approach is trial and error. The development of a methodology that would allow a
systematic approach would be very useful. It is possible to obtain some indication during pre-
processing. For instance, if replicate spectra have been measured, a good pre-processing methodology
will produce minimum differences between replicates [36] though this does not necessarily lead to
optimal predictive value. If only one measure per sample is given, it can be useful to compute the
correlation between each of the original variables and the property of interest and do the same for the
transformed variables (Fig. 3). It is likely that good correlations will lead to a good prediction.
However, this approach is univariate and therefore does not give a complete picture of predictive
ability. Depending on the physical state of the samples and the trend of the spectra, a background
and/or a scatter correction can be applied. If only background correction is required, offset correction is
usually preferable over differentiation, because with the former the SNR is not degraded and because
differentiation may lead to less robust models over time. If additionally scatter correction is required,
SNV and MSC yield very similar results. An advantage of SNV is that spectra are treated individually,
while in MSC one needs to refer to other spectra. When a change is made in the model, e.g. if, because
of clustering, it is decided to make two local models instead of one global one, it may be necessary to
repeat the MSC pre-processing. Non- linear behaviour between X and y appears (or increases) after
some of the pre-processing methods. This is the case for instance for SNV. However this does not
cause problems provided the differences between spectra are relatively small.
New Trends in Multivariate Analysis and Calibration
114
Fig. 3. Correlation coefficients between (corrected) absorbance and moisture content for spectra in fig. 2. : a) original data b) 1st. derivative c) 2nd. Derivative d) offset corrected
Fig. 3. Correlation coefficients between (corrected) absorbance and moisture content for spectra in fig. 2. : e) SNV corrected f) detrended corrected g) detrended+SNV corrected h) MSC corrected
4. Data matrix pre-treatment
Before MLR is performed, some scaling techniques can be used. The most popular pre-treatment,
which is nearly always used for spectroscopic data sets, is column-centering. In the x-matrix, by
convention, each column represents a wavelength and column-centering is thus an operation which is
carried out for each wavelength over all objects in the calibration set. It consists of subtracting, for each
column, the mean of the column from the individual elements of this column, resulting in a zero mean
of the transformed variables and eliminating the need for a constant term in the regression model. The
effect of column-centering on prediction in multivariate calibration was studied in [37]. It was
Chapter 2 – Comparison of Multivariate Calibration Methods
115
concluded that if the optimal number of variables/factors decreases upon centering, a model should be
made with mean-centered data. Otherwise, a model should be made with the raw data. Because this
cannot be known in advance, it seems reasonable to consider column-centering as a standard operation.
For spectroscopic data it is usually the only pre-treatment performed, although sometimes autoscaling
(also known as column standardisation) is also employed. In this case, each element of a column-
centered table is divided by its corresponding column standard deviation, so that all columns have a
variance of one. This type of scaling can be applied in order to obtain an idea about the relative
importance of the variables [38], but it is not recommended for general use in spectroscopic
multivariate calibration since it unduly inflates the noise in baseline regions.
After pre-treatment, the mean (and the standard deviation for autoscaled data) of the calibration set
must be stored in order to transform future samples, for which the concentration or other characteristic
must be predicted, using the same values.
5. Graphical information
Certain plots should always be made. One of these is to simply plot all spectra on the same graph (Fig.
2). Evident outliers will become apparent. It is also possible to identify noisy regions and perhaps to
exclude them from the model.
Another plot that one should always make is the Principal Component Analysis (PCA) score plot.
Many books and papers are devoted to PCA [39-41]. PCA is not a new method, and was first described
by Pearson in 1901 [42] and by Hotelling in 1933 [43]. Let us suppose that n samples (objects) have
been spectroscopically measured at p wavelengths (variables). This information can be written in
matrix form as:
=
np2n1n
p22221
p11211
x...xx............
x...xxx...xx
X (13)
where x1 = [x11 x12 ...x1p] is the row vector containing the absorbances measured at p wavelengths (the
spectrum) for the first sample, x2 is the row vector containing the spectrum for the second sample and
New Trends in Multivariate Analysis and Calibration
116
so on. We will assume that the reader is more or less familiar with PCA and that, as is usual in PCA in
the context of multivariate calibration, the x-matrix was column-centered (see chapter 4). PCA creates
new orthogonal variables (latent variables) that are linear combinations of the original x-variables. This
can be achieved by the method known as singular value decomposition (SVD) of X :
pppnpppppnpn '' xxxxxx PTPUX == Λ (14)
U is the unweighted (normalised) score matrix and T is the weighted (unnormalised) score matrix.
They contain the new variables for the n objects. We can say that they represent the new co-ordinates
for the n objects in the new co-ordinate system. P is the loading matrix and the column vectors of P are
called eigenvectors or loading-PCs. The elements of P are the loadings (weights) of the original
variables on each eigenvector. High loadings for certain original variables on a particular eigenvector
mean that these variables are important in the construction of the new variable or score on that
principal component (PC).
Two main advantages arise from this decomposition. The first one is that the new variables are
orthogonal (U'U=I). This has very important implications in PCR, in particular in the MLR step of the
methods [6] if variables are correlated. Moreover, we assume that the first new variables or PCs,
accounting for the majority of the variance of the original data, contain meaningful information, while
the last ones, which account for a small amount of variance, only contain noise and can be deleted.
Since PCA produces new variables, such that the highest amount of variance is explained by the first
eigenvectors, the score plots can be used to give a good representation of the data. By using a small
number of score plots (e.g. t1-t2, t1-t3, t2-t3), useful visual information can be obtained about the data
distribution, inhomogeneities, presence of clusters or outliers, etc. We recommend that it is carried out
with the centered raw data and on the data after the signal pre-processing chosen in step 3. Plots of the
loadings (contribution of the original variables in the new ones) identify spectral regions that are
important in describing the data and those which contain mainly noise, etc. However, the loadings plots
should be used only as an indication when it comes to selecting useful variables.
Chapter 2 – Comparison of Multivariate Calibration Methods
117
6. Clustering tendency
Clusters are groups of similar objects inside a population. When the population of objects is separated
into several clusters, it is not homogeneous. To perform multivariate calibration modelling, the
calibration objects should preferably belong to the same population. Often this is not possible, e.g. in
the analysis of industrial samples, when these samples belong to different quality grades. The
occurrence of clusters may indicate that the objects belong to different populations. This suggests there
is a fundamental difference between two or more groups of samples, e.g. two different products are
included in the analysis, or a shift or drift has occurred in the measurement technique. When clustering
occurs, the reason must be investigated and appropriate action should be taken. If the clustering is not
due to instrumental reasons that may be corrected (e.g. two sets of samples were measured at different
times and instrumental changes have occurred) then there are two possibilities : to split the data in
groups and make a separate model for each cluster, or to keep all of them in the same calibration
model.
The advantages of splitting the data are that one obtains more homogeneous populations and therefore,
one hopes, better models. However, it also has disadvantages. There will be less calibration objects for
each model and it is also considerably less practical since it is necessary to optimise and validate two or
more models instead of one. When a new sample is predicted, one must first determine to which cluster
it belongs before one can start the actual prediction. Another disadvantage is that the range of y-values
can be reduced, leading to less stable models. For that reason, it is usually preferable to make a single
model. The price one pays in doing this is a more complex and therefore potentially less robust model.
Indeed, the model will contain two types of variables, variables that contain information common to the
two clusters and therefore have similar importance for both, and variables that correct for the bias
between the two clusters. Variables belonging to the second type are often due to peaks in the spectrum
that are present in the objects belonging to one cluster and absent or much weaker in the other objects.
An example where two clusters occur is presented in [44]. Some of the variables selected are directly
related with the property to be measured in both clusters, whereas others are related to the presence or
absence of one peak. This peak is due to a difference in chemical structure and is responsible for the
clustering. The inclusion of the latter variables takes into account this difference and improves the
predictive ability of the model, but also increases the complexity.
New Trends in Multivariate Analysis and Calibration
118
Clustering techniques have been exhaustively studied (see a review of methods in [45]). Their results
can for example be presented as dendrograms. However, in multivariate calibration model
development, we are less interested in the actual detailed clustering, but rather in deciding whether
significant clusters actually occur. For this reason there is little value in carrying out clustering: we
merely want to be sure that we will be aware of significant clustering if it occurs.
The presence of clusters may be due to the y-variable. If the y-values are available in this step, they can
be assessed on a simple plot of the y-values. If it is distinctly bimodal, then there are two clusters in y,
which should be reflected by two clusters in X. If y-clustering occurs, one should investigate the reason
for it. If objects with y-values intermediate between the two clusters are available, they should be added
to the calibration and tests sets. If this is not the case, and the clustering is very strong (Fig. 4) one
should realise that the model will be dominated by the differences between the clusters rather than by
the differences within clusters. It might then be better to make models for each cluster, or instead of
MLR to use a method that is designed to work with very heterogeneous data such as locally weighted
regression (LWR) [31,46].
Fig. 4. An example of strongly clustered data.
The simplest way to detect clustering in the x-data is to apply PCA and to look at the score plots. In
some cases, the clustering will become apparent only in plots of higher PCs so that one must always
look at several score plots. For this reason, a method such as the one proposed by Szcubialka et al [47]
may have advantages. In this method, the distances between an object and all other objects are
computed, ranked and plotted. This is done for each of the objects. The graph obtained is then
Chapter 2 – Comparison of Multivariate Calibration Methods
119
compared with the distances computed in the same way for objects belonging to a normal or to a
homogeneous distribution. A simple example is shown in figure 5 where the distance curves for a
clustered situation are compared with that for a homogeneous distribution of the samples.
a)
b)
Fig. 5. a) plot of two hundred objects
normally distributed in two variables x1 and x2
b) the distance curves of the two hundred normally distributed
New Trends in Multivariate Analysis and Calibration
120
c)
Fig. 5. c) Clusterd data, normally
distributed in each clustered d) the distance curves of the
clustered data
d)
If a numerical indicator is preferred, the Hopkins index for clustering tendency (Hind) can be applied.
This statistic examines whether objects in a data set differ significantly from the assumption that they
are uniformly distributed in the multidimensional space [15,48,49]. It compares the distances wi
between the real objects and their nearest neighbours to the distances qi between artificial objects,
uniformly generated over the data space, and their nearest real neighbours. The process is repeated
several times for a fraction of the total population. After that, the Hind statistic is computed as :
Chapter 2 – Comparison of Multivariate Calibration Methods
121
∑+∑
∑=
==
=n
1ii
n
1ii
n
1ii
indHwq
q (15)
If objects are uniformly distributed, qi and wi will be similar, and the statistic will be close to 1/2. If
clustering are present, the distances for artificial objects will be larger than for the real ones, because
these artificial objects are homogeneously distributed whereas the real ones are grouped together, and
the value of Hind will increase. A value for Hind higher than 3/4 indicates a clustering tendency at the
90% confidence level [49]. Figures 6-a and 6-b show the application of the Hopkins' statistic, i.e. how
the qi- and wi-values are computed for two different data sets, the first unclustered and the second
clustered. Because the artificial data set is homogeneously generated inside a square box that covers all
the real objects and with co-ordinates determined by the most extreme points, an unclustered data set
lying on the diagonal of the reference axis (Fig. 6-c) might lead to a false detection of clustering [50].
For this reason, the statistic should be determined on the PCA scores. After PCA of the data, the new
axis will lie in the direction of maximum variance, in this case coincident with the main diagonal (Fig.
6-d). Since an outlier in the X-space is effectively a cluster, the Hopkins statistic could detect a false
clustering tendency in this example. A modification of the original statistic has been proposed in [49]
to minimise false positives. Further modifications were proposed by Forina et al [50].
New Trends in Multivariate Analysis and Calibration
122
a)
b)
Fig. 6. Hopkins statistics applied to two different data sets. Open circles represent real objects, closed circles selected real objects and asterisks represent artificial objects generated over the data space. a) H value = 0.49 b) H value = 0.73
Chapter 2 – Comparison of Multivariate Calibration Methods
123
c)
Fig. 6. Hopkins statistics applied to two different data sets. Open circles represent real objects, closed circles selected real objects and asterisks represent artificial objects generated over the data space. c) H value = 0.69 d) H value = 0.56 (the same data set
as in c, after PCA rotation)
d)
Clusters can become more obvious upon data pre-treatment. For instance, a cluster which is not visible
from the raw data may become more apparent when applying SNV. Consequently it is better to carry
out investigations concerning clustering on the data pre-treated prior to modelling.
7. Detection of extreme samples
MLR is a least squares based method, and for this reason is sensitive to the presence of outliers. We
distinguish between two types of outliers : outliers in the x-space and outliers towards the model.
Moreover we can consider outliers in the y-space. The difference is shown in figure 7. Outliers in the x-
space are points lying far away from the rest when looking at the x-values only. This means we do not
New Trends in Multivariate Analysis and Calibration
124
use knowledge about the relationship between X and y. Outliers towards the model are those that
present a different relationship between X and y, or in other words, samples that do not fit the model.
An object can also be an outlier in y, i.e. can present extreme values of the concentration to be
modelled. If an object is extreme in y, it is probably also extreme in X.
Fig. 7. Illustration of the different kinds of outliers : (*1) outlier in X and outlier towards the model (*2) outlier in y and towards the model (*3) outlier towards the model (*4) outlier in X and y
At this stage of the process, we have not developed the model and therefore cannot identify outliers
towards the model. However, can already look for outliers in X and in y separately. Detection of
outliers in y is a univariate problem that can be handled with the usual univariate tests such as the
Grubbs [51,52,15] or the Dixon [5,15] test. Outliers in X are multivariate and therefore represent a
more challenging problem. Our strategy will be to identify the extreme objects in X, i.e. identify
objects with extreme characteristics, and apply a test to decide whether they should be considered
outliers or not. Once the outliers have been identified, we must decide whether we eliminate them or
simply flag them for examination after the model is developed so that we can look at outliers towards
the model. In taking the decision, it may be useful to investigate whether the same object is an outlier
in both y and X. If an object is outlying in concentration (y) but is not extreme in its spectral
characteristics (X), then it will probably prove an outlier towards the model that at a later stage (chapter
13) and it will be necessary at the minimum to make models with and without the object. A decision to
eliminate the object at this stage may save work.
Extreme samples in the x-space can be due to measurement or handling errors, in which case they
should be eliminated. They can also be due to the presence of samples that belong to another
Chapter 2 – Comparison of Multivariate Calibration Methods
125
population, to impurities in one sample that are not present in the other samples, or to a sample with
extreme amounts of constituents (i.e. with very high or low quantity of analyte). In these cases it may
be appropriate to include the sample in the model, as it represents a composition that could be
encountered during the prediction stage. We therefore have to investigate why the outlier presents
extreme behaviour, and at this stage it can be discarded only if it can be shown to be of no value to the
model or detrimental to it. We should be aware however that extreme samples always will have a larger
influence on the model than other samples.
Extreme samples in the x-space will probably have extreme values on some variables that will have an
extreme (and possibly deleterious) effect in the regression. The extreme behaviour of an object i in the
x-space can be measured by using the leverage value. This measure is closely related with the
Mahalanobis distance (MD) [53,54], and can be seen as a measure of the distance of the object to the
centroid of the data. Points close to the center provide less information for the building model than
extreme points. However, outliers in the extremes are more dangerous than those close to the center.
High leverage points are called bad high leverage points, if they are outliers to the model. If they fit the
true model they will stabilise the model and make it more precise, they are then called good high
leverage points. However, at this stage we will rarely be able to distinguish between good and bad
leverage.
In the original space, leverage values are computed as :
X'XX'XH 1)( −= (16)
H is called the hat matrix. The diagonal elements of H, hii, are the leverage values for the different
objects i. If there are more variables than objects, as is probable for spectroscopic data, X'X cannot be
inverted. The leverage can then be computed in the PC space. There are two ways to compute the
leverage of an object i in the PC-space. The first one is given by the equation :
∑λ
==
a
1i 2i
2ij
i
th (17)
New Trends in Multivariate Analysis and Calibration
126
∑λ
+==
a
1i 2i
2ij
it
n1
h (18)
a being the minimum value of n and p and 2iλ the eigenvalue of PCi. The correction by the value 1/n in
eqn (18) is used if column centered data are employed, as is usual in PCA. Then
a = (n-1) if (n-1) < a and a = min (n-1, p)
The leverage values can also be obtained by applying an equation equivalent to eqn (16) :
H = T(T' T)-1 T' (19)
where T is the matrix with the weighted (unnormalised) scores obtained after PCA of X.
Instead of using all the PCs, one can apply only the significant ones. Suppose that r PCs have been
selected to be significant, for instance based on the total percentage of variance they explain [8]. The
total leverage can then be decomposed in contributions due to the significant eigenvectors and the non-
significant ones [53] :
∑∑∑+=
+=λ
+= λ
== λ
=a
1ri
2ih1
ih2i
2ijtr
1i 2i
2ijta
1i 2i
2ijt
ih (20)
For centered data the same correction with 1/n as in eqn (18) is applied. H1 can be also obtained by
using eqn (19) with T being the matrix with the weighted scores from PC1 to PCr. Because we are only
interested in the first r PCs, it seems that hi1 is a more natural leverage concept than hi, and
complications derived by including noisy PCs are avoided.
The value r/n ((r + 1)/n for centered data) is called average partial leverage. If the leverage of an
extreme object exceeds it by a certain factor, the object is considered to be an outlier. As outlier
detection limit one can then set, for example, hi1 > (constant x r/n), where the constant often equals 2.
The leverage is related to the squared Mahalanobis distance of object i to the centre of the calibration
data. One can compute the squared Mahalanobis distance from the covariance matrix, C :
Chapter 2 – Comparison of Multivariate Calibration Methods
127
−−=−−= −
n1h)1n()'()(MD iji
1ji
2i xxCxx (21)
where C is computed as
XX'C1n
1−
= (22)
X being as usual the mean-centered data matrix.
In the same way as the leverage, when the number of variables exceeds the number of objects, C
becomes singular and cannot be inverted. There are also two ways to calculate the Mahalanobis
distance in the PC space, using either all a PCs or using only the r significant ones :
−∑ −=
λ−=
= n1
h)1n(t
)1n(MD ia
1i 2i
2ij2
i (23)
−∑ −=
λ−=
= n1
h)1n(t
)1n(MD 1i
r
1i 2i
2ij2
i (24)
where hi and 1ih are computed using the centered data.
X-space outlier detection can also be performed in the PC space with Rao's statistic [55]. Rao's statistic
sums all the variation from a certain PC on. If there are a PCs, and we start looking at variation from
PCr on, then :
∑=+=
a
1ri
2ij
2i tD (25)
New Trends in Multivariate Analysis and Calibration
128
A high value for 2iD means that object i shows a high score on some of the PCs that were not included
and therefore cannot be explained completely by r PCs. For this reason it is then suspected to be an
outlier. The method is presented here because it uses only information about X. The way in which
Rao's statistic is normally used requires the number of PCs entered in the model. This number is put
equal to r. What can be done to estimate this number of PCs is to follow the D value as a function of r,
starting from r = 0. High values of r indicate that the object is modelled only correctly when higher PCs
are included. If the number of necessary PCs is higher for this object than for the others, it will be an
outlier. A test can be applied for checking the significance of high values for the Rao's statistic by using
these values as input data for the single outlier Grubbs' test [15] :
2
1n
n
1i
2iD
2testD
z
−=
=
∑
(26)
Because the information provided by each of these methods is not necessarily the same, we recommend
that more than one is used, for example by studying both leverage values and Rao's statistic with
Grubbs' test, in order to check if the same objects are detected.
Unfortunately, outlier detection is not easy. This certainly is the case if more than one outlier is present.
In that case all the above methods are subject to what is called masking and swamping. Masking occurs
when an outlier goes undetected because of the presence of another, usually adjacent, one. Swamping
occurs when good observations are incorrectly identified as outliers because of the presence of another,
usually remote, subset of outliers (Fig. 8). Masking and swamping occur because the mean and the
covariance matrix are not robust to outliers.
Chapter 2 – Comparison of Multivariate Calibration Methods
129
Fig. 8. Due to the remote set of outliers (4 upper objects), there is a swamping effect on outlier (*).
Robust methods have been described [56]. Probably, the best way to avoid the lack of robustness of the
leverage measures is to use the Multivariate Trimming estimator (MVE) defined as the minimum
volume ellipsoid covering at least (N/2)+1 points of X. It can be understood as the selection of a subset
of objects without outliers in it : a clean subset. In this way, one avoids that the measured leverage
being affected by the outlier. In fact in equ. (21) all objects, the outliers included, are used, so that the
outliers influence the criterion that will be used to determine if an object is an outlier. For instance,
when an outlier is included in a set of data, it influences the mean value of variables characterising that
set. With the MVE, the densest domain in the x-space including a given amount of samples is selected.
This domain does not include the possible outliers, so that they do not influence the criteria.
An algorithm to find the MVE is given in [57-60]. The leverage measures based on this subset are not
affected by the masking and swamping effects. A simulation study showed that in more than 90% of
the cases the proposed algorithm led to the correct identification of x-space outliers, without masked or
swamped observations [60]. For this reason, MVE probably is the best methodology to use, but it
should be noted that there is little practical experience in its application. To apply the algorithm, the
number of objects in the data set must be at least three times higher than the number of selected latent
variables.
A method of an entirely different type is the potential method proposed by Jouan-Rimbaud et al. [61].
Potential methods first create so-called potential functions around each individual object. Then these
functions are summed (Fig. 9). In dense zones, large potentials are created, while the potential of
outliers does not add to that of other objects and can therefore be detected in that way. An advantage is
New Trends in Multivariate Analysis and Calibration
130
that special objects within the x-domain are also detected, for instance, an isolated object between two
clusters. Such objects (we call them inliers) can in certain circumstances have the same effect as
outliers. A disadvantage is that the width of the potential functions around each object has to be
adjusted. It cannot be too small, because many objects would then be isolated; it cannot be too large
because all objects would be part of one global potential function. Moreover, while the method does
very well in flagging the more extreme objects, a decision on their rejection cannot be taken easily.
Fig. 9. Adapted from D. Bouveresse, doctoral thesis (1997), Vrije universiteit Brussel, contour plot corresponding to k=4 with the 10% percentile method and with (*) the identified inlier.
8. Selection and representativity of the calibration sample subset
Because the model has to be used for the prediction of new samples, all possible sources of va riation
that can be encountered later must be included in the calibration set. This means that the chemical
components present in the samples must be included in the calibration set; with a range of variation in
concentration at least as wide, and preferab ly wider than the one expected for the samples to be
analysed; that sources of variation such as different origins or different batches are included and
possible physical variations (e.g. different temperatures, different densities) among samples are also
covered.
In addition, it is evident that the higher the number of samples in the calibration set, the lower the
prediction error [62]. In this sense, a selection of samples from a larger set is contra- indicated.
However, while a random selection of samples may approach a normal distribution, a selection
Chapter 2 – Comparison of Multivariate Calibration Methods
131
procedure that selects samples more or less equally distributed over the calibration space will lead to a
flat distribution. For an equal number of samples, such a distribution is more favourable from a
regression point of view than the normal distribution, so that the loss of predictive quality may be less
than expected by looking only at the reduction of the number of samples [63]. Also, from an
experimental point of view, there is a practical limit on what is possible. While the NIR analysis is
often simple and not costly, this cannot usually be said for the reference method. It is therefore
necessary to achieve a compromise between the number of samples to be analysed and the prediction
error that can be reached. It is advisable to spend some of the resources available in obtaining at least
some replicates, in order to provide information about the precision of the model (chapter 2).
When it is possible to artificially generate a number of samples, experimental design can and should be
used to decide on the composition of the calibration samples [1]. When analysing tablets, for instance,
one can make tablets with varying concentrations of the components and compression forces, according
to an experimental de sign. Even then, it is advisable to include samples from the process itself to make
sure that unexpected sources of variation are included. In the tablet example, it is for instance unlikely
that the tablets for the experimental design would be made with t he same tablet press as those from the
production process and this can have an effect on the NIR spectrum [64].
In most cases only real samples are available, so that an experimental design is not possible. This is the
case for the analysis of natural products and for most samples coming from an industrial production
process. One question then arises: how to select the calibration samples so that they are representative
for the group.
When many samples are available, we can first measure their spectra and select a representative set that
covers the calibration space (x-space) as well as possible. Normally such a set should also represent the
y-space well, this should preferably be verified. The chemical analysis with the reference method,
which is often the most expensive step, can then be restricted to the selected samples.
Several approaches are available for selecting representative calibration samples. The simplest is
random selection, but it is open to the possibility that some source of variation will be lost. These are
often represented by samples that are less common and have little probability of being selected. A
second possibility is based on knowledge about the problem. If one is confident that we are aware of all
the sources of variation, samples can be selected on the basis of that knowledge. However, this
situation is rare and it is very possible that some source of variation will be forgotten.
New Trends in Multivariate Analysis and Calibration
132
One algorithm that can be used for the selection is based on the D-optimal concept [65,66]. The D-
optimal criterion minimises the variance of the regression coefficients. It can be shown that this is
equivalent to maximising the covariance matrix, selecting samples such that the variance is maximised
and the correlation minimised. The criterion comes from multivariate regression and experimental
design. In our context, the variance maximisation leads to selection of samples with relatively extreme
characteristics and located on the borders of the calibration domain.
Kennard and Stone proposed a sequential method that should cover the experimental region uniformly
and that was meant for the use in experimental design [67]. The procedure consists of selecting as the
next sample (candidate object) the one that is most distant from those already selected objects
(calibration objects). The distance is usually the Euclidean distance although it is possible, and
probably better, to use the Mahalanobis distance. The distance are usually calculated in the PC space
since spectroscopic data tend to generate a high number of variables. As starting points we either select
the two objects that are most distant from each other, or preferably, the one closest to the mean. From
all the candidate points, the one is selected that is furthest from those already selected and added to the
set of calibration points. To do this, we measure the distance from each candidate point i0 to each point
i which has already been selected and determine which is smallest ))i,i(di
min(0
. From these we select
the one for which the distance is maximal, dselected = 0i
max ))i,i(di
min(0
. In the absence of strong
irregularities in the factor space, the procedure starts first selecting a set of points close to those
selected by the D-optimality method, i.e. on the borderline of the data set (plus the center point, if this
is chosen as the starting point). It then proceeds to fill up the calibration space. Kennard and Stone
called their procedure a uniform mapping algorithm; it yields a flat distribution of the data which, as
explained earlier, is preferable for a regression model.
Næs proposed a procedure based on cluster analysis. The clustering is continued until the number of
clusters matches the number of calibration samples desired [68]. From each cluster, the object that is
furthest away from the mean is selected. In this way the extremes are covered but not necessarily the
centre of the data.
In the method proposed by Puchwein [69], the first step consists in sorting the samples according to the
Mahalanobis distances to the centre of the set and selecting the most extreme point. A limiting distance
is then chosen and all the samples that are closer to the selected point than this distance are excluded.
The sample that is most extreme among the remaining points is selected and the procedure repeated,
Chapter 2 – Comparison of Multivariate Calibration Methods
133
choosing the most distant remaining point, until there are no data points left. The number of selected
points depends on the size of the limiting distance: if it is small, many points will be included; if it is
large, very few. The procedure must therefore be repeated several times for different limiting distances
until the limiting distance is reached for which the desired number of samples is selected.
Figure 10 shows the results of applying these four algorithms to a 2-dimensional data set of 250
objects, designed not to be homogeneous. Clearly, the D-optimal design selects points in a completely
different way from the other algorithms. The Kennard-Stone and Puchwein algorithms provide similar
results. Næs method does not cover the centre. Other methods have been proposed such as "unique-
sample selection" [70]. The results obtained seem similar to those obtained from the previously cited
methods.
An important question is how many samples must be included in the calibration set. This va lue must be
selected by the analyst. This number is related to the final complexity of the model. The term
complexity should be understood as the number of variables or PCs included plus the number of
quadratic and interaction terms. An ASTM standard states that, if the complexity is smaller than three,
at least 24 samples must be used. If it is equal or greater than four, at least 6 objects per degree of
complexity are needed [58,71].
New Trends in Multivariate Analysis and Calibration
134
Fig. 10. The first 24 points selected using different algorithms : a) D-optimal design (optimal design with the three
points denoted by closed circles) b) Puchwein method c) Kennard & Stone method (closest point to the
mean included) d) Naes clustering method e) DUPLEX method with (o) the calibration set and
(*) the test set
In Chapter 11 we state that the model optimisation (validation) step requires that different independent
sub-sets are created. Two sub-sets are often needed. At first sight, we might use one of the selection
algorithms described above to split up the calibration set for this purpose. However, because of the
sample selection step, the sub-sets would be no longer independent unless random selection is applied.
Validation in such circumstances might lead us to underestimate prediction errors [72]. A selection
Chapter 2 – Comparison of Multivariate Calibration Methods
135
method which appears to overcome this drawback is a modification by Snee of the Kennard-Stone
method, called the DUPLEX method [73]. In the first step, the two points which are furthest away from
each other are selected for the calibration set. From the remaining points, the two objects which are
furthest away from each other are included in the test set. In the third step, the remaining point which is
furthest away from the two previously selected for the calibration set is included in that set. The
procedure is repeated selecting a single point for the test set which is furthest from the existing points
in that set. Following the same procedure, points are added alternately to each set. This approach
selects representative calibration and test data sets of equal size. In figure 10 the result of applying the
DUPLEX method is also presented.
Of all the proposed methodologies, the Kennard-Stone, DUPLEX and Puchwein's methods need the
minimum a priori knowledge. In addition, they provide a calibration set homogeneously distributed in
space (flat distribution). However, Puchwein's method must be applied several times. The DUPLEX
method seems to be the best way to select representative calibration and test data sets in a validation
context.
Once the calibration set has been selected, several tests can be employed to determine the
representativity of the selected objects with respect to the total set [74]. This appears to be unnecessary
if one of the algorithms recommended for the selection of the calibration samples has been applied. In
practice, however, little attention is often paid to the proper selection. For instance, it may be that the
analyst simply takes the first n samples for the calibration set. In this case a representativity test is
necessary. One possibility is to obtain PC score plots and to compare visually the selected set of
calibration samples to the whole set. This is difficult when there are many relevant PCs. In such cases a
more formal approach can be useful. We proposed an approach that includes the determination of three
different characteristics [75]. The first one checks if both sets have the same direction in the space of
the PCs. The directions are compared by computing the scalar product of two direction vectors
obtained from the PCA decomposition of both data sets. To do this, the normed scalar product between
the vectors d1 and d2 is obtained :
22
21
21 'P
dd
dd= (27)
New Trends in Multivariate Analysis and Calibration
136
where d1 and d2 are the average direction vector for each data set:
∑ λ==
r
1i,i1
2i,11 pd and i,2
r
1i
2i,22 pd ∑ λ=
= (28)
where 2i,1λ and p1,i are the corresponding eigenvalues and loading vectors for data set 1, and 2
i,2λ and
p2,i are the corresponding eigenvalues and loading vectors for data set 2. If the P value (cosinus of the
angle between the direction of each set) is higher than 0.7, it can be concluded that the original
variables have similar contribution to the latent variables, and they are comparable.
The second test compares the variance-covariance matrices. The intention is to determine whether the
two data sets have a similar volume both in magnitude and direction. The comparison is made by using
an approximation of the Bartlett's test. First the pooled variance-covariance matrix is computed :
CC C
=− + −
+ −( ) ( )n n
n n1 1 2 2
1 2
1 12
(29)
The Box M-statistic is then obtained :
−+−ν= −− CCCC 1
221
11 ln)1n(ln)1n(M (30)
with
−+−
−+
−−−+
−=ν2nn
11n
11n
1)1p(6
1p3p21
2121
2 (31)
and the parameter CV is defined as:
22n1nM
eCV −+−
= (32)
Chapter 2 – Comparison of Multivariate Calibration Methods
137
If CV is close to 1, both the volume and the direction of the data sets are comparable.
The third and last test compares the data set centroids. To do this, the squared Mahalanobis distance D2
between the means of each data set is computed :
xxCxx ) ()'(D 211
212 −−= − (33)
C is defined as in eqn (21), and from this value, a parameter F is defined as:
2
2121
2121 D)2nn)(nn(p
)1pnn(nnF
−++−−+
= (34)
F follows a Fisher-Snedecor distribution, with p and n1+n2-p-1 degrees of freedom.
As already stated these tests are not needed when a selection algorithm is used. With some selection
algorithms they would even be contra- indicated. For instance, the test that compares variances cannot
be applied for calibration sets selected by the D-optimal design, because the most extreme samples are
selected and the calibration set will necessarily have a larger variance than the original set.
9. Non-linearity
Sources of non- linearity in spectroscopic methods are described in [76], and can be summarised as due
to :
1 - violations of the Beer-Lambert law
2 - detector non- linearity's
3 - stray light
4 - non- linearity's in diffuse reflectance/transmittance
5 - chemically-based non- linearities
6 - non- linearities in the property/concentration relationship.
Methods, based on ANOVA, proposed by Brown [77] and Xie et al (non- linearity tracking analysis
algorithm) [78] detect non-linear variables, which one may decide to delete. There seems to be little
New Trends in Multivariate Analysis and Calibration
138
expertise available in the practical use of these methods. Moreover, non-linear regions may contain
interesting information. The methods should therefore be used only as a diagnostic, signalling that non-
linearities occur in specific regions. If it is later found that the MLR model is not as good as was hoped,
or is more complex than expected, it may be useful to see if better results are obtained after elimination
of the more non- linear regions.
Most methods for detection of non- linearity depend on visual evaluation of plots. A classical method is
to plot the residuals against y or the fitted (predicted) response y for the complete model [79,80,54].
The latter is to be preferred, since it removes some of the random error which could make the
evaluation more difficult (Fig. 11-b). This is certainly the case when the imprecision of y is relatively
large. Non-linearity typically leads to residuals of one sign for most of the samples with mid-range y-
values, whereas most of the samples with low or high y-value have residuals of the opposite sign. The
runs test [1] examines whether an unusual pattern occurs in a set of residuals. In this context a run is
defined as a series of consecutive residuals with the same sign. Figure 11-d would lead to 3 runs and
the following pattern: “ + + + + + + + - - - - - - + + +“.
From a statistical point of view long runs are improbable and are considered to indicate a trend in the
data, in this case a non- linearity. The test therefore consists of comparing the number of runs with the
number of samples. Similarly, the Durbin Watson test examines the null hypothesis that there is no
correlation between successive residuals. In this case no trend occurs. The runs or Durbin-Watson tests
should be carried out as a complement to the visual evaluation and not as a replacement.
Chapter 2 – Comparison of Multivariate Calibration Methods
139
Fig. 11. Tools for visual detection of non-linearities : a) PRP plot b) RP plot
New Trends in Multivariate Analysis and Calibration
140
Fig. 11. Tools for visual detection of non-linearities : c) e-RP plot d) ApaRP plot
A classical statistical way to check for non- linearities in one or more variables in multiple linear
regression is based on testing whether the model improves significantly when a squared term is added.
One compares
i2i2i10i bbb exxy +++= (35)
to
Chapter 2 – Comparison of Multivariate Calibration Methods
141
*i
*1
*0i bb iexy ++= (36)
xi being the values of the x-variable investigated for object i. A one-sided F-test can be employed to
check if the improvement of fit is significant. One can also apply a two-sided t-test for checking if b2 is
significantly different from 0. The calculated t-value is compared to the t-test value with (n-3) degrees
of freedom, at the desired level of confidence. It can be noted that This can be applied also when the
variables are PC scores to the linear model [2].
All these methods are lack-of-fit methods and it is probable that they will also indicate lack-of-fit when
the reason is not non-linearity, but the presence of outliers. Caution is therefore required. We prefer the
runs or the Durbin Watson tests, in conjunction with visual evaluation of the partial response plot or the
Mallows plot.
It should be noted that many of the methods described here require that a model has already been built.
In this sense, this chapter should come after the chapters 10 and 11. However, we recommend that non-
linearity be investigated at least partly before the model is built by plotting very significant variables if
available (e.g. peak maxima in Raman spectroscopy) or the scores of the first PCs as a function of y
(e.g. for NIR data). If a clear non- linear relationship with y is obtained with one of these variables/PCs,
it is very probable that a non-linear approach is to be preferred. If no non-linearity is found in this step,
then one should, after obtaining a linear model (chapters 10 and 11) check again e.g. using Mallows
plot and the runs test to confirm linearity.
10. Building the model
When variables are not correlated and more samples than variables are available, the model can be built
simply using all of the variables. This usually happens for non-spectroscopic data. This situation can
however also arise in the case of very specific spectroscopic applications, for instance when using a
simultaneous ICP-AES instrument equipped with only few photo-multiplicators fixed to emission on
specific wavelengths. In some other particular cases, expert knowledge can be used to select very few
variables out of a spectrum. For instance, in Raman or Atomic Emission spectroscopy, compounds in a
mixture can be represented by neat and narrow peaks. Building the model can then simply consist in
selecting the variables corresponding to the maxima of peaks representative for the product which
New Trends in Multivariate Analysis and Calibration
142
concentration has to be predicted. The extreme case being the situation when only one variable is
necessary to obtain satisfying prediction, leading to a univariate model.
However, modern spectroscopic instruments usually generate a very high number of variables,
exceeding by far the number of available samples (objects). In current applications, and in particular in
NIR spectroscopy, variable selection is therefore needed to overcome the problems of matrix under-
determination and correlated variables. Even when more objects than variables are available, it can be
interesting to select only the most representative variables in order to obtain a simpler model. In the
majority of cases, building the MLR model therefore consists in performing variable selection : finding
the subset of variables that has to be used.
10.1. Stepwise approaches
The most classical variable selection approach, which is found in many statistical packages, is called
stepwise regression [1,2]. This family of methods consists in optimising the subset of variables used for
calibration by adding and/or removing them one by one from the total set.
The so-called forward selection procedure consists in first selecting the variable that is best correlated
with y. Suppose this is found to be xi. The model is at this stage restricted to y = f (xi). The regression
coefficient b obtained from the univariate regression model relating xi to y is tested for significance
using a t-test at the considered critical level α = 1 or 5 %. If it is not found to be significant, the process
stops and no model is built. Otherwise, all other variables are tested for inclusion in the model. The
variable xj which will be retained for inclusion together with xi is the one that, when added to the
model, leads to the largest improvement compared to the original univariate model. It is then tested
whether the observed improvement is significant. If not, the procedure stops and the model is restricted
to y = f(xi). If the improvement is significant, xj is definitively incorporated in the model that becomes
bivariate : y = f (xi,xj). The procedure is repeated for a third variable to be included in the model, and so
on until finally no further improvement can be obtained.
Several variants of this procedure can be used. In backward elimination, the selection is started with all
variables included in the model. The least significant ones are successively eliminated in a comparable
way as in forward selection. Forward and backward steps can be combined in order to obtain a more
sophisticated stepwise selection procedure. As is the case in forward selection, the first variable xi
Chapter 2 – Comparison of Multivariate Calibration Methods
143
entered in the model is the most correlated to the property of interest y. The regression coefficient b
obtained from the univariate regression model relating xi to y is tested for significance. The next step is
forward selection. The variable xj that yields the highest Partial Correlation Coefficient (PCC) is
included in the model. The inclusion of a new variable in the model can decrease the contribution of a
variable already included and make it non-significant. After each inclusion of a new variable, the
significance of the regression terms (b ixi) already in the model is therefore tested, and the non-
significant terms are eliminated from the equa tion. This is the backward elimination step. Forward
selection and backward elimination are repeated until no improvement of the model can be achieved by
including a new variable, and all the variables already included are significant. Such stepwise
approaches using both forward and backward steps are usually the most efficient.
10.2. Genetic algorithms
Genetic algorithms can also be used for variable selection. They were first proposed by Holland [81].
They were introduced in chemometrics by Lucasius et al [82] and Leardi et al [83]. They were applied
for instance in multivariate calibration for the determination of certain characteristics of polymers [84]
or octane numbers [85]. Reviews about applications in chemistry can be found in [86,87]. There are
several competing algorithms such as simulated annealing [88] or the immune algorithm [89].
Genetic Algorithms are general optimisation tools aiming at selecting the fittest solution to a problem.
Suppose that, to keep it simple, 9 variables are measured. Possible solutions are represented in figure
12. Selected variables are indicated by a 1, non-selected variables by a 0.
New Trends in Multivariate Analysis and Calibration
144
VARIABLE1 2 3 4 5 6 7 8
CHROMOSOMES
0 1 1 0 0 0 1 0
(Solutions)
���
9
1
1 0 0 1 0 0 1 00
0 1 1 1 0 0 0 0 0
Fig. 12. A set of solutions for feature selection from nine variables for MLR.
Such solutions are sometimes called chromosomes in analogy with genetics. A set of such solutions is
obtained by random selection (several hundreds chromosomes are often generated in real applications).
For each solution an MLR model is built using an equation such as (1) and the sum of squares of the
residuals of the objects towards that model is determined. One says that the fitness of each solution is
determined : the smaller the sum of squares the better the model describes the data and the fitter the
corresponding solutions are. Then follows what is described as the selection of the fittest (leading to
names such as genetic algorithms or evolutionary computation). For instance out of the, say 100
original solutions, the 50 fittest are retained. They are called the parent generation. From these is
obtained a child generation by reproduction and mutation.
Reproduction is explained in figure 13. Two randomly chosen parent solutions produce two child
solutions by cross over. The cross over point is also chosen randomly. The first part of solution 1 and
the second part of solution 2 together yield child solution 1’. Solution 2’ results from the first part of
solution 2 and the second of solution 1. The child solutions are added to the selected parent solutions to
form a new generation. This is repeated for many generations and the best solution from the final
generation is retained.
Chapter 2 – Comparison of Multivariate Calibration Methods
145
1 0 1 0 0 0 0 0 1
+
â
+
(cross over)
1 0 0 1 0 0 0 10
0 1 0 1 0 0 0 11
1 0 0 0 0 0 0 10
REPRODUCTION (MATING)
*
Fig. 13. Genetic algorithms: the reproduction step. The cross over point is indicated by the * symbol.
Each generation is additionally submitted to mutation steps. Randomly chosen bits of the solution
string are changed here and there (0 to 1 or 1 to 0). This is applied in figure 14. The need for the
mutation step can be understood from figure 12. Suppose that the best solution is close to one of the
child solutions in that figure, but should not include variable 9. However, because the value for variable
9 is 1 in both parents, it is also unavoidably 1 in the children. Mutation can change this and move the
solutions in a better direction.
New Trends in Multivariate Analysis and Calibration
146
*
MUTATION
*
1 0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 01
Fig. 14. Genetic algorithms: the mutation step. The mutation point is indicated by the * symbol.
11. Model optimisation and validation
11.1. Training, optimisation and validation
The determination of the optimal complexity of the model (the number of variables that should be
included in the model) requires the estimation of the prediction error that can be reached. Ideally, a
distinction should be made between training, optimisation and validation. Training is the step in which
the regression coefficients are determined for a given model. In MLR, this means that the b-coefficients
are determined for a model that includes a given set of variables. Optimisation consists in comparing
different models and deciding which one gives best prediction. Validation is the step in which the
prediction with the chosen model is tested independently. In practice, as we will describe later, because
of practical constraints in the number of samples and/or time, less than three steps are often included.
In particular, analysts rarely make a distinction between optimisation and validation and the term
validation is then sometimes used for what is essentially an optimisation. While this is acceptable to
some extent, in no case should the three steps be reduced to one. In other words, it is not acceptable to
draw conclusions about optimal models and/or quality of prediction using only a training step. The
same data should never be used for training, optimising and validating the model. If this is done, it is
possible and even probable that an overfit of the model will occur, and prediction error obtained in this
Chapter 2 – Comparison of Multivariate Calibration Methods
147
way may be over-optimistic. Overfitting is the result of using a too complex model. Consider a
univariate situation in which three samples are measured. The y = f(x) model really is linear (first
order), but the experimenter decides to use a quadratic model instead. The training step will yield a
perfect result: all points are exactly on the line. If, however, new samples are predicted, then the
performance of the quadratic model will be worse than the performance of the linear one.
11.2. Measures of predictive ability
Several statistics are used for measuring the predictive ability of a model. The prediction error sum of
squares, PRESS, is computed as :
∑=∑ −===
n
1i
2i
n
1i
2ii e)yy(PRESS (37)
where yi is the actua l value of y for object i and iy the y-value for object i predicted with the model
under evaluation, ei is the residual for object i (the difference between the predicted and the actual y-
value) and n is the number of objects for which y is obtained by prediction.
The mean squared error of prediction (MSEP) is defined as the mean value of PRESS :
n
e
n
)yy(
nPRESSMSEP
n
1i
2i
n
1i
2ii ∑
=∑ −
== == (38)
Its square root is called root mean squared error of prediction, RMSEP:
n
e
n
)yy(MSEPRMSEP
n
1i
2i
n
1i
2ii ∑
=∑ −
== == (39)
New Trends in Multivariate Analysis and Calibration
148
All these quantities give the same information. In the chemometrics literature it seems that RMSEP
values are preferred, partly because they are given in the same units as the y-variable.
11.3. Optimisation
The RMSEP is determined for different models. For instance, with stepwise selection, a models can be
built using a t-test significance level of 1%, and another a t-test significance level of 5%. With genetic
algorithms, various models can be obtained with different numbers of variable. The result can be
presented as a plot showing RMSEP as a function of the number of variables and is called the RMSEP
curve. This curve often shows an intermediate minimum and the number of variables for which this
occurs is then considered to be the optimal complexity of the model. This can be a way of optimising
the output of stepwise selection procedure (optimising the number of variables retained). A problem
which is sometimes encountered is that the global minimum is reached for a model with a very high
complexity. A more parsimonious model is often more robust (the parsimonity principle). Therefore, it
has been proposed to use the first local minimum or a deflection point is used instead of the global
minimum. If there is only a small difference between the RMSEP of the minimum and a model with
less complexity, the latter is often chosen. The decision on whether the difference is considered to be
small is often based on the experience of the analyst. We can also use statistical tests that have been
developed to decide whether a more parsimonious model can be considered statistically equivalent. In
that case the more parsimonious model should be preferred. An F-test [90,91] or a randomisation t-test
[92] have been proposed for this purpose. The latter requires less statistical assumptions about data and
model properties, and is probably to be preferred. However in practice it does not always seem to yield
reliable results.
11.4. Validation
The model selected in the optimisation step is applied to an independent set of samples and the y-
values (i.e. the results obtained with the reference method) and y -values (the results obtained with
multivariate calibration) are compared. An example is shown in figure 15. The interpretation is usually
done visually : does the line with slope 1 and intercept 0 represent the points in the graph sufficiently
well ? It is necessary to check whether this is true over the whole range of concentrations (non-
Chapter 2 – Comparison of Multivariate Calibration Methods
149
linearity) and for all meaningful groups of samples, e.g. for different clusters. If a situation is obtained
when most samples of a cluster are found at one side of the line, a more complex modelling method
(e.g. locally weighted regression [31, 46]) or a model for each separate cluster of samples may yield
better results.
Fig. 15. The measured property (y) plotted against the predicted values of the property(yhat).
Sometimes a least squares regression line between y and y is obtained and a test is carried out to verify
that the joint confidence interval contains slope = 1 and intercept = 0 [93]. Similarly a paired t-test
between y and y values can be carried out. This does not obviate, however, the need for checking non-
linearity or looking at individual clusters.
An important question is what RMSEP to expect ? If the final model is correct, i.e. there is no bias,
then the predictions will often be more precise than those obtained with the reference method
[94,10,95], due to the averaging effect of the regression. However, this cannot be proved from
measurements on validation samples, the reference values of which were obtained with the reference
method. The RMSEP value is limited by the precision (and accuracy) of the reference method. For that
reason, RMSEP can be applied at the optimisation stage as a kind of target value. An alternative way of
deciding on model complexity therefore is to select the lowest complexity which leads to an RMSEP
value comparable to the precision of the reference method.
New Trends in Multivariate Analysis and Calibration
150
11.5. External validation
In principle, the same data should not be used for developing, optimising and validating the model. If
we do this, it is possible and even probable that we will overfit the model and prediction errors
obtained in this way may be over-optimistic. Terminology in this field is not standardised. We suggest
that the samples used in the raining step should be called the training set, those that are used in
optimisation the evaluation set and those for the validation the validation set. Some multivariate
calibration methods require three data sets. This is the case when neural nets are applied (the evaluation
set is then usually called the monitoring set). In PCR and related methods, often only two data sets are
used (external validation) or, even only one (internal validation). In the latter case, the existence of a
second data set is simulated (see further chapter 11.6). We suggest that the sum of all sets should be
called the calibration set. Thus the calibration set can consist of the sum of training, evaluation and
validation sets, or it can be split into a training and a test set, or it can serve as the single set applied in
internal validation. Applied with care, external and internal validation methods will warn against
overfitting.
External validation uses a completely different group of samples for prediction (sometimes called the
test set) from the one used for building the model (the training set). Care should be taken that both
sample sets are obtained in such a way that they are representative for the data being investigated. This
can be investigated using the measures described for representativity in chapter 8. One should be aware
that with an external test set the prediction error obtained may depend to a large extent on how exactly
the objects are situated in space in relationship to each other.
It is important to repeat that, in the presence of measurement replicates, all of them must be kept
together either in the test set or in the training set when data splitting is performed. Otherwise, there is
no perturbation, nor independence, of the statistical sample.
The preceding paragraphs apply when the model is developed from samples taken from a process or a
natural population. If a model was created with artificial samples with y-values outside the expected
range of y-values to be determined, for the reasons explained in chapter 8, then the test set should
contain only samples with y-values in the expected range.
Chapter 2 – Comparison of Multivariate Calibration Methods
151
11.6. Internal validation
One can also apply what is called internal validation. Internal validation uses the same data for
developing the model and validating it, but in such a way that external validation is simulated. A
comparison of internal validation procedures usually employed in spectrometry is given in [96]. Four
different methodologies were employed:
a. Random splitting of the calibration set into a training and a test set. The splitting can then
have a large influence on the obtained RMSEP value.
b. Cross-validation (CV), where the data are randomly divided into d so-called cancellation
groups. A large number of cancellation groups corresponds to validation with a small perturbation of
the statistical sample, whereas a small number of cancellation groups corresponds to a heavy
perturbation. The term perturbation is used to indicate that the data set used for developing the model in
this stage is not the same as the one developed with all calibration objects, i.e. the one, which will be
applied in chapters 13 and 14. Too small a perturbation means that overfitting is still possible. The
validation procedure is repeated as many times as there are cancellation groups. At the end of the
validation procedure each object has been once in the test set and d-1 times in the training set. Suppose
there are 15 objects and 3 cancellation groups, consisting of objects 1-5, 6-10 and 11-15. We
mentioned earlier that the objects should be assigned randomly to the cancellation groups, but for ease
of explanation we have used the numbering above. The b-coefficients in the model that is being
evaluated are determined first for the training set consisting of objects 6-15 and objects 1-5 function as
test set, i.e. they are predicted with this model. The PRESS is determined for these 5 objects. Then a
model is made with objects 1-5 and 11-15 as training and 6-10 as test set and, finally, a model is made
with objects 1-10 in the training set and 11-15 in the test set. Each time the PRESS value is determined
and eventually the three PRESS values are added, to give a value representative for the whole data set
(PRESS values are more indicated here to RMSEP values, because PRESS values are variances and
therefore additive).
c. leave-one-out cross-validation (LOO-CV), in which the test sets contain only one object (d =
n). Because the perturbation of the model at each step is small (only one object is set aside), this
procedure tends to overfit the model. For this reason the leave-more-out methods described above may
be preferable. The main drawback of LOO-CV is that the computation is slow because a model has to
be developed for each object.
New Trends in Multivariate Analysis and Calibration
152
d. Repeated random splitting (repeated evaluation set method) (RES) [96]. The procedure
described in a is repeated many times. In this way, at the end of the validation procedure, one hopes
that an object has been in the test set several times with different companions. Stable results are
obtained after repetition of the procedure several times (even hundreds of times). To have a good
picture of the prediction error, low and high percentages of objects in the evaluation set have to be
used.
12. Random correlation
12.1. The Random Correlation issue
Fig. 16. The 16 wavelengths selected by the Stepwise selection method for Stepwise-MLR calibration results (RMSECV) obtained for a random (20 x 100) spectral matrix and a random (1 x 20) concentration vectors.
Let us consider a simulated X spectral matrix made of 20 spectra with 100 wavelengths filed with
random values between 0 and 100. And a y matrix of 20 random values between 0 and 10. The
Stepwise selection applied on such a data set will surprisingly lead to sometimes retain a certain
number of variables (Fig. 16). If cross validation is performed to validate the obtained model,
RMSECV results can even look as the model is very efficient in predicting y (tab le 1). This
phenomenon is common for stepwise variable selection applied to noisy data. It has already been
described [97,98], and is referred to as random correlation or chance correlation.
Chapter 2 – Comparison of Multivariate Calibration Methods
153
Table 1. Stepwise-MLR calibration results (RMSECV) obtained for a random (20 x 100) spectral matrix and 3 different random (1 x 20) concentration vectors. In most of the case, the method would find correlated variables and a model is built only on chance correlated variables.
α = 1 % α = 5 %
RMSECV # variables RMSECV # variables
Y matrix # 1 2.0495 2 0.1434 12
Y matrix # 2 No result No variable correlated 0.0702 14
Y matrix # 3 2.0652 2 0.0041 16
12.2. Random Correlation on real data
This phenomenon is illustrated here in a spectacular manner on simulated data, but it must be noted that
this can also happens on real spectroscopic data. For instance, a model is built relating Raman spectra
of 5-compounds mixtures [99] to the concentration of one of these compounds (called MX). Figure 17
shows the variables retained to model the MX product. The selected variables are represented by stars
on the spectrum of a typical mixtures containing equivalent quantities of the 5 products. The RMSECV
is found to be suspiciously low compared to the RMSECV of the univariate model built using only the
first selected variable (maximum of the MX peak).
New Trends in Multivariate Analysis and Calibration
154
Fig. 17. Wavelengths selected by the Stepwise selection method for the MX model, and order of selection of those variables. Displayed on the spectrum of a typical mixture containing all of the 5 components.
The variable selection does not seem correct. The first variable is as expected retained on the maximum
of the MX peak, but all the other variables are selected in unin formative parts of the spectrum. The
correlation coefficients of these variables with y are quite high (table 2).
Table 2. Model built with Stepwise selection for the meta-xylene (17 first variables only). The correlation coefficient and the regression coefficient for each of the selected variables are also given.
Institut Français du Pétrole (I.F.P.), 1-4 Avenue du Bois Préau, 92506 Rueil-Malmaison
France
Ph. Marteau
Université Paris Nord, L.I.M.P.H.,
Av. J.B. Clément, 93430 Villetaneuse
France
ABSTRACT
An industrial process separating p-xylene from mainly other C8 aromatic compounds is monitored with an online remote Raman analyser. The concentrations of six constituents are currently evaluated with a classical calibration method. The aim of the study being to improve the precision of the monitoring of the process, inverse calibration linear methods were applied on a synthetic data set, in order to evaluate the improvement in prediction such methods could yield. Several methods were tested including Principal Component Regression with Variable Selection, Partial Least Square Regression or Multiple Linear Regression with variable selection (Stepwise or based on Genetic Algorithm). Methods based on selected wavelengths are of great interest because the obtained models can be expected to be very robust toward experimental conditions. However, because of the important noise in the spectra due to short accumulation time, variable selection methods selected a lot of irrelevant variables by chance correlation. Strategies were investigated to solve this problem and build reliable robust models. These strategies include the use of signal pre-processing (smoothing and filtering in the Fourier or Wavelets domain), and the use of an improved variable selection algorithm based on the selection of spectral windows instead of single wavelengths when this leads to a better model. The best results were achieved with Multiple Linear Regression and Stepwise variable selection applied to spectra denoised in the Fourier domain.
* Corresponding author
KEYWORDS : Chemometrics, Raman Spectroscopy, Multivariate Calibration, random correlation.
Chapter 3 – New Types of Data : Nature of the Data Set
171
1 - Introduction
The Eluxyl process separates para-xylene from other C 8 aromatic compounds (ortho and meta-xylene,
and either para-di-ethylbenzene or toluene used as solvent) by simulated moving bed chromatography
[1]. The evolution of the process is monitored online using a Raman analyser equipped with optical
fibres. The Raman scattering studied is in the visible range and is collected on a 2-dimensional Charge
Coupled Device (CCD) detector that allows true simultaneous recordings. The Raman technique gives
access to the fundamental vibrations of molecules by using either a visible or a near-IR excitation. This
allows an easy attribution of the vibrational bands and the possibility to use classical calibration
methods for quantitative analysis in non-complex mixtures. Nevertheless, taking into account small
quantities (< 5 %) of impurities (i.e. C9+ compounds), the classical calibration method is naturally
limited in precision if all the impurities are not clearly identified in the spectrum.
The scope of this paper is to evaluate the improvement that could be achieved in terms of precision of
the quantification by us ing inverse calibration methods. The work presented here is at the stage of a
feasibility study aiming at showing that inverse calibration should be applied later on the industrial
installations. Synthetic samples were therefore studied using a laboratory instrument. In order not to
overestimate the possible improvements obtained, the study has been performed in the wavelength
domain currently used and optimised for the classical calibration method. Moreover, the synthetic
samples contained no impurities, leading to a situation optimal for the direct calibration method. It can
therefore be expected that any improvement achieved in these conditions would be even more
appreciable on the real industrial process. It is also important to evaluate which inverse calibration
method is the most efficient, so that the implementation of the new system on the industrial process can
be performed as quickly as possible.
New Trends in Multivariate Analysis and Calibration
172
2 – Calibration Methods
Bold upper-case letters (X) stand for matrices, bold lower-case letters (y) stand for vectors, and italic
lower-case letters (h) stand for scalars.
2.1 - Comparison of classical and inverse calibration
The main assumption when building a classical calibration model to determine concentrations from spectra
is that the error lies in the spectra. The model can be seen as :
Spectra = f (Concentrations). Or, in a matrix form :
R = C . K + E (1)
where R is the spectral response matrix, C the concentration matrix, K the matrix of molar absorptivities
of the pure components, and E the error matrix. This implies that it is necessary to know all the
concentrations in order to build the model, if a high precision is required.
In inverse calibration, one assumes that the error lies in the measurement of the concentrations. The model
can be seen as : Concentrations = f (Spectra). Or, in a matrix form:
C = P . R + E (2)
where R is the spectral matrix, C the concentration matrix, P the regression coefficients matrix, and E the
error matrix. A perfect knowledge about the composition of the system is then not necessary.
2.2 - Method currently used for the monitoring
The concentrations are currently evaluated using a software [2] implementing a classical multivariate
calibration method based on the measurement of the areas of the Raman peaks. It is assumed that there
is a linear relationship between Raman intensity and the molar density of a substance. The Raman
Chapter 3 – New Types of Data : Nature of the Data Set
173
intensity collected also depends on other factors (excitation frequency, laser intensity, etc…), but those
factors are the same for all of the bands in a spectrum. It is therefore necessary to work with relative
concentrations for the substances. The relative concentration of a molecule j in a mixture including n
types of molecules is obtained by calculating :
∑=
= n
1iii
jjj
/ sp
/ spc (3)
where pj is the theoretical integrated intensity of the Raman line due specifically to the molecule j, and
σj the relative cross section of this molecule. The cross section of a molecule represents the fact that
different molecules, even when studied at the same concentration, can induce Raman scattering with
different intensity.
The measured intensity mj of a peak is also due to the contribution of peaks from other molecules. For
the method to take overlapping between peaks into account, the theoretical pj values must therefore be
deduced from the experimentally measured integrated intensities mj (Fig. 1). The following system has
to be solved :
a11 p1 +a21 p2 + a31 p3 + … + a i1 pi = m1
a12 p1 + a22 p2 + a32 p3 + … + a i2 pi = m2
… (4)
a1j p1 + a2j p2 + a3j p3 + … + a ij pi = mn
where the aij coefficients represent the contribution of the ith molecule on the integrated frequency
domain corresponding to the jth molecule (Fig. 1).
The aij coefficients are deduced from the Raman spectra of pure components as being the ratio between
the integrated intensity in the frequency domains of the jth and ith molecules respectively. The aii
coefficients are equal to 1.
The system (4) can be written in a matrix form as :
New Trends in Multivariate Analysis and Calibration
174
M.KPMPK 1−=→=. (5)
The integrated intensities m of the matrix M were measured over frequency domains of 7 cm-1 centered
on the maximum of the peaks (Fig. 1). This is of the order of their width at half height. The maxima
have therefore to be determined before the calculation can be performed. The spectra of the five pure
products are used for this purpose. The relative scattering cross-sections σj are obtained from the
spectra of binary equimolar mixtures of each of the molecules with one taken as a reference. Here,
toluene is taken as a reference, this leads to :
σtoluene = 1
σj = σ(j / toluene) = pj / ptoluene (6)
Once the p and σ values are known, the concentrations are obtained using equation (5). A more
detailed description of the method is available in [2].
Fig. 1. Measured intensity mOX of the meta-xylene peak on the spectrum of a single component sample. The contribution of the meta-xylene peak under the ortho-xylene peak aMX/OX is also represented. The 7 cm-1 integration domains are filled in grey.
Chapter 3 – New Types of Data : Nature of the Data Set
175
2.3 - Stepwise Multiple Linear Regression (Stepwise MLR)
Stepwise Multiple Linear Regression [3] is an MLR with variable selection. Stepwise selection is used
to select a small subset of variables from the original spectral matrix X. The first variable xj entered in
the model is the most correlated to the property of interest y . The regression coefficient b obtained
from the univariate regression model relating xj to y is tested for significance using a t-test at the
considered critical level α = 1 or 5 %. The next step is forward selection. This consists in including in
the model the variable xi that yields the highest Partial Correlation Coefficient (PCC). The inclusion of
a new variable in the model can decrease the contribution of a variable already included and make it
non-significant. After each inclusion of a new variable, the significance of the regression terms (biXi)
already in the model is therefore tested, and the non-significant terms are eliminated from the equation.
This is the backward elimination step. Forward selection and backward elimination are repeated until
no improvement of the model can be achieved by including a new variable, and all the variables
already included are significant.
Stepwise variable selection method is known for sometimes selecting uninformative variables because
of chance correlation to the property of interest. This can occur when the method is applied to noisy
signals. In order to reduce this risk, a modified version of this algorithm was proposed. The main idea
is the same as in Stepwise, the forward selection and backward elimination steps are maintained. The
difference lies in the fact that each time a variable xj is selected for entry in the model, an iterative
process begins :
• A new variable is built. This variable xj1 is made of the average Raman scattering value of a 3-
point window centred on xj (from xj-1 to xj+1). If xj1 yields a higher PCC than xj, it becomes the new
candidate variable.
• A second new variable, xj2 (average Raman scattering value of points xj-2 to xj+2) is built, it is
compared with xj1 , and the process goes on.
• When the enlargement of the window does not lead to a variable xj(n+1) with a better PCC than
xjn, the method stops and xjn enters the model.
Selecting an (2n+1)-points spectral window instead of a single wavelength implies a local averaging of
the signal. This should reduce the effect of noise in the prediction step. Moreover, as the first variables
New Trends in Multivariate Analysis and Calibration
176
entered into the model (most important ones) yield a better PCC, less uninformative variables should be
retained.
2.4 - MLR with selection by Genetic Algorithm (GA MLR)
Genetic Algorithms (GA) are used here to select a small subset of original variables in order to build an
MLR model [4]. A population of k strings (or chromosomes) is randomly chosen from the original
predictor matrix X. The chromosomes are made of genes (or bitfields) representing the parameters to
optimise. In the case of variable selection, each gene is made of a single bit corresponding to an
origina l variable. The fitness of each string is evaluated in terms of Root Mean Squared Error of
Prediction, defined as :
n/n
)yiyi(RMSEP t1i
2t
∑=
−= (7)
where nt is the number of objects in the test set, yi the known value of the property of interest for object
i, and yiˆ the value of the property of interest predicted by the model for object i.
With a probability depending on their fitness, pairs of strings are selected to undergo cross-over. Cross-
over is a GA operator consisting in mixing the information contained in two existing (parent) strings to
obtain new (children) strings. In order to enable the method to escape a possible local minimum, a
second GA operator, mutation, is introduced with a much lower probability. This means that each bit in
the children strings may be randomly changed. In the algorithm used here [5], the children strings may
replace members of the population of parent strings yielding a worse fit. This whole procedure is called
a generation. It is iterated unt il convergence to a good solution is reached. In order to improve the
variable selection, a backward elimination was added to ensure that all the selected variables are
relevant for the model. The principle is the same as the backward elimination step in the Stepwise
variable selection method.
Chapter 3 – New Types of Data : Nature of the Data Set
177
This method requires as input parameters the number of strings in each generation (size of the
population), the number of variables in each string (number of genes per chromosome), the frequency
of cross-over, mutations and backward elimination, and the number of generations.
2.5 - Principal Component Regression with variable selection (PCR VS)
This method includes two steps. The original data matrix X(n,p) is approximated by a small set of
orthogonal Principal Components (PCs) T(n,a). A Multiple Linear Regression model is then built
relating the scores of the PCs (independent variables) to the property of interest y(n,1) . The main
difficulty of this method is to choose the number of PCs that have to be retained. This was done here by
means of Leave One Out (LOO) Cross Validation (CV). The predictive ability of the model is
estimated at several complexities (models including 1,2, … etc PCs) in terms of Root Mean Square
Error of Cross Validation (RMSECV). RMSECV is defined as RMSEP (equ. 7) when yiˆ is obtained
by cross validation. The complexity leading to the smallest RMSECV is considered as optimal in a first
approach. In a second step, in order to avoid overfitting, more parsimonious models (smaller
complexities, one or more of the last selected variables are removed) are tested to determine if they can
be considered as equivalent in performance. The slightly worse RMSECV can in that case be
compensated by a better robustness of the resulting model. This is done using a randomisation test [6].
This test is applied to check the equality of performance of two prediction methods or the same
prediction method at two different complexities. In this study, the probability was estimated as the
average of three calculations with 249 iterations each, and the alpha value used was 5%.
In the usual PCR [7], the variables are introduced into the model according to the percentage of
variance they explain. This is called PCR top-down. But the PCs explaining the largest part of the
global variance in X are not always the most related to y. In PCR with variable selection (PCR VS), the
PCs are included in the model according to their correlation [8] with y, or their predictive ability [9].
2.6 - Partial Least Squares Regression (PLS)
Similarly to PCR, PLS [10] reduces the data to a small number of latent variables. The basic idea is to
focus only on the systematic variation in X that is related to y. PLS maximises the covariance between
New Trends in Multivariate Analysis and Calibration
178
the spectral data and the property to be modelled. The original NIPALS [11-12] algorithm was used in
this study. In the same way as for PCR, the optimal complexity is determined by comparing the
RMSECV obtained from models with various complexities. To avoid overfitting, this complexity is
then confirmed or corrected by comparing the model leading to the smaller RMSECV with the more
parsimonious ones using a randomisation test.
3– Signal Processing Methods
3.1 - Smoothing by moving average
Smoothing by moving average (first order Savitzky-Golay algorithm [13]) is the simplest way to
reduce noise in a signal. It has however important drawbacks. For instance, it modifies the shape of
peaks, tending to reduce their height and enlarge their base. The size of the window chosen for the
smoothing must be optimised in order not to reduce the predictive abilities of the models obtained.
3.2 - Filtering in Fourier domain
Filtering was carried out in the Fourier domain [14]. The filtering method consists in applying a low
pass filter [15] on the frequency domain : a frequency value, above which the Fourier coefficients
should be kept, is selected. The cutoff frequency value was here automatically calculated on the bases
of the power spectra (PS). The power spectrum of a function is the measurement of the signal energy at
a given frequency. The narrowest peaks of interest in the signal are related to the minimum frequency
to be kept in the Fourier domain. The energy corresponding to the non- informative peaks is calculated,
and the power spectra are used to determine which frequencies should be kept depending on this value.
3.3 - Filtering in Wavelet Domain
The main steps of signal denoising in Wavelet domain are the decomposition of the signal, the
thresholding, and the reconstruction of the denoised signal [16].
The wavelet transform of a discrete signal f is obtained by :
Chapter 3 – New Types of Data : Nature of the Data Set
179
w = W f (8)
where w is a vector containing wavelet transform coefficients and W is the matrix of the wavelet filter
coefficients.
The coefficients in W are derived from the mother wavelet function. The Daubechies family wavelet
was used here. To choose the relevant wavelet coefficients (those related to the signal) a threshold
value is calculated. Many methods are available. This was done here using the method known as
universal thresholding [17] (ThU) in which the threshold level is calculated from the standard deviation
of the noise. Once the threshold is known, two different approaches are generally used, namely hard
and soft thresholding. Soft thresholding [18] was used here, in this case the wavelet coefficients are
reduced by a quantity equal to the threshold value.
When the relevant wavelet coefficients wt are determined, the denoised signal ft can be rebuilt as :
ft = W’ wt (9)
4 - Experimental
4.1 - Data set
The data set was made of synthetic mixtures prepared from products previously analysed by gas
chromatography in order to assess their purity. Those mixtures were designed to cover a wide range of
concentrations representative for all the possible situations on the process. Only the spectra of the
“pure” products and the binary mixtures are required to build the model in case of the classical
calibration method. For all the inverse calibration methods, all the samples (except the replicates) are
used in the model building phase.
The data set consists of 52 spectra :
New Trends in Multivariate Analysis and Calibration
180
- 1 spectrum for each of the 5 pure products (toluene, meta, para, and ortho-Xylene, and ethyl-
benzene)
- 9 spectra of binary p-xylene / m-xylene mixtures (concentrations from 10/90 to 90/10 with a 10%
step)
- 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five
pure products.
- 10 equimolar ternary mixtures
- 5 quaternary mixtures
- 1 mixture including the five constituents
- 10 replicates of randomly chosen mixtures
Raman spectra were recorded using a spectroscopic device quite similar to the one industrially used in
the ELUXYL separation process. The main differences are that a laser diode (SDL 8530) emitting at
785 nm was used instead of an argon ion laser (514.53 nm), and a 1 meter long optical fibre replaced
the 200 meters one used on the process. Back scattered Raman signal was recovered through a Super
DILOR Head equipped with interferential and Notch filters to prevent the Raman signal of silica to be
superimposed to the Raman signal of the sample. The grating spectrometer was equipped with a CCD
camera used in multiple spectra configuration. The emission line of a neon lamp could therefore also be
recorded to allow wavelength calibration. The spectra were acquired from 930 to 650 cm-1, no rotation
of the grating was needed to cover this spectral range. The maximum available power at the output of
the fibre connected to the laser is 250 mW. However, in order to prevent any damage to the filters, this
power was reduced to a sample excitation power of 30 mW. Each spectrum was acquired during 10
seconds. This corresponds to the conditions on the industrial process, considering that concentration
values have to be provided by the system every 15 seconds. The five remaining seconds should be
enough for data treatment (possible pre-treatment, and concentration predictions).
The wavelength domain retained in the spectra was specifically designed to fit the requirements of the
classical calibration method. Thanks to the relatively simple structure of Raman spectra, it is sometimes
possible to find a spectral region in which each of the peaks is readily assignable to one product of the
mixture, and where there is not too much overlap. The spectral region has therefore been chosen so that
each product is represented mainly by one peak (Fig. 2). There are at least two frequency regions with
Chapter 3 – New Types of Data : Nature of the Data Set
181
no Raman back-scattering in this domain. This allows an easy recovery of the baseline. The spectral
domain studied was anyway very restricted because of the focal of the instrument and the dispersion of
the grating.
a)
b)
c)
d)
e)
Fig. 2. Spectra of the five pure products in the selected spectral domain. (2a) toluene,(2b) m-xylene, (2c) p-xylene, (2d) o-xylene, (2e) ethyl-benzene.
New Trends in Multivariate Analysis and Calibration
182
4.2 - Normalisation of the Raman spectra
It is known that the principal source of instability of the intensity of the Raman scattering is the
possible variations of the intensity of the laser source. This imposes to normalise the spectra or to
perform semi-quantitative measurements. In this study, repeatability has been evaluated using replicate
measurements performed over a period of time of several days. This indicated some instability leading
to a variation of about 2% on the Raman scattering intensity. It is therefore probable that a
normalisation would have been desirable. However, given the spectral domain accessible with the
instrument used, and the difference in the cross section of the substances present in the mixtures, a
normalisation performed using for instance the total surface of the peaks was not considered. It was
therefore necessary to study the improvement of the inverse calibration methods compared to the
classical method without normalising the Raman spectra.
4.3 - Spectral shift correction
Variation in ambient temperature has an effect on the monochromator present in the Raman
spectrometer, and produces a spectral shift. The first part of the spectra is then used to perform a
correction. The first 680 points (out of 1360) of each spectrum are not related to the studied mixture,
but to the radiation from a neon lamp (Fig. 3).
Fig. 3. Raman spectrum of a typical mixture.
Chapter 3 – New Types of Data : Nature of the Data Set
183
The spectrum of this lamp shows very narrow peaks which wavelengths are perfectly known. The
maximum of the most intense peak can be determined very precisely, and the spectrum is then shifted
in such a way that this maximum is set to a given value. This is called the neon correction. At the end
of the pre-treatment procedure, some small spectral regions on the extremities of the spectra were
removed (from 930 to 895 cm-1 and from 685 to 650 cm-1). It was possible to remove these data points
as they are known to be uninformative (containing no significant Raman emission from any of the
compounds). The resulting spectra consisted of 500 points (Fig. 4).
Fig. 4. Raman spectra of a synthetic mixture after “neon correction” (PX = p-xylene, 21.38 %; T = toluene, 20.13 %; EB = ethyl-benzene, 18.07 %; OX = o-xylene, 19.93 %; MX = m-xylene; 20.33 %).
5 – Results and discussion
In all cases, separate models were built for each of the products. The results are given in terms of
percentage of the result obtained with the classical calibration method. Results lower than 100% mean
a lower RMSECV. The first and second derivative did not yield any improvement in the predictive
ability of the models. More methods, like PLS II [10] or Uniformative Variable Elimination PLS [19]
(UVE PLS), were used but did not lead to better models.
5.1 - Classical method
This method applies classical multivariate calibration. The intensities of the peaks are represented as
the result of the presence of a given number of chemical components with a certain concentration and a
New Trends in Multivariate Analysis and Calibration
184
given cross section. As can be seen in system (4) and equation (5), according to the model built using
this method, the mixture can contain only those components. Impurities that might be present are not
taken into account, as the sum of the concentrations of the modelled components is always 100%. This
method takes into account the variation of the laser intensity and always uses relative concentrations.
These results were computed from the values given by the software after the spectra acquisition with a
calibration performed using spectra from this data set. The results of this method are taken as reference.
The RMSEP values for all the products are therefor set to 100 %.
5.2 - Univariate Linear Regression
Linear regression models were built to relate the concentration of each of the products to the maximum
of the corresponding peak, and to the average Raman scattering value of 3 to 7 points spectral windows
centred on this maximum (Table 1). Compared to those obtained with the classical multivariate
method, the results obtained with linear regression are comparable for some compounds (toluene, o-
xylene), worse for some other (m-xylene, p-xylene) and better in one case (ethyl-benzene). These
differences are due to the fact that models built here are univariate models, therefore not taking into
account overlapping between peaks.
Table 1. Relative RMSECV calibration results obtained using Linear Regression applied to the wavelength corresponding to the maximum of each peak and to the sum of the integrated intensities of 3 to 7 points spectral windows centred on this maximum. The wavenumber corresponding to the maximum of the peak is also given.
toluene m-xylene p-xylene o-xylene eth-benzene
Maximum (cm-1)
790 728 833 737 774
RMSECV 1 point 98.1 211.1 213.9 133.0 66.7
RMSECV 3 points
101.3 204.6 211.5 131.9 63.6
RMSECV 5 points
102.5 193.7 211.8 131.6 68.3
RMSECV 7 points
103.9 162.8 210.8 129.3 68.5
Chapter 3 – New Types of Data : Nature of the Data Set
185
5.3 - Stepwise MLR
Stepwise-MLR appeared to give the best results (table 2). The models built with a critical level of 1 %
are parsimonious (between 1 and 4 variables retained) and all give better results than the ones obtained
with the previous methods except in case of p-xylene. This model is built retaining only one variable. A
slightly less parsimonious model could be expected to give best result without a significant loss of
robustness.
Table 2. Relative RMSECV calibration results obtained for each of the five products using Stepwise Multiple Linear Regression.
toluene m-xylene p-xylene o-xylene eth-benzene
RMSECV 96.1 72.2 155.4 69.2 73.6 α = 1 %
# variables 3 4 1 3 2
RMSECV 23.5 6.6 0.2 9.1 0.0 α = 5 %
# variables 20 22 34 15 35
As expected, the models built with α = 5 % retain more variables. But here, the number of retained
variables is by far too high, the models are dramatically overfitted. Moreover, the RMSECV are so low
that they can not be considered as relevant. Those results are only possible because RMSECV is not
used in the variable selection step of the method. It is only used after the model is built to evaluate its
predictive ability.
The possibility of variables selected by chance correlation was then investigated. Variable selection
methods can retain irrelevant variables because of chance correlation. It has been shown that a
Stepwise selection applied to a simulated X spectral matrix filled with random values and a random Y
concentration matrix will lead to retain a certain number of variables [20-21]. The cross validation
performed on the obtained model will even lead to a very good RMSECV result. This can also happen
New Trends in Multivariate Analysis and Calibration
186
with more sophisticated variable selection methods like Genetic Algorithms [22]. It was shown that this
behaviour is by far less frequent for methods working on the whole spectrum, like PCR or PLS [23].
This is actually what happens in this study. For instance, on the m-xylene model (22 variables
retained), some variables that should not be considered as informative (not located on one of the peaks,
low Raman intensity) have a quite high correlation coefficient with the considered concentration (table
3). Those variables also have high regression coefficients, so that although the Raman intensity for
those wavelengths is quite low since many of them are located in the baseline, they take a significant
importance in the model.
Table 3. Model built with Stepwise selection for the m-xylene (18 first variables only). The correlation coefficient and the regression coefficient for the selected variables are also given.
Using the regression coefficient obtained for a variable, and the average Raman intensity for the
corresponding wavelength, it is possible to evaluate the weight this variable has in the MLR model
(table 4). One can see that the relative importance of variable 80, selected in fourth position, is about
one third of the importance of the first selected variable. This relative importance explains why the last
selected variables are still considered relevant and lead to a dramatic improvement of the RMSECV. In
Chapter 3 – New Types of Data : Nature of the Data Set
187
this particular case, this is not the sign of a better model, but this shows the failure of cross validation
combined with backward elimination.
Table 4. Evaluation of the relative importance of selected variables in the MLR model built with Stepwise variable selection for m-xylene.
Order of selection
Index of variable
Correlation coefficient
Regression coefficient
Raman intensity
Weight in the model
1 398 0.9981 0.0298 1029.2 30.67
4 493 0.1335 0.9663 8.01 7.74
8 80 -0.69 -3.01 3.41 -10.26
5.4 -PCR VS and PLS
Calibration models were built with PCR VS and PLS (table 5). These two models gave comparable
results (except for p-xylene) and usually required 4 latent variables, except for Ethyl-Benzene that
required 7 latent variables. These complexities do not appear to be especially high for models
predicting the concentration of a product in a five compound mixture. Using more latent variables for
Ethyl-Benzene is logical because this peak is the most broad and overlapped by other peaks. It is also
the peak with the smallest Raman scattering intensity and it therefore has the worst signal/noise ratio.
Compared to Stepwise MLR with α = 1 %, those latent variable methods gave systematically worse
results, except in the case of p-xylene
Table 5. Relative RMSECV calibration results obtained using Principal Component Regression with Variable Selection (the PCs are given in the order in which they are selected) and Partial Least Square.
toluene m-xylene p-xylene o-xylene eth-benzene
RMSECV 128.1 92.3 208.9 125.6 84.4 PCR VS Selected
PCs 3 4 2 1 2 1 4 3 1 3 2 4 1 2 3 4 4 3 2 1
New Trends in Multivariate Analysis and Calibration
188
RMSECV 112.43 75.84 149.2 108.8 102.6 PLS
# factors 4 4 5 4 7
5.5 - Improved variable selection
The modified Stepwise selection method enabled to improve the MLR models built for a critical level
of 5 %. The models are more parsimonious and the RMSECV seem much more physically meaning-
full (table 6).
Table 6. Relative RMSECV calibration results obtained for each of the five products using Stepwise Multiple Linear Regression with improved variable selection method
toluene m-xylene p-xylene o-xylene eth-
benzene
RMSECV 80.5 41.9 82.6 69.2 59.6 α = 5 %
# variables 7 11 9 3 6
Some new variables are built with spectral windows from 3 to 13 points (table 7). This enlargement of
variable happens in each model for the maximum of the peak corresponding to the modelled
compound, but also for variables in the baseline or on the location of other peaks. However, this
approach does not seem to solve the problem completely. For some models, variables are still retained
because of chance correlation, leading to excessively high complexities in some cases (11 varia bles for
m-xylene).
Chapter 3 – New Types of Data : Nature of the Data Set
189
Table 7. Complexity of the MLR calibration models built using variables selected with the modified stepwise selection method. Size is the size of the spectral window centred on this variable and used as new variable.
Genetic Algorithms were used with the following input parameters. Number of strings in each
generation : 20 ; number of variables in each string : 10 ; frequency of cross-over : 50 %; mutations : 2
%, and backward elimination : once every 20 generations ; the number of generations :200. The models
obtained are much better than the α = 5 % Stepwise-MLR models in terms of complexity. However,
the complexities are still high (table 8), which seems to indicate that the G.A. selection is also affected
by random correlation. Moreover, the RMSECV values are comparable with those obtained with the α
= 1 % Stepwise MLR model, but they are worse than those obtained with the modified Stepwise
approach. Globally, the G.A. approach is therefore not more efficient than the modified Stepwise
procedure.
Table 8. Relative RMSECV calibration results obtained for each of the five products using Genetic Algorithm Multiple Linear Regression.
New Trends in Multivariate Analysis and Calibration
190
toluene m-xylene p-xylene o-xylene eth-
benzene
RMSECV 109.1 82.9 179.8 98.8 78.9 α = 5 %
# variables 5 9 8 9 5
5.6 - Improved signal pre-processing.
Another possibility to avoid the inclusion of noise variables in MLR is to decrease the noise by signal
pre-processing. By plotting the difference between a spectrum and the average of the three replicates of
the same sample, one can have an estimation of the noise structure (Fig. 5). It appears that the noise
variance is not constant along the spectrum but heteroscedastic, it increases as the signal of interest
increases. Unfortunately, it is not possible to use the average of spectra in practice to achieve a better
signal/noise ratio because this would lead to acquisition times non-compatible with the kinetic of the
process.
a)
b)
Fig. 5. Para-xylene spectrum (5a) and estimation of the noise for this spectrum (5b).
Chapter 3 – New Types of Data : Nature of the Data Set
191
Smoothing by moving average was used to reduce the noise in the signal. The optimisation, of the
window size was done for each compound individually using PCR VS and PLS models. The optimal
size for the smoothing average window is 5 points. For this window size, the RMSECV values of the
PCR VS and PLS models are slightly improved (table 9).
Table 9 Relative RMSECV calibration results for PCR VS (the PCs are given in the order in which they are selected) and PLS models. Spectra smoothed using a 5 points window moving average.
toluene m-xylene p-xylene o-xylene eth-
benzene
RMSECV 95.5 83.9 152.9 70.6 68.7 PCR VS
PCs 3 4 2 1 2 1 4 3 1 3 2 4 1 2 3 4 4 3 2 1
RMSECV 95.7 94.2 154.4 69.9 52.3 PLS
# factors 4 4 5 4 7
The complexities are unchanged, showing that no extra component was added because of noise. In the
case of Stepwise MLR, the model complexities are reduced, but the Stepwise variable selection method
is still subject to chance correlation with those smoothed data (table 10).
Table 10 Relative RMSECV results for Stepwise MLR models. Spectra smoothed using a 5 points window moving average.
toluene m-xylene p-xylene o-xylene eth-
benzene
RMSECV 96.1 72.2 155.4 69.2 73.6 α = 1 %
# variables 3 4 1 3 2
RMSECV 73.5 38.9 82.6 52.9 45.9 α = 5 %
# variables 8 18 9 8 10
Some of those models seem to be quite reasonable. For instance, the model built for toluene uses 8
variables and gives a relative RMSECV of 73.5 %, but more important, the wavelengths retained seem
New Trends in Multivariate Analysis and Calibration
192
to have a physical meaning (Fig. 6). The first wavelength selected is located on the peak maximum, the
second takes into account the overlapping due to the p-xylene peak, the third is on the baseline, the
fourth takes into account the overlapping due to the ethyl-benzene peak, and three extra variables are
selected around the peak maximum.
Fig. 6. Wavelengths selected by the Stepwise selection method for the toluene model, and order of selection of those variables displayed on the spectrum of a typical mixture containing all 5 components.
On the other hand, for some models, the method has retained variables in a much more surprising way.
In the case of the model built for m-xylene for instance, 18 variables are retained. Most of those
variables are located in non- informative parts of the spectra (Fig. 7) and are selected because of chance
correlation. In that case, the denoising has not been efficient and chance correlation still occurs.
Fig. 7. Wavelengths selected by the Stepwise selection method for the m-xylene model, and order of selection of those variables displayed on the spectrum of a typical mixture containing all 5 components.
Chapter 3 – New Types of Data : Nature of the Data Set
193
In order to check if the optimal smoothing window size is the same for PCR/PLS and Stepwise MLR,
the fitness of the Stepwise-MLR models was evaluated depending on this parameter (table 11). The
results show again that because smoothing by moving average modifies the shape and height of the
peaks, this kind of smoothing can lead to degradation of the models. The optimal window size is
anyway not the same for all of the models and it is difficult to find a typical behaviour in the calibration
results.
Table 11 Complexity and performance (relative RMSECV) of Stepwise MLR models (α = 5 %) depending of the window size used for smoothing by moving average. The best model in terms of complexity and RMSECV value for each constituent is written in bold.
New Trends in Multivariate Analysis and Calibration
194
To apply filtering in the Fourier domain, a slightly wider spectral region had to be retained (removing
less points at the extremities of the original data after neon-correction) in order to set the number of
points in the spectra to 512 (29 points). The Stepwise-MLR models obtained using the denoised spectra
(Fig. 8) are by far better especially in terms of complexity. The models are much more parsimonious
with only 3 to 5 wavelengths retained and the RMSECVs are the best obtained for all the substances
(table 12).
a)
b)
Fig. 8. Example of a typical spectrum of a five compounds mixture before (8-a) and after (8-b) denoising in the Fourier domain.
Chapter 3 – New Types of Data : Nature of the Data Set
195
Table 12. Relative RMSECV calibration results obtained with Stepwise MLR applied to data denoised in the Fourier domain.
toluene m-xylene p-xylene o-xylene eth-benzene
RMSECV 81.6 70.2 165.0 87.3 69.3 α = 1 %
# variables 5 4 3 4 3
RMSECV 81.6 70.2 145.8 65.1 69.3 α = 5 %
# variables 5 4 5 4 3
Some models built with a critical level α = 1 % are exactly identical to those built with α = 5 %. The
fact that increasing the critical level does not lead to selecting more variables could mean that the
models are optimal. Some are slightly worse for equal or smaller complexity. PCR VS and PLS models
were also built using the filtered spectra in order to check if those method would benefit from this pre-
treatment (table 13). It appears that the PCR VS and PLS models built on denoised data are equivalent
or worse than the ones built on raw data. This probably means that this denoising was too extensive in
the case of a full spectrum method. The benefit of removing noise was lost because of the fact that peak
shapes were damaged. In this case the pre-treatment has a deleterious effect on the resulting model.
Table 13. Relative RMSECV calibration results obtained with PCR VS (the PCs are given in the order in which they are selected) and PLS (the number of factors retained is given) on the spectra denoised in the Fourier domain
toluene m-xylene p-xylene o-xylene eth-
benzene
RMSECV 159.6 87.4 205.1 111.2 132.3 PCR CV
PCs 3 4 2 1 2 1 4 3 1 3 2 4 1 2 3 4 4 3 2 1
RMSECV 146.3 87.3 154.0 89.1 101.8 PLS
# factors 5 4 5 5 5
New Trends in Multivariate Analysis and Calibration
196
The same spectra (512 points) were used to perform filtering in the wavelet domain. The Daubechies
family wavelet was used on the first level of decomposition only (Fig. 9). Higher decomposition levels
were investigated, but this did not lead to better models. The results obtained are generally good (table
14). However, both complexities and RMSECV values are worse than in the case of filtering in the
Fourier domain, except for p-xylene. In the case of o-xylene, only three variables are retained, this is
the same complexity as in the Stepwise-MLR model built with a critical level of 1% on data before
denoising, but the RMSECV is worse for the denoised data. This could be expected when looking at
the denoised spectra. Spectra denoised in the wavelets domain (Fig. 9-b) have a more angular shape
than those denoised in the Fourier domain (Fig. 8-b). This indicates that the shape of the peaks is
probably more affected by the wavelets pre-treatment. The filtering in wavelet domain can therefore be
considered here as less efficient than denoising in Fourier domain.
a)
Fig. 9. Example of a typical spectrum of a mixture of five compounds before (8-a) and after (8-b) denoising in the wavelet domain.
Chapter 3 – New Types of Data : Nature of the Data Set
197
b)
Table 14. Relative RMSECV calibration results obtained with Stepwise MLR (α = 5 %) applied to data denoised in the wavelet domain.
toluene m-xylene p-xylene o-xylene eth-benzene
RMSECV 92.5 52.9 101.4 107.9 49.6
# variables 4 7 5 3 7
6 - Conclusion
Inverse calibration methods were used on Raman spectroscopic data in order to model the
concentrations of individual compounds in a C8 compounds mixture. These methods outperformed the
classical calibration method currently used. In this classical calibration method, the sum of the relative
concentrations of the modelled components is always 100 %, impurities are not taken into account. In
inverse calibration, the concentrations are assumed to be a function of the spectral values (Raman
scattering). Therefore, a perfect knowledge of the composition of the system is not necessary and the
presence of possible impurities should not be a problem anymore. This is the main limitation of
classical multivariate calibration and the main reason why an even more significant improvement can
be expected when using inverse calibration methods on real data containing impurities. Moreover, the
acquisition conditions and the spectral region studied were chosen based on constraints related to the
instrument, the industrial process and the calibration method used. These conditions were therefore not
New Trends in Multivariate Analysis and Calibration
198
optimal for this study. In fact, inverse calibration methods would probably have benefited from using
more information on a wider spectral region. It can be expected that, for a given substance, calibration
performed on several informative peaks would outperform the current models. Another interesting
point is that the total integrated surface of a complex Raman spectrum is directly related to the intensity
of the excitation source. Working in a wider spectral region would allow performing a standardisatio n
of the spectra to take into account the effect of variations of the laser intensity. This would probably
have improved significantly the calibration results. This will be investigated in a second part of this
study, using an instrument with better performances particularly in terms of spectral region covered.
The very specific and simple structure of Raman spectra implied that the most sophisticated methods
are not the most efficient. It was shown that Stepwise Multiple Linear Regression leads to the best
models. One problem is that the Stepwise variable selection method is disturbed by noise in the spectra,
which induces the selection of chance correlated variables. This problem was efficiently resolved by
denoising. Whatever denoising method is used, the procedure should always be seen as finding a
compromise between actual noise removal (improves the performance of the model) and changing the
peaks shape and height (deleterious effect on the resulting model). The best method for this purpose
appeared to be filtering in the Fourier domain. The problems related to noise could also disappear when
the instrument with better performances is used, as the signal/noise ratio will be much higher.
REFERENCES
[1] Ph. Marteau, N. Zanier, A. Aoufi, G. Hotier, F. Cansell, Vibrational Spectroscopy 9 (1995) 101.
[2] Ph. Marteau, N. Zanier, Spectroscopy 10 (1995) 26.
[3] N. R. Draper, H. Smith, Applied Regression Analysis, second edition (Wiley, New York, 1981).
[4] R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267.
[5] D. Jouan-Rimbaud, D.L. Massart, R. Leardi, O.E. de Noord, Anal. Chem. 67 (1995) 4295.
[6] H. van der Voet, Chemom. Intell. Lab. Syst. 25 (1994) 313.
[7] T. Naes, H. Martens, J. Chemom. 2 (1998) 155.
[8]. J. Sun, J. Chemom. 9 (1995) 21.
[9] J. M. Sutter, J. H. Kalivas, P.M. Lang, J. Chemom. 6 (1992) 217.
[10] H. Martens, T. Naes, Multivariate Calibration (Wiley, Chichester,1989).
Chapter 3 – New Types of Data : Nature of the Data Set
199
[11] D. M. Haaland, E. V. Thomas, Anal. Chem. 60 (1988) 1193.
[12] P. Geladi, B. K. Kovalski, Anal. Chim. Acta 185 (1986) 1.
[13] A. Savitzky and M. J. E. Golay, Anal. Chem. 36 (1964) 1627.
[14] G. W. Small, M. A. Arnold, L. A. Marquardt, Anal. Chem. 65 (1993) 3279.
[15] H. C. Smit, Chemom. Intell. Lab. Syst. 8 (1990) 15.
[16] C. R. Mittermayer, S. G. Nikolov, H. Hutter, M. Grasserbauer, Chemom. Intell. Lab. Syst. 34
(1996) 187.
[17] D. L. Donoho in: Y. Mayer and S. Roques, Progress in Wavelet Analysis and Application,
(Edition Frontiers, 1993).
[18] D.L.Donoho, IEEE Transaction on Information Theory 41 (1995) 6143.
[19] V. Centner, D. L. Massart, O. E. de Noord, S. de Jong, B. M. V. Vandeginste, C. Sterna, Anal.
Chem. 68 (1996) 3851.
[20] J. G. Topliss, R. J. Costello, Journal of Medicinal Chemistry 15 (1971) 1066.
[21] J. G. Topliss, R. P. Edwards, Journal of Medicinal Chemistry 22 (1979) 1238.
[22] D. Jouan-Rimbaud, D. L. Massart, O. E. de Noord, Chem. Intell. Lab. Syst. 35 (1996) 213.
[23] M. Clark, R. D. Cramer III, Quantitative. Structure Activity Relationship 12 (1993) 137.
New Trends in Multivariate Analysis and Calibration
An industrial process separating p-xylene from mainly other C8 aromatic compounds is monitored with an online remote Raman analyser. The concentrations of six constituents are currently evaluated with a classical calibration method. The aim of the study being to improve the precision of the monitoring of the process, inverse calibration linear methods were applied on a synthetic data set, in order to evaluate the improvement in prediction such methods could yield. Several methods were tested including Principal Component Regression with Variable Selection, Partial Least Square Regression or Multiple Linear Regression with variable selection (Stepwise or based on Genetic Algorithm). Methods based on selected wavelengths are of great interest because the obtained models can be expected to be very robust toward experimental conditions. However, because of the important noise in the spectra due to short accumulation time, variable selection methods selected a lot of irrelevant variables by chance correlation. Strategies were investigated to solve this problem and build reliable robust models. These strategies include the use of signal pre-processing (smoothing and filtering in the Fourier or Wavelets domain), and the use of an improved variable selection algorithm based on the selection of spectral windows instead of single wavelengths when this leads to a better model. The best results were achieved with Multiple Linear Regression and Stepwise variable selection applied to spectra denoised in the Fourier domain.
* Corresponding author
KEYWORDS : Chemometrics, Raman Spectroscopy, Multivariate Calibration, random correlation.
Chapter 3 – New Types of Data : Nature of the Data Set
201
1 - Introduction
The task of our group in this study was to evaluate whether the use of Inverse Calibration methods could
lead to an improvement in the quality of the online monitoring of the Eluxyl process.
The process is currently monitored using the experimental setup and software developed by Philippe
Marteau. This software implements a classical multivariate calibration method based on the measurement
of the areas of the Raman peaks. The main assumption when building a classical calibration model to
determine concentrations from spectra is that the error lies in the spectra. The model can be seen as :
Spectra = f (Concentrations). Or, in a matrix form: R = C . K , where R is the spectral response matrix, C
the concentration matrix, and K the matrix of molar absorptivities of the pure components. This implies
that it is necessary to know the concentrations of all the products present in the mixture in order to build
the model, at least if a high precision is required. Taking into account that a small quantities (<5%) of
impurities (i.e. C9+ compounds) is present in the mixture when working on real data, the classical
calibration method is naturally limited in precision if all the impurities are not clearly identified in the
spectrum.
In inverse calibration, one assumes that the error lies in the measurement of the concentrations. The model
can be seen as : Concentrations = f (Spectra). Or, in a matrix form : C = P . R , where R is the spectral
matrix, C the concentration matrix, and P the regression coefficients matrix. A perfect knowledge about
the composition of the system is then not necessary. Better results are therefore expected as the presence of
impur ities does not affect the prediction of the concentration of the compounds of interest (at least if these
impurities were present in the calibration data set used to build the model).
2 – Data set used in this study
The data set was made of synthetic mixtures prepared from products previously analysed by gas
chromatography in order to assess their purity. Those mixtures were designed to cover a wide range of
concentrations representative for all the possible situations on the process. The data set consists of 71
spectra :
New Trends in Multivariate Analysis and Calibration
202
- 1 spectrum for each of the 5 pure products (toluene, meta, para, and ortho-Xylene, and ethyl-
benzene)
- 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five
pure products.
- 10 equimolar ternary mixtures
- 5 equimolar quaternary mixtures
- 1 equimolar mixture including the five constituents
- 9 spectra of binary para-xylene / meta-xylene mixtures (concentrations from 10/90 to 90/10 with a
10% step)
- 5 spectra of binary toluene / meta-xylene mixtures (concentrations from 10/90 to 90/10 %)
- 10 replicates of randomly chosen mixtures
- 16 mixtures including the five constituents with various concentrations
Spectra were acquired from 0 to 3400 cm-1 with a 1.7 cm-1 step. After interpolation by the instrument
software, the spectra had a 0.3 cm-1 step, leading to 11579 data points per spectrum.
3 – Pre-processing
The main problem in this data set was due to the instability of the excitation source used during the
acquisition of the spectra. The laser used for excitation was in fact ageing, leading to the fact that it
could deliver only one half of its nominal power at the beginning of the acquisition period, and only
one fourth at the end of the acquisitions. This is not a problem when relative concentrations have to be
evaluated, like this is the case with the software developed by Philippe Marteau. But this problem has
to be solved when one wants to evaluate absolute concentrations. The best way would be to have a
reference sample, independent from the sample studied, but measured at the same time with the same
excitation source. The spectra could therefore be corrected to take into account the intensity variations
of the source. This was not available here. The only way left to normalise the spectra was to work on
their surface. It would have been easier if the mixtures studied had contained many products, leading to
very complex spectra which total surfaces could have been considered constant. In this case, it would
have been enough scale all the spectra to a given value. In the present case, the small number of
substances with very different cross-sections forbids the use of such a methodology. It was therefore
Chapter 3 – New Types of Data : Nature of the Data Set
203
necessary to try and find a part of the spectra with constant enough surface so that the scaling can be
performed according to this part only. The choice was made empirically, testing the results of a
benchmark method on data normalised according to the surface of a given part (or given parts) of the
spectra. The benchmark method chosen was Principal Component Regression with Variable Selection
(PCR-VS) [1,2,3,4]. Suitability of the pre-processing was evaluated according to the results of models
built for para-xylene. Five zones were defined in the spectra (Fig. 1) :
Zone 1 : 0-160 cm-1 à nothing
Zone 2 : 160-1700 cm-1 à Actual spectra
Zone 3 : 1700-2500 cm-1 à Baseline
Zone 4 : 2500-3200 cm-1 à CH range
Zone 5 : 3200-3400 cm-1 à noise
Fig. 1. Spectra of the products with
the 5 spectral domains defined.
The models were built using 4 to 7 principal components. The results are given in terms of Root Mean
Squared Error of Cross Validation (RMSECV).
New Trends in Multivariate Analysis and Calibration
204
Table 1. RMSECV for a PCR-VS model built for para-xylene depending of standardisation.
Reference zone(s) RMSECV value Number of PCs retained No standardisation 4.23 7 1 + 2 + 3 + 4 + 5 0.69 4
2 0.79 5 3 1.54 4 4 0.64 4
2 + 4 0.56 5
It appears that the best way to normalise the spectra is to scale them according to the total surface
corresponding to zones 2 and 4 (actual spectra and CH range). It can be seen also how tremendously
important such a correction is, as the results can be improved by a factor 10. However, the solution is
more than probably not optimal for, as said before, considering that the total surface of these two
spectral zones should theoretically be constant is not a valid hypothesis.
After this standardisation procedure, the baseline shift visible in spectral domain #3 was almost
perfectly removed. The use of a specific baseline removal procedure did not further improve the
calibration results.
The spectra were corrected for wavelength shift using the corresponding Neon spectra. However, with
spectra from this new experimental setup, this correction happened to be by far less crucial than on
previous Eluxyl Raman data we investigated.
3 – Choice of the calibration method to be used
In a previous study performed a synthetic data set simulating ELUXYL data, it had been shown that the
most effective calibration method was Stepwise Multiple Linear Regression (Stepwise-MLR) [5]
applied to spectra denoised in the Fourier domain. At that time, no non- linearity had been detected in
the data set. Considering the much better signal/noise ration and repeatability in this data set, it was
necessary to investigate for non- linearity again. In fac t, it now appears very clearly that the mixture
effects are not linear. It is the case for instance for meta-xylene/para-xylene mixture. The results of a
PCR-VS model of meta-xylene show that there is a clear deviation from linearity (Fig. 2-a) on the first
PC. This is especially visible for samples number 2-3 and 32 to 40 (corresponding to pure meta and
Chapter 3 – New Types of Data : Nature of the Data Set
205
para-xylene, and binary meta/para mixtures with various concentrations). Adding more components to
this model (Fig 2-b,c) tends to accommodate for the non- linearity, but even for the optimal 4
components model, the non- linearity was not completely corrected (Fig. 2-d).
a)
Fig. 2-a. Y vs Yhat, PCR-VS model for meta-xylene. 1 component.
b)
Fig. 2-b. Y vs Yhat, PCR-VS model for meta-xylene. 2 components.
New Trends in Multivariate Analysis and Calibration
206
c)
Fig. 2-c. Y vs Yhat, PCR-VS model for meta-xylene. 3 components.
d)
Fig. 2-d. Y vs Yhat, PCR-VS model for meta-xylene. 4 components.
Because of these non- linearities, linear methods such as PCR-VS, Partial Least Squares Regression
(PLS) [6-8] and Stepwise MLR did not lead to good results (RMSECV values always around 0.5). It
was therefore decided to work with non- linear methods. The most representative of these non- linear
methods is artificial Neural Networks (NN) [9,10].
Individual models were built for each of the compounds, using the scores of a PCA as input variable.
PCA was applied on the spectra limited to their informative parts (spectral ranges 2 and 4), after
column centering. The calibration results are much better than those obtained with linear methods
(Table 2). They are given in terms of Root Mean Squared Error of Monitoring.
Chapter 3 – New Types of Data : Nature of the Data Set
Linear and non- linear calibration methods (Principal Component Regression, Partial Least Squares Regression and Neural Networks) were applied to a slightly non- linear Raman data set. Because of the large size of this data set, recently introduced linear calibration methods specifically optimised for speed were also used. These fast methods achieve speed improvement by using the Lanczos decomposition for the singular value decomposition steps of the calibration procedures, and for some of their variants, by optimising the models without cross-validation. Linear methods could deal with the slight non-linearity present in the data by including extra components, therefore performing comparably to Neural Networks. The Fast methods performed as well as their classical equivalents in terms of precision in prediction, but the results were obtained considerably faster. It however appeared that cross-validation remains the most appropriate method for model complexity estimation.
* Corresponding author
KEYWORDS : Multivariate Calibration, Raman spectroscopy, Lanczos decomposition, Fast Calibration
methods.
Chapter 4 – New Types of Data : Structure and Size
215
1 - Introduction
Data treated by chemometricians tend to get larger and larger. The data set considered in our study
contains 71 spectra that were acquired from 0 to 3400 cm-1 with a 1.7 cm-1 step. After interpolation by
the instrument software, the spectra had a 0.3 cm-1 step, leading to 11579 data points per spectrum (Fig.
1). This number of variables was rounded to 10000 by removing points without physical significance at
both extremities of the spectra. The data set consists of spectra of mixtures obtained from five pure
BTEX products (benzene, toluene, ortho, meta and para xylene, all C8 molecules) previously analysed
by gas chromatography in order to assess their purity. These mixtures were designed to cover a wide
range of concentrations representative for all the possible mixtures that can be obtained with these five
compounds, and specifically cover binary mixtures in order to investigate non-linear effects. The data
set was split in calibration and test sets.
Fig. 1. Spectra of the five pure products.
The calibration set consists of 51 spectra :
- 1 spectrum for each of the 5 pure products.
- 10 equimolar binary mixtures consisting of all binary mixtures which can be prepared from the five
pure products.
- 10 equimolar ternary mixtures
- 5 equimolar quaternary mixtures
New trends in Multivariate Analysis and Calibration
216
- 1 equimolar mixture including the five constituents
- 9 spectra of binary product 2 / product 3 mixtures (concentrations from 10/90 to 90/10 with a 10 %
step)
- 5 spectra of binary product 1 / product 2 mixtures (concentrations from 10/90 to 90/10 with a 20 %
step)
- 10 mixtures including the five constituents with various random concentrations
The test set used to assess the models predictive ability is made of 20 spectra :
- 20 mixtures including the five constituents with various random concentrations
The shape of the obtained design is shown in figureg. 2-a,b.
Fig. 2-a. Score plot of PC1 vs PC2 vs PC3. The test points are in a circle.
Chapter 4 – New Types of Data : Structure and Size
217
Fig. 2-b. Score plot of PC1 vs PC2 vs PC4. The test points are in a circle.
Calibration methods such as Principal Component Regression (PCR), Partial Least Squares Regression
(PLS) or Neural Networks were used on this data set. Apart from these usual methods, because of the
large size of this data set, it was also interesting to apply calibration methods specifically optimised for
speed. Such fast methods deriving from PCR and PLS were recently proposed by Wu and Manne [1]
who compared them to their classical equivalents on five near infrared (NIR) data sets. The new
methods reportedly achieved equivalent prediction results, using models with identical complexities,
but the speed of the new algorithms was much faster. These fast methods were therefore applied in this
study.
2 - Methods
2.1 - Principal Component Regression with variable selection (PCRS)
This method includes two steps. The original data matrix X(n,p) is approximated by a small set of
orthogonal Principal Components (PC) T(n,a). A Multiple Linear Regression model is then built relating
the scores of the PCs (independent variables) to the property of interest y(n) . The main difficulty of this
method is to choose the number of PCs that have to be retained. This was done here by means of Leave
One Out (LOO) Cross Validation (CV). The predictive ability of the model is estimated at several
New trends in Multivariate Analysis and Calibration
218
complexities (models including 1,2, … etc PCs) in terms of Root Mean Square Error of Cross
Validation (RMSECV). RMSECV is defined as :
nyiyin
/)ˆ(RMSECV1i
2∑=
−= (1)
where n is the number of calibration objects, yi the known value of the property of interest for object i,
and yiˆ the value of the property of interest predicted by the model for object i.
The complexity leading to the smallest RMSECV is considered as optimal in a first approach. In a
second step, in order to avoid overfitting, more parsimonious models (smaller complexities, one or
more of the last selected variables are removed) are tested to determine whether they can be considered
as equivalent in performance. The slightly worse RMSECV can in that case be compensated by a better
robustness of the resulting parsimonious model. This is done using a randomisation test [2,3]. This test
is applied to compare a prediction method at two different complexities. In the usual PCR [4], the
variables are introduced into the model according to the percentage of spectral variance (variance in X)
they explain. This is called PCR top-down. But the PCs explaining the largest part of the global
variance in X are not always the most related to y. PCR with variable selection (PCRS) was used in our
study. In PCRS, the PCs are included in the model according to their correlation [5] with y, or their
predictive ability [6].
2.2 - Partial Least Squares Regression
Similarly to PCR, PLS [7] reduces the data to a small number of latent variables. The basic idea is to
focus only on the systematic variation in X that is related to y. PLS maximises the covariance between
the spectral data and the property to be modelled. De Jong’s modified version [8] of the original
NIPALS [9,10] algorithm was used in this study. In the same way as for PCR, the optimal complexity
is determined by comparing the RMSECV obtained from models with various complexities. To avoid
overfitting, this complexity is then confirmed or corrected by comparing the model leading to the
smaller RMSECV with the more parsimonious ones using a randomisation test.
Chapter 4 – New Types of Data : Structure and Size
219
2.3 - Fast PCR and PLS algorithms
The fast algorithms are based on the Lanczos decomposition scheme [11,12,13]. The Lanczos method
is a way to efficiently solve eigenvalue problems. It has its fast convergence properties when applied to
a large, sparse and symmetric matrix A. The method generates a sequence of tridiagonal matrices T.
These matrices have the property that their extreme eigenvalues are progressively better estimates of
the extreme eigenvalues of A. The method is therefore useful when only a small number of the largest
and/or smallest eigenvalues of A are required. This is the case in calibration methods where
information present in a large X matrix has to be compressed to a small number of PCs. In the present
case, the decomposition scheme is applied on A = X’ X. The speed improvement is achieved only if T
is much smaller than A. In this case, the Singular Value Decomposition (SVD) of T is much faster than
the one of A, nevertheless leading to very similar eigenvalues. Two parameters have to be optimised
when performing a Lanczos-based SVD. The size of the small tridiagonal matrix T has to be set. It
corresponds to the number of Lanczos base vectors that have to be estimated (nl). The number of
factors (PCs) that have to be extracted (nf) also has to be set, considering that nf ≤ nl. These parameters
were estimated in two different ways. The first is based on LOO-CV, that was used to optimise first the
size of the Lanczos basis (nl) and then the number of eigenvectors extracted from the resulting matrix
(nf). A less time consuming approach was also used. The iterations of the Lanczos algorithms were
stopped before the loss of orthogonality between successive base vectors becomes important enough to
require special corrections. This behaviour of the Lanczos algorithm is well known and leads to
rounding errors that greatly affect the outcome of the method. The size of the Lanczos basis (nl) being
set this way, the number of factors to be extracted from the resulting matrix (nf) was estimated based on
model fit, by estimating how much each individual eigenvector contributes to the model of the property
of interest.
The model optimised through the CV procedure is called PCRL (L stands for Lanczos), and the models
obtained through the other approach is called PCRF (F stands for Fast). The PLS versio n of the fast
algorithms is presented by the authors of the original article as a special case in which the full space of
eigenvectors generated in the Lanczos basis is used, leading to nl = nf. The obtained models are denoted
by PLSF.
New trends in Multivariate Analysis and Calibration
220
2.4 - Neural Network (NN)
In our study Neural Networks calibration [14,15] was performed on the X data matrix after it was
compressed by means of PC transformation. The most relevant PCs, selected on the basis of explained
variance, are used as input to the NN. The number of hidden layers was set to 1 in our study. The
transfer function used in the hidden layer was non-linear (i.e. hyperbolic tangent). The weights were
optimised by means of the Levenberg-Marquardt algorithm [16]. A method was applied to find the best
number of nodes to be used in the input and hidden layers based on the contribution of each node [17].
The optimisation procedure of NN also requires the calibration set to be split into training and
monitoring set in order to avoid overfitting. The last 10 spectra of the calibration set (10 mixtures
including the five constituents with various random concentrations) were used as monitoring set since
they can be expected to be most representative of future mixtures to be predicted.
3 - Results and discussion
All calculations were performed on a personal computer equipped with an AMD Athlon 600 Mhz
processor and 256 Megs of RAM, in the Matlab environment. The software used is house made
except for the new methods for which the code provided as annex in the original paper [1] was used.
The results used to assess the predictive ability of the methods are given in terms of Root Mean
Squared Error of Prediction (RMSEP) which is defined as :
nn
yiyi t/)ˆ(RMSEPt
1i
2∑=
−= (2)
where nt is the number of objects in the test set, yi the known value of the property of interest for object
i, and yiˆ the value of the property of interest predicted by the model for object i.
The speed of the methods is measured by estimating the number of operations necessary to perform the
complete calibration and prediction procedure. This number of operations is estimated using the
Chapter 4 – New Types of Data : Structure and Size
221
Matlab ‘FLOPS’ function, which is used to count the number of floating-point operations performed,
and expressed in Mflops (Millions of operations). Prediction results are given in tables 1 to 5.
Oxford University John Radcliffe Hospital Oxford OX3 9DU,
U.K.
ABSTRACT
The aim of this study is to investigate whether useful information can be extracted from an electroencephalographic (EEG) data set with a very high number of modes, and to determine which model is the most appropriate for this purpose. The data was acquired during the testing phase of a new drug expected to have effect on the brain activity. The implemented test program (several patients followed in time, different doses, conditions, etc …) led to a 6-way data set. After it was confirmed that the exploratory analysis of this data set could not be handled with classical PCA, and it was verified that multi-dimensional structure was present, multi-way methods were used to model the data. It appeared that Tucker 3 was the most suited model. It was possible to extract useful information from this high-dimensionality data. Non-relevant sources of variance (outlying patients for instance) were identified so that they can be removed before the in-depth physiological study is performed.
Condition dimension : 2 measurement conditions (resting and vigilance controlled)
New trends in Multivariate Analysis and Calibration
228
The calculations were performed on a Personal Computer with an AMD Athlon 600 MHz CPU and
256 Mega Bytes of RAM. The software used was house made or parts of The N-way Toolbox from Bro
and Andersson [3]. The whole study was performed in the Matlab® environment.
3 - Models
3.1 - Unfolding PCA – Tucker 1
Unfolding Principal Component Analysis (PCA) consists in applying classical two-way PCA on the
data matrix after it has been unfolded. The principle of unfolding is to consider the multidimensional
matrix as a collection of regular 2-ways matrices and to put them next to another, leading to a new 2-
way matrix containing all the data. It is possible to unfold a 3-way matrix along the 3 dimensions (Fig.
2).
Fig. 2. Three possible ways of unfolding a 3-way array X. X(1), X(2) and X(3) are the 2-way matrices obtained after unfolding with preserving the 1st, 2nd and 3rd mode respectively.
This results in 3 different matrices X(1), X(2) and X(3) in which modes 1, 2 and 3 are respectively
preserved. The score matrices obtained building a PCA model on each of those 3 matrices, respectively
Chapter 4 – New Types of Data : Structure and Size
229
called A, B and C, are the output of a Tucker 1 model. Tucker 1 is considered a weak multidimensional
model, as it does not take into account the multi-way structure of the data. The A, B and C matrices are
independently built. The Tucker 1 model is a collection of independent bilinear models, and not a
multi- linear model.
3.2 - Tucker 3
The Tucker 3 [4,5] model is a generalisation of bilinear PCA to data with more modes. The Tucker 3
model (limited here to a 3-way case for sake of simplicity) can be formulated as in eq. 1.
∑∑∑= = =
=1 2 3
1 1 1
w
l
w
m
w
n
lmnknjmilijk gcbax (1)
where x ijk is an (lxmxn) multidimensional array, w1, w2 and w3 are the number of components extracted
on the 1st, 2nd and 3rd mode respectively, a, b, and c are the elements of the A, B and C loadings
matrices for the 1st, 2nd and 3rd mode respectively, and g are the elements of the core matrix G.
The information carried by these matrices is therefore of the same nature as the information contained
in the equivalent matrices of the Tucker 1 model. The difference comes from the fact that these
matrices are built simultaneously during the Alternating Least Squares (ALS) fitting process of the
model in order to account for the multidimensional structure. Tucker 3 is a multi-linear model.
Moreover, the G matrix defines how individual loading vectors in the different modes interact. This
information is not available in the Tucker 1 model. The Tucker 3 model can also be seen in a more
graphical way as shown in figure 3, it appears as a weighted sum of outer products between the factors
stored as columns in the A, B and C matrices.
New trends in Multivariate Analysis and Calibration
230
Fig. 3. Representation of the Tucker 3 model applied to a 3-way array X. A, B and C are the loadings corresponding respectively to the 1st, 2nd and 3rd dimension. G is the core matrix. E is the matrix of residuals.
One of the interesting properties of the Tucker model is that the number of components for the different
modes does not have to be the same (as is the case in the PARAFAC model). In Tucker 3, the
components in each mode are usually constrained to orthogonality, which implies a fast convergence.
A limitation of this model is that the solution obtained is not unique, an infinity of other equivalent
solutions can be obtained by rotating the result without changing the fit of the model.
3.3 - Parafac
The Parafac model [6,7] is another generalisation of bilinear PCA to higher order data. It can be
mathematically described as in eq. 2 :
∑=
=w
l
kljlilijk cbax1
(2)
Like Tucker 3, Parafac is a real multi- linear model. It can be considered as a special case of the Tucker
3 model, in which the number of components extracted along each mode would be the same, and the
core matrix would contain only non-zero elements on its diagonal. This specific structure of the core
makes Parafac models much easier to interpret than Tucker 3 models. The Parafac model can also be
seen in a more graphical way as shown in figure 4.
Chapter 4 – New Types of Data : Structure and Size
231
Fig. 4. Representation of the Parafac model applied to a 3-way array X. A, B and C are the loadings corresponding to the 1st, 2nd and 3rd dimension. G is the super-diagonal core matrix. E is the matrix of residuals.
The most interesting feature of the Parafac model is uniqueness. The model provides unique factor
estimates, the solution obtained cannot be rotated without modification of its fit. As components on
each mode are not constrained to orthogonality, the convergence is usually quite slower than observed
with the Tucker 3 model.
4 - Results and discussion
4.1 - Linear and bi-linear models
Because of the nature of the data set, it was very difficult to explore it visually the way it is usually
done for instance with spectral data. In order to get a better insight of the data, some averages were
computed directly from the original variables. This corresponds to building simple linear models. The
global average (on patients, doses and conditions) for the energy bands can then be displayed on a map
of the brain for each of the measurement times. It is then possible to see in a rough way the evolution of
the activity of the brain as a function of time and of the location in the brain (Fig. 5).
New trends in Multivariate Analysis and Calibration
232
Fig. 5. Original data (averaged on patients, doses, conditions, and the 7 energy bands) displayed, for each of the measurement times, on a grid representing the electrodes locations. Dark zones indicate low activity.
It can be seen that the activity of the brain seems to globally increase to reach a maximum at time 6
(11AM, first day). The activity seems to increase mainly in the back part of the brain. The plot
corresponding to time 11 (9AM, second day), shows that the state of the brain seems to be similar on
the first and second day at equivalent times. Studying such plots for individual energy bands shows that
the different bands are not all present and varying in the same parts of the brain (i.e. some are more
present and active in the front or back part of the brain).
Classical two-way PCA can also be used to explore this data set. bi- linear models are then constructed.
The intensities of the 7 energy bands are considered as variables, and the 32256 measurement
conditions as objects. The PCA results (Fig. 6) show that there is some structure in the data. Points of
the score plot corresponding to an individual patient are located in relatively well-defined areas. The
same thing can be observed for points corresponding to a certain electrode or dose. However, the
results are too complex to be readily interpretable, and justify the use of multi-way methods to explore
this data set.
Chapter 4 – New Types of Data : Structure and Size
233
Fig. 6. Results of PCA on the (7 x 32256) matrix : scores on PC1 versus scores on PC2. Points corresponding to patient #9 are highlighted.
4.2 - Assessing multi-linear structure
Many data sets can be arranged in a multi-way form. This does not mean that multi-way methods
should be applied on such data sets as using such methods makes sense only if multi- linear structure is
present in the data. For instance, if slices of a three-way array are completely independent, no structure
(or correlation) is present along this mode, and multi-way methods should not be used. Two-way PCA
can be used to ensure that some multi-dimensional structure is actually present in the data. The data can
be reduced to a smaller dimensionality (smaller number of modes) array by extracting parts of the array
corresponding to one element of a given mode. For instance, considering only patient #11, the 30 mg
dose, and the Resting condition, the resulting matrix is a 3-way array with dimension (28x12x7). Only
the spatial, time, and variable dimensions are then taken into account. This matrix has to be unfolded
before ordinary PCA can be performed. If the data is unfolded preserving the first dimension, the
resulting matrix will have dimension (28x(12x7)). The scores of a PCA model performed on this data
give information about the 28 electrodes, and the loadings give simultaneously information about the
time and the variables, 12 repetitions (one per time) of the information about the 7 variables are
expected. It is verified that there is a structure remaining in the loadings of the PCA model (Fig. 7).
New trends in Multivariate Analysis and Calibration
234
Fig. 7. Loadings on PC1 for a (28 x 12 x 7) model (patient #11, 30 mg dose, resting condition).
The loadings for each variable globally vary following a common time profile. This is an indication of
a dimensional structure between the time and variables dimensions in the data used. A (7x(28x12))
array can also be obtained by rearranging the previous matrix. This time, the loadings show combined
information about the electrodes and time dimension. The plot shows 12 repetitio ns (one per time) of
the 28 electrodes. It can be observed that the loading values of the electrodes once again globally
follow a time profile, indicating that there is some multi-way structure relating these two modes.
Considering only the part of the data set corresponding to patient #11 and the resting condition leads to
a 4-way array with dimension (28x7x12x4). The loadings of the PCA model built on this array
unfolded preserving its first mode should give information about variables, time, and doses
simultaneously. A structure due to the dose dimension is visible (Fig. 8). Dose 3 (30mg) seems to be
standing out.
Chapter 4 – New Types of Data : Structure and Size
235
Fig. 8. Loadings on PC1 for a (28 x 7 x 12 x 4) model (patient #11, resting condition).
4.3 - Multi-linear models optimization
The Parafa c model should preferably be used as its simplicity makes the interpretation of the results
easier and also because of its uniqueness property. However, it has first to be investigated whether the
data can be modelled with Parafac. This verification can be performed using the Core Consistency
Diagnostic [8]. This approach is used to estimate the optimal complexity of a Parafac model (or any
other model that can be considered as a restricted Tucker 3 model). It can be seen as building a Tucker
3 model with the same complexity as the Parafac model and with unconstrained components and
analysing its core. In practice, the core consistency diagnostic is performed by calculating a Tucker 3
core matrix from the loading matrices of the Parafac model. If the Parafac model is valid and optimal in
terms of complexity, the core matrix of this Tucker 3 model, after rotation to optimal diagonality,
should contain no significant non-diagonal element. The data was first restricted to simpler 3-way
cases, and 3-way Parafac models were built. For instance, in the case of models built for data restricted
to one patient, one condition, and one dose, the dimensions modelled are the spatial dimension
(position of the electrodes), the time dimension, and the variables dimension. In all cases studied here,
a 2 components Parafac model was always optimal. However, the performances of the Parafac models
depended greatly on the patient studied. For patient #6, for instance (Fig. 9), the model is much better
than for patient #11 (Fig. 9).
New trends in Multivariate Analysis and Calibration
236
Fig. 9-a. Core Consistency Diagnostic for Parafac models built on 3-way data. Patient 6, Resting condition, 30 mg dose.
Fig. 9-b. Core Consistency Diagnostic for Parafac models built on 3-way data. Patient 11, Resting condition, 30 mg dose.
This indicates that the data do not seem to follow a Parafac model, or at least the modelling is not easy,
the data can therefore not be fit adequately by this model. By increasing the number of dimensions
modelled, it was verified that a Parafac model is probably not appropriate for this data set. In order to
assess the validity of the Parafac model on a data set, it is also useful to estimate the fit of both Tucker
3 and Parafac models in order to evaluate if the larger flexibility of the Tucker model leads to a
significant improvement in the fit. The fit of the 2 components Parafac model and the (222222) 6-way
Tucker 3 model (2 components extracted on each of the 6 modes) are actually almost identical (around
93.5% of explained variance). However, this complexity does not seem to be optimal at all in the case
of the 6-way Tucker 3 model. In order to keep computation time reasonable, the optimal complexity of
Chapter 4 – New Types of Data : Structure and Size
237
the 6-way Tucker 3 model was evaluated (Fig. 10) taking into account only a number of components
quite close to 2.
Fig. 10. Variance explained by the Tucker 3 models as a function of the model complexity.
The complexity was therefore investigated only from (111111) to (333333). It appeared that the
optimal complexity is (333221), which can be detailed as follows :
EEG dimension 3 components
Subject dimension 3 components
Spatial dimension 3 components
Dose dimension 2 components
Time dimension 2 components
Condition dimension 1 components
This complexity corresponds to the beginning of the last plateau on the curve (more exactly in this case
on a part of the curve just after a significant reduction of the slope). The model is on purpose not
chosen to be parsimonious as it would for instance have been possible to select the complexity
corresponding to the beginning of the plateau containing the (222222) model. It is however always
possible to discard some components from the model if it appears from the interpretation of the core
that they are not useful in the reconstruction of the original matrix X.
New trends in Multivariate Analysis and Calibration
238
4.4 - 6-way Tucker 3 model
The 6-way Tucker 3 model leads to a core array G with dimensions (3x3x3x2x2x1) and six component
matrices A,B,C,D,E,and F related each to one of the modes.
4.4.1 - Loadings on the variable dimension
The first matrix A holds the loadings for the EEG dimension (7 EEG bands). By calculating from the
original data the average energy (over the five other modes) of each frequency band, it can be seen
(Fig. 11-a) that the first component is used to describe the average energy of the bands. The second
component, as well as the third one (Fig. 11-b), will at this stage be interpreted as showing the effect of
some other parameters (time or effect of the substance) on the distribution of the bands.
Fig. 11-a. Loadings on the variable dimension, 6-way model with complexity (3 3 3 2 2 1). A(1) versus A(2). The mean energies of the bands are also given.
A(1)
A(2)
Chapter 4 – New Types of Data : Structure and Size
239
Fig. 11-b. Loadings on the variable dimension, 6-way model with complexity (3 3 3 2 2 1). A(2) versus A(3).
4.4.2 - Loadings on the patient dimension
The second matrix B holds the loadings for the patients dimension (12 patients). The main information
in the loading plots is that some extreme values are present. Patient #6 appears as an extreme value on
component 1 (Fig. 12-a). Patient #11 appears as an outlier on component 3 (Fig. 12-b). At this stage,
without looking at the core array G in order to remove the rotational indeterminacy of the Tucker 3
model, it is not possible to go further in the discussion about this matrix.
A(3)
A(2)
New trends in Multivariate Analysis and Calibration
240
Fig. 12-a. Loadings on the patient dimension, 6-way model with complexity (3 3 3 2 2 1). B(1) versus B(2).
Fig. 12-b. Loadings on the patient dimension, 6-way model with complexity (3 3 3 2 2 1). B(2) versus B(3).
4.4.3 - Loadings on the spatial dimension
The third matrix C holds the loadings for the spatial dimension (28 electrodes). The first remarkable
thing in the plot of C(1) versus C(2) is the symmetry of the loadings (Fig. 13-a). All electrodes that are
symmetrical on the brain (Fig. 1), for instance electrodes #17 and 20 appear very close to each other on
the loading plot. Moreover, considering all these pairs of symmetrical electrodes, the one located on the
right part of the brain appears to have systematically higher loading values. For instance, electrode #20
has higher loadings values than electrode #17. This rule holds for all of the pairs of electrodes, except
for electrodes #12 and 16. It will be established when interpreting the core matrix that this is due to a
B(2)
B(1)
B(3)
B(2)
Chapter 4 – New Types of Data : Structure and Size
241
specific problem with one of these leads for one of the patients. If the loading values on component 1
are reported on the map of the electrodes on the brain, a representation of the activity of the brain is
obtained (Fig. 13-b), it looks very similar to what was obtained with linear models in the data
exploration part (Fig. 5).
Fig. 13-a. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). C(1) versus C (2).
Fig. 13-b. Loadings on the spatia l dimension, 6-way model with complexity (3 3 3 2 2 1). Ranking of the electrodes on C(1) reported on the map of the brain.
If the second component of the C matrix is now considered (Fig. 13-c), and the loading values are
reported on the map of the electrodes on the brain, a clear separation between the front and back part of
the brain can be observed (Fig 13-d). Considering directions in the plots, a central part of the brain can
be identified. These patterns are interpreted as showing the activity of the substance on different parts
of the brain.
C(2)
C(1)
New trends in Multivariate Analysis and Calibration
242
Fig. 13-c. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). C(1) versus C(2).
Fig. 13-d. Loadings on the spatial dimension, 6-way model with complexity (3 3 3 2 2 1). Patterns on the loading plots are reported on the map of the brain.
It is important to note that, at this stage, only with the information present in the loading matrices, it is
not possible to know whether the high loadings on C(1) for the central part of the brain mean high or
low activity. A basic knowledge of brain physiology indicates that this indeed corresponds to high
activity. It is however necessary to get rid of the rotational indeterminacy of the Tucker3 model by
interpreting the core matrix to extract this information from the model.
C(2)
C(1)
Chapter 4 – New Types of Data : Structure and Size
243
4.4.4 - Loadings on the dose dimension
The first component on the dose dimension D(1) can be interpreted quite easily (Fig. 14). It shows that
10mg is quite close to Placebo, indicating that this dose is not efficient. 90mg is more different
compared to Placebo indicating a better effect of this dose, and the most different is 30mg. This can
appear surprising, but the medical doctors in charge of the study expected this result. The higher dose
does not systematically lead to the higher effect with this kind of substances. The second dimension,
making a difference between 30mg and the other doses is much more difficult to interpret at this stage,
but the phenomenon will be explained when interpreting the core matrix G.
Fig. 14. Loadings on the dose dimension, 6-way model with complexity (3 3 3 2 2 1). D(1) versus D(2).
4.4.5 - Loadings on the time dimension
The first component on the time dimension E(1) shows the normal time profile of the evolution of the
state of the brain during day time (Fig. 15). The activity globally increases from 8AM (time 1) until
11AM (time 6). This would of course still have to be confirmed by removing the rotational
indeterminacy using G, but it already fits what was seen in the linear data exploration part (Fig. 5).
Afterwards, the activity reduces. The loading value for the second day at 9AM (point 11) is located
between the ones corresponding to 8:30AM and 9:30AM in the first day, confirming this interpretation.
The second dimension is interpreted as showing the time profile of the effect of the drug activity. It has
to be specified that the drug was administrated immediately after 8:30AM (time 2). No effect of the
D(2)
D(1)
New trends in Multivariate Analysis and Calibration
244
substance can therefore be expected before 9:30AM (time 3). The loadings on component 2 are indeed
negative before 8:30AM and become positive from 9:30AM, regularly increasing until 11:30-
12:00AM. After 12AM, the activity drops and becomes zero (no activity, same negative loading values
as before the administration of the drug), and stays at this level during the second day.
Fig. 15. Loadings on the time dimension, 6-way model with complexity (3 3 3 2 2 1). E(1) versus E(2).
4.4.6 - Loadings on the condition dimension
The last component matrix F gives information about the two different measurement conditions. It is in
fact a vector as only one component was extracted along this mode. The loadings values are 0.701 for
the resting condition, and 0.713 for the vigilance controlled condition. The loadings are positive for
both the conditions, this indicates that when interpreting the model, this mode can only have a scale
effect. This means that the effect of the drug can only be larger or smaller depending on the condition,
but one can not expect to see opposed effects due to this parameter. The loading values for each
condition are also very similar. This indicates that the two conditions do not imply any effect on the
brain activity that is significant for the model. This dimension was further investigated. 5-way models
were built on data taking into account one of the conditions, the other condition, and the average of the
data in the two conditions. All these models gave almost perfectly identical results, showing that the
two conditions can in fact be considered as replicates of the same 5-way data set. This mode is
therefore not relevant in the data set.
E(2)
E(1)
Chapter 4 – New Types of Data : Structure and Size
245
4.4.7 - The core matrix G
The important elements of the core are shown in table 1, together with their squared value (that
represents the relative importance of the core element), and the variance explained by these elements.
Table 1. Important core elements of the 6-way model with complexity (3 3 3 2 2 1).
A new procedure for identification of outliers in Tucker3 model is proposed. It is based on robust initialization of the Tucker3 algorithm using Multivariate trimming or Minimum covariance determinant. The performance of the algorithm is tested by a Monte Carlo study on simulated data sets and also on a real data set known to contain outliers.
* Corresponding author
KEYWORDS : Multivariate Calibration, Raman spectroscopy, Lanczos decomposition, Fast Calibration
methods.
Chapter 4 – New Types of Data : Structure and Size
251
1 – Introduction
N-way methods based on the Alternating Least Squares (ALS) algorithm are least squares methods that
are highly influenced by outlying data points. One outlying sample can strongly influence the resulting
model. As for 2-way PCA and related methods, there are two possibilities to deal with outliers:
statistical diagnostics can be used or a robust algorithm can be constructed. Statistical diagnostics tools
can be applied to the already constructed models and are usually based on the detection of the 'leverage
points', defined as points that are far away from the remaining data points in the model space. This
approach does not always work for multiple outliers because of the so-called masking effect. Robust
versions of modelling procedures aim at building models describing the majority of data without being
influenced by the outlying objects. By data majority we mean the data subset containing at least 51% of
objects. Robust procedures are characterized by the so-called breakdown point, defined as a percentage
of data objects that may be corrupted while the model still yields the proper estimates. A subset of data,
containing no outliers is called 'clean subset'.
In the arsenal of chemometrical methods there are already many robust approaches, such as robust
PCA, PCR, PLS [1,2,3]. The aim of our study was to construct a robust version of the Tucker3
approach, one of the most popular N-way methods.
2 – Theory
2.1 - N-way methods of data exploration
Several methods were proposed for N-way exploratory analysis, for instance
CANDECOMP/PARAFAC [4,5] and the family of Tucker models [6,7]. In the present study, only the
Tucker3 model is considered. Most of the N-way methods are based on ALS. The principle of ALS is
to divide the parameters into several sets and for each set the least squares solution is found
conditionally on the remaining parameters. The estimation of parameters is repeated until a
convergence criterion is satisfied. Figure 1 shows the decomposition according to the Tucker3 model.
The 3-way data matrix X is decomposed into 3 orthogonal loading matrices A (I x L), B (J x M), C (K
x N) and the core matrix Z (L x M x N) which describes the relationship among them. The largest
New trends in Multivariate Analysis and Calibration
252
squared elements of the core matrix Z indicate the most important factors in the model of X.
Mathematically, the Tucker3 model can be expressed as
∑∑∑= = =
+=L
1l
M
1m
N
1nijklmnknjmilijk ezcbax (1)
Fig. 1. The Tucker3 model.
2.2 - Data unfolding
For computational convenience, the Tucker3 algorithm used does not perform calculations directly on
N-way arrays. The X matrix is unfolded to standard 2-way matrices. This can be done in three different
ways (see Fig. 2). Unfolded matrices are denoted as: X(I x JK), X(J x IK) and X(K x IJ) . To calculate the
loading matrices several procedures can be used. Anderson and Bro [8] tested most of them with
respect to speed and found NIPALS to be the fastest for large data arrays. In our algorithm, SVD is
used for the estimation of A, B and C matrices.
Fig. 2. Three different ways of unfolding of a 3-way data matrix.
Chapter 4 – New Types of Data : Structure and Size
253
2.3 - Algorithm of Tucker3 model
0) - Initialize B and C (as random orthogonal matrices)
1) - [A,v,d] = svd(X(I x JK) (C ⊗ B), L)
2) - [B,v,d] = svd(X(J x IK) (C ⊗ A), M)
3) - [C,v,d] = svd(X(K x IJ) (B ⊗ A), N)
4) - Go to step 1 until the relative change in fit is small
5) - Z = AT X (C ⊗ B)
where symbols L, M, N denote numbers of factors in matrix A, B and C respectively and symbol ⊗
denotes Kronecker multiplication: A ⊗ B yields the element-by-element multiplication of B with the
elements from A, expressed as :
=⊗
OMM
LL
BBBB
BA 2221
1211
aaaa
2.4 - Robust PCA
One could think about robust initialization of the ALS algorithm, i.e. finding a clean subset for the
matrix X(I x JK), but in reality, as the loadings matrices B and C are only just initialized, the resulting
matrix (X(I x JK) (C ⊗ B)) of dimensionality IxMN should be taken into account. The clean subset can
be determined using such methods as for instance, Multivariate Trimming (MVT) [11] or Minimum
Covariance Determinant (MCD) [12]. Robust initialization of the Tucker3 algorithm seems to be the
most important step to determine the final model and because this step is placed out of the main loop,
the algorithm does not lead to oscillations. In the consecutive steps of ALS algorithm the clean subset
is constructed to decrease an objective function (see eq. 4), so that oscillations are avoided and
convergence of the algorithm is achieved.
New trends in Multivariate Analysis and Calibration
254
2.4.1 - Multivariate Trimming (MVT) [11]
The MVT procedure can be used for 'clean' subset selection when the input data matrix contains at least
two times more objects than variables. The squared Mahalanobis distance (MD2) is calculated
according to the following equation:
MDi2 =(ti- t ) S-1 (ti- t ) T (2)
where ti denotes the i-th object, t denotes the vector containing means of data matrix columns and S is
the covariance matrix.
A fixed percentage of objects (here 49%) with the highest MD2 are removed and the remaining ones
are used to calculate a mean and covariance matrix. MD2 is calculated again for all objects using the
new estimates of the mean and covariance matrix. Again the 49% of objects with highest MD2 are
removed and the process is repeated until convergence of successive estimates of covariance matrix
and mean. The subset of objects for which covariance and mean are stable is considered to be a clean
subset of data.
2.4.2 - Minimum Covariance Determinant (MCD) [12]
MCD aims at selecting a subset of h (out of m) objects, with the smallest determinant, i.e. the smallest
volume in the p-dimensional space.
h = (m + p + 1)/2 (3)
The MCD algorithm can be summarized as follows:
1) - Randomly select 500 subsets of data containing p+1 objects
2) - For each subset:
a) Calculate its mean and covariance, t and S
b) Calculate Mahalanobis distances for all objects using the estimates of data mean and
covariance matrix calculated in step 2 a
Chapter 4 – New Types of Data : Structure and Size
255
c) Sort MD and take h objects with the smallest MD to calculate the next estimate of mean and
covariance matrix,
d) Repeat steps b and c twice
3) - Take the 10 best solutions, i.e. the 10 subsets of h objects with the smallest determinants, and for
each of them, repeat steps b and c until two subsequent determinants are equal
4) - Report the best solution, i.e. the subset with the smallest determinant
The procedure starts with many very small data subsets (containing only p+1 objects) to increase the
probability that these subsets do not contain outliers. Two iterations only are performed for all 500
subsets (steps 2b and 2c) to speed up the MCD procedure and, as demonstrated by P. Rousseeuw [12],
small number of iterations is sufficient to find good candidates of clean subsets. Only for the 10 best
subsets the calculations are repeated till convergence of the algorithm.
2.5 - Algorithm for robust Tucker3 model
To find possible multiple outliers in the first mode of the X the following algorithm is proposed:
0) - Initialize loadings B and C
1) - Calculate X(I x JK) (C ⊗ B) and determine clean subset (using MVT or MCD)
2) - [A*,v,d] = svd (X(I* x JK) (C ⊗ B), L)
3) - [B*,v,d] = svd (X(J x I*K) (C ⊗ A*), M)
4) - [C*,v,d] = svd (X(K x I*J) (B* ⊗ A*), N)
5) - Z = A*T X* (C* ⊗ B*)
6) - Predict loadings A for all objects
7) - Reconstruct X(I x JK) : X(I x JK)=A Z(L x MN)(C⊗B)T
8) - Calculate the sum of squared residuals for I objects in the first mode as the differences between the
original data and the reconstructed one :
residuals = sum(((X(I x JK)- X (I x JK))2)T)
9) - Sort residuals along the first mode
10) -Find h objects with the smallest residuals. They constitute the clean subset
11) - Go to step 2 until the relative change in fit is small
New trends in Multivariate Analysis and Calibration
256
A*, X* etc. are the matrices A, X etc. limited to the clean subset of objects, and the notation X(I* x JK)
means that the unfolded data set contains objects reduced to the clean subset I*. h is the number of
objects in the clean subset.
In each iteration of the ALS subroutine, the loadings A*, B* and C* are calculated for the clean subset
of objects only. In step 6 the loadings A are predicted for all objects and the set X(I x JK) is reconstructed
with the predefined number of factors. Residuals between the initial X(I x JK) and the reconstructed X (I x
JK) are calculated and sorted, and 51 % of objects with the smallest residuals is selected to form the
clean subset for the next ALS iteration. The objective function, F, to be minimized, is the sum of
squared residuals for the h clean objects from the first mode:
2)ˆ(F ∑∑ −= *X*X (4)
There is no guarantee that the selected clean subset is optimal, but convergence of the ALS approach is
secured.
In this algorithm, the outliers are identified in the first mode only, but as all modes are treated
symmetrically, one can look for outliers in any mode. This can be done simply by inputting the X
matrix with dimension of interest in the first mode.
2.5.1 - Outlier identification
Once the robust Tucker3 model is constructed, the standardized residuals from that model are
calculated for all objects of the first mode according to the following equation [10] :
Chapter 4 – New Types of Data : Structure and Size
257
for i = 1,…,I and j = 1,…JK. In eq. 5, the residuals are divided by the robust version of standard
deviation. Using )))res(medianres((median48.1 2ii −∗ , the residuals for 51% of objects, which fit
the model best, are calculated. This corresponds to the robust standard deviation of the data residuals.
Objects with standardized residuals higher than 3 times the robust standard deviation are considered as
outlying and are removed from the data set. This is equivalent with using the ratio prese nted in eq. 5
and cut-off equals one. The final Tucker3 model is constructed as the least squares model for the data
after outlier elimination.
3 - Data
3.1 - Simulated data set
A systematic Monte Carlo study was performed to evaluate performance of the algorithm. A data set of
dimensionality (50 x 10 x 10) was simulated with 2 factors in all modes. Two Tucker3 models (X1 and
X2) were constructed to explain 60 % and 90 % of data variance. The initial data sets were then
contaminated with different types (T1-T4) and different percentages (20% and 40%) of outliers.
The different types of outliers (T1-T4) can be characterized as follows:
T1 data set constructed according to the same model as the initial data, but with a certain
percentage of randomly permuted variables.
T2 data set with the same dimensionality and the same level of noise, but constructed
according to a different tri-linear model
T3 data set with the same level of noise but with a higher dimensionality than the initial data
set
T4 data set with the same level of noise but with a lower dimensionality than the initial data
set
The simulation of tri- linear data structure was performed as follows: first, orthogonal loading matrices
A, B and C with predefined dimensions were randomly initialized. For the selected structure and core
matrix Z, the X matrix was constructed as XIxJK = A * ZLxMN * (C ⊗ B)T . Then, the Tucker3 model was
built, and new X was reconstructed with chosen number of factors in each mode and used as initial data
set with tri- linear structure. At the end, white Gaussian noise was added to X. In this way models,
New trends in Multivariate Analysis and Calibration
258
which differ in percentage of explained variance, data complexity and structure of core matrix, can be
constructed.
The two following types of calculations were performed for 2 data models (X1 and X2), each with 4
types of outliers (T1-T2) and two percentages (20 and 40%):
1) One contaminated data set was constructed and the Tucker3 and robust Tucker3 models were built
100 times with random initialization of loadings B, C
2) The construction of Tucker3 and robust Tucker3 models was repeated 100 times for the predefined
type and percentage of outliers, but this time outliers were simulated randomly according to the
chosen type in each run.
The performance of the algorithms is presented in the form of a percentage of unexplained variance for
the constructed final models. In the case of robust Tucker3 approach, the final model is considered to
be the Tucker3 model after outlier removal. The MVT procedure was applied in the Monte Carlo study
to speed up calculations.
3.2 - Real data set An electroencephalographic (EEG) data set was used. The principle of electroencephalography is to
give a representation of the electrical activity of the brain [13]. This activity is measured using metal
electrodes placed on the scalp. The data was acquired during the testing phase of a new antidepressant
drug. The effect of the drug was followed in time over a two days period (12 measurements). The
EEGs were measured on 28 leads located on the patient’s scalp. Each of the EEG was decomposed
using the Fast Fourier Transform into 7 energy bands commonly used in neuro-sciences [14]. Only the
numerical values corresponding to the average energy of specific frequency bands are taken into
account. This leads, for each patient, to a 3-way array with dimensions (28x7x12). The study was
performed on 12 patients. Only the results corresponding to two patients are shown here. Patient #6
shows a very typical behaviour, while patient #9 has aberrant results for electrode #12.
Chapter 4 – New Types of Data : Structure and Size
259
4 – Results and discussion
4.1 - Monte Carlo study
Let us consider the data set X1 contaminated with 20% outliers of type1 (T2). The Tucker3 model for
this data set is presented in figure 3. As one can notice there are ten objects far away from the
remaining ones, and the Tucker3 model is highly influenced by them.
For the same data set the robust Tucker3 model was constructed and the object residuals from that
model are presented in figure 4. The first 10 outlying objects are correctly identified as the outlying
ones. After their removal, the final Tucker3 model is constructed and its results are presented in figure
5.
Fig. 3. Tucker3 model for data set X1 (90 % of explained variance) with 20 % of outliers (type T1).
New trends in Multivariate Analysis and Calibration
260
Fig. 4. Residuals from the robust Tucker3 model, data set X1, 20 % contamination, type T1.
Fig. 5. Final Tucker3 model after elimination of identified outliers.
For each studied data set, the Tucker3 and robust Tucker3 algorithms were run 100 times with random
initialization of loadings. The results for the discussed data set, expressed as the percentage of the
explained variance, are presented in bar form in figure 6-a.
Chapter 4 – New Types of Data : Structure and Size
261
Fig. 6. Monte Carlo study for the data set X1, type of outliers T2 and 20 % contamination constructed by a) robust Tucker3, b) Tucker3 model with random initialization and c) robust Tucker3, d) Tucker3 model with each time randomly generated outliers.
The observed results show that the robust Tucker3 algorithm always converges to the proper solution,
and that the outlying objects do not influence the final Tucker3 model.
Analogous results for the (non-robust) Tucker3 model are presented in figure 6-b. They indicate that
the Tucker3 algorithm is highly influenced by outliers and, depending on the initialization of the
loadings, the algorithm converges to different solutions.
In the next step of our study, both algorithms, i.e. Tucker3 and robust Tucker3, were run 100 times,
each time once for a different data set contaminated randomly with 20 % of outliers constructed
according to the chosen model (type T2). The results are presented in figure 6-c,d. The robust Tucker3
algorithm always leads to the proper model not influenced by outlying objects, whereas the Tucker3
models are highly influenced by them.
The calculations described above were performed for the data sets contaminated with different
percentages of outliers of different types. The final results, presented in figure 7, reveal that the
proposed robust version of the Tucker3 model works properly for data sets containing no more than
20% of outlying samples. The robust models constructed for data sets X1 and X2 with 20% of outliers,
i.e. data sets with a different percentage of explained variance, are not influenced by outliers.
a) b)
c) d)
New trends in Multivariate Analysis and Calibration
262
Set X1(good model), 20% contamination.
Set X2 (bad model), 20% contamination.
Fig. 7. Final results for Monte Carlo study for contamination 20 % (data sets X1 and X2, type of outliers T1-T2).
The final results for data sets X1 and X2 with 40% of outliers are presented in figure 8. The robust
model performed properly only for two types of outliers (T2 and T4). The results for the types T1 and
T1 T1
T2 T2
T3 T3
T4 T4
Chapter 4 – New Types of Data : Structure and Size
263
T3 were strongly influenced by the procedure for the selection of the clean subset. Here MVT results
are presented, those with MCD are somewhat better.
Set X1(good model), 40% contamination.
Set X2 (bad model), 40% contamination.
Fig. 8. Final results for Monte Carlo study for contamination 40 % (data sets X1 and X2, type of outliers T1-T2).
T1 T1
T2 T2
T3 T3
T4 T4
New trends in Multivariate Analysis and Calibration
264
Analogous calculations were performed for the data sets with clustering tendency. The results of the
Monte Carlo study for these data sets lead to the same conclusions.
While working with the highly contaminated data sets (40%), it was noticed that there is an essential
difference depending on the methods used to select a clean subset. In figure 9 the results for X1 (40%
of outliers T1; simulation type 2) achieved with MVT and MCD are presented for illustrative purposes.
Fig. 9-a. Comparison of two algorithms for finding a clean subset. Multivariate trimming (MVT).
Fig. 9-b. Comparison of two algorithms for finding a clean subset. Multivariate covariance determinant (MCD).
The observed differences in MVT and MCD performance for highly contaminated data (40%) are
associated with different breakdown points of those methods. MCD with breakdown point 50%
Chapter 4 – New Types of Data : Structure and Size
265
performs better, but due to the relatively long computation time required, it was not used in the Monte
Carlo study.
4.2 - Real data set
The classical and robust Tucker3 algorithms were applied on the real data set. The results obtained for
patient #6 (the one without outlying object) show (Fig. 10-a,b) that the classical and the robust Tucker3
models are equivalent on this normal patient.
Fig. 10-a. A, B and C loading matrices and convergence times for patient #6. Tucker3 model.
Fig. 10-b. A, B and C loading matrices and convergence times for patient #6. Robust Tucker3 model.
New trends in Multivariate Analysis and Calibration
266
Moreover, convergence is as fast in both cases. The results obtained for patient #9 with the classical
Tucker3 model (Fig. 11-a) already spots object #12 as an outlier on the A loading plot (corresponding
to the electrodes dimension). This is even more obvious when using the robust version of the algorithm
(Fig. 11-b) as scale is different.
Fig. 11-a. A, B and C loading matrices and convergence times for patient #9. Tucker3 model.
Fig. 11-b. A, B and C loading matrices and convergence times for patient #9. Robust Tucker3 model.
In the case of the robust Tucker3, the loadings on B and C are not influenced anymore by electrode #12
as the corresponding slice of the matrix is not used in the model construction. For patient #6, the
Chapter 4 – New Types of Data : Structure and Size
267
residuals obtained for the 1st mode (electrodes dimension) with the classical method (Fig. 12-a) and the
robust method (Fig. 12-b) show the same pattern. The situation is very different for patient #9. For the
classical Tucker3 model, the residuals for electrode #12 (Fig. 12-c) are not higher than the residuals of
other points corresponding to good electrodes. The outlying electrode is therefore not found by the
model residuals. For the robust Tucker3 model the residuals for electrode #12 (Fig. 12-d) are extreme ly
high and the outlier can be found and eliminated. In the robust Tucker3 approach, the loadings on A, B,
and C are really robust. The reconstruction is good for all of the points except electrode #12.
Fig. 12. Residuals obtained for the reconstruction of the objects on the 1st mode (12 electrodes) : a) Patient #6, Tucker3 model. b) Patient #6, robust Tucker3 model. c) Patient #9, Tucker3 model. d) Patient #9, robust Tucker3 model.
4 - Conclusion
The performed study shows that the robust version of the Tucker3 model always converges to a good
solution when the data are contaminated by 20% outliers. For 40% contamination the algorithm
converges to a good solution only for two types of outliers (T2 and T4). It can be concluded that MCD
is better algorithm for finding the clean subset than MVT. The robust Tucker3 algorithm gives good
results also for the real data set.
New trends in Multivariate Analysis and Calibration
268
ACKNOWLEDGEMENT
Professor Massart thanks the FWO project (G.0171.98) and EU project NWAYQUAL (G6RD-CT-
1999-00039) for founding this research.
REFERENCES
[1] Y. L. Xie, J. H. Wang, Y. Z. Liang, L. X. Sun, X. H. Song, R. Q. Yu, J. Chemom., 7 (1993)
527-541.
[2] B. Walczak, D. L. Massart, Chemom. Intell. Lab. Syst., 27 (1995) 354-362.
[3] I. N. Wakeling, H. J. H. Macfie, J. Chemom., 6 (1992) 189-198.
[4] J. D. Carroll, J. J. Chang, Psychometrica, 35 (1970) 283-319.
[5] R. A. Harshman, UCLA working papers in Phonetic, 16 (1970) 1-84.
[6] L. R. Tucker, Problems in measuring change, The University of Wisconsin Press, Madison,
(1963) 122-137.
[7] L. R. Tucker, Psychometrica, 31 (1966) 279-311.
[8] C. A. Andersson, R. Bro, Chemom. Intell. Lab. Syst., 42 (1998) 93.
[9] L. P. Ammann, J. Am. Stat. Assoc., 88 (1994) 505-514.
[10] P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier Detection, Wiley, New York,
1987.
[11] R. Gnanadesikan, J. R. Kettenring, Biometrics, 28 (1972) 81-124.
[12] P. J. Rousseeuw, K. Van Driessen, Technometrics, 41 (1999) 212.
[13] M. J. Aminoff, Electrodiagnosis in Clinical Neurology, second edition (Churchill Livingstone).
[14] H. H. Jasper, Electroencephalor. Clin. Neurophysiol., 10 (1958) 370.
Chapter 4 – New Types of Data : Structure and Size
269
New trends in Multivariate Analysis and Calibration
270
NEW TRENDS IN MULTIVARIATE ANALYSIS AND CALIBRATION
CONCLUSION
Chemometrics is by definition a discipline at the interface of several branches of science (chemistry,
statistics, process engineering, etc …). Chemometricians often have very different backgrounds and our
discipline was in time enriched with a lot of techniques from their respective original fields of research.
The most common chemometrical modelling methods, together with some more advanced ones, in
particular methods applying to data with complex structure, were presented in Chapter 1. Even from
this necessarily non-exhaustive introduction, it can be seen that a very wide range of methods is
available. The profusion of available options for the resolution of a given problem is usually the first
issue encountered by chemometricians during a typical study. The choice of the best method to be used
is very often done following subjective considerations such as personal preferences or software
availability. The second chapter of this thesis was an attempt to rationalise this step of method selection
in the process of building a multivariate calibration model. A part of the work had already been done
covering the simplest and somehow ideal case where the robustness of calibration methods is not
challenged. A very frequently occurring difficulty is extrapolation. In other words, a prediction has to
be done and the new sample is out of the space covered by the calibration samples. From a purely
statistical point of view, the answer to the problem is simple : no model should be used to predict an
object out of the calibration domain. However, this problem can very often not be avoided when
models are used on real-life industrial applications. All possible sources of variance cannot be foreseen
when the model is constructed and some are therefore not taken into account. The robustness of 14
methods toward extrapolation was studied using 5 reference data sets presenting challenging
characteristics often found in industrial data (non- linearity, inhomogeneity). Some important
conclusions were drawn from this study. First of all, it illustrated that the inevitable problem of
extrapolation can indeed be dealt with in industria l applications. Some general recommendations and
guidelines could also be made about the best method to be used depending on the expected level of
extrapolation and the structure of the data set.
Conclusion
271
Another problem currently occurs in real-life industrial conditions. Modifications in measurement
conditions, aging, maintenance, or replacement of an instrument can induce drift and changes in the
instrumental response. It is most of the time not possible to take these perturbations into account in the
calibration step. The quality of prediction for new samples can therefore be expected to degrade over
time. The second study presented in Chapter 2 aimed at evaluating the robustness of calibration
methods in the case of instrumental perturbations. It was performed on 12 multivariate calibration
methods, using the same 5 industrial data sets as the previous study, and by simulating 6 different
instrumental perturbations on the response obtained for the samples to be predicted. Some general
recommendations could be made in particular about the type of model, in terms of complexity or pre-
treatment, that has to be avoided in order to increase robustness toward instrumental perturbations.
The third and final part of Chapter 2 follows naturally the comparative studies presented. It aims at
explaining, step by step, from data pre-processing to the prediction of new samples, how to develop a
calibration model. Even though this tutorial describes the construction of a Multiple Linear Regression
model on spectroscopic data, most of the strategy can be applied to other calibration methods and/or
data sets of different nature.
The third chapter of this thesis presents some specific case studies. The aim of this chapter is twofold.
First of all, the strategy and guidelines developed in Chapter 2 are applied on industrial data. The whole
chapter illustrates how an industrial process can be improved by proper use of chemometrical tools. It
also gives another illustration of the importance of the method selection step. The fact of using another
instrument for data acquisition can have a dramatic influence on the multivariate calibration model
building process. Even though the studied process was the same and the nature of the spectroscopic
technique remained unchanged, the fact that an instrument with better resolution was used implied that
the best results were achieved by a different calibration method.
The second important aspect of this study is that it was performed on Raman spectroscopic data.
Sophisticated data treatment is usually not considered necessary for Raman data. Specialists in the field
mostly employ direct calibration, as opposed to inverse calibration methods used by chemometricians.
It was demonstrated that chemometrical tools cannot only match the results obtained by the methods
classically used on Raman data, but can even outperform them. When classical methods could only
predict relative concentration for the monitored chemical process, using inverse calibration it was for
New trends in Multivariate Analysis and Calibration
272
the first time possible to evaluate absolute concentrations, moreover achieving a much better precision
level.
The fourth chapter of this thesis continues with the effort in broadening the field of applicability of
chemometrics. This chapter is devoted to methodologies used to deal with data that can be considered
original because of their structure and/or size. The first study in this chapter shows that data sets with
very a high number of variables can be treated very efficiently by new algorithms designed specifically
for such computationally intensive cases. Even though current computers have enough power and
speed to deal with very big matrices in relatively short amounts of time, the existence of such methods
can be very important in situations where the time factor is critical, for instance for online analysis or
Statistical Process Control (SPC).
The rest of this chapter was devoted to rather new techniques in the field of chemometrics : N-way
methods. These methods take into account data with a more complex structure than the traditional
tables (2 dimensions ). It is important to realise that the N-way structure is not only dealt with, it is
actually used to achieve a better understanding of the data structure, and a more efficient extraction of
information contained in the data. A case study on a pharmaceutical data set with very high
dimensionality (6 dimensions, over 225000 data points) showed that these methods (in particular the
Tucker 3 model) are unmatched for the exploration of such data sets.
The study of this data set however confirmed that N-way models are just as sensible to outliers or
extreme samples as classical method. It was therefore investigated how the Tucker 3 model could be
made more robust, and a methodology was proposed in this sense. This methodology proved efficient
both on synthetic data set and on the 6-way pharmaceutical data set.
Overall, this thesis confirmed that chemo metrical methods can be applied to data coming from other
spectroscopic techniques than NIR, and of course also to non-spectroscopic data. As was illustrated by
the study of electro-encephalographic data with N-way models, new methods can help chemometrics to
step foot in new fields of science. Another example of this phenomenon is the current merging of
chemometrics and Quantitative Structure Activity Relationship (QSAR), which hopefully represents a
step forward in the direction of the unification of all branches of computer-chemistry.
Conclusion
273
New trends in Multivariate Analysis and Calibration
274
PUBLICATION LIST
“A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part
III : Predictive Ability under Instrumental Perturbation Conditions.”
F. Estienne , L. Pasti, V. Centner, B. Walczak, F. Despagne, D. Jouan Rimbaud, O.E. de Noord, D.L.
Massart.
Submitted for publication.
“Chemometrics and Modelling.”
F. Estienne , Y. Vander Heyden, D.L. Massart.
Chimia, 55 (2001) 70-80.
“Multi-way Modelling of Electro-encephalographic Data.”
F. Estienne , Ph. Ricoux, D. Leibovici, D.L. Massart.
Chemometrics and Intelligent Laboratory System, 58 (2001) 59-72.
“Multivariate Calibration with Raman data using Fast PCR and PLS methods.”
F. Estienne , D.L. Massart.
Analytica Chimica Acta, 450 (2001) 123-129..
“A Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets. Part
II : Predictive Ability under Extrapolation Conditions.”
F. Estienne , L. Pasti, V. Centner, B. Walczak, F. Despagne, D. Jouan Rimbaud, O.E. de Noord, D.L.
Massart.
Chemometrics and Intelligent Laboratory Systems, 58 (2001) 195-211.
Conclusion
275
“Robust Version of Tucker 3 Model.”
V. Pravdova, F. Estienne, B. Walczak, D.L. Massart.
Chemometrics and Intelligent Laboratory System, 59 (2001) 75-88.
“Multivariate Calibration with Raman Spectroscopic Data : A case Study.”
F. Estienne , N. Zanier, P. Marteau, D.L. Massart. Analytica Chimica Acta, 424 (2000) 185-201.
“The Development of Calibration Models for Spectroscopic Data Using Principal Component
Regression.”
R. De Maesschalck, F. Estienne, J. Verdú-Andrés, A. Candolfi, V. Centner, F. Despagne, D. Jouan-
Rimbaud, B. Walczak, D.L. Massart, S. de Jong, O.E. de Noord, C. Puel, B.M.G. Vandeginste.
Internet Journal of Chemistry, 2 (1999) 19.
New trends in Multivariate Analysis and Calibration