Multivariate Data Analysis Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR) Laboratory exercises: Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,… Romà Tauler (IDAEA, CSIC, Barcelona) [email protected]
187
Embed
Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
A box plot summarizes the information on the data distribution primarily in termsof the median, the upper quartile, and lower quartile. The “box” by definitionextends from the upper to lower quartile. Within the box is a dot or line markingthe median. The width of the box, or the distance between the upper and lowerquartiles, is equal to the interquartile range, and is a measure of spread. Themedian is a measure of location, and the relative distances of the median from theupper and lower quartiles is a measure of symmetry “in the middle” of thedistribution. is defined by the upper and lower quartiles. A line or dot in the box marks the median. For example, the median is approximately in the middle of thebox for a symmetric distribution, and is positioned toward the lower part of thebox for a positively skewed distribution.
Jan July
0
1
2
3
4
5
6
7
Box plots, Tucson Precipitation
P (
in)
Month
median
upper quartile
lower quartile
interquartile rangeirq
Probability distributions: Box plots
“Whiskers” are drawn outside the box at what are called the the “adjacent values.”The upper adjacent value is the largest observation that does not exceed the upperquartile plus 1.5 iqr , where iqr is the interquartile range. The lower adjacent valueis the smallest observation than is not less than the lower quartile minus1.5 iqr . Ifno data fall outside this 1.5 iqr buffer around the box, the whiskers mark the data extremes. The whiskers also give information about symmetry in the tails of thedistribution.
Jan July
0
1
2
3
4
5
6
7
Box plots, Tucson Precipitation
P (
in)
Month
Probability distributions: Box plots
Whiskers
interquartile rangeirq
For example, if the distance from thetop of the box to the upper whiskerexceeds the distance from the bottomof the box to the lower whisker, thedistribution is positively skewed in the tails. Skewness in the tails may be different from skewness in the middleof the distribution. For example, a distribution can be positively skewedin the middle and negatively skewedin the tails.
Any points lying outside the 1.5 iqr around the box are marked by individual symbolsas “outliers”. These points are outliers in comparison to what is expected from a normal distribution with the same mean and variance as the data sample. For a standard normal distribution, the median and mean are both zero, and: q at 0.25 = −0.67449, q at 0.75 =0.67449, iqr = q 0.75 − q 0.25 =1.349, where q 0.25and q. 075are the first and third quartiles, and iqr is the interquartile range. We see that the whiskersfor a standard normal distribution are at data values: Upper whisker = 2.698 , Lowerwhisker = -2.698
Jan July
0
1
2
3
4
5
6
7
Box plots, Tucson Precipitation
P (
in)
Month
Outliers
Probability distributions: Box plots
From the cdf of the standard normal distribution, we see that the probability of a lower value than x=−2.698 is 0.00035. This result shows that for a normal distribution, roughly 0.35 percent of the data is expected to fall below the lower whisker. By symmetry, 0.35 percent of thedata are expected above the upper whisker. These data values are classified as outliers. Exactly how many outliers might be expected in a sample of normally distributeddata depends on the sample size. For example, with a sample size of 100, weexpect no outliers, as 0.35 percent of 100 is much less than 1. With a sample sizeof 10,000, however, we would expect 35 positive outliers and 35 negative outliersfor a normal distribution.
Probability distributions: Box plots
For a normal distribution>> varnorm=randn(10000,3);>> boxplot(varnorm)
0.35% of 10000are approx. 35 outliersat each whisker side
Parametric vs Robust Statistics
MeanStandard Dev.
Parametric
Robust
Box Plot
Median
Mínimum
Màximum
Interquartil Range(IQR)
Mean and standarddeviation plots
Median and IQRs plots(Box plots)
Sn Zn Fe Ni Sn Zn Fe Ni
They help to see the size and range scale differencesThey suggest the use of appropriate data pretreatments
to handle these differences.
Parametric vs Robust Statistics
Multivariate Statistics
• Covariance Matrix, S (m,m): it has all the possible pairwise combinations between variables.
Description of the variable relationships
2 2 211 12 1m2 221 2m
2 2m1 mm
s s ... ss ... ... s
S... ... ... ...s ... ... s
⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
2ij
n
ij j il li 1s
n 1
(x x )(x -x )==
−
−∑
2
2jj
n
ij ji 1s
n 1
(x x )==
−
−∑
Covariance
Variance
Multivariate Statistics
• Correlation Matrix, C(m,m): it has all the possible of correlations between variables.– Diagonal elements are 1.
2 2 211 12 1m2 221 2m
2 2m1 mm
r r ... rr ... ... r
C... ... ... ...r ... ... r
⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
2ij2
iji j
2i ii
sr
s s
s s
=
=
Correlation
Description of the variable relationships
Multivariate Statistics
Sn Zn Fe NiSn 0,047319 -0,002593 0,002518 0,000434Zn -0,002593 0,033644 0,000581 -0,000022Fe 0,002518 0,000581 0,000494 0,000022Ni 0,000434 -0,000022 0,000022 0,000004
CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.
Need• The studied property is selectively correlated to a
single variable only in a few ocasions (lack of total selectivity).
• The studied property is determined by a set of variables with which presents high correlatio
P = f(x1, x2, ...,xn);
Observations = Structure + Noise
Structure = part of the signal correlated withthe sought property
Noise = all the other contributions, instrumental noise, experimental errors, other components, ...
NeedExperimental measures contain informationwhich is not relevant to the property of interest
Multivariate Data Analyis
Causality vs Correlation
Correlation is a statistical concept which measures the linear relationbetween two variables
Causality relationship is a deterministic interpretation from the problem or application
Example: the number of stork and the number of new born childrenin a geographical area
Multivariate Data Analyis
Information
Initial Hypothesis: Data have the sought information.Exists a relationship which can be modelled from measured variables and the measured property. When variables change their value, also the property will changed
X (variables) -------> Y propertymodel Y = f(X)
X is a vector or a matrix (e.g. spectral measures)Y is a scalar, vector or matrix (e.g. analytic concentrations)
Multivariate Data Analyis
Visualization of original dataPlot of the matrix rows and/or columns
Spectra set
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3
Rows
outlier sample
Variables0 5 10 15 20 25 30
0
0.5
1
1.5
2
2.5
3
ColumnsSamples
020
4060
80100
0
10
20
300
0.5
1
1.5
2
2.5
3
Rows and Columns (3D)Variables
Samples
Detection of outlier samples/variables Detection of scale and range variable differencesSystematic information (structure) is easily detected (instrumental responses).Difficult to interpret when the number of samples is high.
Visualization of original dataMap of samplesIn the column space (variables)
Samples are drawn as points in the variables spaceSimilarities among samples can be detected (distances among samples).
2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠
m1
m2
v1 v2 v3
2 4 6
642
(1,0,6)
(2,3,4)v1
v2
v3
m1
m2
Map of variablesIn the row space (samples)
Variables are drawn as vectors in the sample subspaceCorrelation among variables can be estimated (angle).
2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠
m1
m2
v1 v2 v3
r(vi,vj) = cos(vi,vj)
r = 1, angle 0o
r = 0, angle 90o2 4 6
642
(4,6)
(2,1)(3,0)
m1
m2
v1
v2
v3
Visualization of original data
Samples Sn Zn Fe Ni
1 0.2 3.4 0.06 0.08
2 0.2 2.4 0.04 0.06
3 0.15 2.0 0.08 0.16
4 0.61 6.0 0.09 0.02
5 0.57 4.2 0.08 0.06
6 0.58 4.82 0.07 0.02
7 0.30 5.60 0.02 0.01
8 0.60 6.60 0.07 0.06
9 0.10 1.60 0.05 0.19
Graphical representacion of multivariate data
in the variable space (3D)
Visualization of original data
02
46
8
0.020.04
0.060.08
0.10
0.2
0.4
0.6
0.8
ZnFeS
n
7
4
8 5
9
3
2
1
6
?
There are two sample groupsIs it critical the representation of a 4th variable?
For more than 3 dimensions space?
• Qualitative approximations for a few variables – Chernoff faces.
• Efficient Compression of the original space of variables– Principal Component Analysis (PCA).
Visualization of original data
• Chernoff faces. – It is easu to distinguish different features in human faces. Each
sample is a Chernoff face.– Each face feature is a variable.
V. 1 High front face of the head
V. 2 Lower front face of the head
V. 3 Eyebrows
V. 4 Smile
Visualization of original dataFor more than 3 dimensions space?
sample 7
Methods Classification
According to their goal
Exploration methods.Discrimation and Classification methods.Correlation and Regresion methods.Resolution methods
According data type
Based on original data.Based on latent variables (factor analysis)
Multivariate Data Analyis
Exploration Methods
• Visualization of the information.• Sample similarities and clusters• Correlations among variables. • Outlier detection.• Measured variables relevance. Selection.• Principal Component Analysis (PCA).
Multivariate Data Analyis
Discriminant and Classification Methods
• Separation of the objects (samples) in defined groups or clusters (classes).
• Assignation of new objects to predefined classes• Detection of outlier objects not belonging to any group
(classes).• PCA, SIMCA, LDA, PLS-DA, SVM.....
Multivariate Data Analyis
Correlation and Regression Methods
• Finding relations between two blocks of variables.• Modeling property changes from a group of variables.• Prediction of a property from the indirect
measurement from a group of variables correlated to it.
• Multilinear Regression (MLR), Principal ComponentsRegresion (PCR), Partial Least Squares Regression (PLS).
• Non-linear Regression methods, Kernel, SVM,...
Multivariate Data Analyis
Factor Analysis based methods
• Factor: source of the observed data variance of independent and defined nature.
• Extraction of the relevant factors (structure) of the data set. Noise Filtering.
• Description of the data variance from basic factors.• Identification of the chemical nature of these relevant
• Modify the size and the range of the scale of thevariables.
• They can be applied in the direction of thecolumns (variables) or of the rows (objects, samples).
• They are selected as a function of the data nature and of the information to be obtained.
• There is no optimal treatment, it depends on thechemical problem to be investigated.
Data pre-processing
Multivariate Data Analyis
1) mean centering (axes translation)Iik
* i 1ik ik k k
K
ik* k 1ik ik i i
xx x x , x
I
xx x x , x
K
=
=
= − =
= − =
∑
∑
on the data matrix columns
on the data matrix rows
centering
Data pre-processing
Multivariate Data Analyis
2) scaling ( )( )
( )( )
I 2
ik k* ik i 1ik k
k
I 2
ik k* ik i 1ik i
i
x xxx , ss I 1
x xxx , ss K 1
=
=
−= =
−
−= =
−
∑
∑
on the data matrix columns
on the data matrix rows
3) autoscaling = mean centering + scaling
autoscaling* *ik k ik iik ik
k i
x x x xx ; xs s− −
= =
Data pre-processing
Multivariate Data Analyis
4) normalization: K K
* 2ik ik i ik i ik
k 1 ñ 1i
Nx x ; c x ; c x ; ....c = =
= = =∑ ∑
5) rotation: X* = RT X
p.e. in two dimensions:
*11
*22
xcosθ sinθxx-sinθ cosθx
⎛ ⎞ ⎛ ⎞⎛ ⎞=⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠
RT, rotation matrix
Data pre-processing
Multivariate Data Analyis
0.00
1.00
2.00
3.00
4.00
5.00
0 5 10
Samples
Met
al C
oncs
. SnZnFeNi
• Original data (without pretreatment). Scale, size and range of variables is kept
Data pre-processing
Multivariate Data Analyis
• Centered Data Each variable value is subtractedwith the mean of all the values of that variable.
Diferences among variables due to scale size are eliminated
-0.3-0.2-0.1
00.10.20.3
0 5 10
Samples
Met
all C
oncs Sn
ZnFeNi
Data pre-processing
Multivariate Data Analyis
• Autoescaled Data ades. Each value of the variable is centered and divided by the standard deviation of thevalues of the variable
Differences among variables due to size and range are eliminated
-2.5-2
-1.5-1
-0.50
0.51
1.5
0 5 10
Samples
Met
al c
oncs Sn
ZnFeNi
Data pre-processing
Multivariate Data Analyis
0.00
1.00
2.00
3.00
4.00
5.00
0 5 10
Mostres
Con
c. m
etal
ls
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0 5 10
MostresC
onc.
met
alls
-2.5-2
-1.5-1
-0.50
0.51
1.5
0 5 10
Mostres
Con
c. m
etal
ls Sn
Zn
Fe
Ni
Original Data Centered Data Autoscaled Data
Whatsamples/variables
have highervalues?
What variables discriminate
better?
What is thecorrelation amongdifferent variables?
Data pre-processing
Multivariate Data Analyis
• Centered data. Each variable value is subtractedby its mean value.
– The mean value of all variables is zero
11 1 12 2 1m m
21 1 22 2 2m m
n1 1 n2 2 nm m
x x x x ... x xx x x x x x
XC(n,m)... ... ... ...
x x x x ... x x
− − −⎛ ⎞⎜ ⎟− − −⎜ ⎟=⎜ ⎟⎜ ⎟⎜ ⎟− − −⎝ ⎠
Centered data and Covariance
n number of samplesm number of variables
),(),(),( 11
mnnmT
mm XCXCn
S−
=
Data pre-processing
Multivariate Data Analyis
• Autoscaled Data. Each value of the variable is centered and divided by the standard deviationof the values of the variable – The mean of all variables is 0.– The variance (dispersion) of all variables is 1.
Autoescaled data and correlation.
XT (n,m) j
jijij s
xxxt
−=
),(),(),( 11
mnnmT
mm XTXTn
C−
=
Data pre-processing
Multivariate Data Analyis
0 10 20 30 40 50 60 70 800
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Longituds d'ona
Abs
orbà
ncia
0 10 20 30 40 50 60 70 800
0.05
0.1
0.15
0.2
0.25
Longituds d'ona
X
spectrum
Normalitzation
Each vector (spectrum) value is divided by its lenght (norm)
XN (n,m)
X
XN
i
ijij x
xxn = ∑=
jiji xx 2
Equals the response intensities.Allows a better comparison of the shapes
Data pre-processingMultivariate Data Analyis
• To eliminate changes due to instrumental variations without chemical information– Smoothing: noise correction– 1st derivative: correction of constant variations– 2nd derivative: correction of linear variations – Peak alignements.– Baseline corrections– Warping– …
Pairwise correlations are difficult tointerpret when many variables areInvolved Need of multivariatedata analysis tools
Pair-wiseCorrelationsvariables
samples
CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.
Multivariate Data Analyis
• What is the SVD of a data matrix X?....– singular value decomposition– singular values are the root square of the eigenvalues– X=USVT, U ana VT are orthonormal matrices and S is a diagonal
matrix with singular values– SVD is an orthogonal matrix decomposition– The elements in S are ordered according to the variance
explained by each each component– Variance is concentrated in the first components, it allows for
reducing the number of variables explaining the variance structure and filtering the noise
1
, ,... , ,... , =
= + = = <<
= +
∑TX USV E
K
ij ik k kj ijk
x u s v e i 1 I j 1 J K I or J
Multivariate Data Analyis
0 2 4 6 8 100
200
400
600
800
1000
0 2 4 6 8 100
200
400
600
800
0 2 4 6 8 100
10
20
30
40
50
0 2 4 6 8 100
10
20
30
40
Effect of data pretreatments on SVDplot(svds)
Raw data X Mean-centered data X
Scaled data X Autoscaled data X
4 larger components
Multivariate Data Analyis
0 2 4 6 8 100
100
200
300
400
500
600
700
800
900
0 2 4 6 8 100
5
10
15
20
25
30
35
Methods of data pretreatmentEffect of pretreatments
Raw data X log10 data X
4 components how many components?non-linearity?
Multivariate Data Analyis
Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
• They are mathematical variables which describe efficiently the data variance.– The relevant variance of the original data is described
by a reduced number of components (PCs)– Visualization of large data sets (many variables) in the
PC space of reduced dimensions.• Information is not repeated (overloaded, PCs are
orthogonal).• Describe the main directions of data variance in
decreasing order.• They are linear combination of the original variables.
Principal Components (PCA)
• They are linear combination of the original variables.
x1
x2
••
•• •
••
•
t1
m1m2
x1p11
x 2p 2
1
t1 = x1p11 + x2p21
p11, p21
loadings of the original variables in first PC, t1.
PCA ModelRelationship betwen the data in the PC spaceand in the original space.
t1 t2 x1 x2 x3 p1 p2
=tj1 tj2 xj1 xj2 xj3
p11
p21
p31
T (n,2)
X (n,3)
P(3,2)
tj1 = xj1 p11 + xj2 p21 + xj3p31
T = XP
PCs (linear combination of the original variables)
PCA Model
T (scores matrix).• Describe the
samples in theprincipal components space.
• They are orthogonal. – ti
T tj = 0.
P (loadings matrix).• Describe the original
variables in theprincipal components space.
• They are orthonormal. – pi
T pj = 0.– ||pi|| = 1.– PTP = I
T = X P(n,npc) (n,m) (m,npc)
n (samples), m (variables), npc (principal components)
X = TPT
PCA Model: X = T PT
scores loadings (projections)
= +XT
PT
E
X = t1p1T + t2p2
T + ……+ tnpnT + E
X t1p1T t2 p2
T tnpnT E= + +….+ +
n number of components (<< number of variables in X)
rank 1 rank 1 rank 1
PCA Model
Model: X = T PT + EX = structure + noise
It is an approximation to the experimental data matrix X
• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).
• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal
Noise E Experimental error, non-explained variances
PCA Model
Determination of the number of components
• When the expected experimental error is known.– Plots of explained or residual variance as a function of
the nr. of components of the model.• Ex. models explaining a 95% of the variance are satisfactory?.
– Compare mean residual values with experimental error size.
• Ex. absornance errors in UV are aprox. 0.002.
mn
ee j,i ij
×=
∑2
n × m number of elements of X matrix
Number of PC → ē ≤ error (0.002).
PCA Model
• When experimental error is unknown:– Plot of singular values (or of eigenvalues).– Empirical functions related to experimental errors.– Cross Validation Methods.
Determination of the number of components
PCA Model
n cromatograms m spectra
How many components coeluted?The number of coeluting components is deduced from the number ofpricipal components.
ExampleDetermination of the number of components
PCA Model
– Plot of singular values(sk) or of functions of eigenvalues (λk).
• Singular values or eigenvalues vs. number of PCs (sk).• Log(eigenvalues) vs. number of PCs (λk).• Log(reduced eigenvalues) vs. number of PCs (REVk).
)1()1()(
+−+−=
kckrkREV kλ
λk = (sk)2r ne filesc ne columnes
Size of sk /λk /REVk ∝ associate PC importance
Determination of the number of componentsPCA Model
4 significative componentsThe rest of components are used to explain the experimental noise
log eigenvalues log(REV)
PCA ModelDetermination of the number of components
– Evaluation of empirical functions related to error.• Eigenvalue fucntions. Take advantage of the relation
between the explained variance and teh size of theeigenvalues.
• These functions have minima or considerable sizechanges for the optimal number of PCs
PCA ModelDetermination of the number of components
Malinowski error functions
( )2RSDIND
c n=
−
c0k
k=n+1RSD =r(c-n)
λ∑
c number of columnsr number of rowsn number of componentsλ0
k eigenvalue of componentk
Indicator Function (IND)
Mínimum IND → optimal number of PCs
PCA ModelDetermination of the number of components
n)-r(c
0k
c
1+n=k=RSDλ∑ c
nRSDIE =
cncRSDXE −
=
λ
λ0k
s
1+n=k
n
s
1+n=k
1)+n-1)(c+n-(r
1)+k-1)(c+k-(r=n)-sF(1,
∑
∑
Malinowski error fucntions
Statistical test estadístic of eigenvalues (Malinowski)
( )2RSDIND
c n=
−
Imbedded error
Extracted error
PCA ModelDetermination of the number of components
Eigenvalues and REVs > 4 have lower sizesIND has a minimum in 4.RE lower its size.Eigenvalue for PC 4 is significatively larger than higher ones.
PCA ModelDetermination of the number of components
– Cross-validation methods.• A part of the data is used to built the model and another
part of the data is described by this model. The optimalnumber of components is teh one giving lower resoidualsin the description of new data.
• This procedure is repeated until all the samples are usedto built the model and as external data set.
• The final results are the mean of all repetitions obtainedin the modeling/description of the non-included samples.
PCA ModelDetermination of the number of components
1. Divide the data sample set on q subsets.2. Built PCA modesl with q –1 data subsets (Xmodel).3. Use these PCA models to explain the external data subset (Xextern).
i. Scores. Textern = XexternP.ii. Reproduction Xextern.
4. PRESS (Predictive Residual Sum of Sq uares) calculation usingdiffernt numbert of PCs.i. For PC k.
5. Repeat steps 1-4 until the q subsets have been used as external data sets.
6. Plot PRESScum vs. number of PCs .i. For PC k.
Texternextern PTX =ˆ
( )∑ −=j,i ijij xx̂)k(PRESS 2
∑=qcum )k(PRESS)k(PRESS
q number of PCA models
PCA ModelDetermination of the number of components
Xextern
Xmodel1 ... Xmodeln
Xextern
PRESScum,i = Σj PRESSji
Xmodel1
1PC 2PC .... mPC
Xmodel2
Xmodeln
.
.
.
PRESS11 PRESS12 PRESS1m
PRESS21 PRESS22 PRESS2m
PRESSn1 PRESSn2 PRESSnm
PRESScum,1 PRESScum,2 PRESScum,m
PCA ModelDetermination of the number of components
Nombre de PCs
PRES
S cum
1 2 3 4 5 6 7 8 9 100
0.5
1
1.5
2
2.5
3
3.5
PC Nr. PRESScum
1.0000 3.4772
2.0000 0.2320
3.0000 0.1117
4.0000 0.0505
5.0000 0.0515
6.0000 0.0517
7.0000 0.0521
8.0000 0.0524
9.0000 0.0535
10.0000 0.0531
Optimal number of PCs → Minimum value of PRESS
PCA ModelDetermination of the number of components
(cross-validation)
Model Fitting
PC reliability
Model reliability
How many principal components?Ex
plai
ned
Var
iànc
e
Number of PCs
Higher PCs explain data noiseA PCA model with noisy PCs is less reliable when describes
new dataWith more PCs in the model, better data fitting but the model
reliability when it is applied to new data may be worse (overfitting)
PCA ModelDetermination of the number of components
samples shows theirsimilarity (5,6 and 4,8 very similar).
• Detection of samplegroups (clusters) (I i II).
• External informationcan help to identify thenature of the detectedgroups (ex. samplesorigen,...).
• Very distant samplesare extreme samples.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
-0.2
-0.1
0
0.1
0.2
0.3
123
45 6
7
8
9
PC1PC2
Samples Map
II
I
Original data
Example PCA model visualitzation
Loadings plot
• Relevant variables for themodel have high loadings andfar from the origen (Zn, Sn).
• Close variables to the origen do not give information aboutthe data variance (Fe, Ni).
• High Loadings in one PC ishow high weight in thiscomponent (Zn – PC1, Sn –PC2).
• Correlated Variables withlower PCs have more importance.
• Correlations between variables are described by their angle(Zn are Sn little correlated).
• Positive (direct) and negative(indirect) correlations betweenvariables can be detected.
-0.8 -0.4 0.4 0.8
-0.8
-0.4
0.4
0.8
PC1
PC2
Sn
ZnFeNi
Map of variables
Original Data
Example PCA model visualitzation
• Close samples to the origen aresililar to the average sample. This cannot be distinguishedwhen they are separated in groups.
• The importance of the variables change because centeringeliminates the weight of the scalesize. Sn, related with PC1, isnow more important than Zn.
-0.4 -0.3 -0.2 -0.1 0.1 0.2 0.3
-0.3
-0.2
-0.1
0.1
0.2
12
3
4
5
6
7
8
9
PC1
PC2
-0.2 0.2 0.4 0.6 0.8 1 1.2
-0.8
-0.6
-0.4
-0.2
0.2
0.4
0.6
0.8
Sn
Zn
FeNi PC1
PC2
Centered dataScores plot Loadings plot
Example PCA model visualitzation
• They eliminate the scale size and range. The effect of all variables is enhanced.
• Correlation information amongvariables is more clearly seen.
-1 -0.5 0 0.5 1 21.5
-1.5
-1
-0.5
0.5
1.5
1
2
34
5
6
7
8
9
PC1
PC2
0.1 0.2 0.3 0.4 0.5 0.60.2
0.4
0.6
0.8
1
1.2
Sn
Zn
Fe
NiPC1
PC2Scores plot
Autoescaled dataLoadings plot
Example PCA model visualitzation
• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals
Interpretation of the ‘scores’(targets, punctuations)
Principal Components (PCA)
Scores and Loadings
• ‘scores’ show the relationships between samples
• ‘loadings’ show the relationships between variables
• ‘scores’ and ‘loadings’ should be interpreted in pairs
• ‘scores’ and ‘loadings’ are plotted ones againstthe others
Principal Components (PCA)
Outliers
• Outlier samples can have a great influence(leverage) in the PCA model
• They can be detected
• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both
Principal Components (PCA)
Detection of anomalous objects
Extreme Objects:• Diferent to the rest of
objects.
Outlier objects• Extreme objects that
cannot be fitted by themodel.
x1
x2
•• •
••
••
•
PC1
••
•
•
O. extreme
O. outlier
An object (samples) is extreme or outlierwhen one or more variables have theirvalues very different to the other samples
Principal Components (PCA)
Outlier detectionWhy is needed?• They distort the model• They hidden the structure of the rest of the data.When they should be eliminated and How?• If it is justified methematically and chemically.• It should be done gradually and stating with the more
n number of components (<< number of variables in X)
rank 1 rank 1 rank 1
Principal Component Analysis (PCA)
Model: X = T PT + EX = structure + noise
It is an approximation to the experimental data matrix X
• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).
• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal
Noise E Experimental error, non-explained variances
Principal Component Analysis (PCA)
PCA Model : X = TPT + EDetermination of the number of principal components A
X(n,m), when n > m, m is the maximum number of PCs, Amax= mwhen m > n, n is the maximum number of PCs, Amax= n
In general, a number much smaller of PCs is used ‘data compression’, ‘data reduction’, A << n or m
A is chosen for the variance in TPT having most of the relevant structure of X, whereas noise remains in E(noise does not interest us! we want to filter it! ...)
To select the appropriate number of PCs, A, E (residuals, lack of fit,...) has to be studied quantitation of the variance in E, e.g. by residuals variance in %
Principal Component Analysis (PCA)
Determination of the number of principal components
a) visual inspection of the magnitudee of the singular values. Graphical representation (search for an inflexion)
b) representationss of the explained/residual variance respect the number of principal component
c) For autoscaled data, keep components until their λ aprox 1-2d) When the noise level is known, select the number of PCs until
the residual variance is similar to noise variance. e) Consider PCs until ‘loadings’ have structural features (nor noise)f) Use statistical tests and methods based in the previous
knowledge of experimental noise size. g) Approximate methods when experimental noise is not known
Determination of the number of PCsfrom eigenvalue/singular value plots
4 components
Principal Component Analysis (PCA)
Model Fitting
PC reliability
Model reliability
How many principal components?
With more PCs in the model, better data fitting but the model reliability when it is applied to new data may be worse (overfitting)
Expl
aine
d V
arià
nce
Number of PCs
Principal Component Analysis (PCA)
Cross-validation Methods- a data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values
Determination of the number of principal components
X
Xr k PCsPCA
Xr = Tk PkT
x Tx̂ x P Pk k
ˆevaluation of (x x)
=
−
eliminated
PCAprojection
Principal Component Analysis (PCA)
Cross-validation Methodsa data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values
r c2
ij iji=1 j=1
ˆPRESS(k) = (x - x (k))∑∑PRESS is plotted for the different number of considered componentsk, and the minimum value of PRESS or when it does not decreaseany more is looked for
Determination of the number of principal components
Principal Component Analysis (PCA)
oo
o o
oo
o o
o
o
o
oo
o oo
PC1
x1
x2
x1 loading
x2 lo
adin
gLoadings
Loadings are orthonormal, PTP = I and PT = P-1
Principal Component Analysis (PCA)
Loadings interpretation
• Determination of the more important variables inthe formation of the principal components (those variableswith large loadings are important, either neg. or pos.)
• Identification and qualitative information (fingerprinting) on the variation sources
Principal Component Analysis (PCA)
Scores
oo
o o
oo
o o
o
o
o
oo
o oo
PC1
sample s
core t 1
x1
x2
Projection of X in the PCs (loadings) gives the ‘scores’T = XP
Principal Component Analysis (PCA)
• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals
Interpretation of the ‘scores’(targets, punctuations)
Principal Component Analysis (PCA)
Scores and Loadings
• ‘scores’ show the relationships between samples
• ‘loadings’ show the relationships between variables
• ‘scores’ and ‘loadings’ should be interpreted in pairs
• ‘scores’ and ‘loadings’ are plotted ones againstthe others
Principal Component Analysis (PCA)
PCA Statistics Residuals statistic to measure the lack of fit (large residuals)Qi = ei ei
T = xi (I-PkPkT) xk
T
samples with large Qi values are unusual(they are out of the model!!!!)
Hotelling statistic T2
Ti2 = ti λi
-1tiT = xi Pk λi
-1 PkT xi
T
samples with large values of Ti2 are unusual
(they are inside the model with high leverage!!!!)
These statistics are used to develop control charts and limits in Statistics Process Control
Principal Component Analysis (PCA)
PCA StatisticsQi = eiei
T = xi(I - PkPkT)xi
T, variation out of the PCA modelTi
2 = ti-1tiT = xiPΤ-1PTxi
T, variation inside of the PCA model
Principal Component Analysis (PCA)
Outliers
• Outlier samples can have a great influence(leverage) in the PCA model
• They can be detected
• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both
Principal Component Analysis (PCA)
Outliers in scores plots
xx x
xxx xx
x
x
xx
x
xx
xxxx x
xxx
xxx
xxxx
xx x
x
xx
x x
Scores en PC2
Scor
esen
PC
1
Principal Component Analysis (PCA)
Detection of ‘outliers’From scores plots ==> outlier samplesFrom loadings plots ==> outlier variables Leverage samples or variables affecting very much the PCA modelIt is evaluated from the expression:
2,
1
1n
i ki
kk
th ns λ=
= +∑hi sample i ‘leverage’ti,k ‘score’ of sample i on the k componentλk singular value of k componentns number of samples n number of considered components
Principal Component Analysis (PCA)
Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)
Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…
Multiple linear regression (MLR) is a method used to model thelinear relationship between a dependent variable (predictand) andone or more independent variables (predictors).
MLR is based on least squares: the model is fit such that thesum-of-squares of differences of observed and predicted values isminimized..
The performance of the model on data not used to fit the model isusually checked in some way by a process called validation.
The reconstruction is a "prediction" in the sense that theregression model is applied to generate estimates of thepredictand variable different to the used to fit the data. TheUncertainty in the reconstruction is summarized by confidenceintervals, which can be computed by various alternative ways.
Multivariate (Multiple) Linear Regression (MLR)
MLR Model0 1 ,1 2 ,2 ,
th,
0
value of predictor in sample
regression constant
coefficient on the predictor
total number of predictors= predictand in sample
error termIn vecto
= + + + + +
=
=
=
=
=
…i i i K i K i
i j
thj
i
i
y b b x b x b x e
x j i
b
b j
Ky ie
r-matrix form: = +y Xb e
Multivariate Linear Regression
MLR Predictions
0 1 ,1 2 ,2 ,
th,
0 1
ˆ ˆ ˆ ˆˆ
value of predictor in new sample ˆ ˆ ˆ, , estimated regression constant
and coefficientsˆ = predicted value for new sample
in matrix-vector
= + + + +
=
=
…i i i K i K
i k
k
i
y b b x b x b x
x k i
b b b
y i
ˆˆform =y Xb
Multivariate Linear Regression
0 1 ,1 2 ,2 ,ˆ ˆ ˆ ˆˆi i i K i Ky b b x b x b x= + + + +…
MLR Prediction
Measurement i might be outside the range used for calibrationor validation
Multivariate Linear Regression
MLR Residuals
ˆ ˆobserved value of predictand in sample
ˆ predicted value of predictand in sample
= −==
i i i
i
i
e y yy iy i
Multivariate Linear Regression
MLR Assumptions
1. Relationships are linear
2. Predictors are nonstochastic
3. Residuals have zero mean
4. Residuals have constant variance
5. Residuals are not autocorrelated
6. Residuals are normally distributed
Multivariate Linear Regression
1. Relationship may be nonlinear or outlier-driven
5. Problems when X variables (predictors) are strongly correlated (common in practice)
Caveats/Problems to interpretation
Multivariate Linear Regression
Alternatives to MLR
Nonlinearity?
Data transformation,use kernel and
try MLR again Neural Networks
Nonparametric Regression(e.g., kernel regression)
Categorical predictand?
Discriminant analysis
Classification treesLogistic regression
Quadratic response surfaces
Multivariate Linear Regression
Correlation among predictors?
Reduce numberof variables
Stepwise Regrssion
Factor Analysisbased methods
MLR Statistics
• R2 -- explanatory power
• Adjusted R2: R2 adjusted for loss of degrees of freedom due to number of predictors in model
• F and its p-value -- significance of the equation
• se standard error of the estimate;equivalent to “root mean square error” (RMSEc);subscript “c” denotes “calibration”
• Confidence interval for parameters
X
Multivariate Linear Regression
MLR ANOVA Table(testing linearity)
MSE=SSE/(n-K-1)SSEn-K-1Residual
MSR=SSR/KSSRKRegression(model)
SSTn-1Total
Mean squaresSum of Squares
dfSource
Multivariate Linear Regression
Validating the MLR regression model
•Regression R-squared, even if adjusted for loss of degrees of freedom due to thenumber of predictors in the model, can give a misleading, overly optimistic viewof accuracy of prediction when the model is applied outside the calibrationperiod.
•Several approaches to validation are available. Among these are cross-validationand split-sample validation.
•In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict thepredictand for the deleted observation. The merged series of predictions fordeleted observations is then checked for accuracy against the observed data.
•In split-sample calibration, the model is fit to some portion of the data (say, thesecond half), and accuracy is measured on the predictions for the other half ofthe data. The calibration and validation periods are then exchanged and theprocess repeated.
Multivariate Linear Regression
Model Calibration vs Validation
Validation1. Testing the model on data not
used to fit the model2. “validation”, “verification”,
“independent” data3. Accuracy statistics:
RE SSEvMSEvRMSEv
Calibration1. Fitting the model to the data
2. “calibration”, “construction”, “estimation” data
3. Accuracy statistics:{R2, Ra
2,}SSEcMSEcRMSEc
Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error ofprediction, PRESS statistic, "hat" matrix, extrapolation vs interpolationAdvantages of cross-validation over alternative validation methods
Multivariate Linear Regression
Cross validation stopping rule
Stop here
Multivariate Linear Regression
Error bars for MLR predictions
1. Standard error of the estimate (calibration statistic)
2. Standard error of prediction (calibration statistic)
3. Root-mean-square error of validation (validation statistic)
Heirarchy
Multivariate Linear Regression
Standard Error of MLR Prediction(Equation for simple linear regression)
( )
( )
1/ 2
2*
ˆ2
1
11y e n
ii
x xs s
n x x=
⎡ ⎤⎢ ⎥−⎢ ⎥= + +⎢ ⎥−⎢ ⎥⎣ ⎦
∑Standard errorof the estimate
MSERMSE
e
c
s ==
Term due to uncertaintyIn the estimate of the predictand mean; a function of sample sizen
Term due to departure of the predictor value for the predicted observation from the predictor mean for the calibration period
Multivariate Linear Regression
Univariate linear regression
1 1
0
i o 1 i i i i y x
i i i1 i2 2
i i
2 2y y y2
b b2 2i XX i
2i
b y 2i
y b b x e , y =f(x ), s s 0
(x X)(y Y) (x X)b y
(x X) (x X)
s s ss , s
(x X) S (x X)
xs s
n (x X)
= + + >> ≈
− − −= =
− −
= = =− −
⎛ ⎞⎜ ⎟=⎜ ⎟−⎝ ⎠
∑ ∑∑ ∑
∑ ∑∑
∑
Least Squares and linear regression
Ordinary Least Squares (OLS)Multivariate Linear regression
n experimental measures, m variables X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model Assumption that experimental errors are only important for yi
y = X b, b= (XTX)-1XTys2(b) = (A AT)-1 s2(y) = (X XT)-1 s2(y)
where: A = {Aij}; Aij = δ ri / δ bj = {-Xij};ri = yi – yci, residuals and s2(y) is estimated from:
s2(y) = ∑ ri2 / (m-n)
Weighted Least Squares (WLS)Multivariate Linear regression
n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model W = {wij} n x m weights to each xij value, considering errors in X (error standard deviations of xij,, sij; wij = 1/sij)
unweigthed weightedy = X b y = X bb= (XTX)-1XTy b= (XTWX)-1XTWys2(b) = (A AT)-1 s2(y) s2(b) = (AWAT)-1 s2(y)
where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:
s2(y) = ∑ wi2ri
2 / (m-n)
Generalized Least Squares (GLS)Multivariate Linear regression
n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model M = {mij} n x m weights to each xij value, calculated from errors in X and y (it is more complex!)
y = X b, b= (XTMX)-1XTMys2(b) = (AMAT)-1 s2(y)
where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:
s2(y) = ∑ ri2 / (m-n)
Interpolation vs Extrapolation
Prediction based on predictordata “similar” to that in the calibration range
Prediction based on predictordata “unlike” that in thecalibration range
X
Multivariate Linear Regression
( )−= T 1 TH X X X X
* * *( )h −= T T 1x X X x
* maxh h>
Classifying MLR predicted values
“Hat” matrix, computed from Calibration-only predictor data
Classification statistic
Vector of predictor data for some observationOutside calibration range
Rule identifying “extrapolation”
Maximum value along diagonal of the hat matrix
Multivariate Linear Regression
Conventions & Notation in CalibrationData are arrenged in two blocs/tables/matrices X and Y where:X = matrix of predictor variables Y = matrix of predicted (predictand) variables ns = number of samples/observationsnx = number of variables in Xny = number of variables in Yn = number of PCs/latent variables/components
1 2 3 .... nypredicted (predictant) variables
sam
ples
123..
ns
MatrixY, y
1 2 3 .... nxpredictor variables
Sam
ples 1
23..
ns
MatrixX
Y = f(X)
Find f
Multivariate Linear Regression
Multivariate Linear Regression Causal vs Predictive Model
Causal models X = f (Y) (1)Predictive models (inverse) Y = f(X) or y = f(X) (2)
Independent (predictors) vs dependent (predictands)
Example:X (R) is the matrix of multivariate (instrumental) responses for different samplesY is the matrix of concentrations of one chemical component (or more) in the different samples y (c) is the concentration of one component in the samplesf is the calibration function in the causal (1) and predictive in (2) linear models, f is a linear function
Multivariate Linear Regression
R = C ST + Ens,nw ns,nc nc,nw ns,nw
R matrix of sensor responses (ns samples, nw wavelengths)
C matrix of concentrations (ns samples, nc components)
ST matrix of sensibilities (nc components, nw wavelengths)
E matrix of experimental errors (ns samples, nw wavelengths)
Multicomponent Analysis:Bilinear ModelMultivariate Linear Regression
• Advantages• Total selectivity is not needed• Allow multicomponent analysis• Outlier detection is possible
• More used methods– MLR, Multilinear Regression
• Classical Least Squares (CLS)• Inverse Least squares (ILS)
– Factors based Linear Regression (biased) • Principal Components Regression (PCR)• Partial Least Squares Regression (PLSR)
– Non-linear Regression
Multivariate Linear Regression
CLS Model : R = C ST + Ens,nw ns,nc nc,nw ns,nw
The responses are modelled as a function of the concentrations. It is the same causal model as for the generalized Beer’s law (generalized multilinear model)
Calibration step:a) direct: pure component spectra of the components or
sensibilities are previously known; ST is known b) indirect: pure component spectra of the components are not
previously known; ST is unknown, it has to be estimated in the calibration step:
ST = C+Rwhere: C+=(CTC)-1CT (pseudoinverse)
Prediction step:
foe a set of samples ‘nunk’ with unknown analyte concentration
Cunk = Runk(ST)+
nunk,nc nunk,nw nw,ncCunk matrix of concentrations of the nunk unknown samples
(nunk samples, nc components)Runk matrix of their instrumental responses
(nunk mostres, nw wavelengths)
(ST)+ pseudoinverse of the sensibilities matrix (pure spectra)(nw wavelengths, nc components)
(ST)+=S(STS)-1
Prediction step (one sample):
the concentration of several analytes in one sample:
cTunk= rT
unk(ST)+
1,nc 1,nw nw,nc
or what is the same:
cunk= S+ rnc,1 nc,nw nw,1
where S+ = (STS)-1ST
Classical (Causal) Least Squares (CLS)Advantages (compared to univariate least squares):1. Increase of precision in the estimations(signal averaging)2. Allows the estimation of the pure responses(pure spectra) => qualitative information, identification3. Allows multicomponent quantitative analysisDisadvantages:Needs knowing and introducing the whole information of allthese components contributing to the measured analytical responseIt does not allow calibration in the presence of unknowninterferents (it is not used in the analysis of natural samples)
Multivariate Linear Regression
2. Inverse Calibration. Inverse Least Squares (ILS)
Model: c = R b + e (y = X b)ns,1 ns,nw nw,1 ns,1
The concentrations are modeled as a function of the instrumentalresponses. It is not a causal model. It is a predictive model. Only needs knowing the analyte concentration in the calibration samples.
Calibration: b = R+ c (b = X+c)nw,1 nw,ns ns,1
b is the calibration vector, evaluated from the responses of the calibration samples R where the analyte concentration c is known.
R+ = (RTR)-1RT pseudoinverse of RX+ = (XTX)-1XT pseudoinverse of X
Prediction: cunk = rTunk b (yunk = xT
unk b) 1,1 1,nw nw,1
cunk is the concentration of the analyte in a new samplerT
unk is the instrumental response given by this sampleb is the calibration vector previously evaluated
R is not square => calculation of the generalized inverse or pseudoinverse R+ = (RTR)-1RT
Problem:In the evaluation of the calibration vector b = (RTR)-1RT c,(RTR)-1 nw,nw has to be evaluated The nw rows and nw columns should be linearly independent!!!and ns > nw (number of calibration samples > number of wavelengths
Inverse Least squares, ILS (inverse model)
Advantages- allows the determination of one analyte in the presence of unknowninterferences (this is not possible with CLS!!!)- Only needs the calibration information for one analyteDisadvantages- It does not use all the variables; only uses a reduced number of selected variables (sensors or wavelengths).- There is no increase of measurement precision (there is nosignal averaging)
Multivariate Linear Regression
Methods based on Factor Analysis• Factor decomposition of matrix X• Resolve the colinearity problem in X• Backgroundd noise filtering• Improve precision (signal averaging)• ‘Compression’ of the information in a reduced number
of new variables o factors
Multivariate Linear Regression
- as in PCA- linearization, if possible- mean centering- variance scaling, when the variablesare in different units or differconsiderably in magnitude
- outliers elimination is critical
Pretreatment Methods in Multivariate Regression/Calibration
Multivariate Linear Regression
Principal Component Regression PCR
1. Descomposition in factors of X by PCAX = T PT + E
ns,nw ns,nc nc,nw (ns,nw)
2. Multilinear regression (MLR) on thescores T (instead of on original variables in X)
y = T bns,1 ns,nc nc,1
Multivariate Linear Regression
Principal Component Regression PCR
3. Evaluation of the regression vector b
b = T+ y = (TTT)-1TT y
the scores (PCA) are orthogonals
(TTT)-1 = diag(1/λi), i =1,...,nc
Multivariate Linear Regression
Principal Component Regression PCR
4. Prediction of a new sample with response xT
unkscore of the new sample: tT
unk= xTunkP
1,nc 1,nw nw,nc
prediction of its concentration yunk = tT b1,1 1,nc nc,1
Multivariate Linear Regression
Principal Component Regression PCR
Direct calculation ycal = X bcal
bcal = X+ ycal
X+ XPCA+ = (T PT)+ = P (TTT)-1TT
PT orthonormal P PT = I (PT)-1 = P
T orthogonal TT T = diag(λi)yunk = xT
unk bcal1,1 1,nw nw,1
Multivariate Linear Regression
PCAX T
PT
ns
nw
ns
nc
nc
nw
y T
1
ns ns
nc
MLR
bnc
1PCR = PCA + MLR
nc<<nwreduction of thenumber of variables!!!
Principal Components Regression, PCRPCR is one of the ways to solve the inversion of matrices ill-conditioned in linear regressionThe property of interest is modelled (regressed) on the PCA scores:
y = TkbPC + e = XPkbPC + ebPC = X(Tk
TTk)-1TkTy
(regression vector from PCA ‘sores’)b = PkbPC ; y = Xb(regression vector from the original variables) X+ = Pk(Tk
TTk)-1TkT
(la inverse matrix is calculated from orthogonal matrices)
Multivariate Linear Regression
Possible problems with PCRSome of the PCs cannot be relevant for the prediction of y, only are relevant for the description of XThe PCs are estimated without considering the prperty to predict ySolution: find the components using the information in y, not only the information in PLSThe number of components has to be estimated using validation methods.Diagnostics are used to find out outliers (Q residuals, T2
values, leverage plots)
Multivariate Linear Regression
Partial Least Squares Regression PLSR
The responses matrix X is decomposed and ‘truncated’in a similar way to PCR, but in the decomposition the information of y in the calibration samples is considered
X = T PT + EY = U QT + F
u = y (in the case of a single component)
X T U Yvariance covariance variance
W
Multivariate Linear Regression
CLS: X = f(y)
ILS,MLR y = f(X)
x1x2x3x4x5
x1x2x3x4x5 x1
x2x3x4x5
x1x2x3x4x5
y
y
y
y
t1
t2
t1t2
PCR = PCA ++ MLR
PLSR
PCA
Partial Least Squares, PLS•PLS is a mixture of PCR and MLR•PCR captures the maximum variance in X•MLR gets maxima correlation between X and y•PLS tries both things: makes maximum the covariance•PLS requires an additional matrix of weights W to keep the orthogonality of the scores and makes easier matrix inversion•The factors are evaluated sequentially by projection of y on X•The expression to evaluate the matrix inverse is more complex than for PCR:
X+ = Wk(PkTWk)-1(Tk
TTk)-1TkT
Multivariate Linear Regression
Comparison of predictive (inverse) linear calibration modelsThe different methods differ in the way they resolve the same equation of calibration (calculation of the inverse matrix of R):
b = X+y MLR (ILS) X+ = (XTX)-1XT
Maximum correlation between X and y is achieved, but the direct inversion of X is problematic
PCR X+ = Pk(TkTTk)-1Tk
T
Maximum variance in X is achieved
PLSR X+ = Wk(PkTWk)-1(Tk
TTk)-1TkT
Maximum covariance between X and y is achieved
Inverses in PCR i PLS are calculated from orthogonal matrices A-1 A= D (diagonal); orthonormal A-1A=ATA=I
Calibration step:
Calculation of the Model
Xcalycal
ModelObtain nr. components
Validation step:
Test the model
Xvalyval
Test del ModelTest nr. components
Validation of the calibration models (regression)
Multivariate Linear Regression
Validation of the model; with the same calibration samples
Xcal + Model
Model Error : - ycal
residual variance (calibration)
RMSEC (Root Mean SquareError of Calibration)
ˆ caly
ˆ calyˆ∑
ns2
i,cal i,cali=1
(y -y )
ns
ˆ∑ns
2i,cal i,cal
i=1
(y -y )
ns
Multivariate Linear Regression
Validation of the Model; with new validation samples
Xval + Model
Model Error : - yval
residual variance (prediction)
RMSEP (Root Mean SquareError of Prediction)
ˆ valy
ˆ valyˆ∑
ns2
i,val i,vali=1
(y -y )
ns
ˆ∑ns
2i,val i,val
i=1
(y -y )
ns
Multivariate Linear Regression
Validation of the calibration model
The ability of the calibration model to predict has to beevaluated using a new sample set not used in thedevelopment of the calibration model:
Samples:
Training set
Test set
Calibration of the Model Xcal ycal
fcalCalculation of RMSEC
Validation of the ModelXval
fvalCalculation of RMSEP
ˆ valy
Multivariate Linear Regression
Validation Methods1) With a calibration set of samples and a different
validation set of samples. Both data sets should representative. It is the best method.
2) Cross-Validation2A) Two groups of data X and y
A
B
calibrationvalidation
validationcalibration
X yXA
XB
yA
yB
Prediction error is evaluated for A and B and the average is evaluated
Multivariate Linear Regression
2B) Full cross-validation or leave-one-out validation
The same number of PCA models as samples are. Successively, one sample is removed, a new PCA model is built and the left out sample is predicted. This is repeated for every sample and the prediction error is calculated.
2C) Segmented Validation (for small groups of samples, i.e. 10 % of samples)
3) Leverage correction (leverage, hi)
Residuals fi = yval - are weighted according to hi
ficorr = fi / (1-hi)
ˆ valy
Multivariate Linear Regression
∑2ni,k
ikk=1
t1h = +ns λ
hi values are between 0 and 1
samples with low ‘leverage’ hi 0 ficor fi
samples with high ‘leverage’ hi 1 ficor >> fi
This procedure uses only the calibration samples dada; it gives a first approximation of the future prediction ability.
where
Multivariate Linear Regression
Cross-Validation
Data are divided in q subsetsBuilt the model with the q-1 subsetsCalculate PRESS (predictive Residual Sum of Squares
Repeat until all the groups have been left out one time. Find the minimum (or the inflexion point) in the the plot of PRESS vs nr. components
2ij ij
i j
ˆPRESS (y y )= −∑∑
Multivariate Linear Regression
1 2 3 4 5 6 7 8 9 100
50
100
150
200
250
300
350
400
450
500
Determination of the number of componentsusing PRESS PLOTS
number of components
cum
ulat
ive
PRES
S
5 components
Multivariate Linear Regression
Model evaluation and validation2
ij iji j
2ij ij
i j
2ij ij
i j
ij iji j
2ij ij
i j2
iji j
ˆPRESS (y y )
ˆ(y ÿ )RMSEP
n.samples
ˆ(y y bias)SEP
n.samples 1ˆ(y y )
biasn.samples
ˆ(y y )RE 100
(y )
= −
−=
− −=
−
−=
−=
∑∑
∑∑
∑∑
∑∑
∑∑∑∑
Prediction Error Sum ofSquares, PRESS
Root Mean Square Error inPrediction
Standard Error in Prediction
bias
Relative Error
Multivariate Linear Regression
• Comparison of experimental values versus model predicted values– for the calibration samples– for the external validation samples
• Plot and calculate regression line of predicted versus actual values:predicted values = slope x experimental values + offsetslope should be one, offset should be zero and r2 = 1
Model evaluation and validation
Multivariate Linear Regression
Partial Least Squares Regression PLSR modelsX loadings
X = T PT + EY = U QT + F
Y loadings
u = y (in the case of a single component)
X scores Y scoresX T U Yvariance covariance variance
W (weights)
Multivariate Linear RegressionPLS model interpretation
Interpretation of PLS models.• Physical interpretation of PLS models can be obtained
from plots of scores and loadings like in PCA. • More interestingly PLS models provide the weights (Wk)
which describe the covariance structure between X and y blocks. Plot of the weights are extremely useful for PLS models interpretation
X+ = Wk(PkTWk)-1(TkTk
T)-1TkT
b=X+yy=Xb
• Other measures exist like the variable influence (importance) on projection, VIP, parameter. Plot of VIPs are also very useful for PLS model interpretation
Multivariate Linear RegressionPLS model interpretation
Variable influence (importance) on projection, VIP, parameter
Multivariate Linear RegressionPLS model interpretation
∑=
− −−=
A
a aoaaakAk SSYSSY
KSSYSSYwVIP1
12
)()((
A total number of factors considered in the model a considered factor
k considered variable
The variables with larger VIPs (larger than one) are more influential and important for explaining Y