Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona)[email protected]

Univariate and Multivariate DataUnivariate

• Only one measurement per sample (pH, Absorbance, peak height or area)

• The property is defined by oneone measurement.

• Total Selectivity is needed• Interferences should be

eliminated de beforemeasurements (separation)

• Numerical treatment is easy

Multivariate• Multiple measurements per

sample– Instrumental measuremnts

(spectra, cromatograms,...)– Multiple measurements

(constituent conc., sensorial variables,....)

• Total selectivity is not needed.• Interferences can be present• Different complexity levels

(vectors, matrices, tensors,...).

Univariate Statisticsn

ii 1

n2

i2 2 i 1

X x

n2

ii 1

X x

n

i ii 1

x ,y x ,y

x ,yx ,y

x y

xX

n

(x X)s

n 1

(x X)s

n 1

(x X)(y Y)s

n 1s

rs s

μ

σ

σ

σ

=

=

=

=

→ =

−→ =

−

−→ =

−

− −→ =

−

→

∑

∑

∑

∑

Mean

Variance

Standard Deviation

Covariance

Correlation

One variable summary

Relationbetween 2 variables

Univariate Statistics

• Mean: 4.00– Mean value of the variable.– Size of the scale of the variable.

• Standard Deviation: 0.18.• Variance: 0.032.

– Dispersion around the mean.– Spread (dispersion) of the scale of the

variable.

Description of one variableZn

4.104.043.823.964.074.233.733.804.23

Univariate Statistics

• Covariance: 0.00043.– High values Indicate a linear relationship

between x and y.– Sign: relation direct (+) or inverse (-).– Depends on the scale sizes of x i y.

• Correlation: 0.996.– |1| total linearity, 0 absence of linear

relationship.– Sign: relation direct (+), inverse (-).– Independent from the scale sizes of x and y.

Description of the relation between 2 variables

High Correlation, redundant information.Low Correlation, complementary Information.

Sn Ni0.20 0.00220.20 0.00200.15 0.00150.61 0.00620.57 0.00560.58 0.00550.30 0.00330.60 0.00560.10 0.0014

Multivariate DataExample

X (9,4) (n,m)

Sn Zn Fe Ni 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 0.61 3.96 0.09 0.0062 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

Sam

ples

Variables

Objects

Multivariate Statistics

• Vector of means:– Mean value of each variable– Sample diferences due to the

differences in the size scales of thevariables

• Variables with higher means havea higher influence in dtadescription.

• If the scale sizes are very different, a data pretreatment will probablybe needed.

Description of variables

Sn Zn Fe Ni (0.37 4.00 0.06 0.0037)

11

−=∑=

n

xx

n

iij

j

Multivariate Statisitcs

• Vector of standard deviations:– Dispersion of the different variables.– Shows the differences in the scale

ranges (dispersion, spread) amongvariables.

• Variables with higher dispersions will have higher influence in thedata description.

• If the scale ranges of the differentvariables are very different, a data pretreatment will probably beneeded.

Description of variables

s = (s1, s2, s3, ..., sn)

Sn Zn Fe Ni (0.22 0.18 0.02 0.0020)

1

)(1

2

−

−=∑=

n

xxs

n

ijij

j

Multivariate DataRobust Parameters

Sn Zn Fe Ni

0,100000 3,730000 0,020000 0,0014

0,150000 3,800000 0,040000 0,0015

0,200000 3,820000 0,050000 0,0020,200000 3,960000 0,060000 0,0022

0,300000 4,040000 0,070000 0,0033

0,570000 4,070000 0,070000 0,0055

0,580000 4,100000 0,080000 0,0056

0,600000 4,230000 0,080000 0,0056

0,610000 4,230000 0,090000 0,0062

MedianSpread

interquartile

They are less sensitive to the presence of outliersThe range of the interquartile is not symmetric respect the medianShow the data structure

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9

Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066

Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166

Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548

Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386

Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535

Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065

Máximo 159 103.6 598.959 1537 1856.5 165.6

Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185

Cuenta 102 102 102 102 102 99

Introduction to Multivariate Data Analysis

A box plot summarizes the information on the data distribution primarily in termsof the median, the upper quartile, and lower quartile. The “box” by definitionextends from the upper to lower quartile. Within the box is a dot or line markingthe median. The width of the box, or the distance between the upper and lowerquartiles, is equal to the interquartile range, and is a measure of spread. Themedian is a measure of location, and the relative distances of the median from theupper and lower quartiles is a measure of symmetry “in the middle” of thedistribution. is defined by the upper and lower quartiles. A line or dot in the box marks the median. For example, the median is approximately in the middle of thebox for a symmetric distribution, and is positioned toward the lower part of thebox for a positively skewed distribution.

Jan July

0

1

2

3

4

5

6

7

Box plots, Tucson Precipitation

P (

in)

Month

median

upper quartile

lower quartile

interquartile rangeirq

Probability distributions: Box plots

“Whiskers” are drawn outside the box at what are called the the “adjacent values.”The upper adjacent value is the largest observation that does not exceed the upperquartile plus 1.5 iqr , where iqr is the interquartile range. The lower adjacent valueis the smallest observation than is not less than the lower quartile minus1.5 iqr . Ifno data fall outside this 1.5 iqr buffer around the box, the whiskers mark the data extremes. The whiskers also give information about symmetry in the tails of thedistribution.

Jan July

0

1

2

3

4

5

6

7


P (

in)

Month


Whiskers

interquartile rangeirq

For example, if the distance from thetop of the box to the upper whiskerexceeds the distance from the bottomof the box to the lower whisker, thedistribution is positively skewed in the tails. Skewness in the tails may be different from skewness in the middleof the distribution. For example, a distribution can be positively skewedin the middle and negatively skewedin the tails.

Any points lying outside the 1.5 iqr around the box are marked by individual symbolsas “outliers”. These points are outliers in comparison to what is expected from a normal distribution with the same mean and variance as the data sample. For a standard normal distribution, the median and mean are both zero, and: q at 0.25 = −0.67449, q at 0.75 =0.67449, iqr = q 0.75 − q 0.25 =1.349, where q 0.25and q. 075are the first and third quartiles, and iqr is the interquartile range. We see that the whiskersfor a standard normal distribution are at data values: Upper whisker = 2.698 , Lowerwhisker = -2.698

Jan July

0

1

2

3

4

5

6

7


P (

in)

Month

Outliers


From the cdf of the standard normal distribution, we see that the probability of a lower value than x=−2.698 is 0.00035. This result shows that for a normal distribution, roughly 0.35 percent of the data is expected to fall below the lower whisker. By symmetry, 0.35 percent of thedata are expected above the upper whisker. These data values are classified as outliers. Exactly how many outliers might be expected in a sample of normally distributeddata depends on the sample size. For example, with a sample size of 100, weexpect no outliers, as 0.35 percent of 100 is much less than 1. With a sample sizeof 10,000, however, we would expect 35 positive outliers and 35 negative outliersfor a normal distribution.


For a normal distribution>> varnorm=randn(10000,3);>> boxplot(varnorm)

0.35% of 10000are approx. 35 outliersat each whisker side

Parametric vs Robust Statistics

MeanStandard Dev.

Parametric

Robust

Box Plot

Median

Mínimum

Màximum

Interquartil Range(IQR)

Mean and standarddeviation plots

Median and IQRs plots(Box plots)

Sn Zn Fe Ni Sn Zn Fe Ni

They help to see the size and range scale differencesThey suggest the use of appropriate data pretreatments

to handle these differences.

Parametric vs Robust Statistics


• Covariance Matrix, S (m,m): it has all the possible pairwise combinations between variables.

Description of the variable relationships

2 2 211 12 1m2 221 2m

2 2m1 mm

s s ... ss ... ... s

S... ... ... ...s ... ... s

⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

2ij

n

ij j il li 1s

n 1

(x x )(x -x )==

−

−∑

2

2jj

n

ij ji 1s

n 1

(x x )==

−

−∑

Covariance

Variance


• Correlation Matrix, C(m,m): it has all the possible of correlations between variables.– Diagonal elements are 1.

2 2 211 12 1m2 221 2m

2 2m1 mm

r r ... rr ... ... r

C... ... ... ...r ... ... r

⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

2ij2

iji j

2i ii

sr

s s

s s

=

=

Correlation



Sn Zn Fe NiSn 0,047319 -0,002593 0,002518 0,000434Zn -0,002593 0,033644 0,000581 -0,000022Fe 0,002518 0,000581 0,000494 0,000022Ni 0,000434 -0,000022 0,000022 0,000004

Covariance Matrix

Correlation MatrixSn Zn Fe Ni

Sn 1,000 -0,065 0,521 0,995Zn -0,065 1,000 0,142 -0,060Fe 0,521 0,142 1,000 0,502Ni 0,995 -0,060 0,502 1,000


Pair-wiseCorrelationsvariables

samples

Introduction to Multivariate Data Analysis

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.

Multivariate Data Analysis

Need• Nature is multivariate.

– Climate = f(T, rain, winds, seasons,...)– Health = f(genetics, diet, climate, habits,...)– Abs. analit = f(solvent, T, interferences, matrix,...)

• Few properties are dependent of only one variable.• Many chemical measurements are multivariate

Multivariate Data Analysis

Need• Many times measurements are indirect.

– Temperature. • Low values (thermometer) (univariate data).• High values (emision spectra FTIR, f(T))

– Nitrogen Concentration.• Kjeldahl chemical method (univariate data).• NIR spectra (f(C)).

Multivariate Data Analyis

Need• The studied property is selectively correlated to a

single variable only in a few ocasions (lack of total selectivity).

• The studied property is determined by a set of variables with which presents high correlatio

P = f(x1, x2, ...,xn);

Observations = Structure + Noise

Structure = part of the signal correlated withthe sought property

Noise = all the other contributions, instrumental noise, experimental errors, other components, ...

NeedExperimental measures contain informationwhich is not relevant to the property of interest


Causality vs Correlation

Correlation is a statistical concept which measures the linear relationbetween two variables

Causality relationship is a deterministic interpretation from the problem or application

Example: the number of stork and the number of new born childrenin a geographical area


Information

Initial Hypothesis: Data have the sought information.Exists a relationship which can be modelled from measured variables and the measured property. When variables change their value, also the property will changed

X (variables) -------> Y propertymodel Y = f(X)

X is a vector or a matrix (e.g. spectral measures)Y is a scalar, vector or matrix (e.g. analytic concentrations)


Visualization of original dataPlot of the matrix rows and/or columns

Spectra set

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

Rows

outlier sample

Variables0 5 10 15 20 25 30

0

0.5

1

1.5

2

2.5

3

ColumnsSamples

020

4060

80100

0

10

20

300

0.5

1

1.5

2

2.5

3

Rows and Columns (3D)Variables

Samples

Detection of outlier samples/variables Detection of scale and range variable differencesSystematic information (structure) is easily detected (instrumental responses).Difficult to interpret when the number of samples is high.

Visualization of original dataMap of samplesIn the column space (variables)

Samples are drawn as points in the variables spaceSimilarities among samples can be detected (distances among samples).

2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠

m1

m2

v1 v2 v3

2 4 6

642

(1,0,6)

(2,3,4)v1

v2

v3

m1

m2

Map of variablesIn the row space (samples)

Variables are drawn as vectors in the sample subspaceCorrelation among variables can be estimated (angle).

2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠

m1

m2

v1 v2 v3

r(vi,vj) = cos(vi,vj)

r = 1, angle 0o

r = 0, angle 90o2 4 6

642

(4,6)

(2,1)(3,0)

m1

m2

v1

v2

v3

Visualization of original data

Samples Sn Zn Fe Ni

1 0.2 3.4 0.06 0.08

2 0.2 2.4 0.04 0.06

3 0.15 2.0 0.08 0.16

4 0.61 6.0 0.09 0.02

5 0.57 4.2 0.08 0.06

6 0.58 4.82 0.07 0.02

7 0.30 5.60 0.02 0.01

8 0.60 6.60 0.07 0.06

9 0.10 1.60 0.05 0.19

Graphical representacion of multivariate data

in the variable space (3D)


02

46

8

0.020.04

0.060.08

0.10

0.2

0.4

0.6

0.8

ZnFeS

n

7

4

8 5

9

3

2

1

6

?

There are two sample groupsIs it critical the representation of a 4th variable?

For more than 3 dimensions space?

• Qualitative approximations for a few variables – Chernoff faces.

• Efficient Compression of the original space of variables– Principal Component Analysis (PCA).


• Chernoff faces. – It is easu to distinguish different features in human faces. Each

sample is a Chernoff face.– Each face feature is a variable.

V. 1 High front face of the head

V. 2 Lower front face of the head

V. 3 Eyebrows

V. 4 Smile

Visualization of original dataFor more than 3 dimensions space?

sample 7

Methods Classification

According to their goal

Exploration methods.Discrimation and Classification methods.Correlation and Regresion methods.Resolution methods

According data type

Based on original data.Based on latent variables (factor analysis)


Exploration Methods

• Visualization of the information.• Sample similarities and clusters• Correlations among variables. • Outlier detection.• Measured variables relevance. Selection.• Principal Component Analysis (PCA).


Discriminant and Classification Methods

• Separation of the objects (samples) in defined groups or clusters (classes).

• Assignation of new objects to predefined classes• Detection of outlier objects not belonging to any group

(classes).• PCA, SIMCA, LDA, PLS-DA, SVM.....


Correlation and Regression Methods

• Finding relations between two blocks of variables.• Modeling property changes from a group of variables.• Prediction of a property from the indirect

measurement from a group of variables correlated to it.

• Multilinear Regression (MLR), Principal ComponentsRegresion (PCR), Partial Least Squares Regression (PLS).

• Non-linear Regression methods, Kernel, SVM,...


Factor Analysis based methods

• Factor: source of the observed data variance of independent and defined nature.

• Extraction of the relevant factors (structure) of the data set. Noise Filtering.

• Description of the data variance from basic factors.• Identification of the chemical nature of these relevant

factors. • PCA, PLS, PCR, SIMCA, • Multivariate Curve Resolution. MCR. PMF, ICA,....


• Modify the size and the range of the scale of thevariables.

• They can be applied in the direction of thecolumns (variables) or of the rows (objects, samples).

• They are selected as a function of the data nature and of the information to be obtained.

• There is no optimal treatment, it depends on thechemical problem to be investigated.

Data pre-processing


1) mean centering (axes translation)Iik

* i 1ik ik k k

K

ik* k 1ik ik i i

xx x x , x

I

xx x x , x

K

=

=

= − =

= − =

∑

∑

on the data matrix columns

on the data matrix rows

centering

Data pre-processing


2) scaling ( )( )

( )( )

I 2

ik k* ik i 1ik k

k

I 2

ik k* ik i 1ik i

i

x xxx , ss I 1

x xxx , ss K 1

=

=

−= =

−

−= =

−

∑

∑

on the data matrix columns

on the data matrix rows

3) autoscaling = mean centering + scaling

autoscaling* *ik k ik iik ik

k i

x x x xx ; xs s− −

= =

Data pre-processing


4) normalization: K K

* 2ik ik i ik i ik

k 1 ñ 1i

Nx x ; c x ; c x ; ....c = =

= = =∑ ∑

5) rotation: X* = RT X

p.e. in two dimensions:

*11

*22

xcosθ sinθxx-sinθ cosθx

⎛ ⎞ ⎛ ⎞⎛ ⎞=⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

RT, rotation matrix

Data pre-processing


0.00

1.00

2.00

3.00

4.00

5.00

0 5 10

Samples

Met

al C

oncs

. SnZnFeNi

• Original data (without pretreatment). Scale, size and range of variables is kept

Data pre-processing


• Centered Data Each variable value is subtractedwith the mean of all the values of that variable.

Diferences among variables due to scale size are eliminated

-0.3-0.2-0.1

00.10.20.3

0 5 10

Samples

Met

all C

oncs Sn

ZnFeNi

Data pre-processing


• Autoescaled Data ades. Each value of the variable is centered and divided by the standard deviation of thevalues of the variable

Differences among variables due to size and range are eliminated

-2.5-2

-1.5-1

-0.50

0.51

1.5

0 5 10

Samples

Met

al c

oncs Sn

ZnFeNi

Data pre-processing


0.00

1.00

2.00

3.00

4.00

5.00

0 5 10

Mostres

Con

c. m

etal

ls

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 5 10

MostresC

onc.

met

alls

-2.5-2

-1.5-1

-0.50

0.51

1.5

0 5 10

Mostres

Con

c. m

etal

ls Sn

Zn

Fe

Ni

Original Data Centered Data Autoscaled Data

Whatsamples/variables

have highervalues?

What variables discriminate

better?

What is thecorrelation amongdifferent variables?

Data pre-processing


• Centered data. Each variable value is subtractedby its mean value.

– The mean value of all variables is zero

11 1 12 2 1m m

21 1 22 2 2m m

n1 1 n2 2 nm m

x x x x ... x xx x x x x x

XC(n,m)... ... ... ...

x x x x ... x x

− − −⎛ ⎞⎜ ⎟− − −⎜ ⎟=⎜ ⎟⎜ ⎟⎜ ⎟− − −⎝ ⎠

Centered data and Covariance

n number of samplesm number of variables

),(),(),( 11

mnnmT

mm XCXCn

S−

=

Data pre-processing


• Autoscaled Data. Each value of the variable is centered and divided by the standard deviationof the values of the variable – The mean of all variables is 0.– The variance (dispersion) of all variables is 1.

Autoescaled data and correlation.

XT (n,m) j

jijij s

xxxt

−=

),(),(),( 11

mnnmT

mm XTXTn

C−

=

Data pre-processing


0 10 20 30 40 50 60 70 800

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Longituds d'ona

Abs

orbà

ncia

0 10 20 30 40 50 60 70 800

0.05

0.1

0.15

0.2

0.25

Longituds d'ona

X

spectrum

Normalitzation

Each vector (spectrum) value is divided by its lenght (norm)

XN (n,m)

X

XN

i

ijij x

xxn = ∑=

jiji xx 2

Equals the response intensities.Allows a better comparison of the shapes

Data pre-processingMultivariate Data Analyis

• To eliminate changes due to instrumental variations without chemical information– Smoothing: noise correction– 1st derivative: correction of constant variations– 2nd derivative: correction of linear variations – Peak alignements.– Baseline corrections– Warping– …

• Different pretreatments can be combined.

In instrumental responses

Data pre-processing


NIR spectra of samples

original data

baseline

vertical offset

2a. derivative

row autoescaling

row autoscaling + baseline correction



Other data pretreatments

– Baseline and background correction– Noise filtering– Shift alignement– Warping– ...


0 5 10 15 20 25 30-50

0

50

100

150

200

250

300

350

Plot of rows (samples)of data matrix X

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

200

250

300

350

Plot of columns (variables)of data matrix X

Data example: environmental monitoiung Data table or matrix X(30,50)

50 parameters are meassured on 30 samples


0 5 10 15 20 25 30-50

0

50

100

150

200

250

300

350

Use of descriptive statistics1) Individual sample plots2) Individual variable plots 3) Descriptive statistics (Excel Statistics)4) Histograms/Box plots5) Binary correlation between variables6) ..............................................................

variables

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

200

250

300

350

samples


0 5 10 15 20 25 30 35 40 45 500

200

400

600

800

1000

1200

1400

Samples sum, average andstandard deviation for everyvariable

0 5 10 15 20 25 30 35 40 45 500

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

summean

std

69

1117 29

35

39

41

47

69

1117 29

35

39

41

47

6

9

1117

29

35

39

4147

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

4000

Variables sum, average andstandard deviationfor every sample

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

0 5 10 15 20 25 300

10

20

30

40

50

60

70

sum mean

std

21

26

21

26

26

21

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9

Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066

Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166

Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548

Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386

Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535

Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065

Máximo 159 103.6 598.959 1537 1856.5 165.6

Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185

Cuenta 102 102 102 102 102 99


**

*

**

The box has lines at the lower quartile, median, and upper quartile values.

The whiskers are lines extending from each end of the box to show the extent of the rest of the data.

Outliers are data with values beyond the ends of the whiskers.

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950

0

50

100

150

200

250

300V

alue

s

Column Number

Boxplot


Columns 1 through 7

1.0000 0.7071 0.9088 0.8677 0.9011 0.7807 0.90200.7071 1.0000 0.7413 0.6186 0.7394 0.6311 0.68350.9088 0.7413 1.0000 0.8069 0.8313 0.9304 0.98120.8677 0.6186 0.8069 1.0000 0.9716 0.5649 0.74570.9011 0.7394 0.8313 0.9716 1.0000 0.5945 0.76320.7807 0.6311 0.9304 0.5649 0.5945 1.0000 0.96590.9020 0.6835 0.9812 0.7457 0.7632 0.9659 1.00000.8438 0.9207 0.8171 0.8244 0.9198 0.6322 0.74840.9213 0.6820 0.9842 0.8547 0.8528 0.9081 0.97900.8608 0.5930 0.8594 0.5502 0.5965 0.9161 0.92030.8508 0.6622 0.7248 0.9727 0.9742 0.4484 0.65250.9012 0.7057 0.8973 0.9595 0.9549 0.7050 0.84410.7564 0.6899 0.9133 0.5232 0.5746 0.9909 0.94470.6942 0.5023 0.6083 0.9450 0.9053 0.3058 0.51790.7363 0.4323 0.6508 0.9624 0.8984 0.3683 0.57590.8445 0.3825 0.5904 0.7990 0.7847 0.3878 0.58530.7738 0.6443 0.9330 0.5707 0.6013 0.9988 0.96540.9249 0.7821 0.9756 0.7923 0.8362 0.9140 0.96880.8241 0.4946 0.6426 0.9021 0.8829 0.4065 0.61110.8350 0.6862 0.9750 0.7015 0.7202 0.9747 0.98190.9383 0.7801 0.9716 0.7655 0.8175 0.9282 0.97590.8716 0.6598 0.9818 0.8190 0.8134 0.9242 0.97540.8883 0.5718 0.7118 0.6022 0.6668 0.6651 0.75560.9154 0.8532 0.9245 0.7477 0.8279 0.8566 0.91160.9280 0.6412 0.9564 0.8246 0.8240 0.8884 0.95960.7005 0.9217 0.7575 0.4710 0.6158 0.7461 0.73880.9209 0.8423 0.9303 0.7820 0.8482 0.8428 0.91660.9145 0.8839 0.9470 0.8490 0.9106 0.8218 0.90690.9090 0.8047 0.9795 0.8521 0.8842 0.8838 0.95450.7394 0.3741 0.6021 0.9413 0.8757 0.3190 0.53880.8043 0.6443 0.9127 0.5437 0.5826 0.9850 0.95670.8287 0.7583 0.6477 0.8407 0.9123 0.3899 0.57590.8259 0.6214 0.9145 0.5609 0.5979 0.9828 0.96280.8927 0.6820 0.9768 0.7402 0.7568 0.9627 0.98880.8630 0.6876 0.9399 0.6166 0.6632 0.9806 0.97630.8785 0.5285 0.7587 0.9735 0.9386 0.5223 0.71730.8702 0.7217 0.9775 0.7037 0.7371 0.9769 0.98840.9045 0.6304 0.9432 0.6802 0.7057 0.9552 0.97960.6489 0.5175 0.5127 0.9035 0.8850 0.1833 0.40990.9031 0.8127 0.9118 0.7370 0.8028 0.8697 0.92300.7481 0.6250 0.9203 0.5424 0.5707 0.9977 0.95470.8293 0.6533 0.9664 0.6903 0.7032 0.9785 0.98260.8431 0.8600 0.9018 0.6070 0.7059 0.9062 0.91180.8637 0.5809 0.7495 0.9830 0.9618 0.4834 0.68450.8606 0.5226 0.7253 0.9750 0.9425 0.4707 0.67570.8783 0.6756 0.9791 0.7609 0.7716 0.9575 0.99090.8453 0.6615 0.9699 0.6840 0.7024 0.9872 0.99050.8632 0.6759 0.7094 0.5733 0.6677 0.6640 0.74220.8802 0.6873 0.9784 0.7389 0.7545 0.9698 0.99510.8733 0.6441 0.9746 0.7600 0.7583 0.9495 0.9837

Columns 8 through 14

0.8438 0.9213 0.8608 0.8508 0.9012 0.7564 0.69420.9207 0.6820 0.5930 0.6622 0.7057 0.6899 0.50230.8171 0.9842 0.8594 0.7248 0.8973 0.9133 0.60830.8244 0.8547 0.5502 0.9727 0.9595 0.5232 0.94500.9198 0.8528 0.5965 0.9742 0.9549 0.5746 0.90530.6322 0.9081 0.9161 0.4484 0.7050 0.9909 0.30580.7484 0.9790 0.9203 0.6525 0.8441 0.9447 0.51791.0000 0.7907 0.6191 0.8598 0.8566 0.6577 0.73770.7907 1.0000 0.8493 0.7651 0.9209 0.8765 0.66660.6191 0.8493 1.0000 0.4861 0.6604 0.9066 0.27430.8598 0.7651 0.4861 1.0000 0.9197 0.4205 0.94920.8566 0.9209 0.6604 0.9197 1.0000 0.6721 0.85000.6577 0.8765 0.9066 0.4205 0.6721 1.0000 0.26150.7377 0.6666 0.2743 0.9492 0.8500 0.2615 1.00000.6804 0.7210 0.3517 0.9398 0.8838 0.3068 0.97390.6010 0.6510 0.6050 0.8304 0.7372 0.3345 0.71760.6435 0.9102 0.9031 0.4539 0.7100 0.9909 0.31750.8429 0.9618 0.8833 0.7292 0.8843 0.9064 0.57920.7205 0.7091 0.5315 0.9175 0.8419 0.3639 0.85340.7284 0.9626 0.8686 0.5926 0.8117 0.9601 0.48010.8364 0.9603 0.9239 0.7106 0.8590 0.9243 0.54400.7595 0.9915 0.8207 0.7119 0.8976 0.8930 0.63270.6553 0.7158 0.8884 0.6247 0.6506 0.6550 0.38800.8947 0.9023 0.8610 0.7288 0.8361 0.8697 0.54760.7663 0.9731 0.8794 0.7436 0.8912 0.8571 0.62700.8329 0.6826 0.7322 0.4908 0.5931 0.8099 0.29060.8913 0.9125 0.8429 0.7626 0.8591 0.8498 0.60390.9505 0.9291 0.7747 0.8258 0.9139 0.8266 0.70050.8850 0.9749 0.8049 0.7916 0.9272 0.8739 0.68350.6457 0.6793 0.3610 0.9306 0.8371 0.2543 0.95470.6320 0.8852 0.9511 0.4452 0.6793 0.9810 0.27150.8997 0.6521 0.5012 0.9263 0.8069 0.3972 0.80140.6301 0.8963 0.9689 0.4626 0.6913 0.9724 0.28850.7383 0.9712 0.9148 0.6436 0.8435 0.9435 0.50500.7027 0.9208 0.9668 0.5335 0.7423 0.9743 0.35450.7583 0.8204 0.5713 0.9575 0.9194 0.4688 0.91790.7573 0.9602 0.9091 0.6111 0.8139 0.9685 0.47080.6942 0.9392 0.9745 0.5950 0.7782 0.9347 0.42660.7378 0.5648 0.1928 0.9486 0.8046 0.1520 0.97500.8557 0.9043 0.8666 0.7040 0.8075 0.8764 0.53550.6158 0.8959 0.8922 0.4208 0.6873 0.9904 0.28720.6956 0.9600 0.8774 0.5768 0.8081 0.9584 0.46200.8389 0.8581 0.8934 0.5786 0.7289 0.9331 0.38330.8037 0.7970 0.5222 0.9823 0.9248 0.4407 0.94090.7596 0.7881 0.5305 0.9709 0.9102 0.4186 0.93500.7463 0.9830 0.8776 0.6584 0.8574 0.9316 0.54640.7044 0.9610 0.9051 0.5740 0.7984 0.9696 0.44840.7293 0.6920 0.8699 0.6102 0.6211 0.6794 0.36720.7405 0.9775 0.8998 0.6392 0.8421 0.9490 0.51460.7210 0.9781 0.8811 0.6526 0.8446 0.9213 0.5430


0.7363 0.8445 0.7738 0.9249 0.8241 0.8350 0.93830.4323 0.3825 0.6443 0.7821 0.4946 0.6862 0.78010.6508 0.5904 0.9330 0.9756 0.6426 0.9750 0.97160.9624 0.7990 0.5707 0.7923 0.9021 0.7015 0.76550.8984 0.7847 0.6013 0.8362 0.8829 0.7202 0.81750.3683 0.3878 0.9988 0.9140 0.4065 0.9747 0.92820.5759 0.5853 0.9654 0.9688 0.6111 0.9819 0.97590.6804 0.6010 0.6435 0.8429 0.7205 0.7284 0.83640.7210 0.6510 0.9102 0.9618 0.7091 0.9626 0.96030.3517 0.6050 0.9031 0.8833 0.5315 0.8686 0.92390.9398 0.8304 0.4539 0.7292 0.9175 0.5926 0.71060.8838 0.7372 0.7100 0.8843 0.8419 0.8117 0.85900.3068 0.3345 0.9909 0.9064 0.3639 0.9601 0.92430.9739 0.7176 0.3175 0.5792 0.8534 0.4801 0.54401.0000 0.7787 0.3747 0.6183 0.8824 0.5312 0.58290.7787 1.0000 0.3722 0.6206 0.9100 0.4613 0.63650.3747 0.3722 1.0000 0.9159 0.4040 0.9772 0.92630.6183 0.6206 0.9159 1.0000 0.6820 0.9455 0.97820.8824 0.9100 0.4040 0.6820 1.0000 0.5180 0.66320.5312 0.4613 0.9772 0.9455 0.5180 1.0000 0.95190.5829 0.6365 0.9263 0.9782 0.6632 0.9519 1.00000.6845 0.5635 0.9286 0.9475 0.6420 0.9758 0.94080.4501 0.8270 0.6476 0.7772 0.7038 0.6444 0.81650.5604 0.6168 0.8565 0.9405 0.6462 0.8927 0.96590.6882 0.6942 0.8852 0.9376 0.7333 0.9317 0.95170.2425 0.3147 0.7513 0.8036 0.3726 0.7524 0.82070.6100 0.6456 0.8462 0.9557 0.6918 0.8829 0.95430.6884 0.6141 0.8296 0.9502 0.7058 0.8927 0.94860.6992 0.5910 0.8907 0.9646 0.6811 0.9485 0.96140.9813 0.8440 0.3212 0.5780 0.9173 0.4731 0.55350.3342 0.4371 0.9801 0.9126 0.4224 0.9465 0.93100.7653 0.7952 0.3919 0.6972 0.8544 0.5099 0.68820.3604 0.4827 0.9761 0.9159 0.4596 0.9474 0.93930.5689 0.5764 0.9606 0.9639 0.6018 0.9739 0.96980.4142 0.5181 0.9764 0.9445 0.5102 0.9578 0.96740.9513 0.8847 0.5232 0.7495 0.9375 0.6455 0.73850.5164 0.5060 0.9783 0.9652 0.5407 0.9840 0.97200.4992 0.6229 0.9480 0.9449 0.6044 0.9439 0.96610.9418 0.7208 0.1943 0.5086 0.8534 0.3626 0.47080.5437 0.5980 0.8706 0.9384 0.6293 0.9024 0.95900.3467 0.3385 0.9982 0.8989 0.3687 0.9729 0.91130.5222 0.4659 0.9805 0.9462 0.5140 0.9883 0.94200.3806 0.4650 0.9085 0.9304 0.4991 0.8998 0.95070.9625 0.8569 0.4859 0.7428 0.9309 0.6248 0.72140.9624 0.8954 0.4710 0.7190 0.9437 0.6027 0.70620.6040 0.5514 0.9601 0.9581 0.6012 0.9871 0.96110.5067 0.4843 0.9880 0.9484 0.5200 0.9912 0.95670.3904 0.7438 0.6513 0.7873 0.6794 0.6489 0.82270.5693 0.5432 0.9705 0.9622 0.5833 0.9856 0.96840.6055 0.5699 0.9497 0.9474 0.6119 0.9756 0.9530

Correlation between variables


0.8716 0.8883 0.9154 0.9280 0.7005 0.9209 0.91450.6598 0.5718 0.8532 0.6412 0.9217 0.8423 0.88390.9818 0.7118 0.9245 0.9564 0.7575 0.9303 0.94700.8190 0.6022 0.7477 0.8246 0.4710 0.7820 0.84900.8134 0.6668 0.8279 0.8240 0.6158 0.8482 0.91060.9242 0.6651 0.8566 0.8884 0.7461 0.8428 0.82180.9754 0.7556 0.9116 0.9596 0.7388 0.9166 0.90690.7595 0.6553 0.8947 0.7663 0.8329 0.8913 0.95050.9915 0.7158 0.9023 0.9731 0.6826 0.9125 0.92910.8207 0.8884 0.8610 0.8794 0.7322 0.8429 0.77470.7119 0.6247 0.7288 0.7436 0.4908 0.7626 0.82580.8976 0.6506 0.8361 0.8912 0.5931 0.8591 0.91390.8930 0.6550 0.8697 0.8571 0.8099 0.8498 0.82660.6327 0.3880 0.5476 0.6270 0.2906 0.6039 0.70050.6845 0.4501 0.5604 0.6882 0.2425 0.6100 0.68840.5635 0.8270 0.6168 0.6942 0.3147 0.6456 0.61410.9286 0.6476 0.8565 0.8852 0.7513 0.8462 0.82960.9475 0.7772 0.9405 0.9376 0.8036 0.9557 0.95020.6420 0.7038 0.6462 0.7333 0.3726 0.6918 0.70580.9758 0.6444 0.8927 0.9317 0.7524 0.8829 0.89270.9408 0.8165 0.9659 0.9517 0.8207 0.9543 0.94861.0000 0.6412 0.8737 0.9535 0.6691 0.8853 0.91060.6412 1.0000 0.7950 0.7707 0.6396 0.7997 0.71420.8737 0.7950 1.0000 0.8924 0.8785 0.9540 0.95290.9535 0.7707 0.8924 1.0000 0.6642 0.8923 0.89500.6691 0.6396 0.8785 0.6642 1.0000 0.8461 0.83560.8853 0.7997 0.9540 0.8923 0.8461 1.0000 0.95730.9106 0.7142 0.9529 0.8950 0.8356 0.9573 1.00000.9697 0.6849 0.9345 0.9382 0.7818 0.9398 0.97860.6306 0.5039 0.5324 0.6725 0.1988 0.5754 0.64100.8891 0.7393 0.8683 0.8778 0.7668 0.8557 0.81310.5822 0.7033 0.7501 0.6570 0.6333 0.7632 0.80480.8936 0.7671 0.8698 0.8967 0.7545 0.8484 0.80980.9670 0.7444 0.9098 0.9533 0.7354 0.9003 0.89840.9139 0.7918 0.9114 0.9169 0.7935 0.8968 0.86370.7702 0.6759 0.7088 0.8082 0.3938 0.7568 0.79170.9622 0.7177 0.9159 0.9328 0.7834 0.9112 0.91020.9233 0.8345 0.8970 0.9477 0.7268 0.8874 0.85570.5167 0.3819 0.5056 0.5321 0.2787 0.5518 0.64930.8816 0.7841 0.9493 0.8819 0.8475 0.9298 0.93980.9190 0.6208 0.8405 0.8690 0.7404 0.8248 0.80820.9737 0.6540 0.8717 0.9248 0.7189 0.8752 0.87040.8463 0.7681 0.9522 0.8479 0.9303 0.9305 0.91850.7479 0.6382 0.7077 0.7824 0.4322 0.7487 0.80270.7349 0.6557 0.6838 0.7792 0.3784 0.7232 0.76990.9866 0.6961 0.8997 0.9481 0.7154 0.8971 0.90850.9700 0.6877 0.8878 0.9358 0.7358 0.8825 0.87850.6247 0.9617 0.8305 0.7468 0.7604 0.8218 0.74900.9794 0.7137 0.9027 0.9491 0.7330 0.9029 0.90580.9792 0.6980 0.8862 0.9558 0.6784 0.8777 0.8879


0.9090 0.7394 0.8043 0.8287 0.8259 0.8927 0.86300.8047 0.3741 0.6443 0.7583 0.6214 0.6820 0.68760.9795 0.6021 0.9127 0.6477 0.9145 0.9768 0.93990.8521 0.9413 0.5437 0.8407 0.5609 0.7402 0.61660.8842 0.8757 0.5826 0.9123 0.5979 0.7568 0.66320.8838 0.3190 0.9850 0.3899 0.9828 0.9627 0.98060.9545 0.5388 0.9567 0.5759 0.9628 0.9888 0.97630.8850 0.6457 0.6320 0.8997 0.6301 0.7383 0.70270.9749 0.6793 0.8852 0.6521 0.8963 0.9712 0.92080.8049 0.3610 0.9511 0.5012 0.9689 0.9148 0.96680.7916 0.9306 0.4452 0.9263 0.4626 0.6436 0.53350.9272 0.8371 0.6793 0.8069 0.6913 0.8435 0.74230.8739 0.2543 0.9810 0.3972 0.9724 0.9435 0.97430.6835 0.9547 0.2715 0.8014 0.2885 0.5050 0.35450.6992 0.9813 0.3342 0.7653 0.3604 0.5689 0.41420.5910 0.8440 0.4371 0.7952 0.4827 0.5764 0.51810.8907 0.3212 0.9801 0.3919 0.9761 0.9606 0.97640.9646 0.5780 0.9126 0.6972 0.9159 0.9639 0.94450.6811 0.9173 0.4224 0.8544 0.4596 0.6018 0.51020.9485 0.4731 0.9465 0.5099 0.9474 0.9739 0.95780.9614 0.5535 0.9310 0.6882 0.9393 0.9698 0.96740.9697 0.6306 0.8891 0.5822 0.8936 0.9670 0.91390.6849 0.5039 0.7393 0.7033 0.7671 0.7444 0.79180.9345 0.5324 0.8683 0.7501 0.8698 0.9098 0.91140.9382 0.6725 0.8778 0.6570 0.8967 0.9533 0.91690.7818 0.1988 0.7668 0.6333 0.7545 0.7354 0.79350.9398 0.5754 0.8557 0.7632 0.8484 0.9003 0.89680.9786 0.6410 0.8131 0.8048 0.8098 0.8984 0.86371.0000 0.6454 0.8624 0.7191 0.8614 0.9474 0.90150.6454 1.0000 0.3057 0.7741 0.3339 0.5304 0.38640.8624 0.3057 1.0000 0.4199 0.9885 0.9517 0.98780.7191 0.7741 0.4199 1.0000 0.4358 0.5735 0.51390.8614 0.3339 0.9885 0.4358 1.0000 0.9575 0.99150.9474 0.5304 0.9517 0.5735 0.9575 1.0000 0.97170.9015 0.3864 0.9878 0.5139 0.9915 0.9717 1.00000.7894 0.9573 0.5172 0.8311 0.5393 0.7029 0.59600.9526 0.4697 0.9636 0.5544 0.9631 0.9876 0.97930.8982 0.4896 0.9677 0.5563 0.9799 0.9747 0.98570.6040 0.9290 0.1639 0.8545 0.1795 0.4035 0.25740.9339 0.5161 0.8768 0.7138 0.8819 0.9091 0.91580.8747 0.2898 0.9765 0.3549 0.9715 0.9507 0.96800.9377 0.4652 0.9522 0.4792 0.9528 0.9776 0.96060.9059 0.3472 0.9218 0.6330 0.9149 0.9030 0.94480.7928 0.9588 0.4789 0.8791 0.5015 0.6735 0.55950.7659 0.9702 0.4658 0.8514 0.4963 0.6651 0.54950.9607 0.5551 0.9353 0.5617 0.9388 0.9806 0.95560.9379 0.4578 0.9659 0.4918 0.9694 0.9867 0.97680.6975 0.4436 0.7411 0.7428 0.7609 0.7305 0.78960.9575 0.5219 0.9532 0.5496 0.9560 0.9876 0.97040.9397 0.5673 0.9261 0.5433 0.9357 0.9761 0.9468


0.8785 0.8702 0.9045 0.6489 0.9031 0.7481 0.82930.5285 0.7217 0.6304 0.5175 0.8127 0.6250 0.65330.7587 0.9775 0.9432 0.5127 0.9118 0.9203 0.96640.9735 0.7037 0.6802 0.9035 0.7370 0.5424 0.69030.9386 0.7371 0.7057 0.8850 0.8028 0.5707 0.70320.5223 0.9769 0.9552 0.1833 0.8697 0.9977 0.97850.7173 0.9884 0.9796 0.4099 0.9230 0.9547 0.98260.7583 0.7573 0.6942 0.7378 0.8557 0.6158 0.69560.8204 0.9602 0.9392 0.5648 0.9043 0.8959 0.96000.5713 0.9091 0.9745 0.1928 0.8666 0.8922 0.87740.9575 0.6111 0.5950 0.9486 0.7040 0.4208 0.57680.9194 0.8139 0.7782 0.8046 0.8075 0.6873 0.80810.4688 0.9685 0.9347 0.1520 0.8764 0.9904 0.95840.9179 0.4708 0.4266 0.9750 0.5355 0.2872 0.46200.9513 0.5164 0.4992 0.9418 0.5437 0.3467 0.52220.8847 0.5060 0.6229 0.7208 0.5980 0.3385 0.46590.5232 0.9783 0.9480 0.1943 0.8706 0.9982 0.98050.7495 0.9652 0.9449 0.5086 0.9384 0.8989 0.94620.9375 0.5407 0.6044 0.8534 0.6293 0.3687 0.51400.6455 0.9840 0.9439 0.3626 0.9024 0.9729 0.98830.7385 0.9720 0.9661 0.4708 0.9590 0.9113 0.94200.7702 0.9622 0.9233 0.5167 0.8816 0.9190 0.97370.6759 0.7177 0.8345 0.3819 0.7841 0.6208 0.65400.7088 0.9159 0.8970 0.5056 0.9493 0.8405 0.87170.8082 0.9328 0.9477 0.5321 0.8819 0.8690 0.92480.3938 0.7834 0.7268 0.2787 0.8475 0.7404 0.71890.7568 0.9112 0.8874 0.5518 0.9298 0.8248 0.87520.7917 0.9102 0.8557 0.6493 0.9398 0.8082 0.87040.7894 0.9526 0.8982 0.6040 0.9339 0.8747 0.93770.9573 0.4697 0.4896 0.9290 0.5161 0.2898 0.46520.5172 0.9636 0.9677 0.1639 0.8768 0.9765 0.95220.8311 0.5544 0.5563 0.8545 0.7138 0.3549 0.47920.5393 0.9631 0.9799 0.1795 0.8819 0.9715 0.95280.7029 0.9876 0.9747 0.4035 0.9091 0.9507 0.97760.5960 0.9793 0.9857 0.2574 0.9158 0.9680 0.96061.0000 0.6579 0.6777 0.8802 0.6993 0.4914 0.63900.6579 1.0000 0.9662 0.3664 0.9221 0.9686 0.98520.6777 0.9662 1.0000 0.3274 0.9024 0.9365 0.94810.8802 0.3664 0.3274 1.0000 0.4777 0.1600 0.34150.6993 0.9221 0.9024 0.4777 1.0000 0.8530 0.87850.4914 0.9686 0.9365 0.1600 0.8530 1.0000 0.97660.6390 0.9852 0.9481 0.3415 0.8785 0.9766 1.00000.5576 0.9339 0.9067 0.3259 0.9468 0.8953 0.88870.9768 0.6322 0.6367 0.9226 0.6885 0.4547 0.61150.9872 0.6149 0.6358 0.9087 0.6750 0.4394 0.59700.7235 0.9805 0.9551 0.4358 0.9174 0.9501 0.98270.6406 0.9934 0.9664 0.3287 0.8991 0.9819 0.99410.6190 0.7194 0.8175 0.3708 0.8183 0.6242 0.63820.7004 0.9890 0.9664 0.4021 0.9176 0.9621 0.98870.7284 0.9723 0.9573 0.4249 0.8900 0.9419 0.9740



0.8431 0.8637 0.8606 0.8783 0.8453 0.8632 0.88020.8600 0.5809 0.5226 0.6756 0.6615 0.6759 0.68730.9018 0.7495 0.7253 0.9791 0.9699 0.7094 0.97840.6070 0.9830 0.9750 0.7609 0.6840 0.5733 0.73890.7059 0.9618 0.9425 0.7716 0.7024 0.6677 0.75450.9062 0.4834 0.4707 0.9575 0.9872 0.6640 0.96980.9118 0.6845 0.6757 0.9909 0.9905 0.7422 0.99510.8389 0.8037 0.7596 0.7463 0.7044 0.7293 0.74050.8581 0.7970 0.7881 0.9830 0.9610 0.6920 0.97750.8934 0.5222 0.5305 0.8776 0.9051 0.8699 0.89980.5786 0.9823 0.9709 0.6584 0.5740 0.6102 0.63920.7289 0.9248 0.9102 0.8574 0.7984 0.6211 0.84210.9331 0.4407 0.4186 0.9316 0.9696 0.6794 0.94900.3833 0.9409 0.9350 0.5464 0.4484 0.3672 0.51460.3806 0.9625 0.9624 0.6040 0.5067 0.3904 0.56930.4650 0.8569 0.8954 0.5514 0.4843 0.7438 0.54320.9085 0.4859 0.4710 0.9601 0.9880 0.6513 0.97050.9304 0.7428 0.7190 0.9581 0.9484 0.7873 0.96220.4991 0.9309 0.9437 0.6012 0.5200 0.6794 0.58330.8998 0.6248 0.6027 0.9871 0.9912 0.6489 0.98560.9507 0.7214 0.7062 0.9611 0.9567 0.8227 0.96840.8463 0.7479 0.7349 0.9866 0.9700 0.6247 0.97940.7681 0.6382 0.6557 0.6961 0.6877 0.9617 0.71370.9522 0.7077 0.6838 0.8997 0.8878 0.8305 0.90270.8479 0.7824 0.7792 0.9481 0.9358 0.7468 0.94910.9303 0.4322 0.3784 0.7154 0.7358 0.7604 0.73300.9305 0.7487 0.7232 0.8971 0.8825 0.8218 0.90290.9185 0.8027 0.7699 0.9085 0.8785 0.7490 0.90580.9059 0.7928 0.7659 0.9607 0.9379 0.6975 0.95750.3472 0.9588 0.9702 0.5551 0.4578 0.4436 0.52190.9218 0.4789 0.4658 0.9353 0.9659 0.7411 0.95320.6330 0.8791 0.8514 0.5617 0.4918 0.7428 0.54960.9149 0.5015 0.4963 0.9388 0.9694 0.7609 0.95600.9030 0.6735 0.6651 0.9806 0.9867 0.7305 0.98760.9448 0.5595 0.5495 0.9556 0.9768 0.7896 0.97040.5576 0.9768 0.9872 0.7235 0.6406 0.6190 0.70040.9339 0.6322 0.6149 0.9805 0.9934 0.7194 0.98900.9067 0.6367 0.6358 0.9551 0.9664 0.8175 0.96640.3259 0.9226 0.9087 0.4358 0.3287 0.3708 0.40210.9468 0.6885 0.6750 0.9174 0.8991 0.8183 0.91760.8953 0.4547 0.4394 0.9501 0.9819 0.6242 0.96210.8887 0.6115 0.5970 0.9827 0.9941 0.6382 0.98871.0000 0.5562 0.5264 0.8900 0.9065 0.8271 0.90680.5562 1.0000 0.9874 0.6899 0.6073 0.6046 0.66790.5264 0.9874 1.0000 0.6798 0.5957 0.6020 0.65730.8900 0.6899 0.6798 1.0000 0.9867 0.6814 0.99240.9065 0.6073 0.5957 0.9867 1.0000 0.6793 0.99310.8271 0.6046 0.6020 0.6814 0.6793 1.0000 0.69930.9068 0.6679 0.6573 0.9924 0.9931 0.6993 1.00000.8684 0.6916 0.6858 0.9819 0.9799 0.6804 0.9837

Column 50

0.87330.64410.97460.76000.75830.94950.98370.72100.97810.88110.65260.84460.92130.54300.60550.56990.94970.94740.61190.97560.95300.97920.69800.88620.95580.67840.87770.88790.93970.56730.92610.54330.93570.97610.94680.72840.97230.95730.42490.89000.94190.97400.86840.69160.68580.98190.97990.68040.98371.0000


Pairwise correlations are difficult tointerpret when many variables areInvolved Need of multivariatedata analysis tools

Pair-wiseCorrelationsvariables

samples

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.


• What is the SVD of a data matrix X?....– singular value decomposition– singular values are the root square of the eigenvalues– X=USVT, U ana VT are orthonormal matrices and S is a diagonal

matrix with singular values– SVD is an orthogonal matrix decomposition– The elements in S are ordered according to the variance

explained by each each component– Variance is concentrated in the first components, it allows for

reducing the number of variables explaining the variance structure and filtering the noise

1

, ,... , ,... , =

= + = = <<

= +

∑TX USV E

K

ij ik k kj ijk

x u s v e i 1 I j 1 J K I or J


0 2 4 6 8 100

200

400

600

800

1000

0 2 4 6 8 100

200

400

600

800

0 2 4 6 8 100

10

20

30

40

50

0 2 4 6 8 100

10

20

30

40

Effect of data pretreatments on SVDplot(svds)

Raw data X Mean-centered data X

Scaled data X Autoscaled data X

4 larger components


0 2 4 6 8 100

100

200

300

400

500

600

700

800

900

0 2 4 6 8 100

5

10

15

20

25

30

35

Methods of data pretreatmentEffect of pretreatments

Raw data X log10 data X

4 components how many components?non-linearity?





Principal Components (PCA)

x1

x2

••

•• •

••

•

Original Variables

Data Representation

PC1

Reduce the dimensions of the original space.Keep the relevant information about data variance.

m1m2

• • • • • • ••

Principal Components

PC1m1m2

PC1 Direction of maximum variation

Principal Components (PCA)Data Representation

Reduce dimensions of the original space.Keep the relevant information about data varianceNot repeat information among PCs (they areorthogonal).

x1

x2

x3

•

••••

••••

•

••••

••••

•••••

••••

PC1PC2

•••••••••• ••••••••

•••••••••

PC1

PC2

PC2 Direcction of maximumremaining variance ⊥ PC1

λ1 λ2 λ3a 1 2 3b 3 3 5c 4 5 8d 5 4 7e 2 1 2f 0 0 0

*c

*e*f

*b

*d

*a

λ1

λ2

λ3

Geometrical interpretation


*a

*b

*c

*d*e

*f

PCA

PC1

PC2


* *

**

**

*

**

**

*PC1directionof maximumvariance

PC2orthogonalto PC1

λ2

λ1

** * **

*

* *

**

λ3



• They are mathematical variables which describe efficiently the data variance.– The relevant variance of the original data is described

by a reduced number of components (PCs)– Visualization of large data sets (many variables) in the

PC space of reduced dimensions.• Information is not repeated (overloaded, PCs are

orthogonal).• Describe the main directions of data variance in

decreasing order.• They are linear combination of the original variables.


• They are linear combination of the original variables.

x1

x2

••

•• •

••

•

t1

m1m2

x1p11

x 2p 2

1

t1 = x1p11 + x2p21

p11, p21

loadings of the original variables in first PC, t1.

PCA ModelRelationship betwen the data in the PC spaceand in the original space.

t1 t2 x1 x2 x3 p1 p2

=tj1 tj2 xj1 xj2 xj3

p11

p21

p31

T (n,2)

X (n,3)

P(3,2)

tj1 = xj1 p11 + xj2 p21 + xj3p31

T = XP

PCs (linear combination of the original variables)

PCA Model

T (scores matrix).• Describe the

samples in theprincipal components space.

• They are orthogonal. – ti

T tj = 0.

P (loadings matrix).• Describe the original

variables in theprincipal components space.

• They are orthonormal. – pi

T pj = 0.– ||pi|| = 1.– PTP = I

T = X P(n,npc) (n,m) (m,npc)

n (samples), m (variables), npc (principal components)

X = TPT

PCA Model: X = T PT

scores loadings (projections)

= +XT

PT

E

X = t1p1T + t2p2

T + ……+ tnpnT + E

X t1p1T t2 p2

T tnpnT E= + +….+ +

n number of components (<< number of variables in X)

rank 1 rank 1 rank 1

PCA Model

Model: X = T PT + EX = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal

Noise E Experimental error, non-explained variances

PCA Model

Determination of the number of components

• When the expected experimental error is known.– Plots of explained or residual variance as a function of

the nr. of components of the model.• Ex. models explaining a 95% of the variance are satisfactory?.

– Compare mean residual values with experimental error size.

• Ex. absornance errors in UV are aprox. 0.002.

mn

ee j,i ij

×=

∑2

n × m number of elements of X matrix

Number of PC → ē ≤ error (0.002).

PCA Model

• When experimental error is unknown:– Plot of singular values (or of eigenvalues).– Empirical functions related to experimental errors.– Cross Validation Methods.

Determination of the number of components

PCA Model

n cromatograms m spectra

How many components coeluted?The number of coeluting components is deduced from the number ofpricipal components.

ExampleDetermination of the number of components

PCA Model

– Plot of singular values(sk) or of functions of eigenvalues (λk).

• Singular values or eigenvalues vs. number of PCs (sk).• Log(eigenvalues) vs. number of PCs (λk).• Log(reduced eigenvalues) vs. number of PCs (REVk).

)1()1()(

+−+−=

kckrkREV kλ

λk = (sk)2r ne filesc ne columnes

Size of sk /λk /REVk ∝ associate PC importance

Determination of the number of componentsPCA Model

4 significative componentsThe rest of components are used to explain the experimental noise

log eigenvalues log(REV)

PCA ModelDetermination of the number of components

– Evaluation of empirical functions related to error.• Eigenvalue fucntions. Take advantage of the relation

between the explained variance and teh size of theeigenvalues.

• These functions have minima or considerable sizechanges for the optimal number of PCs


Malinowski error functions

( )2RSDIND

c n=

−

c0k

k=n+1RSD =r(c-n)

λ∑

c number of columnsr number of rowsn number of componentsλ0

k eigenvalue of componentk

Indicator Function (IND)

Mínimum IND → optimal number of PCs


n)-r(c

0k

c

1+n=k=RSDλ∑ c

nRSDIE =

cncRSDXE −

=

λ

λ0k

s

1+n=k

n

s

1+n=k

1)+n-1)(c+n-(r

1)+k-1)(c+k-(r=n)-sF(1,

∑

∑

Malinowski error fucntions

Statistical test estadístic of eigenvalues (Malinowski)

( )2RSDIND

c n=

−

Imbedded error

Extracted error


n Valor propi RE IND REV %SL (test F) 1 5.4068e+001 1.6716e-002 6.6864e-006 1.1043e-002 0 2 1.2172e+000 5.1353e-003 2.1388e-006 2.5625e-004 -3.5756e-005 3 6.6753e-002 3.5263e-003 1.5305e-006 1.4493e-005 1.6084e-003 4 1.8613e-002 2.9281e-003 1.3255e-006 4.1695e-006 3.7223e-001 5 2.1991e-003 2.8744e-003 1.3584e-006 5.0858e-007 2.9015e+001 6 2.0762e-003 2.8223e-003 1.3937e-006 4.9599e-007 2.9475e+001 7 1.9640e-003 2.7715e-003 1.4316e-006 4.8495e-007 2.9895e+001 8 1.9102e-003 2.7198e-003 1.4710e-006 4.8779e-007 2.9620e+001 9 1.7319e-003 2.6728e-003 1.5152e-006 4.5770e-007 3.1087e+001 10 1.6755e-003 2.6254e-003 1.5618e-006 4.5852e-007 3.0982e+001

Eigenvalues and REVs > 4 have lower sizesIND has a minimum in 4.RE lower its size.Eigenvalue for PC 4 is significatively larger than higher ones.


– Cross-validation methods.• A part of the data is used to built the model and another

part of the data is described by this model. The optimalnumber of components is teh one giving lower resoidualsin the description of new data.

• This procedure is repeated until all the samples are usedto built the model and as external data set.

• The final results are the mean of all repetitions obtainedin the modeling/description of the non-included samples.


1. Divide the data sample set on q subsets.2. Built PCA modesl with q –1 data subsets (Xmodel).3. Use these PCA models to explain the external data subset (Xextern).

i. Scores. Textern = XexternP.ii. Reproduction Xextern.

4. PRESS (Predictive Residual Sum of Sq uares) calculation usingdiffernt numbert of PCs.i. For PC k.

5. Repeat steps 1-4 until the q subsets have been used as external data sets.

6. Plot PRESScum vs. number of PCs .i. For PC k.

Texternextern PTX =ˆ

( )∑ −=j,i ijij xx̂)k(PRESS 2

∑=qcum )k(PRESS)k(PRESS

q number of PCA models


Xextern

Xmodel1 ... Xmodeln

Xextern

PRESScum,i = Σj PRESSji

Xmodel1

1PC 2PC .... mPC

Xmodel2

Xmodeln

.

.

.

PRESS11 PRESS12 PRESS1m

PRESS21 PRESS22 PRESS2m

PRESSn1 PRESSn2 PRESSnm

PRESScum,1 PRESScum,2 PRESScum,m


Nombre de PCs

PRES

S cum

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

PC Nr. PRESScum

1.0000 3.4772

2.0000 0.2320

3.0000 0.1117

4.0000 0.0505

5.0000 0.0515

6.0000 0.0517

7.0000 0.0521

8.0000 0.0524

9.0000 0.0535

10.0000 0.0531

Optimal number of PCs → Minimum value of PRESS


(cross-validation)

Model Fitting

PC reliability

Model reliability

How many principal components?Ex

plai

ned

Var

iànc

e

Number of PCs

Higher PCs explain data noiseA PCA model with noisy PCs is less reliable when describes

new dataWith more PCs in the model, better data fitting but the model

reliability when it is applied to new data may be worse (overfitting)


PCA model visualitzation

X = T PT

Scores plot(map of samples on PC space)

Loadings plot(map of variables on PC space)

tj1 tj2mj

T

PC1 PC2

PC1

PC2

•mj (tj1, tj2)

pj1 pj2

P

PC1 PC2

PC1

PC2

xj (pj1, pj2)

xj

Sn Zn Fe Ni 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 0.61 3.96 0.09 0.0062 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

Sam

ples

X (9,4)

Example PCA model visualitzation

Scores plot• distance among

samples shows theirsimilarity (5,6 and 4,8 very similar).

• Detection of samplegroups (clusters) (I i II).

• External informationcan help to identify thenature of the detectedgroups (ex. samplesorigen,...).

• Very distant samplesare extreme samples.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

-0.2

-0.1

0

0.1

0.2

0.3

123

45 6

7

8

9

PC1PC2

Samples Map

II

I

Original data


Loadings plot

• Relevant variables for themodel have high loadings andfar from the origen (Zn, Sn).

• Close variables to the origen do not give information aboutthe data variance (Fe, Ni).

• High Loadings in one PC ishow high weight in thiscomponent (Zn – PC1, Sn –PC2).

• Correlated Variables withlower PCs have more importance.

• Correlations between variables are described by their angle(Zn are Sn little correlated).

• Positive (direct) and negative(indirect) correlations betweenvariables can be detected.

-0.8 -0.4 0.4 0.8

-0.8

-0.4

0.4

0.8

PC1

PC2

Sn

ZnFeNi

Map of variables

Original Data


• Close samples to the origen aresililar to the average sample. This cannot be distinguishedwhen they are separated in groups.

• The importance of the variables change because centeringeliminates the weight of the scalesize. Sn, related with PC1, isnow more important than Zn.

-0.4 -0.3 -0.2 -0.1 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.1

0.2

12

3

4

5

6

7

8

9

PC1

PC2

-0.2 0.2 0.4 0.6 0.8 1 1.2

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

Sn

Zn

FeNi PC1

PC2

Centered dataScores plot Loadings plot


• They eliminate the scale size and range. The effect of all variables is enhanced.

• Correlation information amongvariables is more clearly seen.

-1 -0.5 0 0.5 1 21.5

-1.5

-1

-0.5

0.5

1.5

1

2

34

5

6

7

8

9

PC1

PC2

0.1 0.2 0.3 0.4 0.5 0.60.2

0.4

0.6

0.8

1

1.2

Sn

Zn

Fe

NiPC1

PC2Scores plot

Autoescaled dataLoadings plot


• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals

Interpretation of the ‘scores’(targets, punctuations)


Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones againstthe others


Outliers

• Outlier samples can have a great influence(leverage) in the PCA model

• They can be detected

• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both


Detection of anomalous objects

Extreme Objects:• Diferent to the rest of

objects.

Outlier objects• Extreme objects that

cannot be fitted by themodel.

x1

x2

•• •

••

••

•

PC1

••

•

•

O. extreme

O. outlier

An object (samples) is extreme or outlierwhen one or more variables have theirvalues very different to the other samples


Outlier detectionWhy is needed?• They distort the model• They hidden the structure of the rest of the data.When they should be eliminated and How?• If it is justified methematically and chemically.• It should be done gradually and stating with the more

outstanding ones.

x1

x2

•• •

••

••

•

PC1••

•O. outlier


Outlier DetectionMathematical Indicadors• Extreme objects

–Mahalanobis Distance (to themean model)

22 ,

1( 1)

ni k

ikk

td ns λ=

= − ∑di distànce of object (sample) i

(for centered data)

Leverage. Related to the scoressize

2,

1

1n

i ki

kk

th ns λ=

= +∑

hi ‘leverage’ of the sample i

Outlier objectsSize of the residual.

∑==j

iji eeQ 22

j variable index

ns number of samplesti,k2 score value for sample i component kλk eigenvalue of component kn number of PCs

T2. Related to the scores size

∑= λ

=n

k k

kii

tT

1

2,2

Detection of outliers• Leverage plot.

Leverage (hi)

Res

idua

l (Q

)

• • •• •• •

•••

•

• • •• ••

•

•••

• • •• •• •

•••

• • •• •• •

•••• • •• •

• ••

•••

Objects poorly described by the model

Outliers

Extreme Objects

To be used with a low number of PCs !!!!(Higher PCs are used to describe outliers)

Similar gràfics Q as a function of T2 eare used in process control

PCA Statistics Residuals statistic to measure the lack of fit (large residuals)Qi = ei ei

T = xi (I-PkPkT) xk

T

samples with large Qi values are unusual(they are out of the model!!!!)

Hotelling statistic T2

Ti2 = ti λi

-1tiT = xi Pk λi

-1 PkT xi

T

samples with large values of Ti2 are unusual

(they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control

Principal Component Analysis (PCA)

PCA StatisticsQi = eiei

T = xi(I - PkPkT)xi

T, variation out of the PCA modelTi

2 = ti-1tiT = xiPΤ-1PTxi

T, variation inside of the PCA model


Summary: Building PCA model

• Need of data pretreatment.• Determination of the number of principal

components of the PCA model.• Detection and elimination of outliers• Repetition of previous steps• Assesment of the model quality and reliability

• What is SVD?....– PCA is many times done using SVD – SVD is an orthogonal matrix decomposition– X=USVT, U ana VT are orthonormal matrices and S is

a diagonal matrix with singular values– The elements in S are ordered according to the

variance explained by each each component– Variance is concentrated in the first components, it

allows for reducing the number of variables explaining the variance structure and filtering the noise

– singular values are the root square of the eigenvalues

Singular Value Decomposition (SVD)

1

K

ij ik k kjk

x u s v=

= =∑ T TX = USV TPscores

loadings

Principal Components in m dimensions

t1 = p11x1 + p12x2 + ..............+ p1mxmt2 = p21x1 + p22x2 + ..............+ p2mxm....................................................................................................................tm = pm1x1 + pm2x2 + ..............+ pmmxm

Linear combination of the original variables (already mean centred)

1 11 12 1m 1

2 21 22 2m 2

m m1 m2 mm m

t p p ... p xt p p ... p x... ... ... ... ... ...t p p ... p x

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟=⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠ ⎝ ⎠⎝ ⎠

principalcomponents

originalvariables


PCA Model: X = T PT

scores loadings (projections)

= +XT

PT

E

X = t1p1T + t2p2

T + ……+ tnpnT + E

X t1p1T t2 p2

T tnpnT E= + +….+ +

n number of components (<< number of variables in X)

rank 1 rank 1 rank 1


Model: X = T PT + EX = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal

Noise E Experimental error, non-explained variances


PCA Model : X = TPT + EDetermination of the number of principal components A

X(n,m), when n > m, m is the maximum number of PCs, Amax= mwhen m > n, n is the maximum number of PCs, Amax= n

In general, a number much smaller of PCs is used ‘data compression’, ‘data reduction’, A << n or m

A is chosen for the variance in TPT having most of the relevant structure of X, whereas noise remains in E(noise does not interest us! we want to filter it! ...)

To select the appropriate number of PCs, A, E (residuals, lack of fit,...) has to be studied quantitation of the variance in E, e.g. by residuals variance in %


Determination of the number of principal components

a) visual inspection of the magnitudee of the singular values. Graphical representation (search for an inflexion)

b) representationss of the explained/residual variance respect the number of principal component

c) For autoscaled data, keep components until their λ aprox 1-2d) When the noise level is known, select the number of PCs until

the residual variance is similar to noise variance. e) Consider PCs until ‘loadings’ have structural features (nor noise)f) Use statistical tests and methods based in the previous

knowledge of experimental noise size. g) Approximate methods when experimental noise is not known

• Malinowski error functions • cross validation• .........


0 2 4 6 8 10 120

50

100

150

200

250

300

350

400

450

Determination of the number of PCsfrom eigenvalue/singular value plots

4 components


Model Fitting

PC reliability

Model reliability

How many principal components?

With more PCs in the model, better data fitting but the model reliability when it is applied to new data may be worse (overfitting)

Expl

aine

d V

arià

nce

Number of PCs


Cross-validation Methods- a data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values


X

Xr k PCsPCA

Xr = Tk PkT

x Tx̂ x P Pk k

êvaluation of (x x)

=

−

eliminated

PCAprojection


Cross-validation Methodsa data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values

r c2

ij iji=1 j=1

ˆPRESS(k) = (x - x (k))∑∑PRESS is plotted for the different number of considered componentsk, and the minimum value of PRESS or when it does not decreaseany more is looked for



oo

o o

oo

o o

o

o

o

oo

o oo

PC1

x1

x2

x1 loading

x2 lo

adin

gLoadings

Loadings are orthonormal, PTP = I and PT = P-1


Loadings interpretation

• Determination of the more important variables inthe formation of the principal components (those variableswith large loadings are important, either neg. or pos.)

• Multivariate correlation between variables– positive correlation (common variation)– negative correlation (contrary variation)

• Identification and qualitative information (fingerprinting) on the variation sources


Scores

oo

o o

oo

o o

o

o

o

oo

o oo

PC1

sample s

core t 1

x1

x2

Projection of X in the PCs (loadings) gives the ‘scores’T = XP


• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals

Interpretation of the ‘scores’(targets, punctuations)


Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones againstthe others


PCA Statistics Residuals statistic to measure the lack of fit (large residuals)Qi = ei ei

T = xi (I-PkPkT) xk

T

samples with large Qi values are unusual(they are out of the model!!!!)

Hotelling statistic T2

Ti2 = ti λi

-1tiT = xi Pk λi

-1 PkT xi

T

samples with large values of Ti2 are unusual

(they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control


PCA StatisticsQi = eiei

T = xi(I - PkPkT)xi

T, variation out of the PCA modelTi

2 = ti-1tiT = xiPΤ-1PTxi

T, variation inside of the PCA model


Outliers

• Outlier samples can have a great influence(leverage) in the PCA model

• They can be detected

• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both


Outliers in scores plots

xx x

xxx xx

x

x

xx

x

xx

xxxx x

xxx

xxx

xxxx

xx x

x

xx

x x

Scores en PC2

Scor

esen

PC

1


Detection of ‘outliers’From scores plots ==> outlier samplesFrom loadings plots ==> outlier variables Leverage samples or variables affecting very much the PCA modelIt is evaluated from the expression:

2,

1

1n

i ki

kk

th ns λ=

= +∑hi sample i ‘leverage’ti,k ‘score’ of sample i on the k componentλk singular value of k componentns number of samples n number of considered components





Multiple linear regression (MLR) is a method used to model thelinear relationship between a dependent variable (predictand) andone or more independent variables (predictors).

MLR is based on least squares: the model is fit such that thesum-of-squares of differences of observed and predicted values isminimized..

The performance of the model on data not used to fit the model isusually checked in some way by a process called validation.

The reconstruction is a "prediction" in the sense that theregression model is applied to generate estimates of thepredictand variable different to the used to fit the data. TheUncertainty in the reconstruction is summarized by confidenceintervals, which can be computed by various alternative ways.

Multivariate (Multiple) Linear Regression (MLR)

MLR Model0 1 ,1 2 ,2 ,

th,

0

value of predictor in sample

regression constant

coefficient on the predictor

total number of predictors= predictand in sample

error termIn vecto

= + + + + +

=

=

=

=

=

…i i i K i K i

i j

thj

i

i

y b b x b x b x e

x j i

b

b j

Ky ie

r-matrix form: = +y Xb e

Multivariate Linear Regression

MLR Predictions

0 1 ,1 2 ,2 ,

th,

0 1

ˆ ˆ ˆ ˆˆ

value of predictor in new sample ˆ ˆ ˆ, , estimated regression constant

and coefficientsˆ = predicted value for new sample

in matrix-vector

= + + + +

=

=

…i i i K i K

i k

k

i

y b b x b x b x

x k i

b b b

y i

ˆˆform =y Xb


0 1 ,1 2 ,2 ,ˆ ˆ ˆ ˆî i i K i Ky b b x b x b x= + + + +…

MLR Prediction

Measurement i might be outside the range used for calibrationor validation


MLR Residuals

ˆ ôbserved value of predictand in sample

ˆ predicted value of predictand in sample

= −==

i i i

i

i

e y yy iy i


MLR Assumptions

1. Relationships are linear

2. Predictors are nonstochastic

3. Residuals have zero mean

4. Residuals have constant variance

5. Residuals are not autocorrelated

6. Residuals are normally distributed


1. Relationship may be nonlinear or outlier-driven

2. Correlation ≠causation

3. Statistical significance≠ practical significance

4. Lagged effects not measured

5. Problems when X variables (predictors) are strongly correlated (common in practice)

Caveats/Problems to interpretation


Alternatives to MLR

Nonlinearity?

Data transformation,use kernel and

try MLR again Neural Networks

Nonparametric Regression(e.g., kernel regression)

Categorical predictand?

Discriminant analysis

Classification treesLogistic regression

Quadratic response surfaces


Correlation among predictors?

Reduce numberof variables

Stepwise Regrssion

Factor Analysisbased methods

MLR Statistics

• R2 -- explanatory power

• Adjusted R2: R2 adjusted for loss of degrees of freedom due to number of predictors in model

• F and its p-value -- significance of the equation

• se standard error of the estimate;equivalent to “root mean square error” (RMSEc);subscript “c” denotes “calibration”

• Confidence interval for parameters

X


MLR ANOVA Table(testing linearity)

MSE=SSE/(n-K-1)SSEn-K-1Residual

MSR=SSR/KSSRKRegression(model)

SSTn-1Total

Mean squaresSum of Squares

dfSource


Validating the MLR regression model

•Regression R-squared, even if adjusted for loss of degrees of freedom due to thenumber of predictors in the model, can give a misleading, overly optimistic viewof accuracy of prediction when the model is applied outside the calibrationperiod.

•Several approaches to validation are available. Among these are cross-validationand split-sample validation.

•In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict thepredictand for the deleted observation. The merged series of predictions fordeleted observations is then checked for accuracy against the observed data.

•In split-sample calibration, the model is fit to some portion of the data (say, thesecond half), and accuracy is measured on the predictions for the other half ofthe data. The calibration and validation periods are then exchanged and theprocess repeated.


Model Calibration vs Validation

Validation1. Testing the model on data not

used to fit the model2. “validation”, “verification”,

“independent” data3. Accuracy statistics:

RE SSEvMSEvRMSEv

Calibration1. Fitting the model to the data

2. “calibration”, “construction”, “estimation” data

3. Accuracy statistics:{R2, Ra

2,}SSEcMSEcRMSEc

Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error ofprediction, PRESS statistic, "hat" matrix, extrapolation vs interpolationAdvantages of cross-validation over alternative validation methods


Cross validation stopping rule

Stop here


Error bars for MLR predictions

1. Standard error of the estimate (calibration statistic)

2. Standard error of prediction (calibration statistic)

3. Root-mean-square error of validation (validation statistic)

Heirarchy


Standard Error of MLR Prediction(Equation for simple linear regression)

( )

( )

1/ 2

2*

ˆ2

1

11y e n

ii

x xs s

n x x=

⎡ ⎤⎢ ⎥−⎢ ⎥= + +⎢ ⎥−⎢ ⎥⎣ ⎦

∑Standard errorof the estimate

MSERMSE

e

c

s ==

Term due to uncertaintyIn the estimate of the predictand mean; a function of sample sizen

Term due to departure of the predictor value for the predicted observation from the predictor mean for the calibration period


Univariate linear regression

1 1

0

i o 1 i i i i y x

i i i1 i2 2

i i

2 2y y y2

b b2 2i XX i

2i

b y 2i

y b b x e , y =f(x ), s s 0

(x X)(y Y) (x X)b y

(x X) (x X)

s s ss , s

(x X) S (x X)

xs s

n (x X)

= + + >> ≈

− − −= =

− −

= = =− −

⎛ ⎞⎜ ⎟=⎜ ⎟−⎝ ⎠

∑ ∑∑ ∑

∑ ∑∑

∑

Least Squares and linear regression

Ordinary Least Squares (OLS)Multivariate Linear regression

n experimental measures, m variables X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model Assumption that experimental errors are only important for yi

y = X b, b= (XTX)-1XTys2(b) = (A AT)-1 s2(y) = (X XT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj = {-Xij};ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ ri2 / (m-n)

Weighted Least Squares (WLS)Multivariate Linear regression

n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model W = {wij} n x m weights to each xij value, considering errors in X (error standard deviations of xij,, sij; wij = 1/sij)

unweigthed weightedy = X b y = X bb= (XTX)-1XTy b= (XTWX)-1XTWys2(b) = (A AT)-1 s2(y) s2(b) = (AWAT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ wi2ri

2 / (m-n)

Generalized Least Squares (GLS)Multivariate Linear regression

n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model M = {mij} n x m weights to each xij value, calculated from errors in X and y (it is more complex!)

y = X b, b= (XTMX)-1XTMys2(b) = (AMAT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ ri2 / (m-n)

Interpolation vs Extrapolation

Prediction based on predictordata “similar” to that in the calibration range

Prediction based on predictordata “unlike” that in thecalibration range

X


( )−= T 1 TH X X X X

* * *( )h −= T T 1x X X x

* maxh h>

Classifying MLR predicted values

“Hat” matrix, computed from Calibration-only predictor data

Classification statistic

Vector of predictor data for some observationOutside calibration range

Rule identifying “extrapolation”

Maximum value along diagonal of the hat matrix


Conventions & Notation in CalibrationData are arrenged in two blocs/tables/matrices X and Y where:X = matrix of predictor variables Y = matrix of predicted (predictand) variables ns = number of samples/observationsnx = number of variables in Xny = number of variables in Yn = number of PCs/latent variables/components

1 2 3 .... nypredicted (predictant) variables

sam

ples

123..

ns

MatrixY, y

1 2 3 .... nxpredictor variables

Sam

ples 1

23..

ns

MatrixX

Y = f(X)

Find f


Multivariate Linear Regression Causal vs Predictive Model

Causal models X = f (Y) (1)Predictive models (inverse) Y = f(X) or y = f(X) (2)

Independent (predictors) vs dependent (predictands)

Example:X (R) is the matrix of multivariate (instrumental) responses for different samplesY is the matrix of concentrations of one chemical component (or more) in the different samples y (c) is the concentration of one component in the samplesf is the calibration function in the causal (1) and predictive in (2) linear models, f is a linear function


R = C ST + Ens,nw ns,nc nc,nw ns,nw

R matrix of sensor responses (ns samples, nw wavelengths)

C matrix of concentrations (ns samples, nc components)

ST matrix of sensibilities (nc components, nw wavelengths)

E matrix of experimental errors (ns samples, nw wavelengths)

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟= +⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠

1,1 1, 1,1 1, 1,1 1,1,1 1,

2,1 2, 2,1 2, 2,1 2,

,1 ,,1 , ,1 , ,1

....

..

nw nc nwnw

nw nc nw

nc nc nwns ns nw ns ns nc ns

r r c c e es s

r r c c e e

s sr r c c e e

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠,ns nw

Multicomponent Analysis:Bilinear ModelMultivariate Linear Regression

• Advantages• Total selectivity is not needed• Allow multicomponent analysis• Outlier detection is possible

• More used methods– MLR, Multilinear Regression

• Classical Least Squares (CLS)• Inverse Least squares (ILS)

– Factors based Linear Regression (biased) • Principal Components Regression (PCR)• Partial Least Squares Regression (PLSR)

– Non-linear Regression


CLS Model : R = C ST + Ens,nw ns,nc nc,nw ns,nw

The responses are modelled as a function of the concentrations. It is the same causal model as for the generalized Beer’s law (generalized multilinear model)

Calibration step:a) direct: pure component spectra of the components or

sensibilities are previously known; ST is known b) indirect: pure component spectra of the components are not

previously known; ST is unknown, it has to be estimated in the calibration step:

ST = C+Rwhere: C+=(CTC)-1CT (pseudoinverse)

Prediction step:

foe a set of samples ‘nunk’ with unknown analyte concentration

Cunk = Runk(ST)+

nunk,nc nunk,nw nw,ncCunk matrix of concentrations of the nunk unknown samples

(nunk samples, nc components)Runk matrix of their instrumental responses

(nunk mostres, nw wavelengths)

(ST)+ pseudoinverse of the sensibilities matrix (pure spectra)(nw wavelengths, nc components)

(ST)+=S(STS)-1

Prediction step (one sample):

the concentration of several analytes in one sample:

cTunk= rT

unk(ST)+

1,nc 1,nw nw,nc

or what is the same:

cunk= S+ rnc,1 nc,nw nw,1

where S+ = (STS)-1ST

Classical (Causal) Least Squares (CLS)Advantages (compared to univariate least squares):1. Increase of precision in the estimations(signal averaging)2. Allows the estimation of the pure responses(pure spectra) => qualitative information, identification3. Allows multicomponent quantitative analysisDisadvantages:Needs knowing and introducing the whole information of allthese components contributing to the measured analytical responseIt does not allow calibration in the presence of unknowninterferents (it is not used in the analysis of natural samples)


2. Inverse Calibration. Inverse Least Squares (ILS)

Model: c = R b + e (y = X b)ns,1 ns,nw nw,1 ns,1

The concentrations are modeled as a function of the instrumentalresponses. It is not a causal model. It is a predictive model. Only needs knowing the analyte concentration in the calibration samples.

Calibration: b = R+ c (b = X+c)nw,1 nw,ns ns,1

b is the calibration vector, evaluated from the responses of the calibration samples R where the analyte concentration c is known.

R+ = (RTR)-1RT pseudoinverse of RX+ = (XTX)-1XT pseudoinverse of X

Prediction: cunk = rTunk b (yunk = xT

unk b) 1,1 1,nw nw,1

cunk is the concentration of the analyte in a new samplerT

unk is the instrumental response given by this sampleb is the calibration vector previously evaluated

R is not square => calculation of the generalized inverse or pseudoinverse R+ = (RTR)-1RT

Problem:In the evaluation of the calibration vector b = (RTR)-1RT c,(RTR)-1 nw,nw has to be evaluated The nw rows and nw columns should be linearly independent!!!and ns > nw (number of calibration samples > number of wavelengths

Inverse Least squares, ILS (inverse model)

Advantages- allows the determination of one analyte in the presence of unknowninterferences (this is not possible with CLS!!!)- Only needs the calibration information for one analyteDisadvantages- It does not use all the variables; only uses a reduced number of selected variables (sensors or wavelengths).- There is no increase of measurement precision (there is nosignal averaging)


Methods based on Factor Analysis• Factor decomposition of matrix X• Resolve the colinearity problem in X• Backgroundd noise filtering• Improve precision (signal averaging)• ‘Compression’ of the information in a reduced number

of new variables o factors


- as in PCA- linearization, if possible- mean centering- variance scaling, when the variablesare in different units or differconsiderably in magnitude

- outliers elimination is critical

Pretreatment Methods in Multivariate Regression/Calibration


Principal Component Regression PCR

1. Descomposition in factors of X by PCAX = T PT + E

ns,nw ns,nc nc,nw (ns,nw)

2. Multilinear regression (MLR) on thescores T (instead of on original variables in X)

y = T bns,1 ns,nc nc,1



3. Evaluation of the regression vector b

b = T+ y = (TTT)-1TT y

the scores (PCA) are orthogonals

(TTT)-1 = diag(1/λi), i =1,...,nc



4. Prediction of a new sample with response xT

unkscore of the new sample: tT

unk= xTunkP

1,nc 1,nw nw,nc

prediction of its concentration yunk = tT b1,1 1,nc nc,1



Direct calculation ycal = X bcal

bcal = X+ ycal

X+ XPCA+ = (T PT)+ = P (TTT)-1TT

PT orthonormal P PT = I (PT)-1 = P

T orthogonal TT T = diag(λi)yunk = xT

unk bcal1,1 1,nw nw,1


PCAX T

PT

ns

nw

ns

nc

nc

nw

y T

1

ns ns

nc

MLR

bnc

1PCR = PCA + MLR

nc<<nwreduction of thenumber of variables!!!

Principal Components Regression, PCRPCR is one of the ways to solve the inversion of matrices ill-conditioned in linear regressionThe property of interest is modelled (regressed) on the PCA scores:

y = TkbPC + e = XPkbPC + ebPC = X(Tk

TTk)-1TkTy

(regression vector from PCA ‘sores’)b = PkbPC ; y = Xb(regression vector from the original variables) X+ = Pk(Tk

TTk)-1TkT

(la inverse matrix is calculated from orthogonal matrices)


Possible problems with PCRSome of the PCs cannot be relevant for the prediction of y, only are relevant for the description of XThe PCs are estimated without considering the prperty to predict ySolution: find the components using the information in y, not only the information in PLSThe number of components has to be estimated using validation methods.Diagnostics are used to find out outliers (Q residuals, T2

values, leverage plots)


Partial Least Squares Regression PLSR

The responses matrix X is decomposed and ‘truncated’in a similar way to PCR, but in the decomposition the information of y in the calibration samples is considered

X = T PT + EY = U QT + F

u = y (in the case of a single component)

X T U Yvariance covariance variance

W


CLS: X = f(y)

ILS,MLR y = f(X)

x1x2x3x4x5

x1x2x3x4x5 x1

x2x3x4x5

x1x2x3x4x5

y

y

y

y

t1

t2

t1t2

PCR = PCA ++ MLR

PLSR

PCA

Partial Least Squares, PLS•PLS is a mixture of PCR and MLR•PCR captures the maximum variance in X•MLR gets maxima correlation between X and y•PLS tries both things: makes maximum the covariance•PLS requires an additional matrix of weights W to keep the orthogonality of the scores and makes easier matrix inversion•The factors are evaluated sequentially by projection of y on X•The expression to evaluate the matrix inverse is more complex than for PCR:

X+ = Wk(PkTWk)-1(Tk

TTk)-1TkT


Comparison of predictive (inverse) linear calibration modelsThe different methods differ in the way they resolve the same equation of calibration (calculation of the inverse matrix of R):

b = X+y MLR (ILS) X+ = (XTX)-1XT

Maximum correlation between X and y is achieved, but the direct inversion of X is problematic

PCR X+ = Pk(TkTTk)-1Tk

T

Maximum variance in X is achieved

PLSR X+ = Wk(PkTWk)-1(Tk

TTk)-1TkT

Maximum covariance between X and y is achieved

Inverses in PCR i PLS are calculated from orthogonal matrices A-1 A= D (diagonal); orthonormal A-1A=ATA=I

Calibration step:

Calculation of the Model

Xcalycal

ModelObtain nr. components

Validation step:

Test the model

Xvalyval

Test del ModelTest nr. components

Validation of the calibration models (regression)


Validation of the model; with the same calibration samples

Xcal + Model

Model Error : - ycal

residual variance (calibration)

RMSEC (Root Mean SquareError of Calibration)

ˆ caly

ˆ calyˆ∑

ns2

i,cal i,cali=1

(y -y )

ns

ˆ∑ns

2i,cal i,cal

i=1

(y -y )

ns


Validation of the Model; with new validation samples

Xval + Model

Model Error : - yval

residual variance (prediction)

RMSEP (Root Mean SquareError of Prediction)

ˆ valy

ˆ valyˆ∑

ns2

i,val i,vali=1

(y -y )

ns

ˆ∑ns

2i,val i,val

i=1

(y -y )

ns


Validation of the calibration model

The ability of the calibration model to predict has to beevaluated using a new sample set not used in thedevelopment of the calibration model:

Samples:

Training set

Test set

Calibration of the Model Xcal ycal

fcalCalculation of RMSEC

Validation of the ModelXval

fvalCalculation of RMSEP

ˆ valy


Validation Methods1) With a calibration set of samples and a different

validation set of samples. Both data sets should representative. It is the best method.

2) Cross-Validation2A) Two groups of data X and y

A

B

calibrationvalidation

validationcalibration

X yXA

XB

yA

yB

Prediction error is evaluated for A and B and the average is evaluated


2B) Full cross-validation or leave-one-out validation

The same number of PCA models as samples are. Successively, one sample is removed, a new PCA model is built and the left out sample is predicted. This is repeated for every sample and the prediction error is calculated.

2C) Segmented Validation (for small groups of samples, i.e. 10 % of samples)

3) Leverage correction (leverage, hi)

Residuals fi = yval - are weighted according to hi

ficorr = fi / (1-hi)

ˆ valy


∑2ni,k

ikk=1

t1h = +ns λ

hi values are between 0 and 1

samples with low ‘leverage’ hi 0 ficor fi

samples with high ‘leverage’ hi 1 ficor >> fi

This procedure uses only the calibration samples dada; it gives a first approximation of the future prediction ability.

where


Cross-Validation

Data are divided in q subsetsBuilt the model with the q-1 subsetsCalculate PRESS (predictive Residual Sum of Squares

Repeat until all the groups have been left out one time. Find the minimum (or the inflexion point) in the the plot of PRESS vs nr. components

2ij ij

i j

ˆPRESS (y y )= −∑∑


1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

400

450

500

Determination of the number of componentsusing PRESS PLOTS

number of components

cum

ulat

ive

PRES

S

5 components


Model evaluation and validation2

ij iji j

2ij ij

i j

2ij ij

i j

ij iji j

2ij ij

i j2

iji j

ˆPRESS (y y )

ˆ(y ÿ )RMSEP

n.samples

ˆ(y y bias)SEP

n.samples 1ˆ(y y )

biasn.samples

ˆ(y y )RE 100

(y )

= −

−=

− −=

−

−=

−=

∑∑

∑∑

∑∑

∑∑

∑∑∑∑

Prediction Error Sum ofSquares, PRESS

Root Mean Square Error inPrediction

Standard Error in Prediction

bias

Relative Error


• Comparison of experimental values versus model predicted values– for the calibration samples– for the external validation samples

• Plot and calculate regression line of predicted versus actual values:predicted values = slope x experimental values + offsetslope should be one, offset should be zero and r2 = 1

Model evaluation and validation


Partial Least Squares Regression PLSR modelsX loadings

X = T PT + EY = U QT + F

Y loadings

u = y (in the case of a single component)

X scores Y scoresX T U Yvariance covariance variance

W (weights)

Multivariate Linear RegressionPLS model interpretation

Interpretation of PLS models.• Physical interpretation of PLS models can be obtained

from plots of scores and loadings like in PCA. • More interestingly PLS models provide the weights (Wk)

which describe the covariance structure between X and y blocks. Plot of the weights are extremely useful for PLS models interpretation

X+ = Wk(PkTWk)-1(TkTk

T)-1TkT

b=X+yy=Xb

• Other measures exist like the variable influence (importance) on projection, VIP, parameter. Plot of VIPs are also very useful for PLS model interpretation


Variable influence (importance) on projection, VIP, parameter


∑=

− −−=

A

a aoaaakAk SSYSSY

KSSYSSYwVIP1

12

)()((

A total number of factors considered in the model a considered factor

k considered variable

The variables with larger VIPs (larger than one) are more influential and important for explaining Y

Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Documents