Tutorials for Multivariate Analysis on NIR and Raman Spectrum R2

Tutorials for multivariate analysis on NIR and Raman spectrum

I. PCA analysis on NIR spectrum

1. Data set

The data set consist of raw unpreprocessed NIR spectra of diesel fuels. Total 784 diesel samples were measured in the absorbance model and the spectra cover the region of 750 to 1550 nm in 2 nm increments.

700 800 900 1000 1100 1200 1300 1400 1500 1600-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Wavelength (nm)

abso

rban

ce

2. Analysis routine for PCA

The analysis routines will be explained based on the example data set. A usual routine for constructing PCA model follows in the given order.

A. Load data

Data should be loaded from Excel, csv, text or other file format. The provided file is *.CSV format.

B. Edit or Define data

- Assign observation ID and variable label for the imported data set.

- Define category of each variable (Ex. X (independent) vs. Y (dependent)). In PCA, there is only X dataset.

- Exclude or Include each variable which will be used during the construction of PCA model

C. Preview data

- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.

-0.05 0 0.05 0.1 0.150

100

200

300

400

500

600

700

800

X

freq

uenc

yHistogram for variable #1

0 100 200 300 400 500 600 700 800-0.05

0

0.05

0.1

0.15

observation ID

abso

rban

ce

time-series for variable #1

- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis.

D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.)

- First, baseline subtraction will be conducted using polynomial fitting method.

- After baseline subtraction, first derivative spectra will be computed using Savitzky-Golay algorithm

- Then, finally, each variable will be scaled by subtracting its mean value (this is called mean-centering). In other cases, auto-scaling (subtracting mean value followed with dividing its standard deviation) can be done.

700 800 900 1000 1100 1200 1300 1400 1500 1600-0.2

0

0.2

0.4

0.6

0.8

1

1.2

wavelength (nm)

abso

rban

ce

baseline subtracted spectra

50 100 150 200 250 300 350 400-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

wavelength (nm)

abso

rban

ce

First derivative spectra

50 100 150 200 250 300 350 400-0.03

-0.02

-0.01

0

0.01

0.02

0.03

wavelength (nm)

abso

rban

ce

mean-centered data

1 2

3

E. Compute PCA model

- The PCA model will be constructed using preprocessed data set.

- Optimal number of principal components (PCs) will be chosen automatically using Cross-validation methods.

- In this example, 7 PCs are optimal because RMSECV is minimum at 7 PCs (alternatively Q2 is maximum at 7 PCs)

2 4 6 8 10 12 14 16 18 200.004

0.006

0.008

0.01

0.012

0.014

0.016

Principal Component Number

RM

SE

CV

Eigenvalues and Cross-validation Results for X

F. Analysis of the computed PCA model

- Following list of plot can be used to examine the constructed PCA model

Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Hotelling's T2 plot SPE or dModX plot Contribution plot

o Contribution plot will allow to bring single sample spectra data by double-clicking a

contribution.

- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PCA model is reconstructed.

G. Prediction using new data set

- Prediction set should be defined with either part of the training data or secondary data. In either case, the program should be able to handle that. The following plots should be available

1. Score prediction2. Hotelling’s T2 prediction3. SPE ot DModX

4. Contribution plot.

II. PLS analysis on Raman spectrum

1. Data set

This data set consists of 80 samples of corn measured with NIR spectrometer (Consider this is a Raman data). The wavelength range is 1100-2498 nm at 2 nm intervals (700 channels). The protein value of each of the sample is also included.

1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

wavelength (nm)

abso

rban

ce

0 20 40 60 807.5

8

8.5

9

9.5

10

prot

ein

conc

entr

atio

n

observation ID

2. Analysis routine for PCA

The analysis routines will be explained based on the example data set. In this case, X (independent) will be NIR spectrum and Y (dependent) will be protein concentration in corn. A usual routine for constructing PLS model follows in the given order.

A. Load data

Data should be loaded from Excel, csv, text or other file format. The provided files are *.xls format.

B. Edit or Define data

- Assign observation ID and variable label for the imported data set.

- Define category of each variable (Ex. X (independent) vs. Y (dependent)).

- Exclude or Include each variable which will be used during the construction of PLS model

C. Preview data

- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.

-0.05 0 0.05 0.1 0.150

100

200

300

400

500

600

700

800

X

freq

uenc

y

Histogram for variable #1

0 100 200 300 400 500 600 700 800-0.05

0

0.05

0.1

0.15

observation ID

abso

rban

ce

time-series for variable #1

- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis.

D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.)

- First, each spectra of X will be smoothed using Savitzky-Golay algorithm.

- After smoothing, the spectra will be preprocessed using SNV to remove baseline drift and scattering effect.

- Finally, each variable in X and Y will be auto-scaled for normalization.

1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

wavelength (nm)

abso

rban

ce

smoothed spectra

100 200 300 400 500 600 700-2

-1

0

1

2

3

wavelength (nm)

abso

rban

ce

SNV treated

100 200 300 400 500 600 700-5

0

5

wavelength (nm)

abso

rban

ce

autoscaled X

0 20 40 60 80-3

-2

-1

0

1

2

3

observation ID

prot

ein

autoscaled Y

E. Compute PLS model

- The PLS model will be constructed using preprocessed X and Y datasets.

- Optimal number of latent variables (LVs) will be chosen automatically using Cross-validation methods.

- In this example, 6 LVs are optimal because RMSECV plot shows a knee at 6 LVs (alternatively Q2 can be used)

5 10 15 200.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Latent Variable Number

RM

SE

CV

SIMPLS Variance Captured and Statistics for X

F. Analysis of computed PLS model

- Following list of plot can be used to examine the constructed PLS model

Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Weight plot (1D/2D/3D) Hotelling's T2 plot SPE or dModX plot Contribution plot

o Contribution plot will allow to bring single sample spectra data by double-clicking a

contribution. Regression coefficient plot VIP score plot Y measured vs. Predicted plot Residual plot

- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PLS model is reconstructed.

- Alternative options:

To evaluate the computed PLS plot, permutation analysis can be performed by permutating each row randomly with given percentage and then comparing Q2 value with that of original PLS model

To select informative region in the original spectra, variable selection methods (Ex. VIP, UVE, GA, etc) can be applied to dataset. This feature is not supported by other commercial software package, such as SIMCA-P and Unscrambler.

G. Prediction using new data set

5. Score prediction6. Hotelling’s T2 prediction7. SPE ot DModX8. Contribution plot.9. Y measured vs. Predicted plot10. Residual plot

Revision 1.0 Haewoo Lee (Draft)

Revision 2.0 Seongkyu Yoon (edited a few sentences)

Tutorials for Multivariate Analysis on NIR and Raman Spectrum R2

Documents