Tutorials for multivariate analysis on NIR and Raman spectrum I. PCA analysis on NIR spectrum 1. Data set The data set consist of raw unpreprocessed NIR spectra of diesel fuels. Total 784 diesel samples were measured in the absorbance model and the spectra cover the region of 750 to 1550 nm in 2 nm increments. 700 800 900 1000 1100 1200 1300 1400 1500 1600 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 W avelength (nm ) absorbance 2. Analysis routine for PCA The analysis routines will be explained based on the example data set. A usual routine for constructing PCA model follows in the given order. A. Load data Data should be loaded from Excel, csv, text or other file format. The provided file is *.CSV format. B. Edit or Define data - Assign observation ID and variable label for the imported data set. - Define category of each variable (Ex. X (independent) vs. Y (dependent)). In PCA, there is only X dataset.
10
Embed
Tutorials for Multivariate Analysis on NIR and Raman Spectrum R2
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Tutorials for multivariate analysis on NIR and Raman spectrum
I. PCA analysis on NIR spectrum
1. Data set
The data set consist of raw unpreprocessed NIR spectra of diesel fuels. Total 784 diesel samples were measured in the absorbance model and the spectra cover the region of 750 to 1550 nm in 2 nm increments.
The analysis routines will be explained based on the example data set. A usual routine for constructing PCA model follows in the given order.
A. Load data
Data should be loaded from Excel, csv, text or other file format. The provided file is *.CSV format.
B. Edit or Define data
- Assign observation ID and variable label for the imported data set.
- Define category of each variable (Ex. X (independent) vs. Y (dependent)). In PCA, there is only X dataset.
- Exclude or Include each variable which will be used during the construction of PCA model
C. Preview data
- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.
-0.05 0 0.05 0.1 0.150
100
200
300
400
500
600
700
800
X
freq
uenc
yHistogram for variable #1
0 100 200 300 400 500 600 700 800-0.05
0
0.05
0.1
0.15
observation ID
abso
rban
ce
time-series for variable #1
- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis.
D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.)
- First, baseline subtraction will be conducted using polynomial fitting method.
- After baseline subtraction, first derivative spectra will be computed using Savitzky-Golay algorithm
- Then, finally, each variable will be scaled by subtracting its mean value (this is called mean-centering). In other cases, auto-scaling (subtracting mean value followed with dividing its standard deviation) can be done.
o Contribution plot will allow to bring single sample spectra data by double-clicking a
contribution.
- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PCA model is reconstructed.
G. Prediction using new data set
- Prediction set should be defined with either part of the training data or secondary data. In either case, the program should be able to handle that. The following plots should be available
1. Score prediction2. Hotelling’s T2 prediction3. SPE ot DModX
4. Contribution plot.
II. PLS analysis on Raman spectrum
1. Data set
This data set consists of 80 samples of corn measured with NIR spectrometer (Consider this is a Raman data). The wavelength range is 1100-2498 nm at 2 nm intervals (700 channels). The protein value of each of the sample is also included.
1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
wavelength (nm)
abso
rban
ce
0 20 40 60 807.5
8
8.5
9
9.5
10
prot
ein
conc
entr
atio
n
observation ID
2. Analysis routine for PCA
The analysis routines will be explained based on the example data set. In this case, X (independent) will be NIR spectrum and Y (dependent) will be protein concentration in corn. A usual routine for constructing PLS model follows in the given order.
A. Load data
Data should be loaded from Excel, csv, text or other file format. The provided files are *.xls format.
B. Edit or Define data
- Assign observation ID and variable label for the imported data set.
- Define category of each variable (Ex. X (independent) vs. Y (dependent)).
- Exclude or Include each variable which will be used during the construction of PLS model
C. Preview data
- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.
-0.05 0 0.05 0.1 0.150
100
200
300
400
500
600
700
800
X
freq
uenc
y
Histogram for variable #1
0 100 200 300 400 500 600 700 800-0.05
0
0.05
0.1
0.15
observation ID
abso
rban
ce
time-series for variable #1
- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis.
D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.)
- First, each spectra of X will be smoothed using Savitzky-Golay algorithm.
- After smoothing, the spectra will be preprocessed using SNV to remove baseline drift and scattering effect.
- Finally, each variable in X and Y will be auto-scaled for normalization.
1000 1500 2000 25000
0.2
0.4
0.6
0.8
1
wavelength (nm)
abso
rban
ce
smoothed spectra
100 200 300 400 500 600 700-2
-1
0
1
2
3
wavelength (nm)
abso
rban
ce
SNV treated
100 200 300 400 500 600 700-5
0
5
wavelength (nm)
abso
rban
ce
autoscaled X
0 20 40 60 80-3
-2
-1
0
1
2
3
observation ID
prot
ein
autoscaled Y
E. Compute PLS model
- The PLS model will be constructed using preprocessed X and Y datasets.
- Optimal number of latent variables (LVs) will be chosen automatically using Cross-validation methods.
- In this example, 6 LVs are optimal because RMSECV plot shows a knee at 6 LVs (alternatively Q2 can be used)
5 10 15 200.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Latent Variable Number
RM
SE
CV
SIMPLS Variance Captured and Statistics for X
F. Analysis of computed PLS model
- Following list of plot can be used to examine the constructed PLS model
o Contribution plot will allow to bring single sample spectra data by double-clicking a
contribution. Regression coefficient plot VIP score plot Y measured vs. Predicted plot Residual plot
- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PLS model is reconstructed.
- Alternative options:
To evaluate the computed PLS plot, permutation analysis can be performed by permutating each row randomly with given percentage and then comparing Q2 value with that of original PLS model
To select informative region in the original spectra, variable selection methods (Ex. VIP, UVE, GA, etc) can be applied to dataset. This feature is not supported by other commercial software package, such as SIMCA-P and Unscrambler.
G. Prediction using new data set
5. Score prediction6. Hotelling’s T2 prediction7. SPE ot DModX8. Contribution plot.9. Y measured vs. Predicted plot10. Residual plot
Revision 1.0 Haewoo Lee (Draft)
Revision 2.0 Seongkyu Yoon (edited a few sentences)