Krishna Rajan
Data Dimensionality Reduction:Introduction to Principal Component Analysis
Case Study: Multivariate Analysis of Chemistry-Property data in Molten Salts
C. Suh1, S. Graduciz2, M. Gaune-Escard2 , K. Rajan1
Combinatorial Sciences and Materials Informatics Collaboratory
1 Iowa State University2 CNRS , Marseilles, France
Krishna Rajan
From a set of N correlated descriptors, we can derive a set of N uncorrelated descriptors (the principal components). Each principal component (PC) is a suitable linear combination of all the original descriptors. PCA reduces the information dimensionality that is often needed from the vast arrays of data in a way so that there is minimal loss of information
( from Nature Reviews Drug Discovery 1, 882-894 (2002) : INTEGRATION OF VIRTUAL AND HIGH THROUGHPUT SCREENING Jürgen Bajorath ; and Materials Today; MATERIALS INFORMATICS , K. Rajan , October 2005
.
PRINCIPAL COMPONENT ANALYSIS: PCA
Krishna Rajan
Functionality 1 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
Functionality 2 = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4 …….
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4 …….
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4 …….
X1 = f ( x2)
X2 = g( x3)
X3= h(x4)
…….
I
II III
…….
Krishna Rajan
Database of molten salts properties tabulates numerous properties for each chemistry :
•What can we learn beyond a “search and retrieve” function?•Can we find a multivariate correlation (s) among all chemistries and properties? •Challenge of reducing the dimensionality of the data set
DIMENSIONALITY REDUCTION: Case study
Krishna Rajan
Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components.
The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
Krishna Rajan
DSpe.conEq.conTempMPEq.wt
VD
Spe
.con
Eq.
Con
Tem
pM
P
BiCl3 series (high viscosity)
Eq.wt: equivalent weightMP: melting pointTemp: temperature of the measurementsEq.con: equivalent conductanceSpe.con: specific conductanceD: densityV: viscosity
DSpe.conEq.conTempMPEq.wt
VD
Spe
.con
Eq.
Con
Tem
pM
P
BiCl3 series (high viscosity)
Eq.wt: equivalent weightMP: melting pointTemp: temperature of the measurementsEq.con: equivalent conductanceSpe.con: specific conductanceD: densityV: viscosity
Dimensionality Reduction of Molten Salts Data(Janz’s Molten Salts Database:1700 chemistries with 7 variables.)
X1 =
f(x2)
X2 =
g(x3)
X3=
h(x4)
Melting point = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
Density = F ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ……)
Where xi = molten salt compound chemistries
……
…….
Krishna Rajan
Mathematically, PCA relies on the fact that most of the descriptors are interrelated and these correlations in some instances are high. It results in a rotation of the coordinate system in such a way that the axes show a maximum of variation (covariance) along their directions.
This description can be mathematically condensed to a so-called eigenvalue problem.
•The data manipulation involves decomposition of the data matrix X into two matrices T and P. The two matrices P and T are orthogonal. The matrix P is usually called the loadings matrix, and the matrix T is called the scores matrix.
•The eigenvectors of the covariance matrix constitute the principal components. The corresponding eigenvalues give a hint to how much "information" is contained in the individual components.
Krishna Rajan
• The loadings can be understood as the weights for each original variable when calculating the principal component. The matrix T contains the original data in a rotated coordinate system.
• The mathematical analysis involves finding these new “data” matrices T and P. The dimensions of T( ie its rank) that captures all the information of the entire data set of A ( ie # of variables) is far less than that of X ( ideally 2 or 3). One now compresses the N dimensional plot of the data matrix X into 2 or 3 dimensional plot of T and P.
Krishna Rajan
The first principal component accounts for the maximum variance (eigenvalue) in the original dataset. The second, third ( and higher order) principal components are orthogonal (uncorrelated) to the first and accounts for most of the remaining variance.
•A new row space is constructed in which to plot the data, where the axes represent the weighted linear combinations of the variables affecting the data. Each of these linear combinations are independent of each other and hence orthogonal. •The data when plotted in this new space is essentially a correlation plot, where the position of each data point not only captures all the influences of the variables on that data but also its relative influence compared to the other data.
PC 1= A1 X1 + A2 X2 + A3 X3 + A4 X4 …….
PC 2 = B1 X1 + B2 X2 + B3 X3 +B4 X4 …….
PC 3 = C1 X1 + C2 X2 + C3 X3 + C4 X4 …….
Krishna Rajan
PC1 PC2 PC3 PC4 PC5 ……………
Minimal contribution to additional information contentbeyond higher order principal components.. “Scree” plot helpsto identify the # of PCs needed to capturereduced dimensionality
NB…depending upon natureof data set, this can be within2, 3 or higher principal components but still less than the # of variables in originaldata set
Eigenvalue
Krishna Rajan
Thus the mth PC is orthogonal to all others and has the mth largest variance in the set of PCs. Once the N PCs have been calculated using eigenvalue/ eigenvector matrix operations, only PCs with variances above a critical level are retained (scree test).
The M-dimensional principal component space has retained most of the information from the initial N-dimensional descriptor space, by projecting it into orthogonal axes of high variance. The complex tasks of prediction or classification are made easier in this compressed, reduced dimensional space.
Krishna Rajan
1 1 1 1
T T T
k kX t p t p t p E
where min{ , }k m nti; scores (orthogonal), pi: loadings (orthonormal)
cov( )1
T
X XS X
m
11 1 1
1 1)
( / ) ... ( / )
( / ... ( / )
k n kn
m k mn kn
a s a s
X
a s a s
11 1
1
...
...
n
m mn
a a
A
a a
Data matrix, A, has 1700 rows(different molten salts) and 7 columns(properties). The properties in this example includes 1) equivalent weight 2) melting point 3) temperature of the measurements 4) equivalent conductance 5) specific conductance 6) density 7) viscosity
X is a scaled matrix of A. Matrix X in the left is an example of “Unit Variance” scaling.Each sij represents standard deviation.
Eigenvalue decomposition of covariance matrixEigenvalue decomposition of covariance matrix
S is a covariance matrix of X.
1
or cov( )i i i
S P P X p p
is a eigenvalue matrix (eigenvalues on the diagonal of this diagonal matrix).
P is called as loading (or eigenvector) matrix.
Generation of data matrix, AGeneration of data matrix, A
Scaling (normalization) of the data matrix, XScaling (normalization) of the data matrix, X
Covariance matrix of the scaled data matrix, SCovariance matrix of the scaled data matrix, S
Calculation of scores from the loadingsCalculation of scores from the loadings
PCA: algorithmic summary
Krishna Rajan
DSpe.conEq.conTempMPEq.wt
VD
Spe
.con
Eq.
Con
Tem
pM
P
BiCl3 series (high viscosity)
Eq.wt: equivalent weightMP: melting pointTemp: temperature of the measurementsEq.con: equivalent conductanceSpe.con: specific conductanceD: densityV: viscosity
DSpe.conEq.conTempMPEq.wt
VD
Spe
.con
Eq.
Con
Tem
pM
P
BiCl3 series (high viscosity)
Eq.wt: equivalent weightMP: melting pointTemp: temperature of the measurementsEq.con: equivalent conductanceSpe.con: specific conductanceD: densityV: viscosity
-6-4
-20
24
6
-6
-4
-2
0
2
4
6
-6-4
-20
24
6
PC2(22
.43%
)
PC
3(1
9.47
%)
PC1(49.89%)
Dimensionality Reduction of Molten Salts Data(Janz’s Molten Salts Database:1700 instances with 7 variables.)
Bivariate representation of the data sets Multivariate (PCA) representation of the data sets
Krishna Rajan
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
0
1
2
3
4
Scores on PC 1 (49.89%)
Sco
res
on
PC
2 (22
.43%
)
TlNO3
HgI2
TlCl
AgI
AgBr
KCNS
InCl2
NaOHLiNO3
ZnCl2 NaNO
2
KNO3KOH
InCl3
YCl3
MgCl2
LiFLi
2CO
3
LiClCaCl
2
KCl
KF
NaFNa
2SO
4 NaCl
LiBr
NaBr
SrCl2
K2SO
4
BaCl2
CsF
MgBr2
NaI
CsIBaI
2
BaBr2
HgBr2
PbBr2
CdI2 SrI
2
GaI2
InCl
CsNO3MgI2 CdCl
2
BiCl3AlI
3RbNO
3
K2Cr
2O
7
ioniccovalent
-5 -4 -3 -2 -1 0 1 2 3 4 5
-3
-2
-1
0
1
2
3
4
Scores on PC 1 (49.89%)
Sco
res
on
PC
2 (22
.43%
)
TlNO3
HgI2
TlCl
AgI
AgBr
KCNS
InCl2
NaOHLiNO3
ZnCl2 NaNO
2
KNO3KOH
InCl3
YCl3
MgCl2
LiFLi
2CO
3
LiClCaCl
2
KCl
KF
NaFNa
2SO
4 NaCl
LiBr
NaBr
SrCl2
K2SO
4
BaCl2
CsF
MgBr2
NaI
CsIBaI
2
BaBr2
HgBr2
PbBr2
CdI2 SrI
2
GaI2
InCl
CsNO3MgI2 CdCl
2
BiCl3AlI
3RbNO
3
K2Cr
2O
7
ioniccovalent
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.50.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Loadings on PC 1 (49.89%)
Load
ings
on
PC
2 (2
2.43
%) Equivalent weight
Melting point
Temperature of the measurement
Equivalent conductance
Specific conductance
Density
INTERPRETATIONS OF PRINCIPAL COMPONENT PROJECTIONS
-6-4
-20
24
6
-6
-4
-2
0
2
4
6
-6-4
-20
24
6
PC2(22
.43%
)
PC
3(1
9.47
%)
PC1(49.89%)
Trends in bonding captured along the PC1 axis of scoring plot
Correlations between variables captured in loading plot
Krishna Rajan
To summarize, when we start with a multivariate data matrix PCA analysis permits us to reduce the dimensionality of that data set. This reduction in dimensionality now offers us better opportunities to:
•Identify the strongest patterns in the data•Capture most of the variability of the data by a small fraction of the total set of dimensions•Eliminate much of the noise in the data making it beneficial for both data mining and other data analysis algorithms
PCA : summary