- Southeast University - 1 PCA and Kernel PCA PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 18, 2022
Dec 19, 2015
- Southeast University -
1
PCA and Kernel PCAPCA and Kernel PCA
Presented by Shicai YangInstitute of Systems Engineering
April 18, 2023
- Southeast University -
2
Outline
• PCA
• Kernel Methods
• Kernel PCA
• Others
- Southeast University -
3
1. PCA Overview
• Principal component analysis (PCA) is a way to reduce data dimensionality
• PCA projects high dimensional data to a lower dimension• PCA projects the data in the least square sense– it
captures big (principal) variability in the data and ignores small variability
- Southeast University -
4
PCA: An Intuitive Approach
N
iiN 1
0
1xmx
Let us say we have xi, i=1…N data points in p dimensions (p is large)
If we want to represent the data set by a single point x0, then
Can we justify this choice mathematically?
N
iiJ
1
2
000 )( xxx
It turns out that if you minimize J0, you get the above solution, namely, sample mean
Sample mean
- Southeast University -
5
PCA: An Intuitive Approach…
emx a
Representing the data set xi, i=1…N by its mean is quite uninformative
So let’s try to represent the data by a straight line of the form:
This is equation of a straight line that says that it passes through m
e is a unit vector along the straight line
And the signed distance of a point x from m is a
The training points projected on this straight line would be
Niaii ...1, emx
- Southeast University -
6
PCA: An Intuitive Approach…
N
iii
N
i
Ti
N
ii
N
iii
N
i
Ti
N
ii
N
iiiN
aa
aa
aaaaJ
1
2
11
2
1
2
11
22
1
2
211
||||)(2
||||)(2||||
),,,,(
mxmxe
mxmxee
xeme
)( mxe iT
ia
N
ii
TN
ii
N
i
Tii
T SJ1
2
1
2
11 ||||||||))(()( mxeemxemxmxee
Let’s now determine ai’s
Partially differentiating with respect to ai we get:
Plugging in this expression for ai in J1 we get:
where
N
i
TiiS
1
))(( mxmx is called the scatter matrix
- Southeast University -
7
So minimizing J1 is equivalent to maximizing:
PCA: An Intuitive Approach…
ee ST
1eeT
)1( eeee TTS
Subject to the constraint that e is a unit vector:
Use Lagrange multiplier method to form the objective function:
Differentiate to obtain the equation: eSe0ee orS 22
Solution is that e is the eigenvector of S corresponding to the largest eigenvalue
- Southeast University -
8
PCA: An Intuitive Approach…
ddaa eemx 11
N
ii
d
kkikd aJ
1
2
1
||)(|| xem
The preceding analysis can be extended in the following way.
Instead of projecting the data points on to a straight line, we may
now want to project them on a d-dimensional plane of the form:
d is much smaller than the original dimension p
In this case one can form the objective function:
It can also be shown that the vectors e1, e2, …, ed are d eigenvectors
corresponding to d largest eigen values of the scatter matrix
N
i
TiiS
1
))(( mxmx
- Southeast University -
9
PCA: Visually
Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.
- Southeast University -
10
PCA Steps
• 设 x = ( x1 , x2 , , x⋯ n)T 为 n 维随机矢量⑴ 将原始观察数据组成观察矩阵 X ,每一列为一个观察
样本,每一行为一维⑵ 计算样本 X 的协方差矩阵 covX=COV(X)
⑶ 计算 covX 的特征值和特征向量,并将特征值按从大到小排列
⑷ 选取前 m 个最大特征值对应的特征向量组成矩阵 V
⑸ Y=VTX ,则 Y 为降维后的矩阵
- Southeast University -
11
PCA 的 Matlab函数与算法
1.princomp :主成分分析• PC=princomp(X)• [PC,score,latent,tsquare]=princomp(X)
– 对数据矩阵 X(N*p ,行 - 观察样本数,列 - 特征变量数 ) 进行主成分分析,给出各主成分 (PC) 、所谓的 Z- 得分 (score) 、 X 的方差矩阵的特征值(latent) 和每个数据点的 HotellingT2 统计量 (tsquare) 。
2.pcacov :运用协方差矩阵进行主成分分析• PC=pcacov(X)• [PC,latent,explained]=pcacov(X)
– 通过协方差矩阵 X 进行主成分分析,返回主成分 (PC) 、协方差矩阵 X的特征值 (latent) 和每个特征向量表征在观测量总方差中所占的百分数(explained) 。
- Southeast University -
12
3.pcares :主成分分析的残差• residuals=pcares(X,ndim)
– 返回保留 X 的 ndim 个主成分所获的残差。注意, ndim 是一个标量,必须小于 X 的列数。而且, X 是数据矩阵,而不是协方差矩阵。
4.barttest :主成分的巴特力特检验• ndim=barttest(X,alpha)
• [ndim,prob,chisquare]=barttest(X,alpha)– 巴特力特检验是一种等方差性检验。 ndim=barttest(X,alpha) 是在显
著性水平 alpha 下,给出满足数据矩阵 X 的非随机变量的 n 维模型,ndim 即模型维数,它由一系列假设检验所确定, ndim=1 表明数据X 对应于每个主成分的方差是相同的; ndim=2 表明数据 X 对应于第二成分及其余成分的方差是相同的。
- Southeast University -
13
计算协方差
(1)XCOV=COV(X)
(2) % row 观察样本, col 特征变量,返回的 cv 为协方差xmean=mean(x); xsize=size(x);
for i=1:xsize(2)
xx1=x(:,i);
mxx1=xmean(i);
for j=1:xsize(2)
xx2=x(:,j);
mxx2=xmean(j);
v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1);
cv(i,j)=v;
cv(j,i)=v;
end
end
- Southeast University -
14
PCA 的 Matlab实现
function [xeigvsort,xeigdsort,final]=KL_Exp(x)
xmean=mean(x);xsize=size(x);
for i=1:xsize(2)
xadjust(:,i)=x(:,i)-xmean(:,i);
end
xcov=cov(xadjust); % 计算协方差[xeigv,xeigd]=eig(xcov); % 计算特征值和特征向量xeigvsort=fliplr(xeigv); % 特征向量 v 排序xeigdsort=flipud(fliplr(xeigd)); % 特征值 d 降序排序finaleigs=xeigvsort(:,1:xsize(2)); 选取变换基, xsize(2) 可调pdata=finaleigs‘*xadjust’; % 进行变换final=pdata';
- Southeast University -
15
假设和局限
• 线形性假设– PCA 的内部模型是线性的。这也就决定了它能进行的主元分析之
间的关系也是线性的。现在比较流行的 Kernel-PCA 的一类方法就是对原有 PCA 方法的非线性拓展。
• 使用中值和方差进行充分统计– 使用中值和方差进行充分的概率分布描述的模型只限于指数型概
率分布模型。若数据的概率分布是 non-Gaussian ,那么 PCA 将会失效, ICA 方法将发挥作用。
- Southeast University -
16
• 大方差向量具有较大重要性– PCA 方法隐含了这样的假设:数据本身具有较高的信噪比,所以
具有最高方差的一维向量就可以被看作是主元,而方差较小的变化则被认为是噪音。这是由于低通滤波器的选择决定的。
• 主元正交– PCA 方法假设主元向量之间都是正交的,从而可以利用线形代数
的一系列有效的数学工具进行求解,大大提高了效率和应用的范围。
- Southeast University -
17
2. Kernel Methods
• Find a mapping such that, in the new space, problem solving is easier (e.g. linear)
• The kernel represents the similarity between two objects, defined as the dot-product in this new vector space
• But the mapping is left implicit• Easy generalization of a lot of dot-product (or distance)
based pattern recognition algorithms
- Southeast University -
18
Kernel Methods : the mapping
Original Space Feature (Vector) Space
- Southeast University -
19
Feature Spaces
: ( ), dx x R F
Non-linear Mapping to F 1. High-d Space2. Infinite-d Countable Space: L23. Function Space (Hilbert Space)
2 2( , ) ( , , 2 )x y x y xyExample:
- Southeast University -
20
Kernel : more formal definition
• A kernel k(x,y) – is a similarity measure – defined by an implicit mapping – from the original space to a vector space (feature space) – such that: k(x,y)=x)•y)
• This similarity measure and the mapping include:– Invariance or other a priori knowledge– Simpler structure (linear representation of the data)– The class of functions the solution is taken from– Possibly infinite dimension (hypothesis space for learning)– … but still computational efficiency when computing k(x,y)
General Principles governing Kernel Design
- Southeast University -
21
Kernel Trick
Note: In the dual representation we used the Gram matrix to express the solution.
Kernel Trick: Replace :
( ),
, ( ), ( ) ( , )ij i j ij i j i j
x x
G x x G x x K x x
kernel
If we use algorithms that only depend on the Gram-matrix, G,then we never have to know (compute) the actual features
This is the crucial point of kernel methods
- Southeast University -
22
ModularityKernel methods consist of two modules:
1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as input
Modularity: Any kernel can be used with any kernel-algorithm.
Some Kernels:
2( || || / )
2 2
( , )
( , ) ( , )
( , ) tanh( , )
1( , )
|| ||
x y c
d
k x y e
k x y x y
k x y x y
k x yx y c
Some Kernel Algorithms:
- SVM- Fisher LDA(KFDA)- Kernel Regression- Kernel PCA- Kernel CCA
- Southeast University -
23
Benefits from kernels
• Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, …– When these algorithms are dot-product based, by replacing the
dot product (x•y) by k(x,y)=x)•y)
e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, …
NM. This often implies to work with the “dual” form of the algo.– When these algorithms are distance-based, by replacing d(x,y)
by k(x,x)+k(y,y)-2k(x,y)
• Freedom of choosing implies a large variety of learning algorithms
- Southeast University -
24
3. Kernel PCA
• Assumption behind PCA is that the data points x are multivariate Gaussian
• Often this assumption does not hold
• However, it may still be possible that a transformation (x) is still Gaussian, then we can perform PCA in the space of (x)
• Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping (x) explicitly!
- Southeast University -
25
KPCA: Basic Idea
- Southeast University -
26
Kernel PCA Formulation
• We need the following fact:
• Let v be a eigenvector of the scatter matrix:
• Then v belongs to the linear space spanned by the data points xi i=1, 2, …N.
• Proof:
N
i
TiiS
1
xx
N
iii
N
i
TiiS
11
)(1
xvxxvvv
- Southeast University -
27
Kernel PCA Formulation…
• Let C be the scatter matrix of the centered mapping (x):
• Let w be an eigenvector of C, then w can be written as a linear combination:
• Also, we have:
• Combining, we get:
N
i
TiiC
1
)()( xx
N
kkk
1
)(xw
ww C
N
kkk
N
kkk
N
i
Tii
111
)())()()()(( xxxx
- Southeast University -
28
Kernel PCA Formulation…
).()(where,
,,2,1,)()()()()()(
)()()()(
)())()()()((
2
11 1
11 1
111
jT
iij
N
kk
Tlk
N
i
N
kkk
Tii
Tl
N
kkk
N
i
N
kkk
Tii
N
kkk
N
kkk
N
i
Tii
KK
KK
Nl
xxαα
αα
xxxxxx
xxxx
xxxx
Kernel or Gram matrix
Sv v
- Southeast University -
29
Kernel PCA Formulation…
αα KFrom the eigen equation
And the fact that the eigenvector w is normalized to 1, we obtain:
1
1))(())((||||11
2
αα
ααxxw
T
TN
iii
TN
iii K
- Southeast University -
30
KPCA AlgorithmStep 1: Compute the Gram matrix: NjikK jiij ,,1,),,( xx
Step 2: Compute (eigenvalue, eigenvector) pairs of K: Mlll ,,1),,( α
Step 3: Normalize the eigenvectors:l
ll
α
α
Thus, an eigenvector wl of C is now represented as:
N
kk
lk
l
1
)(xw
To project a test feature (x) onto wl we need to compute:
N
kk
lk
N
kk
lk
TlT k11
),())(()()( xxxxwx So, we never need explicitly
- Southeast University -
31
Examples of Kernels
Polynomial kernel (n=2)
RBF kernel (n=2)
- Southeast University -
32
4. Others
• 2DPCA– 南京理工大学杨静宇教授等 , IEEE T-PAMI, 2004(1)– 2DPCA 特征提取效果至少要好于 PCA, 不过要求的内存比 PCA
大。• 2DLDA
– 北京交通大学袁保宗教授等 , P. R. Letters, 2005(3)
• Kernel ECA ( KECA )– Robert Jensson et.al, IEEE T-PAMI, 2010(5)– 最大熵保留,使熵减最少,巧妙地将熵与核方法数据映射结合,
将熵的计算顺水推舟化作核矩阵的计算,于是变成一个核空间里的优化问题
- Southeast University -
33
References[1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods," IEEE
Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004.
[2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, 1999.
[3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998.
[4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information Theory and Statistical Learning, Springer US, 2009, pp. 209-230.
[5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems 19, 2007, pp. 633-640.
[6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image Analysis, Oslo, Norway, 2009, pp. 626-635.
[7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010.