Top Banner
- Southeast University - 1 PCA and Kernel PCA PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 18, 2022
33

- Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

1

PCA and Kernel PCAPCA and Kernel PCA

Presented by Shicai YangInstitute of Systems Engineering

April 18, 2023

Page 2: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

2

Outline

• PCA

• Kernel Methods

• Kernel PCA

• Others

Page 3: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

3

1. PCA Overview

• Principal component analysis (PCA) is a way to reduce data dimensionality

• PCA projects high dimensional data to a lower dimension• PCA projects the data in the least square sense– it

captures big (principal) variability in the data and ignores small variability

Page 4: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

4

PCA: An Intuitive Approach

N

iiN 1

0

1xmx

Let us say we have xi, i=1…N data points in p dimensions (p is large)

If we want to represent the data set by a single point x0, then

Can we justify this choice mathematically?

N

iiJ

1

2

000 )( xxx

It turns out that if you minimize J0, you get the above solution, namely, sample mean

Sample mean

Page 5: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

5

PCA: An Intuitive Approach…

emx a

Representing the data set xi, i=1…N by its mean is quite uninformative

So let’s try to represent the data by a straight line of the form:

This is equation of a straight line that says that it passes through m

e is a unit vector along the straight line

And the signed distance of a point x from m is a

The training points projected on this straight line would be

Niaii ...1, emx

Page 6: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

6

PCA: An Intuitive Approach…

N

iii

N

i

Ti

N

ii

N

iii

N

i

Ti

N

ii

N

iiiN

aa

aa

aaaaJ

1

2

11

2

1

2

11

22

1

2

211

||||)(2

||||)(2||||

),,,,(

mxmxe

mxmxee

xeme

)( mxe iT

ia

N

ii

TN

ii

N

i

Tii

T SJ1

2

1

2

11 ||||||||))(()( mxeemxemxmxee

Let’s now determine ai’s

Partially differentiating with respect to ai we get:

Plugging in this expression for ai in J1 we get:

where

N

i

TiiS

1

))(( mxmx is called the scatter matrix

Page 7: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

7

So minimizing J1 is equivalent to maximizing:

PCA: An Intuitive Approach…

ee ST

1eeT

)1( eeee TTS

Subject to the constraint that e is a unit vector:

Use Lagrange multiplier method to form the objective function:

Differentiate to obtain the equation: eSe0ee orS 22

Solution is that e is the eigenvector of S corresponding to the largest eigenvalue

Page 8: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

8

PCA: An Intuitive Approach…

ddaa eemx 11

N

ii

d

kkikd aJ

1

2

1

||)(|| xem

The preceding analysis can be extended in the following way.

Instead of projecting the data points on to a straight line, we may

now want to project them on a d-dimensional plane of the form:

d is much smaller than the original dimension p

In this case one can form the objective function:

It can also be shown that the vectors e1, e2, …, ed are d eigenvectors

corresponding to d largest eigen values of the scatter matrix

N

i

TiiS

1

))(( mxmx

Page 9: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

9

PCA: Visually

Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.

Page 10: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

10

PCA Steps

• 设 x = ( x1 , x2 , , x⋯ n)T 为 n 维随机矢量⑴ 将原始观察数据组成观察矩阵 X ,每一列为一个观察

样本,每一行为一维⑵ 计算样本 X 的协方差矩阵 covX=COV(X)

⑶ 计算 covX 的特征值和特征向量,并将特征值按从大到小排列

⑷ 选取前 m 个最大特征值对应的特征向量组成矩阵 V

⑸ Y=VTX ,则 Y 为降维后的矩阵

Page 11: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

11

PCA 的 Matlab函数与算法

1.princomp :主成分分析• PC=princomp(X)• [PC,score,latent,tsquare]=princomp(X)

– 对数据矩阵 X(N*p ,行 - 观察样本数,列 - 特征变量数 ) 进行主成分分析,给出各主成分 (PC) 、所谓的 Z- 得分 (score) 、 X 的方差矩阵的特征值(latent) 和每个数据点的 HotellingT2 统计量 (tsquare) 。

2.pcacov :运用协方差矩阵进行主成分分析• PC=pcacov(X)• [PC,latent,explained]=pcacov(X)

– 通过协方差矩阵 X 进行主成分分析,返回主成分 (PC) 、协方差矩阵 X的特征值 (latent) 和每个特征向量表征在观测量总方差中所占的百分数(explained) 。

Page 12: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

12

3.pcares :主成分分析的残差• residuals=pcares(X,ndim)

– 返回保留 X 的 ndim 个主成分所获的残差。注意, ndim 是一个标量,必须小于 X 的列数。而且, X 是数据矩阵,而不是协方差矩阵。

4.barttest :主成分的巴特力特检验• ndim=barttest(X,alpha)

• [ndim,prob,chisquare]=barttest(X,alpha)– 巴特力特检验是一种等方差性检验。 ndim=barttest(X,alpha) 是在显

著性水平 alpha 下,给出满足数据矩阵 X 的非随机变量的 n 维模型,ndim 即模型维数,它由一系列假设检验所确定, ndim=1 表明数据X 对应于每个主成分的方差是相同的; ndim=2 表明数据 X 对应于第二成分及其余成分的方差是相同的。

Page 13: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

13

计算协方差

(1)XCOV=COV(X)

(2) % row 观察样本, col 特征变量,返回的 cv 为协方差xmean=mean(x); xsize=size(x);

for i=1:xsize(2)

xx1=x(:,i);

mxx1=xmean(i);

for j=1:xsize(2)

xx2=x(:,j);

mxx2=xmean(j);

v=((xx1-mxx1)'*(xx2-mxx2))/(xsize(2)-1);

cv(i,j)=v;

cv(j,i)=v;

end

end

Page 14: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

14

PCA 的 Matlab实现

function [xeigvsort,xeigdsort,final]=KL_Exp(x)

xmean=mean(x);xsize=size(x);

for i=1:xsize(2)

xadjust(:,i)=x(:,i)-xmean(:,i);

end

xcov=cov(xadjust); % 计算协方差[xeigv,xeigd]=eig(xcov); % 计算特征值和特征向量xeigvsort=fliplr(xeigv); % 特征向量 v 排序xeigdsort=flipud(fliplr(xeigd)); % 特征值 d 降序排序finaleigs=xeigvsort(:,1:xsize(2)); 选取变换基, xsize(2) 可调pdata=finaleigs‘*xadjust’; % 进行变换final=pdata';

Page 15: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

15

假设和局限

• 线形性假设– PCA 的内部模型是线性的。这也就决定了它能进行的主元分析之

间的关系也是线性的。现在比较流行的 Kernel-PCA 的一类方法就是对原有 PCA 方法的非线性拓展。

• 使用中值和方差进行充分统计– 使用中值和方差进行充分的概率分布描述的模型只限于指数型概

率分布模型。若数据的概率分布是 non-Gaussian ,那么 PCA 将会失效, ICA 方法将发挥作用。

Page 16: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

16

• 大方差向量具有较大重要性– PCA 方法隐含了这样的假设:数据本身具有较高的信噪比,所以

具有最高方差的一维向量就可以被看作是主元,而方差较小的变化则被认为是噪音。这是由于低通滤波器的选择决定的。

• 主元正交– PCA 方法假设主元向量之间都是正交的,从而可以利用线形代数

的一系列有效的数学工具进行求解,大大提高了效率和应用的范围。

Page 17: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

17

2. Kernel Methods

• Find a mapping such that, in the new space, problem solving is easier (e.g. linear)

• The kernel represents the similarity between two objects, defined as the dot-product in this new vector space

• But the mapping is left implicit• Easy generalization of a lot of dot-product (or distance)

based pattern recognition algorithms

Page 18: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

18

Kernel Methods : the mapping

Original Space Feature (Vector) Space

Page 19: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

19

Feature Spaces

: ( ), dx x R F

Non-linear Mapping to F 1. High-d Space2. Infinite-d Countable Space: L23. Function Space (Hilbert Space)

2 2( , ) ( , , 2 )x y x y xyExample:

Page 20: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

20

Kernel : more formal definition

• A kernel k(x,y) – is a similarity measure – defined by an implicit mapping – from the original space to a vector space (feature space) – such that: k(x,y)=x)•y)

• This similarity measure and the mapping include:– Invariance or other a priori knowledge– Simpler structure (linear representation of the data)– The class of functions the solution is taken from– Possibly infinite dimension (hypothesis space for learning)– … but still computational efficiency when computing k(x,y)

General Principles governing Kernel Design

Page 21: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

21

Kernel Trick

Note: In the dual representation we used the Gram matrix to express the solution.

Kernel Trick: Replace :

( ),

, ( ), ( ) ( , )ij i j ij i j i j

x x

G x x G x x K x x

kernel

If we use algorithms that only depend on the Gram-matrix, G,then we never have to know (compute) the actual features

This is the crucial point of kernel methods

Page 22: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

22

ModularityKernel methods consist of two modules:

1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as input

Modularity: Any kernel can be used with any kernel-algorithm.

Some Kernels:

2( || || / )

2 2

( , )

( , ) ( , )

( , ) tanh( , )

1( , )

|| ||

x y c

d

k x y e

k x y x y

k x y x y

k x yx y c

Some Kernel Algorithms:

- SVM- Fisher LDA(KFDA)- Kernel Regression- Kernel PCA- Kernel CCA

Page 23: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

23

Benefits from kernels

• Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, …– When these algorithms are dot-product based, by replacing the

dot product (x•y) by k(x,y)=x)•y)

e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, …

NM. This often implies to work with the “dual” form of the algo.– When these algorithms are distance-based, by replacing d(x,y)

by k(x,x)+k(y,y)-2k(x,y)

• Freedom of choosing implies a large variety of learning algorithms

Page 24: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

24

3. Kernel PCA

• Assumption behind PCA is that the data points x are multivariate Gaussian

• Often this assumption does not hold

• However, it may still be possible that a transformation (x) is still Gaussian, then we can perform PCA in the space of (x)

• Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping (x) explicitly!

Page 25: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

25

KPCA: Basic Idea

Page 26: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

26

Kernel PCA Formulation

• We need the following fact:

• Let v be a eigenvector of the scatter matrix:

• Then v belongs to the linear space spanned by the data points xi i=1, 2, …N.

• Proof:

N

i

TiiS

1

xx

N

iii

N

i

TiiS

11

)(1

xvxxvvv

Page 27: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

27

Kernel PCA Formulation…

• Let C be the scatter matrix of the centered mapping (x):

• Let w be an eigenvector of C, then w can be written as a linear combination:

• Also, we have:

• Combining, we get:

N

i

TiiC

1

)()( xx

N

kkk

1

)(xw

ww C

N

kkk

N

kkk

N

i

Tii

111

)())()()()(( xxxx

Page 28: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

28

Kernel PCA Formulation…

).()(where,

,,2,1,)()()()()()(

)()()()(

)())()()()((

2

11 1

11 1

111

jT

iij

N

kk

Tlk

N

i

N

kkk

Tii

Tl

N

kkk

N

i

N

kkk

Tii

N

kkk

N

kkk

N

i

Tii

KK

KK

Nl

xxαα

αα

xxxxxx

xxxx

xxxx

Kernel or Gram matrix

Sv v

Page 29: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

29

Kernel PCA Formulation…

αα KFrom the eigen equation

And the fact that the eigenvector w is normalized to 1, we obtain:

1

1))(())((||||11

2

αα

ααxxw

T

TN

iii

TN

iii K

Page 30: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

30

KPCA AlgorithmStep 1: Compute the Gram matrix: NjikK jiij ,,1,),,( xx

Step 2: Compute (eigenvalue, eigenvector) pairs of K: Mlll ,,1),,( α

Step 3: Normalize the eigenvectors:l

ll

α

α

Thus, an eigenvector wl of C is now represented as:

N

kk

lk

l

1

)(xw

To project a test feature (x) onto wl we need to compute:

N

kk

lk

N

kk

lk

TlT k11

),())(()()( xxxxwx So, we never need explicitly

Page 31: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

31

Examples of Kernels

Polynomial kernel (n=2)

RBF kernel (n=2)

Page 32: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

32

4. Others

• 2DPCA– 南京理工大学杨静宇教授等 , IEEE T-PAMI, 2004(1)– 2DPCA 特征提取效果至少要好于 PCA, 不过要求的内存比 PCA

大。• 2DLDA

– 北京交通大学袁保宗教授等 , P. R. Letters, 2005(3)

• Kernel ECA ( KECA )– Robert Jensson et.al, IEEE T-PAMI, 2010(5)– 最大熵保留,使熵减最少,巧妙地将熵与核方法数据映射结合,

将熵的计算顺水推舟化作核矩阵的计算,于是变成一个核空间里的优化问题

Page 33: - Southeast University - 1 PCA and Kernel PCA Presented by Shicai Yang Institute of Systems Engineering June 14, 2015.

- Southeast University -

33

References[1] J. T. Y. Kwok and I. W. H. Tsang, "The Pre-Image Problem in Kernel Methods," IEEE

Transactions on Neural Networks, vol. 15, pp. 1517-1525, 2004.

[2] S. Mika, et al., "Kernel PCA and De-Noising in Feature Spaces," in Proceedings of the 1998 conference on Advances in Neural Information Processing Systems II, 1999.

[3] B. Schölkopf, et al., "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998.

[4] R. Jenssen, "Information Theoretic Learning and Kernel Methods," in Information Theory and Statistical Learning, Springer US, 2009, pp. 209-230.

[5] R. Jenssen, et al., "Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm," in Proceedings of the 2006 conference on Advances in Neural Information Processing Systems 19, 2007, pp. 633-640.

[6] R. Jenssen and O. Storås, "Kernel Entropy Component Analysis Pre-images for Pattern Denoising," in Proceedings of the 16th Scandinavian Conference on Image Analysis, Oslo, Norway, 2009, pp. 626-635.

[7] R. Jenssen, "Kernel Entropy Component Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, pp. 847-860, 2010.