Principal Component Analysis Jieping Ye Department of Computer Science and Engineering Arizona State University jye02.

Principal Component Analysis

Jieping Ye

Department of Computer Science and Engineering

Arizona State University

http://www.public.asu.edu/~jye02

Outline of lecture

• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis (PCA)• Nonlinear PCA using Kernels

What is feature reduction?

• Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space.– Criterion for feature reduction can be different based on different

problem settings.• Unsupervised setting: minimize the information loss• Supervised setting: maximize the class discrimination

• Given a set of data points of p variables

Compute the linear transformation (projection)

nxxx ,,, 21

)(: pdxGyxG dTpdp

What is feature reduction?

dY pdTG

dTdp XGYXG :

Linear transformation

Original data reduced data

High-dimensional data

Gene expression Face images Handwritten digits

Outline of lecture

• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels

Why feature reduction?

• Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade rapidly as the dimension

increases.

• The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type

of disease may be small.

Why feature reduction?

• Visualization: projection of high-dimensional data onto 2D or 3D.

• Data compression: efficient storage and retrieval.

• Noise removal: positive effect on query accuracy.

Application of feature reduction

• Face recognition• Handwritten digit recognition• Text mining• Image retrieval• Microarray data analysis• Protein classification

Outline of lecture

Feature reduction algorithms

• Unsupervised– Latent Semantic Indexing (LSI): truncated SVD– Independent Component Analysis (ICA)– Principal Component Analysis (PCA)– Canonical Correlation Analysis (CCA)

• Supervised – Linear Discriminant Analysis (LDA)

• Semi-supervised – Research topic

Outline of lecture

What is Principal Component Analysis?

• Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of

variables, smaller than the original set of variables– Retains most of the sample's information.– Useful for the compression and classification of data.

• By information we mean the variation present in the sample, given by the correlations between the original variables. – The new variables, called principal components (PCs), are

uncorrelated, and are ordered by the fraction of the total information each retains.

Geometric picture of principal components (PCs)

• the 1st PC is a minimum distance fit to a line in X space

• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

Algebraic definition of PCs

.,,2,1,1

111 njxaxazp

pnxxx ,,, 21

]var[ 1z

Given a sample of n observations on a vector of p variables

define the first principal component of the sampleby the linear transformation

where the vector

is chosen such that is maximum.

121111

Algebraic derivation of PCs

To find first note that

is the covariance matrix.

ii xxxx

1))((]var[

Saaaxxxxan

mean. theis 1

In the following, we assume theData is centered. 0x

npnxxxX ],,,[ 21

Assume

Form the matrix:

Obtain eigenvectors of S by computing the SVD of X:

To find that maximizes subject to

Let λ be a Lagrange multiplier

is an eigenvector of S

corresponding to the largest eigenvalue

therefore

1a ]var[ 1z 111 aaT

aaSaaL

To find the next coefficient vector maximizing

then let λ and φ be Lagrange multipliers, and maximize

subject to

and to

First note that

122 aaT

]var[ 2z

0],cov[ 12 zz

2112112 ],cov[ aaSaazz TT

122222 )1( aaaaSaaL TTT

uncorrelated

122222 )1( aaaaSaaL TTT

001222

aaSaLa

2222 and SaaaSa T

We find that is also an eigenvector of S

whose eigenvalue is the second largest.

In general

• The kth largest eigenvalue of S is the variance of the kth PC.

• The kth PC retains the kth greatest fraction of the variation in the sample.

kkTkk Saaz ]var[

• Main steps for computing PCs– Form the covariance matrix S.

– Compute its eigenvectors:

– Use the first d eigenvectors to form the d PCs.

– The transformation G is given by

],,,[ 21 daaaG

.point A test dTp xGx

Optimality property of PCA

npTndT

XGGXXG

Dimension reductionReconstruction

ndT XGY

Original data

dpG npX

Optimality property of PCA

The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:

Main theoretical result:

GIGXGGXdp

T2G subject to )(min

reconstruction error

PCA projection minimizes the reconstruction error among all linear projections of size d.

Applications of PCA

• Eigenfaces for recognition. Turk and Pentland. 1991.

• Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001.

• Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.

PCA for image compression

d=1 d=2 d=4 d=8

d=16 d=32 d=64 d=100Original Image

Outline of lecture

Motivation

Linear projections will not detect thepattern.

Nonlinear PCA using Kernels

• Traditional PCA applies linear transformation– May not be effective for nonlinear data

• Solution: apply nonlinear transformation to potentially very high-dimensional space.

• Computational efficiency: apply the kernel trick.– Require PCA can be rewritten in terms of dot product.

)(: xx

)()(),( jiji xxxxK More on kernelslater

Rewrite PCA in terms of dot product

.0 i.e., centered,been has data theAssume i

1The covariance matrix S can be written as

ii xvx

nvvvxx

nSv )(

Let v be The eigenvector of S corresponding to nonzero eigenvalue

Eigenvectors of S lie in the space spanned by all data points.

].x,,x,[xX where,1

n21 TXXn

ii xvx

nvvvxx

nSv )(

The covariance matrix can be written in matrix form:

Sv T 1

)())((1

XXXXXXn

T Any benefits?

Next consider the feature space: )(: xx

].x,,x,[x where,1

n21 XXX

T1 Xxvi

The (i,j)-th entry of XXT

is )()( ji xx

Apply the kernel trick: )()(),( jiji xxxxK

1K is called the kernel matrix.

• Projection of a test point x onto v:

),()()(

)()()(

Explicit mapping is not required here.

Reference

• Principal Component Analysis. I.T. Jolliffe.

• Kernel Principal Component Analysis. Schölkopf, et al.

• Geometric Methods for Feature Extraction and Dimensional Reduction. Burges.

Principal Component Analysis Jieping Ye Department of Computer Science and Engineering Arizona State University jye02.

Documents

Happy are Ye If Ye Do Them

Patient Information TODAY'S DATEPatient Information TODAY'S....

Large Scale Matrix Factorization - College of...

O Praise Ye the Lord - Amazing Facts · O Praise Ye the...

PRESSBOOK LES OGRES Def - MYmovies.it · 2018. 1. 9. ·...

Ye boke of ye grande Masonnic fancie fayre & bazaar of ye...

Ye Dar Ye Aastany

YE Analysis

ADP · YE 2012 YE 2013 YE 2014 YE 2015 Ye 2016 160 GBR ITA....

Ye Kurkum Ye

Hear-ye! Hear-ye! Now presenting Caernarfon Castle.

Hear Ye Hear Ye (wanna get your word out?) v2

Mining Sparse Representations: Theory, Algorithms, and...

SERVICES - King's College, Cambridge · 2020. 3. 18. · O....

Here ye, here ye!

Vol 11 - No 3 Hear Ye Hear Ye Fall 1990