Top Banner
Feature Selection/Extraction Dimensionality Reduction
49

Feature Selection/Extraction

Feb 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature Selection/Extraction

Feature Selection/Extraction

Dimensionality Reduction

Page 2: Feature Selection/Extraction

Feature Selection/Extraction• Solution to a number of problems in Pattern Recognition can be

achieved by choosing a better feature space.• Problems and Solutions:

– Curse of Dimensionality:• #examples needed to train classifier function grows exponentially with

#dimensions.• Overfitting and Generalization performance

– What features best characterize class?• What words best characterize a document class• Subregions characterize protein function?

– What features critical for performance?– Subregions characterize protein function?

– Inefficiency• Reduced complexity and run-time

– Can’t Visualize• Allows ‘intuiting’ the nature of the problem solution.

Page 3: Feature Selection/Extraction

Curse of DimensionalitySame Number of examplesFill more of the available spaceWhen the dimensionality is low

Page 4: Feature Selection/Extraction

Selection vs. Extraction• Two general approaches for dimensionality reduction

– Feature extraction: Transforming the existing features into a lower dimensional space– Feature selection: Selecting a subset of the existing features without a transformation

• Feature extraction– PCA– LDA (Fisher’s)

– Nonlinear PCA (kernel, other varieties– 1st layer of many networks

Feature selection ( Feature Subset Selection ) Although FS is a special case of feature extraction, in practice quite different

– FSS searches for a subset that minimizes some cost function (e.g. test error)– FSS has a unique set of methodologies

Page 5: Feature Selection/Extraction

Feature Subset Selection Definition

Given a feature set x={xi | i=1…N}find a subset xM ={xi1, xi2, …, xiM}, with M<N, that

optimizes an objective function J(Y), e.g. P(correct classification)

Why Feature Selection?• Why not use the more general feature extraction methods?

Feature Selection is necessary in a number of situations• Features may be expensive to obtain

• Want to extract meaningful rules from your classifier

• When you transform or project, measurement units (length, weight, etc.) arelost

• Features may not be numeric (e.g. strings)

Page 6: Feature Selection/Extraction

Implementing Feature Selection

Page 7: Feature Selection/Extraction

Objective Function

The objective functionevaluates candidate subsetsand returns a measure oftheir “goodness”.

This feedback is used by thesearch strategy to select newcandidates.

Simple Objective function:Cross-validation error rate.

Page 8: Feature Selection/Extraction

x1

Page 9: Feature Selection/Extraction
Page 10: Feature Selection/Extraction

Feature Extraction

Page 11: Feature Selection/Extraction

In general, the optimal mapping y=f(x) will be anon-linear function

• However, there is no systematic way to generate non-linear transforms

• The selection of a particular subset of transforms is problemdependent

• For this reason, feature extraction is commonly limited tolinear transforms: y=Wx

Page 12: Feature Selection/Extraction
Page 13: Feature Selection/Extraction

!

x =Uy, such that UTU = I

x =r u

1

r u

2L

r u n[ ]y = yi

r u i

i=1:n

"

PCA Derivation: Minimizing Reconstruction Error

Any point in Rn canperfectly reconstructedin a new Orthonormalbasis of size n.

Goal: Find anorthonormal basis of mvectors, m<n thatminimizesReconstruction error.

!

ˆ x =r u

1

r u

2L

r u m[ ]

y1

Mym

"

#

$ $

%

&

' ' +

r u m +1

r u m +2

Lr u n[ ]

ym +1

Myn

"

#

$ $

%

&

' '

ˆ x = Umym + Udb = ˆ x (m) + ˆ x discard

!

ˆ x (m) =r u

1

r u

2L

r u m[ ]

y1

Mym

"

#

$ $

%

&

' '

Define a reconstruction based on the‘best’ m vectors x(m)ˆ

!

Errrecon2 = x k" ˆ x k( )

Tx k" ˆ x k( )

k=1:Nsamples

#

Page 14: Feature Selection/Extraction

Visualizing Reconstruction Error

x

Data scatter Data as 2D vectors

y

ugood

u

x

xt u

r

xp

ugood

Solution involves findingdirections u which minimizethe perpendicular distancesand removing them

Page 15: Feature Selection/Extraction

!

"x(m) = x # ˆ x (m) = yi

r u i #

i=1:n

$ yi

r u i +

i=1:m

$ bi

r u i

i= (m +1):n

$%

& ' '

(

) * * = (yi # bi)

r u i

i= (m +1):n

$

Errrecon

2 = E "x(m)2[ ] = E (y j # bj )

r u j (yi # bi)

r u i

i= (m +1):n

$j=(m +1):n

$+

, - -

.

/ 0 0

= E (yi # bi)(y j # b j )r u i

T r u j

i= (m +1):n

$j=(m +1):n

$+

, - -

.

/ 0 0

= E (yi # bi)2

i= (m +1):n

$+

, -

.

/ 0 = E (yi # bi)

2[ ]i= (m +1):n

$

Goal: Find basis vectors ui and constants bi minimizereconstruction error

!

"Err

"bi= 0 =

"

"biE (yi # bi)

2[ ] =i= (m+1):n

$ 2(E yi[ ] # bi) % bi = E yi[ ]

Rewriting theerror….

Solving for b….

Therefore, replace the discarded dimensions yi’s by their expected value.

Page 16: Feature Selection/Extraction

!

E (yi " E[yi])2[ ]

i= (m +1):n

# = E (xT r u i " E[x

T r u i])

2[ ]i= (m +1):n

#

= E (xT r u i " E[x

T r u i])

T(x

T r u i " E[x

T r u i])[ ]

i= (m +1):n

#

= Er u i

T(x

T " E[xT])

T(x

T " E[xT])

r u i[ ]

i= (m +1):n

#

= Er u i

T(x " E[x])(x " E[x])

T r u i[ ]

i= (m +1):n

#

=r u i

TE (x " E[x])(x " E[x])

T[ ]r u i

i= (m +1):n

#

=r u i

TC

r u i

i= (m +1):n

#

Now rewrite the error replacing the bi

C is the covariance matrix for x

Page 17: Feature Selection/Extraction

Thus, finding the best basis ui involves minimizing the quadratic form,

subject to the constraint || ui ||=1

!

Err =r u

i

TC

r u

i

i= (m +1):n

"

Using Lagrangian Multipliers we form the constrainederror function:

!

Err =r u

i

TC

r u

i

i= (m +1):n

" + #i1$

r u

i

Tr u

i( )

%Err

%r u

i

=%

%r u

i

r u

i

TC

r u

i

i= (m +1):n

" + #i1$

r u

i

Tr u

i( ) = 0

=%

%r u

i

r u

i

TC

r u

i+ #

i1$

r u

i

Tr u

i( )( ) = 2Cr u

i$ 2#

i

r u

i= 0

!

Cr u

i= "

i

r u

i

Which results in the followingEigenvector problem

Page 18: Feature Selection/Extraction

!

Err =r u

i

TC

r u

i

i= (m +1):n

" + #i1$

r u

i

Tr u

i( )

Err =r u

i

T #i

r u

i( )i= (m +1):n

" + 0 = #i

i= (m +1):n

"

Plugging back into the error:

Thus the solution is to discard the m-n smallesteigenvalue eigenvectors.

PCA summary:1) Compute data covariance2) Eigenanalysis on covariance matrix3) Throw out smallest eigenvalue eigenvectors

Problem: How many to keep?Many criteria.

e.g. % total data variance:

!

max(m) "

#i

i= (m+1):n

$

#i

i=1:n

$< %

Page 19: Feature Selection/Extraction
Page 20: Feature Selection/Extraction
Page 21: Feature Selection/Extraction

http://www-white.media.mit.edu/vismod/demos/facerec/basic.html

PCA on aligned face images

Page 22: Feature Selection/Extraction

Extensions: ICA• Find the ‘best’ linear basis, minimizing the

statistical dependence between projectedcomponentsProblem:Find c hiddenind. sources xi

ObservationModel

Page 23: Feature Selection/Extraction

ICA Problem statement:Recover the source signals from the sensed signals. More specifically, we seek a real matrix W

such that z(t) is an estimate of x(t):

Page 24: Feature Selection/Extraction

Solve via:

Page 25: Feature Selection/Extraction

Depending on density assumptions, ICA canhave easy or hard solutions

• Gradient approach• Kurtotic ICA: Two lines matlab code.

– http://www.cs.toronto.edu/~roweis/kica.html• yy are the mixed measurements (one per column)• W is the unmixing matrix.

• % W = kica(yy);• xx = sqrtm(inv(cov(yy')))*(yy-repmat(mean(yy,2),1,size(yy,2)));• [W,ss,vv] = svd((repmat(sum(xx.*xx,1),size(xx,1),1).*xx)*xx');

Page 26: Feature Selection/Extraction

Kernel PCA

• PCA after non-linear transformation

Page 27: Feature Selection/Extraction

Using Non-linear components• Principal Components Analysis (PCA) attempts to efficiently

represent the data by finding orthonormal axes which maximallydecorrelate the data

• Makes Following assumptions:– · Sources are Gaussian– · Sources are independent

and stationary (iid)

Page 28: Feature Selection/Extraction

Extending PCA

Page 29: Feature Selection/Extraction
Page 30: Feature Selection/Extraction
Page 31: Feature Selection/Extraction
Page 32: Feature Selection/Extraction

Kernel PCA algorithm

!

Kij = k(xi,x j )

Eigenanalysis

(m")r # = K

r #

K = A$A%1

Enforce

"nr # n

2

=1

Compute Projections

yn = # i

jk(xi,x)

i=1

m

&

Page 33: Feature Selection/Extraction
Page 34: Feature Selection/Extraction
Page 35: Feature Selection/Extraction
Page 36: Feature Selection/Extraction

Probabilistic Clustering

EM, Mixtures of Gaussians, RBFs,etc

Page 37: Feature Selection/Extraction
Page 38: Feature Selection/Extraction
Page 39: Feature Selection/Extraction

But only if we are given the distributions and prior

Page 40: Feature Selection/Extraction
Page 41: Feature Selection/Extraction
Page 42: Feature Selection/Extraction
Page 43: Feature Selection/Extraction
Page 44: Feature Selection/Extraction
Page 45: Feature Selection/Extraction
Page 46: Feature Selection/Extraction
Page 47: Feature Selection/Extraction
Page 48: Feature Selection/Extraction

http://www.ncrg.aston.ac.uk/netlab/

* PCA* Mixtures of probabilistic PCA* Gaussian mixture model with EM training* Linear and logistic regression with IRLS* Multi-layer perceptron with linear, logistic and

softmax outputs and error functions* Radial basis function (RBF) networks with

both Gaussian and non-local basis functions* Optimisers, including quasi-Newton methods,

conjugate gradients and scaled conj grad.* Multi-layer perceptron with Gaussian mixture

outputs (mixture density networks)* Gaussian prior distributions over parameters

for the MLP, RBF and GLM including multiple hyper-parameters

* Laplace approximation framework for Bayesian inference (evidence procedure)

* Automatic Relevance Determination for input selection

* Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo

* K-nearest neighbour classifier* K-means clustering* Generative Topographic Map* Neuroscale topographic projection* Gaussian Processes* Hinton diagrams for network weights* Self-organising map

Page 49: Feature Selection/Extraction

Data sampled fromMixture of 3 Gaussians Spectral Clustering