Top Banner

of 22

3 - Feature Extraction

Apr 10, 2018

ReportDownload

Documents

  • 8/8/2019 3 - Feature Extraction

    1/22

    1

    1

    FEATURE EXTRACTION ANDSELECTION METHODS

    2

    The task of the feature extraction andselection methods is to obtain the mostr e l ev a n t i n f o rm a t i o n from theoriginal data and represent thatinformation in a l o w e r d i m e n si o n a li t y space.

    Feature extraction and selection methods

  • 8/8/2019 3 - Feature Extraction

    2/22

    2

    3

    When the cost of the acquisition andmanipulation of all the measurements is highwe must make a selection of features.

    The goal is to select, among all the availablefeatures, those that will perform better.

    Example: which features should be used forclassifying a student as a good or bad one:

    Available features: marks, height, sex, weight, IQ. Feature selection would choose marks and IQ and

    would discard height, weight and sex.

    We have to choose P variables in a set of Mvariables so that the separability is maximal.

    Selection methods

    4

    The goal is to build, using the availablefeatures, those that will perform better.

    Example: which features should be used for

    classifying a student as a good or bad one: Available features: marks, height, sex, weight, IQ.

    Feature extraction may choose marks + IQ2 as thebest feature (in fact, it is a combination of twofeatures).

    The goal is to transform the origin space X ina new space Y to obtain new features thatwork better. This way, we can compress theinformation.

    Extraction methods

  • 8/8/2019 3 - Feature Extraction

    3/22

    3

    5

    PCA = Karhunen-Loeve transform = Hotelling transform

    PCA is the most popular feature extraction method

    PCA is a linear transformation

    PCA is used in face recognition systems based on appearance

    Principal Component Analysis

    6

    PCA has been successfully applied to human facerecognition.

    PCA consists on a transformation from a space of

    high dimension to another with more reduceddimension.

    If the data are highly correlated, there is redundantinformation.

    PCA decreases the amount of redundant information bydecorrelating the input vectors.

    The input vectors, with high dimension and correlated, canbe represented in a lower dimension space anddecorrelated.

    PCA is a powerful tool to compress data.

    Principal Component Analysis

  • 8/8/2019 3 - Feature Extraction

    4/22

    4

    7

    PCA by Maximizing Variance (I)

    We will derive PCA by maximizing the variance in the direction ofprincipal vectors.

    Let us suppose that we have N M-dimensional vectors xj aligned in thedata matrix X.

    Let u be a direction (a vector of lenght 1). The projection of the j-thvector xj onto the vector u can be calculated in the following way:

    M

    N = examples

    dimension =

    =

    ==M

    i

    ijij

    T

    j xuxup1

    rr

    8

    PCA by Maximizing Variance (II)

    We want to find a direction u that maximizes the variance of theprojections of all input vectors xj,j=1,..N.

    The function to maximize is:

    uCuppN

    puJT

    N

    j

    jj

    PCA rrr====

    =

    ...)(1

    )()(1

    22

    Using the technique of Lagrange multipliers, the solutionto this maximization problem is to compute theeigenvectors and the eigenvalues of the covariancematrix C.

    where C is the covariance matrix of the data matrix X.

    MORE INFO in PCA.pdf

    [ ]Tm

    xN

    T XXXXN

    C

    ,...,

    11

    1

    1

    =

    ==

  • 8/8/2019 3 - Feature Extraction

    5/22

    5

    9

    PCA by Maximizing Variance (III)

    The largest eigenvalue equals the maximal variance, while thecorresponding eigenvector determines the direction with themaximal variance.

    By performing singular value decomposition (SVD) of thecovariance matrix C we can diagonalize C:

    TUUC =

    in such a way that the orthonormal matrix U contains theeigenvectors u1, u2,.. uN in its columns and the diagonal matrix contains the eigenvalues 1, 2,.. N on its diagonal.The eigenvalues and the eigenvectors are arranged with respectto the descending order of the eigenvalues, thus 1 2 .. N.Therefore, the most of the variability of the input randomvectors is contained in the first eigenvectors. Hence, theeigenvectors are called principal vectors.

    10

    Computing PCA

    Steps to compute the PCA transformationof a data matrix X:

    Center the data

    Compute the covariance matrix

    Obtain the eigenvectors and eigenvalues

    of the covariance matr ix

    Project the original data in theeigenspace

    Matlab code:

    %number of examples

    N=size(X,2);

    %dimension of each example

    M=size(X,1);

    %mean

    meanX=mean(X,2);

    %centering the data

    Xm=X-meanX*ones(1,N);

    %covariance matrix

    C=(Xm*Xm')/N;

    %computing the eigenspace:

    [U D]=eig(C);

    %projecting the centered data

    over the eigenspace

    P=U'*Xm;

    XC

    ,U

    XUP T =

    U can be used as a linear transformationto project the original data of highdimension into a space of lowerdimension.

  • 8/8/2019 3 - Feature Extraction

    6/22

    6

    11

    PCA of a bidimensional dataset

    -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5-1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    original

    centereduncorrelated

    12

    Computing PCA of a set of images

    This approach to the calculation of principal vectors is veryclear and widely used. However, if the size of the data vectorM is very large, which is often the case in the field ofcomputer vision, the covariance matrix C becomes very largeand eigenvalue decomposition ofC becomes unfeasible.

    But, if the number of input vectors is smaller than the size ofthese vectors (N

  • 8/8/2019 3 - Feature Extraction

    7/22

    7

    13

    Face recognition using PCA (I)

    Eigenfaces for Recognition, Turk, M. & Pentland, A. ,

    Journal of Cognitive Neuroscience, 3, 71-86, 1991.

    14

    LDA = Fisher analysis

    LDA is a linear transformation

    LDA is also used in face recognition

    LDA seeks directions that are efficient fordiscrimination between classes

    In PCA, the subspace defined by the vectors is the onethat better describes the conjunct of data.

    LDA tries to discriminate between the different classesof data.

    Linear Discriminant Analysis (I)

  • 8/8/2019 3 - Feature Extraction

    8/22

    8

    15

    We have a conjunct of N vectors of dimension M in the datamatrix MxN.

    We have C classes and k vectors per class. We want to find the transformation matrix W that better

    describes the subspace that discriminates between classes, afterprojecting the data in the new space.

    The objective is to make maximum the distance betweenclasses Sb and minimizing Sw.

    Linear Discriminant Analysis (I)

    XWP = are the eigenvectors ofW

    C

    16

    Linear Discriminant Analysis (II)

    c lass 1c lass 1

    c lass 2 c lass 2

    The figure shows the effect of LDA transform ina conjunct of data composed of 2 classes.

  • 8/8/2019 3 - Feature Extraction

    9/22

    9

    17

    Linear Discriminant Analysis (III)

    Limitations of LDA

    LDA works better than PCA when the training data arewell representative of the data in the system.

    If the data are not representative enough, PCA performsbetter.

    18

    Independent Component Analysis (I)

    Independent Component Analysis

    ICA is a statistical technique that represents amultidimensional random vector as a linear

    combination of nongaussian random variables('independent components') that are as independentas possible.

    ICA is somewhat similar to PCA.

    ICA has many applications in data analysis, sourceseparation, and feature extraction.

  • 8/8/2019 3 - Feature Extraction

    10/22

    10

    19

    ICA cocktail party problem Cocktail party problem

    ICA is a statistical technique for decomposing a complexdataset into independent sub-parts. Here we show how itcan be applied tothe problem of separation of BlindSources.

    )t(x1)t(x 2

    )t(x4

    )t(x3

    )t(s1

    )t(s2)t(s3

    )t(s4

    20

    ICA cocktail party problem

    Cocktail party problem

    )t(si

    )t(x iEstimate the sourcesfrom the mixed signals

  • 8/8/2019 3 - Feature Extraction

    11/22

    11

    21

    ICA cocktail party problem

    Linear model:

    )t(sa)t(sa)t(sa)t(sa)t(x

    )t(sa)t(sa)t(sa)t(sa)t(x

    )t(sa)t(sa)t(sa)t(sa)t(x

    )t(sa)t(sa)t(sa)t(sa)t(x

    4443432421414

    4343332321313

    4243232221212

    4143132121111

    +++=

    +++=

    +++=

    +++=

    We can model the problem as X=AS S = 4D vector containing the independent source

    signals.

    A = mixing matrix.

    X = Observed signals.

    22

    ICA cocktail party problem

    Mixed signals

  • 8/8/2019 3 - Feature Extraction

    12/22

    12

    23

    ICA cocktail party problem

    Sources

    24

    ICA cocktail party problem

    ICA: One possible solution is to assume that thesources are independent.

    )s(p)s(p)s(p)s,,s,s(p n21n21 KK =

    )t(si)t(x i

    Estimate the sourcesfrom the mixed signals

  • 8/8/2019 3 - Feature Extraction

    13/22

    13

    25

    ICA cocktail party problem

    SAX =

    SourcesMixed signals Mixing matrix

    MODEL ICA

    26

    ICA cocktail party problem

    XWS =

    Mixed

    signals

    Sources Separation matrix

    ESTIMATING THE SOURCES

    ASXAW

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.