Top Banner
March 2006 Alon Slapak 1 of 32 Feature extraction A practical approach Feature selecti on Data synthes is Separabi lity Feature extract ion Guideli nes Cluster Example Bibliogr aphy
32

Feature extraction

Jan 26, 2016

Download

Documents

tress

Feature extraction. Cluster. Data synthesis. Separability. Feature extraction Guidelines. A practical approach. Feature selection. Example. Bibliography. Each of the red circles represents an object pattern. The coordinates of the circle are the features (e.g. height and weight).  1. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature extraction

March 2006 Alon Slapak 1 of 32

Feature extraction

A practical approachFeature

selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Page 2: Feature extraction

March 2006 Alon Slapak 2 of 32

120 130 140 150 160 170 180 19030

35

40

45

50

55

60

65

70

75

80

height [cm]

weig

ht

[kg]

Females

Geometric structure of a cluster

(152,51.5)

Each of the red circles represents an object pattern. The coordinates of the circle are the features (e.g. height and weight).

The ellipse which encircles the major part of the cluster represents the distribution of the patterns.

The center of the ellipse is the mean of the patterns distribution .

The length of the principle axes of the ellipse is proportional to twice the square root of the eigenvalues of the covariance matrix of the patterns distribution (1, 2).

1

2

The principles axes of the ellipse coincide with the eigenvectors of the covariance matrix of the patterns distribution.

Feature selection

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Data synthesis

Page 3: Feature extraction

March 2006 Alon Slapak 3 of 32

Typical dimension: Mean The center of the ellipse is the mean of the patterns distribution. It can be estimated by:

1

1 N

kk

xN

While xk - is the kth pattern in the cluster,

N - is the number of patterns in the cluster.

The mean is sometimes

referred to as the

centroid of the cluster.

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

we

igh

t [kg

]

Feature selection

Separability

Feature extraction Guidelines

Example

Bibliography

Cluster

Data synthesis

Page 4: Feature extraction

March 2006 Alon Slapak 4 of 32

Typical dimension: Scatter matrix SScatter matrix S of a cluster is defined as:

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Largerscatter

Smallerscatter

1

NT

k kk

S x x

While xk - is the kth pattern in the cluster,

N - is the number of patterns in the cluster, - is the mean of all the patterns in the cluster

The scatter matrix

may be interpreted as

a biased estimation of

the covariance matrix

of the cluster.

Feature selection

Separability

Feature extraction Guidelines

Example

Bibliography

Cluster

Data synthesis

Page 5: Feature extraction

March 2006 Alon Slapak 5 of 32

Data synthesis• In order to test and debug pattern

recognition algorithms, it is customary to use synthesized data.

• The synthesized data may be drawn from arbitrary distribution, but in most of the literature it is customary to assume a normal distribution.

• But, it is not infrequent to come across applications involving other patterns distributions

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Feature selection

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Data synthesis

Page 6: Feature extraction

March 2006 Alon Slapak 6 of 32

Example: Two clusters synthesisclear all

N1 = 150; N2 = 150;

E1 = [150 15; 120 20]; E2 = [100 10; 70 30];

M1 = [170,75]'; M2 = [160,50]';

[P1,A1] = eig(E1); [P2,A2] = eig(E2);

y1=randn(2,N1); y2=randn(2,N2);

for i=1:N1,

x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1;

end;

for i=1:N2,

x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2;

end;

figure;

plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');

axis([120 220 30 120]);

xlabel ('height [cm]');

ylabel ('weight [kg]'); height [cm]

wei

ght [

kg]

Males

Females

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

Feature selection

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Data synthesis

Page 7: Feature extraction

March 2006 Alon Slapak 7 of 32

ExerciseTry to synthesis two classes, which have a common centroid at (3,5).One class coincides with horizontal and is ~40 units in length and ~8 units in width.The other class is inclined at 45o from horizontal and is ~50 units in length and ~4 units in width.

-50 -40 -30 -20 -10 0 10 20 30 40 50-50

-40

-30

-20

-10

0

10

20

30

40

50

Hint: Use E=PAP-1 to build the covariance matrix.

Feature selection

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Data synthesis

Page 8: Feature extraction

March 2006 Alon Slapak 8 of 32

Separability

Q: How does feature extraction method affect the pattern recognition performance?

height [cm]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Page 9: Feature extraction

March 2006 Alon Slapak 9 of 32

SeparabilityA: More than the classification algorithm. If the patterns achieved by the feature extraction method creates nonseparate clusters, no classification algorithm can do the job.

Feature extraction

method

Classificationalgorithm

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 10: Feature extraction

March 2006 Alon Slapak 10 of 32

Separability

In fact, the separability achieved by the feature extraction method is the upper limit of the pattern recognition performance.

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 11: Feature extraction

March 2006 Alon Slapak 11 of 32

Separability

Q: How can one assess the separability achieved by the feature extraction method?

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Good?Bad?Hopeless?

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 12: Feature extraction

March 2006 Alon Slapak 12 of 32

Separability

A: Several separability criterias are exist, most are involving scatters matrices

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Largerscatter

Smallerscatter

1

NT

k kk

S x x

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 13: Feature extraction

March 2006 Alon Slapak 13 of 32

SeparabilityIf we define the within-class scatter matrix as:

1 1

iNCT

W i ki i ki ii k

S P x x

While xki - is the kth pattern in the ith cluster,

Ni - is the number of patterns in the ith cluster,i - is the mean of all the patterns in the ith clusterC - is the number of the clustersPi - is the a priori probability of the ith cluster, and may be

estimated by 1

C

i i pp

P N N

and the between-class scatter matrix as:

1

CT

B i i ii

S P

While - is the mean of all the patterns in all the clusters

and the total-class scatter matrix as:

T B WS S S

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 14: Feature extraction

March 2006 Alon Slapak 14 of 32

Separability function JWe can find in the literature several functions which represent the separability of clusters by a scalar:

1W BJ tr S S 1det W BJ S S

B

W

tr SJ

tr S 1ln det W BJ S S

Pay attention to the fact that J is a scalar. It is of utmost importance because scalar function is necessary for later optimization (we will try to maximize J of course)

Feature selection

Data synthesis

Feature extraction Guidelines

Cluster

Example

Bibliography

Separability

Page 15: Feature extraction

March 2006 Alon Slapak 15 of 32

Feature extraction guidelines

Example: Features of a student:• Number of eyes• Hair color• Wear glasses or not• Hair length• Show size• Height• Weight

Useless. Includes no information on gender.

Useless. Very poor correlation with gender.

Useless. Very poor correlation with gender.

Effective, but is hard to measure.

Effective and simple.

Effective and simple.

Effective and simple.

Q: What will be regarded as a reasonable feature?

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Page 16: Feature extraction

March 2006 Alon Slapak 16 of 32

Features extraction guidelines

• Correlation with the classification feature

• Easy to measure

• Maximizing the separability function

“When we have two or more classes, feature extraction consist of choosing those features which are most effective for preserving class separability” (Fukunaga p. 441)

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography

Feature extraction Guidelines

Page 17: Feature extraction

March 2006 Alon Slapak 17 of 32

Features extraction guidelinesQ: How can one determine what feature follows the guidelines?

A1: There is an extensive literature dealing with features to specific applications (e.g. symmetry for face recognition)

A2: A widespread approach is to emulate human mechanism (e.g. treating all the pixels as features in face recognition)

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography

Feature extraction Guidelines

Page 18: Feature extraction

March 2006 Alon Slapak 18 of 32

Example - handwritten recognitionThe following 20 figures depict examples of handwritten digits. Since almost everyone can recognize each of the digits, it is reasonable to assume that treating all the pixels as features will assure a successful pattern recognition process.

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography

Feature extraction Guidelines

Page 19: Feature extraction

March 2006 Alon Slapak 19 of 32

Feature selection

• While dealing with 28X28 pixels handwriting digits, the feature space dimensions is 784.

• While dealing with face recognition, we may have about 200X200 pixels. It means that the feature space dimensions is 40000.

• While dealing with voice recognition, we may have about 1[Sec]X16[kHz]. It means that the feature space dimensions is 16000.

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Page 20: Feature extraction

March 2006 Alon Slapak 20 of 32

Feature selection

Because of:• The complexity of most of the

classification algorithms is O(n2).

• Learning time

• “Course of dimensionality”

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Feature selection

Page 21: Feature extraction

March 2006 Alon Slapak 21 of 32

Feature selection“Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called

curse of dimensionality.” (WIKIPEDIA)

ObjectHigh

dimension Pattern

Feature extracti

onLow

dimension Pattern

Feature Selectio

n

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Feature selection

Page 22: Feature extraction

March 2006 Alon Slapak 22 of 32

Feature selectionQ: How can one identify the irrelevant and the redundant features?A: Several options:1. Heuristics

In most applications, the relevant information resides in the low frequency range (e.g. images), hence it is logical to reduce dimensionality by taking the first coefficients of the Fourier/DCT transform.

2. Optimization approach (KLT, FLD) We may select a subset of the features (or linear combinations of the features) which best contribute to the separability of the clusters.

3. Grouping By grouping features one can represent every set of features by a small set of features.

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Feature selection

Page 23: Feature extraction

March 2006 Alon Slapak 23 of 32

Feature selection - ExampleThe separability of these two class is pretty good. But, do we really need two features?

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght [

kg]

Males

Females

120 130 140 150 160 170 180 190 200 210 2200

5

10

15

20

25

30

30

40

50

60

70

80

90

100

110

120

0 5 10 15 20 25 30

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Feature selection

Page 24: Feature extraction

March 2006 Alon Slapak 24 of 32

Feature selection - ExampleNo! The same separability can be achieved by projecting the patterns onto the blue axis i.e. only one-dimension feature-space is needed. Data

synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography120 130 140 150 160 170 180 190 200 210 220

30

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght [

kg]

Males

Females

-1.2

-1

-0.8

-0.6

-0.4 0

5

10

15

20

25

30

-101

Feature selection

Page 25: Feature extraction

March 2006 Alon Slapak 25 of 32

Feature selectionBut how can we find this “blue axis” or the feature subspace on which we should project the patterns?

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Feature selection

Please refer to:Lec7.pdf - PCA and the course of dimensionalityLec8.pdf in - Fisher Linear Discriminanthttp://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html

Of Prof. Olga Vekslerhttp://www.csd.uwo.ca/faculty/olga/

Page 26: Feature extraction

March 2006 Alon Slapak 26 of 32

Example – Handwriting recognition

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

The following example describes the application of FLD to handwriting recognition.

The characters were derived from the MNIST database (http://yann.lecun.com/exdb/mnist/).

The example demonstrates the separability of two classes while reducing the dimensionality from 784 (28X28 pixels) to 2.

Page 27: Feature extraction

March 2006 Alon Slapak 27 of 32

Example – Handwriting recognitionclear all;load img0load img1N1=length(img0);N2=length(img1);D = 50; % Low pass filteringDd = 2; % Desired pattern dimension%-----------------------------------------------------% Test set synthesis%-----------------------------------------------------for i=1:N1, x = reshape(squeeze(img0(i,:,:)), 28*28,1); X=dct(x); l1(:,i) = X(1:D);end;for i=1:N2, x = reshape(squeeze(img1(i,:,:)), 28*28,1); X=dct(x); l2(:,i) = X(1:D);end;

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

The img0 and img1 files contain N 28X28 arrays of grayscale images of the handwriting characters.

Reshaping the 28X28 arrays into a 784X1 features vector (pattern)

DCT transform is recommended while dealing with mostly black&white images. The numerousity of identical numbers (0 and 255) damages the FLD computations, and DCT transform may alleviate this phenomena.

Taking only the first 50 DCT coefficients decrease dimensionality (most of the energy resides in the low frequency range, but it is not essential for this example.

Page 28: Feature extraction

March 2006 Alon Slapak 28 of 32

Example – Handwriting recognition%--------------------------------------------------------------------------

% Compute Sb amd Sw

%--------------------------------------------------------------------------

Mu1 = mean(l1')'; % Mean of cluster 1

Mu2 = mean(l2')'; % Mean of cluster 2

Mu = (Mu1*N1 + Mu2*N2)/(N1+N2); % Total mean of all pattens

Sw = zeros(D);

for i=1:N1,

Sw = Sw + (l1(:,i)-Mu1)*(l1(:,i)-Mu1)';

end;

for i=1:N2,

Sw = Sw + (l2(:,i)-Mu2)*(l2(:,i)-Mu2)';

end;

Sb = N1*(Mu1-Mu)*(Mu1-Mu)' + N2*(Mu2-Mu)*(Mu2-Mu)';

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Compute the between scatter matrix.

Compute the within scatter matrices

Compute the mean of each cluster, and the mean of all the patterns.

1

CT

B i i ii

S P

1 1

iNCT

W i ki i ki ii k

S P x x

Page 29: Feature extraction

March 2006 Alon Slapak 29 of 32

Example – Handwriting recognition%--------------------------------------------------------------------------

% Compute V

% [W,D] = EIG(A,B) produces a diagonal matrix D of generalized

% eigenvalues and a full matrix W whose columns are the

% corresponding eigenvectors so that A*W = B*W*D.

%--------------------------------------------------------------------------

[W,D]=eig(Sb,Sw); % Solve the generalized eigenvalue problem

Lambda = diag(D); % Sort the eigenvalues.

[Lam,p]=sort(Lambda);

V=[]; % Build V form the eigen vectors

for i=1:Dd, % which are corresponding to the

V= [V W(:,p(end+1-i))]; % biggest eigenvalues

end;

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Solve the generalized eigenvalues problem

Construct the projection matrix V such that its column vectors are the eigenvectors corresponding to the largest eigenvalues.

B WS V S V

TB

TW

V S VJ V

V S V

Page 30: Feature extraction

March 2006 Alon Slapak 30 of 32

Example – Handwriting recognition%--------------------------------------------------------------------------

% Project the initial patterns onto the reduced space

%--------------------------------------------------------------------------

for i = 1:N1,

r1(:,i) = V' * l1(:,i);

end;

for i = 1:N2,

r2(:,i) = V' * l2(:,i);

end;

figure;

plot(r1(1,:),r1(2,:),'.',r2(1,:),r2(2,:),'or');

xlabel ('feature1');

ylabel ('feature2');

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

Project the initial patterns onto the reduced space.

Plot 2-dim clusters

Ti iy V x

Page 31: Feature extraction

March 2006 Alon Slapak 31 of 32

Example – Handwriting recognitionThe figure below depicts the 2-dim clusters. It is easy to see that the separability of the clusters is perfect even in one dimension, which is not surprising because there are only two clusters.

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography

-4 -2 0 2 4 6 8 10

x 10-5

-1

-0.5

0

0.5

1

1.5

2

2.5

3x 10

-6

feature1

feat

ure2

Page 32: Feature extraction

March 2006 Alon Slapak 32 of 32

Bibliography•Prof. Olga Veksler pattern recognition course: http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html•Handwriting database: http://yann.lecun.com/exdb/mnist/

Feature selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography