Feature extraction

March 2006 Alon Slapak 1 of 32

Feature extraction

A practical approachFeature

selection

Data synthesis

Separability

Feature extraction Guidelines

Cluster

Example

Bibliography


120 130 140 150 160 170 180 19030

35

40

45

50

55

60

65

70

75

80

height [cm]

weig

ht

[kg]

Females

Geometric structure of a cluster

(152,51.5)

Each of the red circles represents an object pattern. The coordinates of the circle are the features (e.g. height and weight).

The ellipse which encircles the major part of the cluster represents the distribution of the patterns.

The center of the ellipse is the mean of the patterns distribution .

The length of the principle axes of the ellipse is proportional to twice the square root of the eigenvalues of the covariance matrix of the patterns distribution (1, 2).

1

2

The principles axes of the ellipse coincide with the eigenvectors of the covariance matrix of the patterns distribution.

Feature selection

Separability


Cluster

Example

Bibliography

Data synthesis


Typical dimension: Mean The center of the ellipse is the mean of the patterns distribution. It can be estimated by:

1

1 N

kk

xN

While xk - is the kth pattern in the cluster,

N - is the number of patterns in the cluster.

The mean is sometimes

referred to as the

centroid of the cluster.

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

we

igh

t [kg

]

Feature selection

Separability


Example

Bibliography

Cluster

Data synthesis


Typical dimension: Scatter matrix SScatter matrix S of a cluster is defined as:

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Largerscatter

Smallerscatter

1

NT

k kk

S x x

While xk - is the kth pattern in the cluster,

N - is the number of patterns in the cluster, - is the mean of all the patterns in the cluster

The scatter matrix

may be interpreted as

a biased estimation of

the covariance matrix

of the cluster.

Feature selection

Separability


Example

Bibliography

Cluster

Data synthesis


Data synthesis• In order to test and debug pattern

recognition algorithms, it is customary to use synthesized data.

• The synthesized data may be drawn from arbitrary distribution, but in most of the literature it is customary to assume a normal distribution.

• But, it is not infrequent to come across applications involving other patterns distributions

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Feature selection

Separability


Cluster

Example

Bibliography

Data synthesis


Example: Two clusters synthesisclear all

N1 = 150; N2 = 150;

E1 = [150 15; 120 20]; E2 = [100 10; 70 30];

M1 = [170,75]'; M2 = [160,50]';

[P1,A1] = eig(E1); [P2,A2] = eig(E2);

y1=randn(2,N1); y2=randn(2,N2);

for i=1:N1,

x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1;

end;

for i=1:N2,

x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2;

end;

figure;

plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');

axis([120 220 30 120]);

xlabel ('height [cm]');

ylabel ('weight [kg]'); height [cm]

wei

ght [

kg]

Males

Females

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

Feature selection

Separability


Cluster

Example

Bibliography

Data synthesis


ExerciseTry to synthesis two classes, which have a common centroid at (3,5).One class coincides with horizontal and is ~40 units in length and ~8 units in width.The other class is inclined at 45o from horizontal and is ~50 units in length and ~4 units in width.

-50 -40 -30 -20 -10 0 10 20 30 40 50-50

-40

-30

-20

-10

0

10

20

30

40

50

Hint: Use E=PAP-1 to build the covariance matrix.

Feature selection

Separability


Cluster

Example

Bibliography

Data synthesis


Separability

Q: How does feature extraction method affect the pattern recognition performance?

height [cm]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography


SeparabilityA: More than the classification algorithm. If the patterns achieved by the feature extraction method creates nonseparate clusters, no classification algorithm can do the job.

Feature extraction

method

Classificationalgorithm

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


Separability

In fact, the separability achieved by the feature extraction method is the upper limit of the pattern recognition performance.

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


Separability

Q: How can one assess the separability achieved by the feature extraction method?

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Good?Bad?Hopeless?

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


Separability

A: Several separability criterias are exist, most are involving scatters matrices

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght

[kg]

Largerscatter

Smallerscatter

1

NT

k kk

S x x

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


SeparabilityIf we define the within-class scatter matrix as:

1 1

iNCT

W i ki i ki ii k

S P x x

While xki - is the kth pattern in the ith cluster,

Ni - is the number of patterns in the ith cluster,i - is the mean of all the patterns in the ith clusterC - is the number of the clustersPi - is the a priori probability of the ith cluster, and may be

estimated by 1

C

i i pp

P N N

and the between-class scatter matrix as:

1

CT

B i i ii

S P

While - is the mean of all the patterns in all the clusters

and the total-class scatter matrix as:

T B WS S S

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


Separability function JWe can find in the literature several functions which represent the separability of clusters by a scalar:

1W BJ tr S S 1det W BJ S S

B

W

tr SJ

tr S 1ln det W BJ S S

Pay attention to the fact that J is a scalar. It is of utmost importance because scalar function is necessary for later optimization (we will try to maximize J of course)

Feature selection

Data synthesis


Cluster

Example

Bibliography

Separability


Feature extraction guidelines

Example: Features of a student:• Number of eyes• Hair color• Wear glasses or not• Hair length• Show size• Height• Weight

Useless. Includes no information on gender.

Useless. Very poor correlation with gender.

Useless. Very poor correlation with gender.

Effective, but is hard to measure.

Effective and simple.



Q: What will be regarded as a reasonable feature?

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography


Features extraction guidelines

• Correlation with the classification feature

• Easy to measure

• Maximizing the separability function

“When we have two or more classes, feature extraction consist of choosing those features which are most effective for preserving class separability” (Fukunaga p. 441)

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography



Features extraction guidelinesQ: How can one determine what feature follows the guidelines?

A1: There is an extensive literature dealing with features to specific applications (e.g. symmetry for face recognition)

A2: A widespread approach is to emulate human mechanism (e.g. treating all the pixels as features in face recognition)

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography



Example - handwritten recognitionThe following 20 figures depict examples of handwritten digits. Since almost everyone can recognize each of the digits, it is reasonable to assume that treating all the pixels as features will assure a successful pattern recognition process.

Feature selection

Data synthesis

Separability

Cluster

Example

Bibliography



Feature selection

• While dealing with 28X28 pixels handwriting digits, the feature space dimensions is 784.

• While dealing with face recognition, we may have about 200X200 pixels. It means that the feature space dimensions is 40000.

• While dealing with voice recognition, we may have about 1[Sec]X16[kHz]. It means that the feature space dimensions is 16000.

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography


Feature selection

Because of:• The complexity of most of the

classification algorithms is O(n2).

• Learning time

• “Course of dimensionality”

Data synthesis

Separability


Cluster

Example

Bibliography

Feature selection


Feature selection“Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called

curse of dimensionality.” (WIKIPEDIA)

ObjectHigh

dimension Pattern

Feature extracti

onLow

dimension Pattern

Feature Selectio

n

Data synthesis

Separability


Cluster

Example

Bibliography

Feature selection


Feature selectionQ: How can one identify the irrelevant and the redundant features?A: Several options:1. Heuristics

In most applications, the relevant information resides in the low frequency range (e.g. images), hence it is logical to reduce dimensionality by taking the first coefficients of the Fourier/DCT transform.

2. Optimization approach (KLT, FLD) We may select a subset of the features (or linear combinations of the features) which best contribute to the separability of the clusters.

3. Grouping By grouping features one can represent every set of features by a small set of features.

Data synthesis

Separability


Cluster

Example

Bibliography

Feature selection


Feature selection - ExampleThe separability of these two class is pretty good. But, do we really need two features?

120 130 140 150 160 170 180 190 200 210 22030

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght [

kg]

Males

Females

120 130 140 150 160 170 180 190 200 210 2200

5

10

15

20

25

30

30

40

50

60

70

80

90

100

110

120

0 5 10 15 20 25 30

Data synthesis

Separability


Cluster

Example

Bibliography

Feature selection


Feature selection - ExampleNo! The same separability can be achieved by projecting the patterns onto the blue axis i.e. only one-dimension feature-space is needed. Data

synthesis

Separability


Cluster

Example

Bibliography120 130 140 150 160 170 180 190 200 210 220

30

40

50

60

70

80

90

100

110

120

height [cm]

wei

ght [

kg]

Males

Females

-1.2

-1

-0.8

-0.6

-0.4 0

5

10

15

20

25

30

-101

Feature selection


Feature selectionBut how can we find this “blue axis” or the feature subspace on which we should project the patterns?

Data synthesis

Separability


Cluster

Example

Bibliography

Feature selection

Please refer to:Lec7.pdf - PCA and the course of dimensionalityLec8.pdf in - Fisher Linear Discriminanthttp://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html

Of Prof. Olga Vekslerhttp://www.csd.uwo.ca/faculty/olga/

http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html


http://www.csd.uwo.ca/faculty/olga/


Example – Handwriting recognition

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

The following example describes the application of FLD to handwriting recognition.

The characters were derived from the MNIST database (http://yann.lecun.com/exdb/mnist/).

The example demonstrates the separability of two classes while reducing the dimensionality from 784 (28X28 pixels) to 2.

http://yann.lecun.com/exdb/mnist/


Example – Handwriting recognitionclear all;load img0load img1N1=length(img0);N2=length(img1);D = 50; % Low pass filteringDd = 2; % Desired pattern dimension%-----------------------------------------------------% Test set synthesis%-----------------------------------------------------for i=1:N1, x = reshape(squeeze(img0(i,:,:)), 28*28,1); X=dct(x); l1(:,i) = X(1:D);end;for i=1:N2, x = reshape(squeeze(img1(i,:,:)), 28*28,1); X=dct(x); l2(:,i) = X(1:D);end;

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

The img0 and img1 files contain N 28X28 arrays of grayscale images of the handwriting characters.

Reshaping the 28X28 arrays into a 784X1 features vector (pattern)

DCT transform is recommended while dealing with mostly black&white images. The numerousity of identical numbers (0 and 255) damages the FLD computations, and DCT transform may alleviate this phenomena.

Taking only the first 50 DCT coefficients decrease dimensionality (most of the energy resides in the low frequency range, but it is not essential for this example.


Example – Handwriting recognition%--------------------------------------------------------------------------

% Compute Sb amd Sw

%--------------------------------------------------------------------------

Mu1 = mean(l1')'; % Mean of cluster 1

Mu2 = mean(l2')'; % Mean of cluster 2

Mu = (Mu1*N1 + Mu2*N2)/(N1+N2); % Total mean of all pattens

Sw = zeros(D);

for i=1:N1,

Sw = Sw + (l1(:,i)-Mu1)*(l1(:,i)-Mu1)';

end;

for i=1:N2,

Sw = Sw + (l2(:,i)-Mu2)*(l2(:,i)-Mu2)';

end;

Sb = N1*(Mu1-Mu)*(Mu1-Mu)' + N2*(Mu2-Mu)*(Mu2-Mu)';

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

Compute the between scatter matrix.

Compute the within scatter matrices

Compute the mean of each cluster, and the mean of all the patterns.

1

CT

B i i ii

S P

1 1

iNCT

W i ki i ki ii k

S P x x



% Compute V

% [W,D] = EIG(A,B) produces a diagonal matrix D of generalized

% eigenvalues and a full matrix W whose columns are the

% corresponding eigenvectors so that A*W = B*W*D.

%--------------------------------------------------------------------------

[W,D]=eig(Sb,Sw); % Solve the generalized eigenvalue problem

Lambda = diag(D); % Sort the eigenvalues.

[Lam,p]=sort(Lambda);

V=[]; % Build V form the eigen vectors

for i=1:Dd, % which are corresponding to the

V= [V W(:,p(end+1-i))]; % biggest eigenvalues

end;

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

Solve the generalized eigenvalues problem

Construct the projection matrix V such that its column vectors are the eigenvectors corresponding to the largest eigenvalues.

B WS V S V

TB

TW

V S VJ V

V S V



% Project the initial patterns onto the reduced space

%--------------------------------------------------------------------------

for i = 1:N1,

r1(:,i) = V' * l1(:,i);

end;

for i = 1:N2,

r2(:,i) = V' * l2(:,i);

end;

figure;

plot(r1(1,:),r1(2,:),'.',r2(1,:),r2(2,:),'or');

xlabel ('feature1');

ylabel ('feature2');

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

Project the initial patterns onto the reduced space.

Plot 2-dim clusters

Ti iy V x


Example – Handwriting recognitionThe figure below depicts the 2-dim clusters. It is easy to see that the separability of the clusters is perfect even in one dimension, which is not surprising because there are only two clusters.

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography

-4 -2 0 2 4 6 8 10

x 10-5

-1

-0.5

0

0.5

1

1.5

2

2.5

3x 10

-6

feature1

feat

ure2


Bibliography•Prof. Olga Veksler pattern recognition course: http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html•Handwriting database: http://yann.lecun.com/exdb/mnist/

Feature selection

Data synthesis

Separability


Cluster

Example

Bibliography








http://yann.lecun.com/exdb/mnist/

Feature extraction

Documents