March 2006 Alon Slapak 1 of 32 Feature extraction A practical approach Feature selecti on Data synthes is Separabi lity Feature extract ion Guideli nes Cluster Example Bibliogr aphy
Jan 26, 2016
March 2006 Alon Slapak 1 of 32
Feature extraction
A practical approachFeature
selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
March 2006 Alon Slapak 2 of 32
120 130 140 150 160 170 180 19030
35
40
45
50
55
60
65
70
75
80
height [cm]
weig
ht
[kg]
Females
Geometric structure of a cluster
(152,51.5)
Each of the red circles represents an object pattern. The coordinates of the circle are the features (e.g. height and weight).
The ellipse which encircles the major part of the cluster represents the distribution of the patterns.
The center of the ellipse is the mean of the patterns distribution .
The length of the principle axes of the ellipse is proportional to twice the square root of the eigenvalues of the covariance matrix of the patterns distribution (1, 2).
1
2
The principles axes of the ellipse coincide with the eigenvectors of the covariance matrix of the patterns distribution.
Feature selection
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Data synthesis
March 2006 Alon Slapak 3 of 32
Typical dimension: Mean The center of the ellipse is the mean of the patterns distribution. It can be estimated by:
1
1 N
kk
xN
While xk - is the kth pattern in the cluster,
N - is the number of patterns in the cluster.
The mean is sometimes
referred to as the
centroid of the cluster.
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
we
igh
t [kg
]
Feature selection
Separability
Feature extraction Guidelines
Example
Bibliography
Cluster
Data synthesis
March 2006 Alon Slapak 4 of 32
Typical dimension: Scatter matrix SScatter matrix S of a cluster is defined as:
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
Largerscatter
Smallerscatter
1
NT
k kk
S x x
While xk - is the kth pattern in the cluster,
N - is the number of patterns in the cluster, - is the mean of all the patterns in the cluster
The scatter matrix
may be interpreted as
a biased estimation of
the covariance matrix
of the cluster.
Feature selection
Separability
Feature extraction Guidelines
Example
Bibliography
Cluster
Data synthesis
March 2006 Alon Slapak 5 of 32
Data synthesis• In order to test and debug pattern
recognition algorithms, it is customary to use synthesized data.
• The synthesized data may be drawn from arbitrary distribution, but in most of the literature it is customary to assume a normal distribution.
• But, it is not infrequent to come across applications involving other patterns distributions
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
Feature selection
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Data synthesis
March 2006 Alon Slapak 6 of 32
Example: Two clusters synthesisclear all
N1 = 150; N2 = 150;
E1 = [150 15; 120 20]; E2 = [100 10; 70 30];
M1 = [170,75]'; M2 = [160,50]';
[P1,A1] = eig(E1); [P2,A2] = eig(E2);
y1=randn(2,N1); y2=randn(2,N2);
for i=1:N1,
x1(:,i) =P1*sqrt(A1)* y1(:,i)+M1;
end;
for i=1:N2,
x2(:,i) =P2*sqrt(A2)* y2(:,i)+M2;
end;
figure;
plot(x1(1,:),x1(2,:),'.',x2(1,:),x2(2,:),'or');
axis([120 220 30 120]);
xlabel ('height [cm]');
ylabel ('weight [kg]'); height [cm]
wei
ght [
kg]
Males
Females
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
Feature selection
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Data synthesis
March 2006 Alon Slapak 7 of 32
ExerciseTry to synthesis two classes, which have a common centroid at (3,5).One class coincides with horizontal and is ~40 units in length and ~8 units in width.The other class is inclined at 45o from horizontal and is ~50 units in length and ~4 units in width.
-50 -40 -30 -20 -10 0 10 20 30 40 50-50
-40
-30
-20
-10
0
10
20
30
40
50
Hint: Use E=PAP-1 to build the covariance matrix.
Feature selection
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Data synthesis
March 2006 Alon Slapak 8 of 32
Separability
Q: How does feature extraction method affect the pattern recognition performance?
height [cm]
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
March 2006 Alon Slapak 9 of 32
SeparabilityA: More than the classification algorithm. If the patterns achieved by the feature extraction method creates nonseparate clusters, no classification algorithm can do the job.
Feature extraction
method
Classificationalgorithm
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 10 of 32
Separability
In fact, the separability achieved by the feature extraction method is the upper limit of the pattern recognition performance.
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 11 of 32
Separability
Q: How can one assess the separability achieved by the feature extraction method?
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
Good?Bad?Hopeless?
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 12 of 32
Separability
A: Several separability criterias are exist, most are involving scatters matrices
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght
[kg]
Largerscatter
Smallerscatter
1
NT
k kk
S x x
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 13 of 32
SeparabilityIf we define the within-class scatter matrix as:
1 1
iNCT
W i ki i ki ii k
S P x x
While xki - is the kth pattern in the ith cluster,
Ni - is the number of patterns in the ith cluster,i - is the mean of all the patterns in the ith clusterC - is the number of the clustersPi - is the a priori probability of the ith cluster, and may be
estimated by 1
C
i i pp
P N N
and the between-class scatter matrix as:
1
CT
B i i ii
S P
While - is the mean of all the patterns in all the clusters
and the total-class scatter matrix as:
T B WS S S
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 14 of 32
Separability function JWe can find in the literature several functions which represent the separability of clusters by a scalar:
1W BJ tr S S 1det W BJ S S
B
W
tr SJ
tr S 1ln det W BJ S S
Pay attention to the fact that J is a scalar. It is of utmost importance because scalar function is necessary for later optimization (we will try to maximize J of course)
Feature selection
Data synthesis
Feature extraction Guidelines
Cluster
Example
Bibliography
Separability
March 2006 Alon Slapak 15 of 32
Feature extraction guidelines
Example: Features of a student:• Number of eyes• Hair color• Wear glasses or not• Hair length• Show size• Height• Weight
Useless. Includes no information on gender.
Useless. Very poor correlation with gender.
Useless. Very poor correlation with gender.
Effective, but is hard to measure.
Effective and simple.
Effective and simple.
Effective and simple.
Q: What will be regarded as a reasonable feature?
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
March 2006 Alon Slapak 16 of 32
Features extraction guidelines
• Correlation with the classification feature
• Easy to measure
• Maximizing the separability function
“When we have two or more classes, feature extraction consist of choosing those features which are most effective for preserving class separability” (Fukunaga p. 441)
Feature selection
Data synthesis
Separability
Cluster
Example
Bibliography
Feature extraction Guidelines
March 2006 Alon Slapak 17 of 32
Features extraction guidelinesQ: How can one determine what feature follows the guidelines?
A1: There is an extensive literature dealing with features to specific applications (e.g. symmetry for face recognition)
A2: A widespread approach is to emulate human mechanism (e.g. treating all the pixels as features in face recognition)
Feature selection
Data synthesis
Separability
Cluster
Example
Bibliography
Feature extraction Guidelines
March 2006 Alon Slapak 18 of 32
Example - handwritten recognitionThe following 20 figures depict examples of handwritten digits. Since almost everyone can recognize each of the digits, it is reasonable to assume that treating all the pixels as features will assure a successful pattern recognition process.
Feature selection
Data synthesis
Separability
Cluster
Example
Bibliography
Feature extraction Guidelines
March 2006 Alon Slapak 19 of 32
Feature selection
• While dealing with 28X28 pixels handwriting digits, the feature space dimensions is 784.
• While dealing with face recognition, we may have about 200X200 pixels. It means that the feature space dimensions is 40000.
• While dealing with voice recognition, we may have about 1[Sec]X16[kHz]. It means that the feature space dimensions is 16000.
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
March 2006 Alon Slapak 20 of 32
Feature selection
Because of:• The complexity of most of the
classification algorithms is O(n2).
• Learning time
• “Course of dimensionality”
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Feature selection
March 2006 Alon Slapak 21 of 32
Feature selection“Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. Feature selection is necessary either because it is computationally infeasible to use all available features, or because of problems of estimation when limited data samples (but a large number of features) are present. The latter problem is related to the so-called
curse of dimensionality.” (WIKIPEDIA)
ObjectHigh
dimension Pattern
Feature extracti
onLow
dimension Pattern
Feature Selectio
n
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Feature selection
March 2006 Alon Slapak 22 of 32
Feature selectionQ: How can one identify the irrelevant and the redundant features?A: Several options:1. Heuristics
In most applications, the relevant information resides in the low frequency range (e.g. images), hence it is logical to reduce dimensionality by taking the first coefficients of the Fourier/DCT transform.
2. Optimization approach (KLT, FLD) We may select a subset of the features (or linear combinations of the features) which best contribute to the separability of the clusters.
3. Grouping By grouping features one can represent every set of features by a small set of features.
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Feature selection
March 2006 Alon Slapak 23 of 32
Feature selection - ExampleThe separability of these two class is pretty good. But, do we really need two features?
120 130 140 150 160 170 180 190 200 210 22030
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght [
kg]
Males
Females
120 130 140 150 160 170 180 190 200 210 2200
5
10
15
20
25
30
30
40
50
60
70
80
90
100
110
120
0 5 10 15 20 25 30
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Feature selection
March 2006 Alon Slapak 24 of 32
Feature selection - ExampleNo! The same separability can be achieved by projecting the patterns onto the blue axis i.e. only one-dimension feature-space is needed. Data
synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography120 130 140 150 160 170 180 190 200 210 220
30
40
50
60
70
80
90
100
110
120
height [cm]
wei
ght [
kg]
Males
Females
-1.2
-1
-0.8
-0.6
-0.4 0
5
10
15
20
25
30
-101
Feature selection
March 2006 Alon Slapak 25 of 32
Feature selectionBut how can we find this “blue axis” or the feature subspace on which we should project the patterns?
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Feature selection
Please refer to:Lec7.pdf - PCA and the course of dimensionalityLec8.pdf in - Fisher Linear Discriminanthttp://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html
Of Prof. Olga Vekslerhttp://www.csd.uwo.ca/faculty/olga/
March 2006 Alon Slapak 26 of 32
Example – Handwriting recognition
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
The following example describes the application of FLD to handwriting recognition.
The characters were derived from the MNIST database (http://yann.lecun.com/exdb/mnist/).
The example demonstrates the separability of two classes while reducing the dimensionality from 784 (28X28 pixels) to 2.
March 2006 Alon Slapak 27 of 32
Example – Handwriting recognitionclear all;load img0load img1N1=length(img0);N2=length(img1);D = 50; % Low pass filteringDd = 2; % Desired pattern dimension%-----------------------------------------------------% Test set synthesis%-----------------------------------------------------for i=1:N1, x = reshape(squeeze(img0(i,:,:)), 28*28,1); X=dct(x); l1(:,i) = X(1:D);end;for i=1:N2, x = reshape(squeeze(img1(i,:,:)), 28*28,1); X=dct(x); l2(:,i) = X(1:D);end;
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
The img0 and img1 files contain N 28X28 arrays of grayscale images of the handwriting characters.
Reshaping the 28X28 arrays into a 784X1 features vector (pattern)
DCT transform is recommended while dealing with mostly black&white images. The numerousity of identical numbers (0 and 255) damages the FLD computations, and DCT transform may alleviate this phenomena.
Taking only the first 50 DCT coefficients decrease dimensionality (most of the energy resides in the low frequency range, but it is not essential for this example.
March 2006 Alon Slapak 28 of 32
Example – Handwriting recognition%--------------------------------------------------------------------------
% Compute Sb amd Sw
%--------------------------------------------------------------------------
Mu1 = mean(l1')'; % Mean of cluster 1
Mu2 = mean(l2')'; % Mean of cluster 2
Mu = (Mu1*N1 + Mu2*N2)/(N1+N2); % Total mean of all pattens
Sw = zeros(D);
for i=1:N1,
Sw = Sw + (l1(:,i)-Mu1)*(l1(:,i)-Mu1)';
end;
for i=1:N2,
Sw = Sw + (l2(:,i)-Mu2)*(l2(:,i)-Mu2)';
end;
Sb = N1*(Mu1-Mu)*(Mu1-Mu)' + N2*(Mu2-Mu)*(Mu2-Mu)';
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Compute the between scatter matrix.
Compute the within scatter matrices
Compute the mean of each cluster, and the mean of all the patterns.
1
CT
B i i ii
S P
1 1
iNCT
W i ki i ki ii k
S P x x
March 2006 Alon Slapak 29 of 32
Example – Handwriting recognition%--------------------------------------------------------------------------
% Compute V
% [W,D] = EIG(A,B) produces a diagonal matrix D of generalized
% eigenvalues and a full matrix W whose columns are the
% corresponding eigenvectors so that A*W = B*W*D.
%--------------------------------------------------------------------------
[W,D]=eig(Sb,Sw); % Solve the generalized eigenvalue problem
Lambda = diag(D); % Sort the eigenvalues.
[Lam,p]=sort(Lambda);
V=[]; % Build V form the eigen vectors
for i=1:Dd, % which are corresponding to the
V= [V W(:,p(end+1-i))]; % biggest eigenvalues
end;
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Solve the generalized eigenvalues problem
Construct the projection matrix V such that its column vectors are the eigenvectors corresponding to the largest eigenvalues.
B WS V S V
TB
TW
V S VJ V
V S V
March 2006 Alon Slapak 30 of 32
Example – Handwriting recognition%--------------------------------------------------------------------------
% Project the initial patterns onto the reduced space
%--------------------------------------------------------------------------
for i = 1:N1,
r1(:,i) = V' * l1(:,i);
end;
for i = 1:N2,
r2(:,i) = V' * l2(:,i);
end;
figure;
plot(r1(1,:),r1(2,:),'.',r2(1,:),r2(2,:),'or');
xlabel ('feature1');
ylabel ('feature2');
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
Project the initial patterns onto the reduced space.
Plot 2-dim clusters
Ti iy V x
March 2006 Alon Slapak 31 of 32
Example – Handwriting recognitionThe figure below depicts the 2-dim clusters. It is easy to see that the separability of the clusters is perfect even in one dimension, which is not surprising because there are only two clusters.
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography
-4 -2 0 2 4 6 8 10
x 10-5
-1
-0.5
0
0.5
1
1.5
2
2.5
3x 10
-6
feature1
feat
ure2
March 2006 Alon Slapak 32 of 32
Bibliography•Prof. Olga Veksler pattern recognition course: http://www.csd.uwo.ca/faculty/olga/Courses//CS434a_541a//index.html•Handwriting database: http://yann.lecun.com/exdb/mnist/
Feature selection
Data synthesis
Separability
Feature extraction Guidelines
Cluster
Example
Bibliography