Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Ching-Lung Ch en Author : Pabitra Mitr a Student Member 國國國國國國國國 National Yunlin University of Science and Technology Unsupervised Feature Selection Using Feature Similarity
Jan 25, 2016
Intelligent Database Systems Lab
Advisor : Dr. Hsu
Graduate: Ching-Lung Chen
Author : Pabitra Mitra
Student Member
國立雲林科技大學National Yunlin University of Science and Technology
Unsupervised Feature Selection Using Feature Similarity
Intelligent Database Systems Lab
Outline
Motivation Objective Introduction Feature Similarity Measure Feature Selection method Feature Evaluation indices Experimental Results and Comparisons Conclusions Personal Opinion Review
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Motivation
Conventional method of feature selection have high-computational complexity problem in both dimension and size.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Objective
Propose an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Introduction 1/3
The sequential floating searches provide better results, though at the cost of a higher computational complexity.
Broadly classified existing methods into two categories: Maximization of clustering performance
Sequential unsupervised feature selection、 maximum entropy、neuro-fuzzy approach…
Based on feature dependency and relevance Correlation coefficients、 measures of statistical redundancy、 li
near dependence
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Introduction 2/3
We propose an unsupervised algorithm which uses feature dependency/similarity for redundancy reduction, but requiring no search.
A new similarity measure call maximal information compression index, is used in clustering. Its comparison with correlation coefficient and least-square regression error is made.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Introduction 3/3
The proposed algorithm is geared toward to two goals: Minimizing the information loss. Minimizing the redundancy present in the reduced feature
subset.
The feature selection algorithm unlike most conventional algorithms, search for best subset, its can be computed in much less time compared to many indices used in other supervised and unsupervised feature selection method.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Similarity Measure
There are two approaches for measuring similarity between two random variables:
1. To nonparametrically test the closeness of probability distributions of the variables.
2. To measure the amount of functional dependency between the variables.
We discuss below two existing linear dependency measures:1. Correlation Coefficient
2. Least Square Regression Error(e)
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Similarity Measure
Correlation Coefficient ( )
var() the variance of a variable
cov() the covariance between two variables.
1. .
2. if x and y are linearly related.
3. (symmetric).
4. if and for some constants a,b,c,d,then the measure is invariant to scaling and translation of the variables
5. the measure is sensitive to rotation of the scatter diagram in (x,y) plane
N.Y.U.S.T.
I.M.
1|),(|10 yx 0|),(|1 yx
|),(|1|),(|1 xyyx
c
axu
d
byv
|),(|1|),(|1 vuyx
Intelligent Database Systems Lab
Feature Similarity Measure
Least Square Regression Error (e)
the error predicting y from the linear model y = a + bx. a and b are the regression coefficients obtained by minimizing the mean square error.
The coefficients are given by , and and the mean square error e(x,y) is given by
N.Y.U.S.T.
I.M.
ya )var(),(
cov
xyxb
)),(1)(var(),( 2yxyyxe
Intelligent Database Systems Lab
Feature Similarity Measure
Least Square Regression Error (e)
1. .
2. e(x,y)=0 if x and y are linearly related
3. (unsymmetric).
4. if u=x/c and v = y/d for some constant a,b,c,d, then e(x,y)=d2e(u,v). the measure e is sensitive to scaling of the variables.
5. the measure e is sensitive to rotation of the scatter diagram in x-y plane.
N.Y.U.S.T.
I.M.
)var(),(0 yyxe
),(),( xyeyxe
Intelligent Database Systems Lab
Feature Similarity Measure
maximal information compression index ( )Let be the covariance matrix of random variables x and y. Define maximal information compression index as smallest eigenvalue of
=0 when the features are linearly dependent and increases as the amount of dependency decreases
N.Y.U.S.T.
I.M.
2
),(2 yx
)),(1)(var()var(4))var()((var()var()(var(),(2 222 yxyxyxyxyx
2
Intelligent Database Systems Lab
Feature Similarity Measure
The corresponding loss of information in reconstruction of the pattern is equal to the eigenvalue along the direction normal to the principal component.
hence, is the amount of reconstruction error committed if the data is projected to a reduced dimension in the best possible way.
there fore , it’s a measure of the minimum amount of information loss or maximum amount of information compression.
N.Y.U.S.T.
I.M.
2
Intelligent Database Systems Lab
Feature Similarity Measure
the significance of can also be explained geometrically in terms of linear regression.
the value of is equal to the sum of the squares of the perpendicular distance of the points (x,y) to the best fit line
The coefficients of such a best fit line are given by and where
N.Y.U.S.T.
I.M.
2
2xbay ˆˆ
yxa cotˆcotˆ b
Intelligent Database Systems Lab
Feature Similarity Measure
has the following properties:
1. .
2. .
3. .
4. .
5. .
N.Y.U.S.T.
I.M.
2
Intelligent Database Systems Lab
Feature Similarity MeasureN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Selection method
The task of feature selection involves two step:1. partition the original feature set into a number of homogeneous
subsets (clusters)
2. selecting a representative feature from each such cluster
The partition of the features is based on K-NN principle
1. compute the k nearest features of each feature.
2. among them the feature having the most compact subset is selected, and its k neighboring features are discarded.
3. the process is repeated for the remaining features until all of them are either selected or discarded
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Selection method
Determining the k nearest-neighbors of features, we assign a constant error threshold ( ) which is set equal to the distance of the kth nearest-neighbor of the feature select in first iteration.
if greater than , then we decrease the value of k.
N.Y.U.S.T.
I.M.
2
Intelligent Database Systems Lab
Feature Selection method
D : original number of features the original feature set be O={Fi, i=1,…,D}
the dissimilarity between features Fi and Fj represent by S(Fi,Fj). Let represent the dissimilarity between feature Fi and its kth near
est-neighbor feature in R.
N.Y.U.S.T.
I.M.
rk
i
Intelligent Database Systems Lab
Feature Selection methodN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Selection method
with respect to the dimension (D), the method has complexity O(D2)
evaluation of the similarity measure for a feature pair is of complexity O(l), thus, the feature selection scheme has overall complexity O(D2l)
k acts as a scale parameter which controls the degree of details in a more direct manner.
this algorithm is nonmetric nature of similarity measure.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Evaluation indices
Now se describe some indices below:
need class information1. class seperability
2. K-NN classification accuracy
3. naïve Bayes classification accuracy
do not need class information1. entropy
2. fuzzy feature evaluation index
3. representation entropy
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Evaluation indices
Class Separability
Sw is the within class scatter matrix
Sb is the between class scatter matrix.
is the a priori probability that a pattern belongs to class wj.
is he sample mean vector of class wj.
N.Y.U.S.T.
I.M.
)(1
SS wbtraceS
jj
Intelligent Database Systems Lab
Feature Evaluation indices
K-NN Classification Accuracy
use the K-NN rule for evaluating the effectiveness of the reduced set for classification.
we randomly select 10% of data as training set and classify the remaining 90% point.
Ten such independent runs are performed and average accuracy on test set.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Evaluation indices
Naïve Bayes Classification Accuracy
Used Bayes maximum likelihood classifier ,assuming normal distribution of classes to evaluating the classification performance.
Mean and covariance of the classes are estimated from a randomly selected 10% training sample and the remaining 90% used as test set.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Evaluation indices
Entropy
xp,j denotes feature value for p along jth direction.
similarity between p,q is given by is a positive constant, a possible value of is is the average distance between data points computed over the entire data set.
if the data is uniformly distributed in the feature space, entropy is maximum.
N.Y.U.S.T.
I.M.
pqDeqpsim ),(
D
5.0ln
D
Intelligent Database Systems Lab
Feature Evaluation indices
Fuzzy Feature Evaluation Index (FFEI)
are the degree that both patterns p and q belong to the same cluster in the feature spaces respectively
membership function may be defined as
the value of FFEI decreases as the intercluster distances increase.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Feature Evaluation indices
Representation Entropy
let the eigenvalues of the d*d covariance matrix of a feature set of size d be
has similar properties like probability, and
this is equivalent to the amount of redundancy present in that particular representation of the data set.
N.Y.U.S.T.
I.M.
.,...,1, djj
j~ 1~
0 j
d
j j11
Intelligent Database Systems Lab
Experimental Results and Comparisons
Three categories of real-life public domain data sets are used: low-dimensional (D<=10) medium-dimensional (10<D<=100) high-dimensional (D>100)
Use nine UCI data set include :1. Isolet
2. Multiple Features
3. Arrhythmia
4. Spambase
5. Waveform
N.Y.U.S.T.
I.M.
6. Ionosphere7. Forest Cover Type8. Wisconsin Cancer9. Iris
Intelligent Database Systems Lab
Experimental Results and Comparisons
We use four indices to measure classification and clustering performance:
1. Branch and Bound Algorithm (BB)
2. Sequential Forward Search (SFS)
3. Sequential Floating Forward Search (SFFS)
4. Stepwise Clustering (SWC) * using correlation coefficient
in our experiments, we have mainly used entropy as the feature selection criterion with first three search algorithm.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Experimental Results and ComparisonsN.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Conclusions An algorithm for unsupervised feature selection using feature similarity m
easures is described.
our algorithm is based on pairwise feature similarity measure , which are fast to compute. It unlike other approaches, which are based on optimizing either classification or clustering performance explicitly .
We have defined a feature similarity measure called maximal information compression index.
It also demonstrated through extensive experiments that representation entropy can be used as an index for quantifying both redundancy reduction and information loss in a feature selection method.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Personal Opinion
We can learning this method to help our experimental of feature selection.
This similarity measure is valid only for numeric features, we can think about how to use in categorical.
N.Y.U.S.T.
I.M.
Intelligent Database Systems Lab
Review
1. compute the k nearest features of each feature.
2. Among them the feature having the most compact subset is selected, and its k neighboring features are discarded.
3. repeated this process for the remaining feature until all of them are either selected or discarded.
N.Y.U.S.T.
I.M.