Advisor ： Dr. Hsu Graduate ： Ching-Lung Chen Author ： Pabitra Mitra Student Member

Intelligent Database Systems Lab

Advisor ： Dr. Hsu

Graduate： Ching-Lung Chen

Author ： Pabitra Mitra

Student Member

國立雲林科技大學National Yunlin University of Science and Technology

Unsupervised Feature Selection Using Feature Similarity


Outline

Motivation Objective Introduction Feature Similarity Measure Feature Selection method Feature Evaluation indices Experimental Results and Comparisons Conclusions Personal Opinion Review

N.Y.U.S.T.

I.M.


Motivation

Conventional method of feature selection have high-computational complexity problem in both dimension and size.

N.Y.U.S.T.

I.M.


Objective

Propose an unsupervised feature selection algorithm suitable for data sets, large in both dimension and size.

N.Y.U.S.T.

I.M.


Introduction 1/3

The sequential floating searches provide better results, though at the cost of a higher computational complexity.

Broadly classified existing methods into two categories: Maximization of clustering performance

Sequential unsupervised feature selection、 maximum entropy、neuro-fuzzy approach…

Based on feature dependency and relevance Correlation coefficients、 measures of statistical redundancy、 li

near dependence

N.Y.U.S.T.

I.M.


Introduction 2/3

We propose an unsupervised algorithm which uses feature dependency/similarity for redundancy reduction, but requiring no search.

A new similarity measure call maximal information compression index, is used in clustering. Its comparison with correlation coefficient and least-square regression error is made.

N.Y.U.S.T.

I.M.


Introduction 3/3

The proposed algorithm is geared toward to two goals: Minimizing the information loss. Minimizing the redundancy present in the reduced feature

subset.

The feature selection algorithm unlike most conventional algorithms, search for best subset, its can be computed in much less time compared to many indices used in other supervised and unsupervised feature selection method.

N.Y.U.S.T.

I.M.


Feature Similarity Measure

There are two approaches for measuring similarity between two random variables:

1. To nonparametrically test the closeness of probability distributions of the variables.

2. To measure the amount of functional dependency between the variables.

We discuss below two existing linear dependency measures:1. Correlation Coefficient

2. Least Square Regression Error(e)

N.Y.U.S.T.

I.M.



Correlation Coefficient ( )

var() the variance of a variable

cov() the covariance between two variables.

1. .

2. if x and y are linearly related.

3. (symmetric).

4. if and for some constants a,b,c,d,then the measure is invariant to scaling and translation of the variables

5. the measure is sensitive to rotation of the scatter diagram in (x,y) plane

N.Y.U.S.T.

I.M.

1|),(|10 yx 0|),(|1 yx

|),(|1|),(|1 xyyx

c

axu

d

byv

|),(|1|),(|1 vuyx



Least Square Regression Error (e)

the error predicting y from the linear model y = a + bx. a and b are the regression coefficients obtained by minimizing the mean square error.

The coefficients are given by , and and the mean square error e(x,y) is given by

N.Y.U.S.T.

I.M.

ya )var(),(

cov

xyxb

)),(1)(var(),( 2yxyyxe



Least Square Regression Error (e)

1. .

2. e(x,y)=0 if x and y are linearly related

3. (unsymmetric).

4. if u=x/c and v = y/d for some constant a,b,c,d, then e(x,y)=d2e(u,v). the measure e is sensitive to scaling of the variables.

5. the measure e is sensitive to rotation of the scatter diagram in x-y plane.

N.Y.U.S.T.

I.M.

)var(),(0 yyxe

),(),( xyeyxe



maximal information compression index ( )Let be the covariance matrix of random variables x and y. Define maximal information compression index as smallest eigenvalue of

=0 when the features are linearly dependent and increases as the amount of dependency decreases

N.Y.U.S.T.

I.M.

2

),(2 yx

)),(1)(var()var(4))var()((var()var()(var(),(2 222 yxyxyxyxyx

2



The corresponding loss of information in reconstruction of the pattern is equal to the eigenvalue along the direction normal to the principal component.

hence, is the amount of reconstruction error committed if the data is projected to a reduced dimension in the best possible way.

there fore , it’s a measure of the minimum amount of information loss or maximum amount of information compression.

N.Y.U.S.T.

I.M.

2



the significance of can also be explained geometrically in terms of linear regression.

the value of is equal to the sum of the squares of the perpendicular distance of the points (x,y) to the best fit line

The coefficients of such a best fit line are given by and where

N.Y.U.S.T.

I.M.

2

2xbay ˆˆ

yxa cotˆcotˆ b



has the following properties:

1. .

2. .

3. .

4. .

5. .

N.Y.U.S.T.

I.M.

2


Feature Similarity MeasureN.Y.U.S.T.

I.M.


Feature Selection method

The task of feature selection involves two step:1. partition the original feature set into a number of homogeneous

subsets (clusters)

2. selecting a representative feature from each such cluster

The partition of the features is based on K-NN principle

1. compute the k nearest features of each feature.

2. among them the feature having the most compact subset is selected, and its k neighboring features are discarded.

3. the process is repeated for the remaining features until all of them are either selected or discarded

N.Y.U.S.T.

I.M.



Determining the k nearest-neighbors of features, we assign a constant error threshold ( ) which is set equal to the distance of the kth nearest-neighbor of the feature select in first iteration.

if greater than , then we decrease the value of k.

N.Y.U.S.T.

I.M.

2



D : original number of features the original feature set be O={Fi, i=1,…,D}

the dissimilarity between features Fi and Fj represent by S(Fi,Fj). Let represent the dissimilarity between feature Fi and its kth near

est-neighbor feature in R.

N.Y.U.S.T.

I.M.

rk

i


Feature Selection methodN.Y.U.S.T.

I.M.



with respect to the dimension (D), the method has complexity O(D2)

evaluation of the similarity measure for a feature pair is of complexity O(l), thus, the feature selection scheme has overall complexity O(D2l)

k acts as a scale parameter which controls the degree of details in a more direct manner.

this algorithm is nonmetric nature of similarity measure.

N.Y.U.S.T.

I.M.


Feature Evaluation indices

Now se describe some indices below:

need class information1. class seperability

2. K-NN classification accuracy

3. naïve Bayes classification accuracy

do not need class information1. entropy

2. fuzzy feature evaluation index

3. representation entropy

N.Y.U.S.T.

I.M.



Class Separability

Sw is the within class scatter matrix

Sb is the between class scatter matrix.

is the a priori probability that a pattern belongs to class wj.

is he sample mean vector of class wj.

N.Y.U.S.T.

I.M.

)(1

SS wbtraceS

jj



K-NN Classification Accuracy

use the K-NN rule for evaluating the effectiveness of the reduced set for classification.

we randomly select 10% of data as training set and classify the remaining 90% point.

Ten such independent runs are performed and average accuracy on test set.

N.Y.U.S.T.

I.M.



Naïve Bayes Classification Accuracy

Used Bayes maximum likelihood classifier ,assuming normal distribution of classes to evaluating the classification performance.

Mean and covariance of the classes are estimated from a randomly selected 10% training sample and the remaining 90% used as test set.

N.Y.U.S.T.

I.M.



Entropy

xp,j denotes feature value for p along jth direction.

similarity between p,q is given by is a positive constant, a possible value of is is the average distance between data points computed over the entire data set.

if the data is uniformly distributed in the feature space, entropy is maximum.

N.Y.U.S.T.

I.M.

pqDeqpsim ),(

D

5.0ln

D



Fuzzy Feature Evaluation Index (FFEI)

are the degree that both patterns p and q belong to the same cluster in the feature spaces respectively

membership function may be defined as

the value of FFEI decreases as the intercluster distances increase.

N.Y.U.S.T.

I.M.



Representation Entropy

let the eigenvalues of the d*d covariance matrix of a feature set of size d be

has similar properties like probability, and

this is equivalent to the amount of redundancy present in that particular representation of the data set.

N.Y.U.S.T.

I.M.

.,...,1, djj

j~ 1~

0 j

d

j j11


Experimental Results and Comparisons

Three categories of real-life public domain data sets are used: low-dimensional (D<=10) medium-dimensional (10<D<=100) high-dimensional (D>100)

Use nine UCI data set include :1. Isolet

2. Multiple Features

3. Arrhythmia

4. Spambase

5. Waveform

N.Y.U.S.T.

I.M.

6. Ionosphere7. Forest Cover Type8. Wisconsin Cancer9. Iris


Experimental Results and Comparisons

We use four indices to measure classification and clustering performance:

1. Branch and Bound Algorithm (BB)

2. Sequential Forward Search (SFS)

3. Sequential Floating Forward Search (SFFS)

4. Stepwise Clustering (SWC) * using correlation coefficient

in our experiments, we have mainly used entropy as the feature selection criterion with first three search algorithm.

N.Y.U.S.T.

I.M.


Experimental Results and ComparisonsN.Y.U.S.T.

I.M.



I.M.



I.M.



I.M.



I.M.



I.M.



I.M.



I.M.


Conclusions An algorithm for unsupervised feature selection using feature similarity m

easures is described.

our algorithm is based on pairwise feature similarity measure , which are fast to compute. It unlike other approaches, which are based on optimizing either classification or clustering performance explicitly .

We have defined a feature similarity measure called maximal information compression index.

It also demonstrated through extensive experiments that representation entropy can be used as an index for quantifying both redundancy reduction and information loss in a feature selection method.

N.Y.U.S.T.

I.M.


Personal Opinion

We can learning this method to help our experimental of feature selection.

This similarity measure is valid only for numeric features, we can think about how to use in categorical.

N.Y.U.S.T.

I.M.


Review

1. compute the k nearest features of each feature.

2. Among them the feature having the most compact subset is selected, and its k neighboring features are discarded.

3. repeated this process for the remaining feature until all of them are either selected or discarded.

N.Y.U.S.T.

I.M.

Advisor ： Dr. Hsu Graduate ： Ching-Lung Chen Author ： Pabitra Mitra Student Member

Documents