2010 / 03 / 17 Yi - Xian Lin 1 A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Accepted by IEEE Transactions on Knowledge and Data Engineering Reporter :Yi-Xian Lin National University of Tainan
31
Embed
A Fuzzy Self Constructing Feature Clustering Algorithm for Text Classification
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2010 / 03 / 17 Yi - Xian Lin 1
A Fuzzy Self-Constructing Feature Clustering Algorithm for Text
Classification
Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee
Accepted by IEEE Transactions on Knowledge and Data Engineering
Reporter :Yi-Xian Lin
National University of Tainan
2010 / 03 / 17 Yi - Xian Lin 2
Outline
• Motivation & Objective
• Feature Reduction
• Feature Clustering
• Fuzzy Feature Clustering
• Text Classification
• Experimental results
• Advantages
2010 / 03 / 17 Yi - Xian Lin 3
Motivation &&&& Objective
• In text classification, the dimensionality of the feature vector is
usually huge
• The current problem of the existing feature clustering methods
� The desired number of extracted features has to be specified in advance
� When calculating similarities, the variance of the underlying cluster is
not considered
• How to reduce the dimensionality of feature vectors for text
classification and run faster ?
2010 / 03 / 17 Yi - Xian Lin 4
Feature Reduction
• Purpose
� Reduce classifier’s computation load
� Increase data consistency
• Techniques
� To eliminate redundant data
� To find representative data
� To reduce the dimensions of the feature sets
� To find the best set of vectors which best separate the patterns
• Two ways of doing feature reduction, feature selection
and feature extraction
2010 / 03 / 17 Yi - Xian Lin 5
Feature Reduction
• Feature selection
� Let the word set W={W1,W2,…,Wm} be the feature vector of the
document set
� Find a new word set
� Then W’ is used as inputs for classification tasks
• Feature extraction
� Extracted features are obtained by a projecting process through
algebraic transformations
� Let a corpus of documents be represented as an matrix
� Find an optimal transformation matrix
' ' ' '
1 2{ , ,... } , kW w w w k m= <
nm×nm
RX×∈
kmRF
×∈*
2010 / 03 / 17 Yi - Xian Lin 6
Feature Clustering
• Feature clustering is an efficient approach for feature reduction
• Groups all features into some clusters where features in a
cluster are similar to each other
• Let D be the matrix consisting of all the original documents
with m features and D’ be the matrix consisting of the
converted documents with new k features
• New feature set corresponds to a partition
{W1,W2,…,Wk} of the original feature set W
' ' ' '
1 2{ , ,... }kW w w w=
2010 / 03 / 17 Yi - Xian Lin 7
Fuzzy Feature Clustering
• A document set D of n documents d1,d2,...,dn
• Feature vector W of m words w1,w2,...,wm
• p classes c1,c2,...,cp
• Construct one word pattern for each word in W
where
( ) ( ) ( )1 2 1 2, ,..., | , | ,..., |i i i ip i i p ix x x x P c w P c w P c w= =
( ) 1
1
| , 1
n
qi qiq
j i n
qiq
dP c w for j p
d
δ=
=
×= ≤ ≤∑∑
2010 / 03 / 17 Yi - Xian Lin 8
Fuzzy Feature Clustering
( ) ( )6 1 6 2 6| , |x P c w P c w=
( )2 6
1 0 2 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1| 0.50
1 2 0 1 1 1 1 1 0P c w
× + × + × + × + × + × + × + × + ×= =
+ + + + + + + +
2010 / 03 / 17 Yi - Xian Lin 9
Fuzzy Feature Clustering• Let G be a cluster containing q word patterns x1,x2,...,xq
• Let
• The mean
• The deviation
• The fuzzy similarity of a word pattern x to cluster G
1 2, ,..., , 1j j j jpx x x x j q= ≤ ≤
1
1 2, ,..., ,
q
jij
p i
xm m m m m
G
== =
∑
1 2, ,..., pσ σ σ σ=
( )2
1 , 1
q
ji jij
i
x mfor i p
Gσ =
−= ≤ ≤∑
( )2
1
expp
i i
i i
x mG xµ
σ=
− = −
∏
2010 / 03 / 17 Yi - Xian Lin 10
Fuzzy Feature Clustering
• A word pattern close to the mean of a cluster is regarded to
be very similar to this cluster
• Suppose m1 = < 0.4, 0.6 > , σ1 = < 0.3 , 0.5 >
( ) 1G xµ ≈
( )2 2
1 2
0.2 0.4 0.8 0.6exp exp
0.3 0.5
0.6412 0.8521 0.5464
G xµ − −
= − × −
= × =
2010 / 03 / 17 Yi - Xian Lin 11
Fuzzy Feature Clustering
• A predefined threshold ρ,
• If , xi passes the similarity test on cluster Gj
• If the user intends to have larger clusters, give a smaller
threshold
• Two cases may occur
� No existing fuzzy clusters on which xi has passed the similarity test
� Create a new cluster Gh , h = k + 1 ( k is the number of currently
existing clusters) ,
� is a user-defined constant vector
0 1ρ≤ ≤
( )j iG xµ ρ≥
0= , h i hm x σ σ=
0 0 0,...,σ σ σ=
2010 / 03 / 17 Yi - Xian Lin 12
Fuzzy Feature Clustering
• If there are existing clusters on which xi has passed the
similarity test, let cluster Gt be the cluster with the largest
membership degree ,
• Modification to cluster Gt
( )( )1
arg max j ij k
t G xµ≤ ≤
=
( )( )
0
2 22 2
0
, , 1
1 1 ,
1
1 , 1
t tj ij
tj tj
t
t tj t tj ij t tj ijt
t t t
t t
S m xm A B
S
S S m x S m xSA B
S S S
for j p and S S
σ σ
σ σ
× += = − +
+
− − + × + × + += =
+
≤ ≤ = +
2010 / 03 / 17 Yi - Xian Lin 13
Fuzzy Feature Clustering
• The order in which the word patterns are fed in influences the
clusters obtained
• Sort all the patterns, in decreasing order, by their largest