The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College London, UK Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes, lsacks}@ee.ucl.ac.uk http://www.ee.ucl.ac.uk/~mmendes/
15
Embed
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The 5th annual UK Workshop on Computational Intelligence
London, 5-7 September 2005
The 5th annual UK Workshop on Computational Intelligence
London, 5-7 September 2005
Department of Electronic & Electrical Engineering
University College London, UK
Learning Topic Hierarchies from Text Documents using
a Scalable Hierarchical Fuzzy Clustering Method
Learning Topic Hierarchies from Text Documents using
a Scalable Hierarchical Fuzzy Clustering Method
E. Mendes Rodrigues and L. Sacks
{mmendes, lsacks}@ee.ucl.ac.uk
http://www.ee.ucl.ac.uk/~mmendes/
E. Mendes Rodrigues and L. Sacks
{mmendes, lsacks}@ee.ucl.ac.uk
http://www.ee.ucl.ac.uk/~mmendes/
OutlineOutline
• Document clustering process
• H-FCM: Hyper-spherical Fuzzy C-Means
• H2-FCM: Hierarchical H-FCM
• Clustering experiments
• Topic hierarchies
Document Clustering ProcessDocument Clustering Process
DocumentRepresentation
DocumentEncoding
Document Clustering
Pre-processing
DocumentClusters
DocumentSimilarity
ClusteringMethod
Cluster Validity
DocumentCollection
Application
Document Clustering
DocumentSimilarity
ClusteringMethod
DocumentCollection
DocumentRepresentation
DocumentEncoding
Pre-processing
DocumentClusters
Cluster Validity
Application
Identify all unique words in the document collection
Discard common words that are included in the stop list
Apply stemming algorithm and combine identical word stems
Apply term weighting scheme to the final set of k indexing terms
Discard terms using pre-processing filters
DocumentVectors
x11 x12 x1k
x21 x22
xN1 xN2 xNk
X =
Vector-Space Model of Information Retrieval
Very high-dimensional
Very sparse (+95%)
Measures of Document RelationshipMeasures of Document Relationship
2/1k
1j
2Bj
k
1j
2Aj
k
1jBjAj
BABA
xx
xx1)x,x(S1)x,x(D
B,ABA ,1)x,x(S0
• FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering
non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms
• Cosine (dis)similarity measure:
widely applied in Information Retrieval
represents the cosine of the angle between two document vectors
insensitive to different document lengths, since it is normalised bythe length of the document vectors