Top Banner
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Den g WWW 07
17

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

Dec 28, 2015

Download

Documents

Alexia West
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

A New Suffix TreeSimilarity Measure forDocument Clustering

Hung Chim, Xiaotie Deng

WWW 07

Page 2: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

1. Document Clustering

• Agglomerative Hierarchical Clustering (AHC)

Page 3: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

• Suffix Tree Clustering (STC)

- commonly used in result clustering

Page 4: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-1. Suffix Tree Clustering

Ex: 3 documents

• cat ate cheese• cat ate mouse too• mouse ate cheese too

Page 5: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 6: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 7: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 8: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

cat ate cheese

Page 9: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

score(B) = |B| f(|P|)f: remove stopwords, <= 3

, > 40% && penalize single word, constant for |P| > 6

2-2. Base Cluster

Page 10: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-3. Combining Base Cluster

• Keep top k(=500) base cluster

• Merge high overlap base clustersmerge Bi & Bj iff

|Bi∩Bj| / |Bi| > 0.5

|Bj∩Bi| / |Bj| > 0.5

Page 11: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

2-4. Advantage

• High precision even using snippet

• Incremental and linear time

• Order Independent

• No magic k

top k base clusters? 0.5?

Page 12: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Page 13: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

3. New Suffix Tree Clustering

diT =

[tfidf(n1, di), tfidf(n2, di), …]

Group-average AHC

(GAHC)

Page 14: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

4. Evaluation

• Use F-measure

precision(Ci, Gj) = |Ci∩ Gj | / |Ci|

recall(Ci, Gj) = |Ci∩ Gj | / | Gj |

Page 15: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

• OHSUMED Document CollectionMeSH indexing terms

• RCV1 Document Collectioncategories

Page 16: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Page 17: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.

5. Comparison

• STC : seldom generate large cluster

• NSTC : not incremental