Document Clustering 文文文文 林林林 林林林林林林林林林林 Sung-Chien Lin Department of Library and Informatio n Studies Shih-Hsin University
Document Clustering文件分類
林頌堅世新大學圖書資訊學系
Sung-Chien LinDepartment of Library and Information Studies
Shih-Hsin University
Contents
• Researches of Document Clustering
• Possible Applications of Document Clustering
• Document Clustering in a Networked Environment
• Conclusions
Document Clustering
• Definition– Documents with some similar properties are assigned into automatically
created groups
• Importance– To improve the efficiency and effectiveness of retrieval
• Time
• Space
• Quality
– To determine the structure of the literatures of a field
• Exploration of latent information of documents
• Reduction of users’ cognition load
Block Diagram
DocumentSet
FeatureExtraction
Features
DeterminingClustering Parameters
Clustering
ClusteredDocuments
Applications
Cluster Structure•Nonhierarchical•Hierarchical
Halting Criteria•Number of Desired Clusters•Number of Iteration
Researches on Document Clustering
• Features to represent documents – Linguistic structure in documents
• Co-occurrences of Terms
• Semantic structure
– Meta-data of documents
• Authors
• Citation
• Co-citation: document : documents cites the examined documents
• bibliographic coupling : documents are cited by the examined documents
Researches on Document Clustering
• Measures of relevance between documents– Highly depending on the choice of features to represent documents
– Several relevance measures
• Vector space model (VSM, Salton)
• Latent semantic indexing (LSI, Schütze)
– Based on Singular Value Decomposition (SVD) algorithm
– Reduction of dimensions of feature vectors in VSM
– Exploiting latent semantic feature of documents
L
kjk
L
kik
L
kjkikdef
ji
ww
wwR
1
2
1
2
1,
)( Measure of relevance between document di and dj
wik and wjk: weights of the kth term in di and dj
kik
def
ik idftfw Frequency of the kth term in diInverse document frequency of the kth term
L: vocabulary size
Researches on Document Clustering
• Clustering algorithms– Agglomerative hierarchical clustering algorithm (AHC)
• Algorithm
1. Put each document in the collection into one cluster
2. Identify the two closet clusters and combine these two clusters as a new cluster
3. Repeat Step 2 until that the halting criteria arrive
• O(N2)
– K-Means algorithm
• O(NK)
– Buckshot algorithm
• Fast, linear time algorithm
• A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection
Query Routing
• Documents distributed in several information servers– Relevant documents are clustered and put in one or proximate servers
– Generating description to represent all of documents in a cluster
• When retrieval takes place– Identifying relevant clusters based on the relevance between queries
and description of clusters
– Forwarding queries to the servers for those clusters
– Merging the results
• An exampleQuery: document clustering
Library Science Computer Science Zoology Geology
Cluster-based Browsing
• The problems of expressing a vague information need as a formal query
• Scatter/Gather (Cutting, et. al., SIGIR’92)– Clustering documents into topic-coherent groups
– Presenting descriptive summaries of the clusters to users
– Users can browse and determine possible clusters hierarchy
– Documents in the selected clusters are clustered and summaries are generated
– Finally, documents are retrievedLibrary Science Computer Science Zoology Geology
InformationRetrieval
LibraryAutomation
Result Set Clustering
• Users’ queries are often very short (about 1-3 words)– Result set included relevant documents and also irrelevant documents
• Clustering documents in the result set according to the degree of relevance– Helping users figure out their real information needs
– Easily retrieving relevant documents
• An exampleQuery: Multimedia
Hypermedia Virtual RealityVideo
Result Set Expansion
• Relevant documents may not match the input queries well
• Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase
• Retrieving a core set of documents that match the query
• Expanding the results with documents not matching the query but clustered with the documents in the core set
Query
Core Set
Expanding Result Set
Query Refinement
• Terms in queries do not match the information needs of users
• Dynamically computing and suggesting recall- and precision-enhancing terms for a given query
• Term suggestion– Grouping retrieved documents into topic-cohesive clusters
– Terms in centroid documents: general concepts
– Term in margin documents: specific concepts
Web Pages vs. Plain Texts
• Lexical distributions of these two kinds of documents are significant different– Web pages including more proper nouns and terms but less verbs
• Information in web pages may be in a multimedia form– Difficult to represent and retrieve nowadays
• Web pages contain rich link information– More than 90% web pages include <A> tags
– Each web page contains 15 links in average
• Inapplicable to use term-based clustering techniques for plain texts to cluster web pages
• Link structure provides useful information to determine relevance among web pages
HTML Tags in Web Pages
• Tags provide helpful information to understand the meaning expressed by the pages– Tags for web composition
• Bold <B>, Italic <I>, Underline <U>, Font <Font>
– Tags for document structures
• Title <Title>
• Header <Head>
• Headline <H1>, <H2>, <H3>
• List Items, <Li>
– Tags for link structures across pages
• Anchor <A>
– Terms with tags are information which the authors think important
• Terms with tags could be weighted to enhance effectiveness of retrieval
Connectivity Analysis
• A link between two pages establishes a relation between the two pages
• The similarity between two pages could be estimated using– The length of the shortest path between the two pages
– The length between the two pages and their least common ancestor
– The length between the two pages and their greatest common descendants
A
DCB
JIHE F G
E is more similar to A than D
Information of Link Structure
• Authority page: One contains a lot of information about the topic– Authority: If a page p has a link to page q, the authors of page p confer
authority on q
– link popularity page authority
• Hub page: One has links to authority pages
• Mutually reinforcing relationship– A good hub page points to many good authority pages
– A good authority page is pointed to by many good hub pages
Hubs Authorities
Information of Anchor Text
• The text around links pointing to a page is often a description of the page– The information of anchor text could be used to determine the relevance o
f the link
• Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo!
From: http://decweb.ethz.ch/WWW7/1898/com1898.htm
Distance -100 -75 -50 -25 0 25 50 75 100Density 1 6 11 31 880 73 112 21 7
Conclusions
• Document clustering is an important technique to improve efficiency and effectiveness in information retrieval– Possible applications are wide
• Technologies of document clustering– Extraction of features to represent documents
– Relevance functions between documents
– Clustering algorithms
• Retrieval of web information rely more and more on the information of the web structure
Important References
• P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597.
• E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442.
• D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329.
• J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.