Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Document Clustering文件分類

林頌堅世新大學圖書資訊學系

Sung-Chien LinDepartment of Library and Information Studies

Shih-Hsin University

Contents

• Researches of Document Clustering

• Possible Applications of Document Clustering

• Document Clustering in a Networked Environment

• Conclusions

Researches of Document Clustering

Document Clustering

• Definition– Documents with some similar properties are assigned into automatically

created groups

• Importance– To improve the efficiency and effectiveness of retrieval

• Time

• Space

• Quality

– To determine the structure of the literatures of a field

• Exploration of latent information of documents

• Reduction of users’ cognition load

Block Diagram

DocumentSet

FeatureExtraction

Features

DeterminingClustering Parameters

Clustering

ClusteredDocuments

Applications

Cluster Structure•Nonhierarchical•Hierarchical

Halting Criteria•Number of Desired Clusters•Number of Iteration

Researches on Document Clustering

• Features to represent documents – Linguistic structure in documents

• Co-occurrences of Terms

• Semantic structure

– Meta-data of documents

• Authors

• Citation

• Co-citation: document : documents cites the examined documents

• bibliographic coupling : documents are cited by the examined documents


• Measures of relevance between documents– Highly depending on the choice of features to represent documents

– Several relevance measures

• Vector space model (VSM, Salton)

• Latent semantic indexing (LSI, Schütze)

– Based on Singular Value Decomposition (SVD) algorithm

– Reduction of dimensions of feature vectors in VSM

– Exploiting latent semantic feature of documents

L

kjk

L

kik

L

kjkikdef

ji

ww

wwR

1

2

1

2

1,

)( Measure of relevance between document di and dj

wik and wjk: weights of the kth term in di and dj

kik

def

ik idftfw Frequency of the kth term in diInverse document frequency of the kth term

L: vocabulary size


• Clustering algorithms– Agglomerative hierarchical clustering algorithm (AHC)

• Algorithm

1. Put each document in the collection into one cluster

2. Identify the two closet clusters and combine these two clusters as a new cluster

3. Repeat Step 2 until that the halting criteria arrive

• O(N2)

– K-Means algorithm

• O(NK)

– Buckshot algorithm

• Fast, linear time algorithm

• A K-Means algorithm where the initial cluster centroids are created by applying AHC to a sample of the document in the collection

Possible Applications of Document Clustering

Query Routing

• Documents distributed in several information servers– Relevant documents are clustered and put in one or proximate servers

– Generating description to represent all of documents in a cluster

• When retrieval takes place– Identifying relevant clusters based on the relevance between queries

and description of clusters

– Forwarding queries to the servers for those clusters

– Merging the results

• An exampleQuery: document clustering

Library Science Computer Science Zoology Geology

Cluster-based Browsing

• The problems of expressing a vague information need as a formal query

• Scatter/Gather (Cutting, et. al., SIGIR’92)– Clustering documents into topic-coherent groups

– Presenting descriptive summaries of the clusters to users

– Users can browse and determine possible clusters hierarchy

– Documents in the selected clusters are clustered and summaries are generated

– Finally, documents are retrievedLibrary Science Computer Science Zoology Geology

InformationRetrieval

LibraryAutomation

Result Set Clustering

• Users’ queries are often very short (about 1-3 words)– Result set included relevant documents and also irrelevant documents

• Clustering documents in the result set according to the degree of relevance– Helping users figure out their real information needs

– Easily retrieving relevant documents

• An exampleQuery: Multimedia

Hypermedia Virtual RealityVideo

Result Set Expansion

• Relevant documents may not match the input queries well

• Clustering relevant documents based on sophisticated features and clustering algorithms in data preparing phase

• Retrieving a core set of documents that match the query

• Expanding the results with documents not matching the query but clustered with the documents in the core set

Query

Core Set

Expanding Result Set

Query Refinement

• Terms in queries do not match the information needs of users

• Dynamically computing and suggesting recall- and precision-enhancing terms for a given query

• Term suggestion– Grouping retrieved documents into topic-cohesive clusters

– Terms in centroid documents: general concepts

– Term in margin documents: specific concepts

Document Clusteringin a Networked Environment

Web Pages vs. Plain Texts

• Lexical distributions of these two kinds of documents are significant different– Web pages including more proper nouns and terms but less verbs

• Information in web pages may be in a multimedia form– Difficult to represent and retrieve nowadays

• Web pages contain rich link information– More than 90% web pages include <A> tags

– Each web page contains 15 links in average

• Inapplicable to use term-based clustering techniques for plain texts to cluster web pages

• Link structure provides useful information to determine relevance among web pages

HTML Tags in Web Pages

• Tags provide helpful information to understand the meaning expressed by the pages– Tags for web composition

• Bold <B>, Italic <I>, Underline <U>, Font <Font>

– Tags for document structures

• Title <Title>

• Header <Head>

• Headline <H1>, <H2>, <H3>

• List Items, <Li>

– Tags for link structures across pages

• Anchor <A>

– Terms with tags are information which the authors think important

• Terms with tags could be weighted to enhance effectiveness of retrieval

An Example of Web Page

Anchor Text

List Item

Tag <I>

Connectivity Analysis

• A link between two pages establishes a relation between the two pages

• The similarity between two pages could be estimated using– The length of the shortest path between the two pages

– The length between the two pages and their least common ancestor

– The length between the two pages and their greatest common descendants

A

DCB

JIHE F G

E is more similar to A than D

Information of Link Structure

• Authority page: One contains a lot of information about the topic– Authority: If a page p has a link to page q, the authors of page p confer

authority on q

– link popularity page authority

• Hub page: One has links to authority pages

• Mutually reinforcing relationship– A good hub page points to many good authority pages

– A good authority page is pointed to by many good hub pages

Hubs Authorities

Information of Anchor Text

• The text around links pointing to a page is often a description of the page– The information of anchor text could be used to determine the relevance o

f the link

• Distribution of “Yahoo” in anchor texts of 5000 web pages pointing to Yahoo!

From: http://decweb.ethz.ch/WWW7/1898/com1898.htm

Distance -100 -75 -50 -25 0 25 50 75 100Density 1 6 11 31 880 73 112 21 7

Conclusions

Conclusions

• Document clustering is an important technique to improve efficiency and effectiveness in information retrieval– Possible applications are wide

• Technologies of document clustering– Extraction of features to represent documents

– Relevance functions between documents

– Clustering algorithms

• Retrieval of web information rely more and more on the information of the web structure

Important References

• P. Willett, “Recent Trends in Hierarchic Document Clustering: A Critical Review,” Information Processing and Management, 24(5), 577-597.

• E. Rasmussen, “Clustering Algorithms,” Information Retrieval: Data Structures and Algorithms, ed. by W. B. Frakes and R. Baeza-Yates, Chap. 16, 419-442.

• D. R. Cutting, D. Karger and J. O. Pedersen, “A Cluster-based Approach to Browsing Large Document Collection,” Proceedings of SIGIR’92, 318-329.

• J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, IBM Research Report RJ 10076, May, 1997.

Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.

Documents

Document Clustering 文件分類林頌堅世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.