Online Clustering of Web Search results Shixian Chu.

Online Clustering of Web Search results

Shixian Chu

Two papers:

O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998.

Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

Introduction…Current status of information Retrieval is far

from satisfaction for several possible reasons:

Many returned pages are useless or irrelevant; Users may be just interested in small part of information

returned while thousands of pages are returned from search engine;

Different users have different requirements and expectations for search results;

Sometimes search requests can not be expressed clearly just in several keywords;

The phenomena of synonymy (several words may correspond to same concept) and polysemy (one word may have several different meanings) make things more complicated;

......

Search results clustering can help to solve some of these problems

Search results can be viewed as a database composed of thousand of documents.

All the results are clustered into hierarchical groups with the “key phrases” as the name of the cluster.

With hierarchical clusters, users will be able to have an overview of the whole topic or just select interested clusters to browse and neglect the non-relevant groups.

Example… Clustered Search results of query “Jaguar”

“Web Document Clustering: A Feasibility Demonstration”

O. Zamir and O. Etzioni.

What’s new?

This paper introduces linear time (in the document collection size) algorithm called Suffix Tree Clustering(STC), which creates clusters based on phrases shared between documents.

STC is faster and more precise than standard clustering methods such as K-means, Buckshot and so on.

Key requirements for Web document clustering methods:

Relevance: relevant and irrelevant docs are in different clusters

Browsable Summaries: key phrases that can

summary the cluster Overlap: one doc maybe in several clusters Snippet-tolerance: produce high quality clusters

even when it only has access to the snippets returned by the search engines

Speed: high

STC has three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix

tree, (3) combining these base clusters into

clusters.

Step 1 - Document "Cleaning"

Deleting word prefixes and suffixes and reducing plural to singular

Marking Sentence boundaries Stripping non-word tokens (such as

numbers,HTML tags and most punctuation)

Step 2 - Identifying Base Clusters

We treat documents as strings of words,not characters, thus suffixes contain one or more of the whole words. In more precise terms:

1. A suffix tree is a rooted, directed tree. 2. Each internal node has at least 2 children. 3. Each edge is labeled with a non-empty sub-

string


4. No two edges out of the same node can have edge-labels that begin with the same word (hence it is compact).

5. For each suffix s of S, there exists a suffix-node whose label equals s.


The following may be the snippets of three search result docs:

"cat ate cheese”---------------document 1 "mouse ate cheese too" ------document 2 "cat ate mouse too"-----------document 3

Step 2 - Identifying Base Clusters"cat ate cheese”,"mouse ate cheese too“, "cat ate mouse too"


All parent nodes are base clusters

Step 2 - Identifying Base Clusters Each base cluster is assigned a score

where |B| is the number of documents in base cluster B, P is the phrase of cluster B, and |P| is the number of words in P that have a non-zero score We maintain a stoplist that is supplemented with Internet

specific words(e.g., “previous”, “java”, “frames” and “mail”). Words appearing in the stoplist, or that appear in too few (3 or less)or too many (more than 80% of the collection) documents receive a score of zero.

Step 3 - Combining Base Clusters Given two base clusters Bm and Bn, with sizes |Bm| and |Bn| |Bm∩Bn| representing the number of documents common to both base clusters

if|Bm∩Bn|/|Bm| > and

|Bm∩Bn|/|Bn| > Similarity of Bm and Bn=

Otherwise

Step 3 - Combining Base Clusters

Step 3 - Combining Base Clusters

Experiments

Experiments

“Semantic, Hierarchical, Online Clustering of Web Search Results”

Dell Zhang and Yisheng Dong.

What’s new?

A document or snippet is treated as a string of characters not as a string of words

Group Web search results semantically Not only English but also oriental

languages like Chinese.

Step 1 - Document "Cleaning"

Deleting word prefixes and suffixes and reducing plural to singular

Marking Sentence boundaries Stripping non-word tokens (such as

numbers,HTML tags and most punctuation)

Step 2 – Key phrase extraction

Extract phrases of high

1. “completeness”,

2. “ stability”,

and 3. “significance”

as Key phrases.

DEFINITION: Completeness

Suppose phrase S occurs in k distinct positions p1, p2, … ,pk in document D, S is “complete” if and only if the (pi-1)th token in D is different with the (pj-1)th token for at least one (i, j) pair, 1≤i<j≤k (called “left-complete”), and the (pi+|S|)th token is different with the (pj+|S|)th token for at least one (i, j) pair, 1≤i<j≤k (called “right-complete”).

DEFINITION: Stability

DEFINITION: significance

Suffix array---result of step 2

Step 3 – Organizing Clusters

X threshold=0.5, y threshold=0.15

Thank you

Online Clustering of Web Search results Shixian Chu.

Documents

search results clustering

base clusters

high slide

hierarchical clusters

punctuation slide

problems search results

sub string slide

high quality clusters