Top Banner
Online Clustering of Web Search results Shixian Chu
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Online Clustering of Web Search results Shixian Chu.

Online Clustering of Web Search results

Shixian Chu

Page 2: Online Clustering of Web Search results Shixian Chu.

Two papers:

O. Zamir and O. Etzioni. Web Document Clustering: A Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Melbourne, Australia,1998.

Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

Page 3: Online Clustering of Web Search results Shixian Chu.

Introduction…Current status of information Retrieval is far

from satisfaction for several possible reasons:

Many returned pages are useless or irrelevant; Users may be just interested in small part of information

returned while thousands of pages are returned from search engine;

Different users have different requirements and expectations for search results;

Page 4: Online Clustering of Web Search results Shixian Chu.

Sometimes search requests can not be expressed clearly just in several keywords;

The phenomena of synonymy (several words may correspond to same concept) and polysemy (one word may have several different meanings) make things more complicated;

......

Page 5: Online Clustering of Web Search results Shixian Chu.

Search results clustering can help to solve some of these problems

Search results can be viewed as a database composed of thousand of documents.

All the results are clustered into hierarchical groups with the “key phrases” as the name of the cluster.

With hierarchical clusters, users will be able to have an overview of the whole topic or just select interested clusters to browse and neglect the non-relevant groups.

Page 6: Online Clustering of Web Search results Shixian Chu.

Example… Clustered Search results of query “Jaguar”

Page 7: Online Clustering of Web Search results Shixian Chu.

“Web Document Clustering: A Feasibility Demonstration”

O. Zamir and O. Etzioni.

Page 8: Online Clustering of Web Search results Shixian Chu.

What’s new?

This paper introduces linear time (in the document collection size) algorithm called Suffix Tree Clustering(STC), which creates clusters based on phrases shared between documents.

STC is faster and more precise than standard clustering methods such as K-means, Buckshot and so on.

Page 9: Online Clustering of Web Search results Shixian Chu.

Key requirements for Web document clustering methods:

Relevance: relevant and irrelevant docs are in different clusters

Browsable Summaries: key phrases that can

summary the cluster Overlap: one doc maybe in several clusters Snippet-tolerance: produce high quality clusters

even when it only has access to the snippets returned by the search engines

Speed: high

Page 10: Online Clustering of Web Search results Shixian Chu.

STC has three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix

tree, (3) combining these base clusters into

clusters.

Page 11: Online Clustering of Web Search results Shixian Chu.

Step 1 - Document "Cleaning"

Deleting word prefixes and suffixes and reducing plural to singular

Marking Sentence boundaries Stripping non-word tokens (such as

numbers,HTML tags and most punctuation)

Page 12: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters

We treat documents as strings of words,not characters, thus suffixes contain one or more of the whole words. In more precise terms:

1. A suffix tree is a rooted, directed tree. 2. Each internal node has at least 2 children. 3. Each edge is labeled with a non-empty sub-

string

Page 13: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters

4. No two edges out of the same node can have edge-labels that begin with the same word (hence it is compact).

5. For each suffix s of S, there exists a suffix-node whose label equals s.

Page 14: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters

The following may be the snippets of three search result docs:

"cat ate cheese”---------------document 1 "mouse ate cheese too" ------document 2 "cat ate mouse too"-----------document 3

Page 15: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters"cat ate cheese”,"mouse ate cheese too“, "cat ate mouse too"

Page 16: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters

All parent nodes are base clusters

Page 17: Online Clustering of Web Search results Shixian Chu.

Step 2 - Identifying Base Clusters Each base cluster is assigned a score

where |B| is the number of documents in base cluster B, P is the phrase of cluster B, and |P| is the number of words in P that have a non-zero score We maintain a stoplist that is supplemented with Internet

specific words(e.g., “previous”, “java”, “frames” and “mail”). Words appearing in the stoplist, or that appear in too few (3 or less)or too many (more than 80% of the collection) documents receive a score of zero.

Page 18: Online Clustering of Web Search results Shixian Chu.

Step 3 - Combining Base Clusters Given two base clusters Bm and Bn, with sizes |Bm| and |Bn| |Bm∩Bn| representing the number of documents common to both base clusters

if|Bm∩Bn|/|Bm| > and

|Bm∩Bn|/|Bn| > Similarity of Bm and Bn=

Otherwise

Page 19: Online Clustering of Web Search results Shixian Chu.

Step 3 - Combining Base Clusters

Page 20: Online Clustering of Web Search results Shixian Chu.

Step 3 - Combining Base Clusters

Page 21: Online Clustering of Web Search results Shixian Chu.

Experiments

Page 22: Online Clustering of Web Search results Shixian Chu.

Experiments

Page 23: Online Clustering of Web Search results Shixian Chu.

“Semantic, Hierarchical, Online Clustering of Web Search Results”

Dell Zhang and Yisheng Dong.

Page 24: Online Clustering of Web Search results Shixian Chu.

What’s new?

A document or snippet is treated as a string of characters not as a string of words

Group Web search results semantically Not only English but also oriental

languages like Chinese.

Page 25: Online Clustering of Web Search results Shixian Chu.

Step 1 - Document "Cleaning"

Deleting word prefixes and suffixes and reducing plural to singular

Marking Sentence boundaries Stripping non-word tokens (such as

numbers,HTML tags and most punctuation)

Page 26: Online Clustering of Web Search results Shixian Chu.

Step 2 – Key phrase extraction

Extract phrases of high

1. “completeness”,

2. “ stability”,

and 3. “significance”

as Key phrases.

Page 27: Online Clustering of Web Search results Shixian Chu.

DEFINITION: Completeness

Suppose phrase S occurs in k distinct positions p1, p2, … ,pk in document D, S is “complete” if and only if the (pi-1)th token in D is different with the (pj-1)th token for at least one (i, j) pair, 1≤i<j≤k (called “left-complete”), and the (pi+|S|)th token is different with the (pj+|S|)th token for at least one (i, j) pair, 1≤i<j≤k (called “right-complete”).

Page 28: Online Clustering of Web Search results Shixian Chu.

DEFINITION: Stability

Page 29: Online Clustering of Web Search results Shixian Chu.

DEFINITION: significance

Page 30: Online Clustering of Web Search results Shixian Chu.
Page 31: Online Clustering of Web Search results Shixian Chu.
Page 32: Online Clustering of Web Search results Shixian Chu.

Suffix array---result of step 2

Page 33: Online Clustering of Web Search results Shixian Chu.

Step 3 – Organizing Clusters

Page 34: Online Clustering of Web Search results Shixian Chu.

X threshold=0.5, y threshold=0.15

Page 35: Online Clustering of Web Search results Shixian Chu.

Thank you