Top Banner
Introduction to Web Clustering D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 June 26, 2009
41

Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Jun 20, 2018

Download

Documents

truongnhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Introduction to Web Clustering

D. De Cao R. Basili

Corso di Web Mining e Retrievala.a. 2008-9

June 26, 2009

Page 2: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web Clustering

Some Web Clustering enginesThe KeySRC approachSome tools for build a Web Clustering engine

Yahoo Search APICLUTO - Family of Data Clustering Software Tools

Page 3: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web ClusteringSome Web Clustering engines

The KeySRC approachSome tools for build a Web Clustering engine

Yahoo Search APICLUTO - Family of Data Clustering Software Tools

Page 4: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web ClusteringSome Web Clustering enginesThe KeySRC approach

Some tools for build a Web Clustering engine

Yahoo Search APICLUTO - Family of Data Clustering Software Tools

Page 5: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web ClusteringSome Web Clustering enginesThe KeySRC approachSome tools for build a Web Clustering engine

Yahoo Search APICLUTO - Family of Data Clustering Software Tools

Page 6: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web ClusteringSome Web Clustering enginesThe KeySRC approachSome tools for build a Web Clustering engine

Yahoo Search API

CLUTO - Family of Data Clustering Software Tools

Page 7: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Outline

Introduction to Web ClusteringSome Web Clustering enginesThe KeySRC approachSome tools for build a Web Clustering engine

Yahoo Search APICLUTO - Family of Data Clustering Software Tools

Page 8: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web data clustering - Basics

Organize data circulated over the Web into groups / collections in orderto facilitate data availability & accessing, and at the same time meetuser preferences.The initial idea was to define the correlation distance / similaritymeasure between any two “elements”.

Why use Web Clustering?

Increasing Web information accessibilityDecreasing lengths in Web navigation pathwaysImproving Web users requests servicingImproving information retrievalImproving content delivery on the WebUnderstanding users’ navigation behaviorIntegrating various data representation standardsExtending current Web information organizational practices

Page 9: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web data clustering - Basics

Organize data circulated over the Web into groups / collections in orderto facilitate data availability & accessing, and at the same time meetuser preferences.The initial idea was to define the correlation distance / similaritymeasure between any two “elements”.

Why use Web Clustering?

Increasing Web information accessibilityDecreasing lengths in Web navigation pathwaysImproving Web users requests servicingImproving information retrievalImproving content delivery on the WebUnderstanding users’ navigation behaviorIntegrating various data representation standardsExtending current Web information organizational practices

Page 10: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directory:

represent a widespread scenario where the most relevant web pages areclassified with respect to a predefined set of categories organized into ahierarchy.Google, Yahoo! are well known examples of such hierarchical organizationof knowledge.

The Open Directory Project:ODP, also known as Dmoz (from directory.mozilla.org, its original domainname), is a multilingual open content directory of World Wide Web linksowned by Netscape that is constructed and maintained by a community ofvolunteer editors.

Page 11: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Open Directory Project

Page 12: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Open Directory Project

Page 13: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Open Directory Project

Page 14: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.

Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.Filtering out irrelevant results.Need to define a label for each cluster.

Page 15: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.

Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.Filtering out irrelevant results.Need to define a label for each cluster.

Page 16: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.Filtering out irrelevant results.Need to define a label for each cluster.

Page 17: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.

Clusters are dynamically generated on user needs.Filtering out irrelevant results.Need to define a label for each cluster.

Page 18: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.

Filtering out irrelevant results.Need to define a label for each cluster.

Page 19: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.Filtering out irrelevant results.

Need to define a label for each cluster.

Page 20: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Directories vs. Web Clustering

Web Directories are based on taxonomies.Web Directories are static view of WWW.Extend Web Directories is a classification problem.

Web Clustering is totally unsupervised.Clusters are dynamically generated on user needs.Filtering out irrelevant results.Need to define a label for each cluster.

Page 21: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clustering

How represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 22: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 23: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 24: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 25: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 26: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?Fixed a priori?

Completely data driven?

Avoid “trivial” clusters - too large or small

Page 27: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 28: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Issues for Web Clustering

Representation for clusteringHow represent Document?

Full documents or snapshot?

Need a notion of similarity/distance

How many clusters?Fixed a priori?Completely data driven?

Avoid “trivial” clusters - too large or small

Page 29: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Classic Document Clustering vs. Web Clustering

Page 30: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Clustering Architecture

Page 31: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Web Search API

Page 32: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Clusty

Page 33: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Carrot

Page 34: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Grokker

Page 35: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

KartOO

Page 36: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

KeySRC

Page 37: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Some Web Clustering Engines

Page 38: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Generalized suffix tree (from Zamir and Etzioni, 1998)

Page 39: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

The KeySRC algorithm

1 Search results preprocessing2 Construction of Generalized Suffix Tree (GST)3 Extraction of keyphrases from GST Extraction of keyphrases from GST

(internal nodes of GST + ≤ 4 words + POS tagging)4 Keyphrases clustering and Label assignment5 Cluster ranking

Page 40: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

Yahoo! search apis Example

Page 41: Introduction to Web Clustering - Università degli Studi ... · Introduction to Web Clustering Some Web Clustering engines ... for Web Clustering Web data ... for Web Clustering Classic

Web Clust. Intro Web Clustering Engines KeySRC approach Tools for Web Clustering

CLUTO: Clustering High-Dimensional Datasets

About CLUTOIt is a software package for clustering low- and high-dimensional datasets and for analyzing thecharacteristics of the various clusters.

Consists of both stand-alone programs and a library via which an application program can access directly the

various clustering and analysis algorithms implemented in CLUTO.

Multiple classes of clustering algorithms:

partitional, agglomerative and graph-partitioning based.Multiple similarity/distance functions:

Euclidean distance, cosine, correlation coefficient, extended Jaccard, user-defined.Numerous novel clustering criterion functions and agglomerative merging schemes.

Traditional agglomerative merging schemes:

single-link, complete-link, UPGMA

Extensive cluster visualization capabilities and output options:

postscript, SVG, gif, xfig, etc.

Multiple methods for effectively summarizing the clusters:

most descriptive and discriminating dimensions, cliques, and frequent itemsets.Can scale to very large datasets containing hundreds of thousands of objects and tens of thousands ofdimensions.