Top Banner

of 21

Cluster Selection

May 30, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/9/2019 Cluster Selection

    1/21

    Improving

    Web Clustering by

    Cluster Selection

    By- Vishal Rathore

    Regd. No. 0721215022(+91)9861084119

  • 8/9/2019 Cluster Selection

    2/21

    Web Search

    Iterative Process

    Problems with Standard Web Search Many Irrelevant Results

    Single Long List

    Solution Identify and Present Implicit Clusters

    2

  • 8/9/2019 Cluster Selection

    3/21

    Web Clustering

    1. Jaguar

    Official worldwide web site of Jaguar Cars.

    2. Apple - Mac OS XThe Apple Mac OS X product page.

    3. Jaguar UK - R is for Racing

    The essence of the Jaguar breed

    4. Jaguar

    General information from Big Cats Online.

    5. Jaguar AU - Jaguar Cars

    Services and news

    6. Jaguar -- Defenders of Wildlife

    Size, appearance, life span and diet.

    Search Results for: Jaguar 1 6 of 70,000,000

    Clusters

    1.Car2. Animal

    3. Mac OS

    4. Other

    4. Jaguar

    General information from Big Cats Online.

    6. Jaguar -- Defenders of WildlifeSize, appearance, life span and diet.

    3

  • 8/9/2019 Cluster Selection

    4/21

    Web Clustering Algorithms

    Many standard clustering algorithms.

    Text oriented clustering algorithms STC - Suffix Tree Clustering

    ESTC - Improvement on STC

    4

  • 8/9/2019 Cluster Selection

    5/21

    Suffix Tree Clustering

    Clean Pages

    Identify Base Clusters

    Combine Base Clusters

    Rank/Select Clusters

    5

  • 8/9/2019 Cluster Selection

    6/21

    STC: Identify Base Clusters

    car model

    car

    animal

    mac os x

    Base clusters each given a score: # documentsv

    phrase score

    5

    10

    4.5

    24

    6

  • 8/9/2019 Cluster Selection

    7/21

    STC: Combining Base Clusters

    3018

    Merge Clusters Based On Overlap

    Merged ClusterScore is sum of base cluster scores

    12

    6

    7

    7

  • 8/9/2019 Cluster Selection

    8/21

    STC: Rank/Select Clusters

    Sort Clusters by Score

    Select Best N

    8

  • 8/9/2019 Cluster Selection

    9/21

    Problems with STC

    STC is better than many other algorithms

    BUT not good enough

    Scores

    Poor Cluster Quality Measure

    Selection Poor Coverage

    Excessive Overlap

    9

  • 8/9/2019 Cluster Selection

    10/21

    ESTC: Better Cluster Scoring

    Base Cluster Scores OK

    Combined Cluster Scores BAD

    Overlap between clusters over counted in sum

    Example - Particularly Similar Pages

    10

  • 8/9/2019 Cluster Selection

    11/21

    ESTC: Scoring Solution

    Solution

    Eliminate the over counting of the overlap

    Merged Cluster Score

    Sum over document scores

    Document Score

    Average phrase score of base clusterscontaining the document in the merged cluster

    11

  • 8/9/2019 Cluster Selection

    12/21

    ESTC: Better Cluster Selection

    Top N Clusters BAD

    Dominant Topic over represented

    12

    Cars Animals Mac OS Other

  • 8/9/2019 Cluster Selection

    13/21

    ESTC: Smarter Selection The Search

    ESTC: Smarter selection

    Heuristic

    Minimize Overlap

    Maximize Coverage

    13

  • 8/9/2019 Cluster Selection

    14/21

    Incremental

    Greedy

    Look-ahead Protection

    Sophisticated Branch and Bound Pruning

    ESTC: The Search

    14

  • 8/9/2019 Cluster Selection

    15/21

    Evaluation Method

    Gold Standard - Ideal Clustering

    2 Searches and 2 Types of Input Data

    Jaguar and Salsa

    Snippets and Full Text

    Precision

    Cluster accuracy against the best matching ideal cluster

    Recall

    Coverage of ideal cluster in matched clusters F-measure

    Combination of precision and recall

    15

  • 8/9/2019 Cluster Selection

    16/21

    Results STC, STC-NS, ESTC

    Jaguar Full Text Clustering Results

    16

  • 8/9/2019 Cluster Selection

    17/21

    Results ESTC vs Grokker

    Similar performance without page titles

    Page titles are often very useful

    Algorithm Input F-measure

    ESTC Snippets 58%

    Grokker Snippets + Page Titles 62%ESTC Full Text 74%

    17

  • 8/9/2019 Cluster Selection

    18/21

    Conclusions

    ESTC has

    A new cluster scoring

    A new cluster selection algorithm

    ESTC is better than STC, and compares favourablywith Grokker.

    ESTC Scoring function applicable to any

    agglomerative clustering algorithm.

    ESTC Cluster Selection algorithm more widely

    applicable.

    18

  • 8/9/2019 Cluster Selection

    19/21

    Future Work

    Make improvements to other stages of STC

    Particularly Combining Base Clusters

    Apply cluster selection method to other

    algorithms

    Improve cluster selection heuristic

    19

  • 8/9/2019 Cluster Selection

    20/21

  • 8/9/2019 Cluster Selection

    21/21