Top Banner

of 29

Syn Presentation(6!05!10)1

Apr 10, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/8/2019 Syn Presentation(6!05!10)1

    1/29

    Efficient Clustering Approaches for

    Organizing Document Collection

    School of Computer &System SciencesJawaharlal Nehru University

    New Delhi-110067

    Dr. Aditi Sharan Sonia

    Assistant Professor PhD Scholar

  • 8/8/2019 Syn Presentation(6!05!10)1

    2/29

    Table of Contents

    Information Retrieval

    Efficient Retrieval System

    Document ClusteringClustering Algorithm

    Feature SelectionDimensionality Reduction

    Subspace ClusteringSubspace Creation

    Research ProposalObjective

  • 8/8/2019 Syn Presentation(6!05!10)1

    3/29

    IR System

    IR

    SystemQuery

    String

    Document

    corpus

    Ranked

    Documents

    1. Doc1

    2. Doc23. Doc3

    .

    .

    Web

    Spider

    Web Search System

    ?

  • 8/8/2019 Syn Presentation(6!05!10)1

    4/29

    A perfect IRS always retrieves all relevant

    documents without retrieving any non

    relevant document.

    In reality , systems retrieve relevant aswell as non relevant documents.

    To measure effectiveness of retrieval

    two ratios are used :precision and recall. Present document according to user need

    IRSystem

  • 8/8/2019 Syn Presentation(6!05!10)1

    5/29

    Document Clustering Automatically partition documents into clusters based on

    content

    Documents within each cluster should be similar

    Documents in different clusters should be different

    Discover categories in an unsupervised manner

    No sample category labelsprovided by humans

    It is a common and important task that finds manyapplications in IR and other places

  • 8/8/2019 Syn Presentation(6!05!10)1

    6/29

    Example Star

  • 8/8/2019 Syn Presentation(6!05!10)1

    7/29

    Why cluster documents?

    Whole corpus analysis/navigation Better user interface

    For improving recall in search applications

    Better search results

    For better navigation of search results

    Effective user recall will be higher

    For speeding up vector space retrieval Faster search

  • 8/8/2019 Syn Presentation(6!05!10)1

    8/29

    ChallengingTask What are the challenges ofWeb Data ?

    Why it is difficult to ClusterWeb data?

    Structure Based Problem Unstructured Heterogeneous Distributed Language dependent

    Information Based Problem

    Larger repository

    Unlabelled

    Dynamic Duplication Interconnected (Hyper Link)

    User based Problem

    Insufficient Query

    Heterogeneous User

    Dynamic Requirements

    Behavioral Changes

  • 8/8/2019 Syn Presentation(6!05!10)1

    9/29

    Why it is difficult to ClusterWeb data?

    Data is Heterogeneous High dimensionality of Data

    No good definition of similarity itself

    Pre-clustering of Data Therefore traditional clustering algorithms

    have to modified ornew algorithm should

    be developed to cluster we

    bdata

  • 8/8/2019 Syn Presentation(6!05!10)1

    10/29

    Self-organizing maps (SOM)

    Multidimensional scaling (MDS)

    Latent Semantic Indexing (LSI)

    Generative Distributions forDocuments

    Expectation Maximization ( EM)

    Multiple Cause Mixture Model (MCMM)

    AspectModels and Probabilistic LSI

    Bottom-up clustering

    Top-down clustering

    Model and Feature Selection

    Generative Models

    & Probabilistic

    Geometric

    EmbeddingHierarchal

    Clustering

    Algorithm

    Partitioning

    Clustering Algorithms

    Buckshot

    Fractionation

    K-means

    Clustering

    Pre-Clustering

    Post-ClusteringCombining Clustering with IR

    Pre-Clustering

    To retrieve one or more clusters in their entirety to a query

  • 8/8/2019 Syn Presentation(6!05!10)1

    11/29

    Post-Clustering Approaches

    Clustering is used in Improving document search and

    retrieval An attempt to improve conventional search techniques

    Enhancing of near-neighbor search

    A document browsing technique that employs document clustering as its primaryA document browsing technique that employs document clustering as its primary

    operation.operation.

    Scatter/Gather MethodScatter/Gather Method

    Document clustering algorithms are often slow, with

    quadratic running times

    How clustering can be effective method in its own right

  • 8/8/2019 Syn Presentation(6!05!10)1

    12/29

    Scatter/Gather : A Cluster Based Approach

    How it works

    The system clusters documents into small no of groups - Scatter

    The system displays short summaries of them

    Userchooses one or more of the groups for further study

    Selected groups are gatheredtogether to form a subcollection

    With each successive iteration the groups become smallerand moredetailed

    The groups become small enough, this process bottoms out by displayingindividual documents

  • 8/8/2019 Syn Presentation(6!05!10)1

    13/29

    Application to Scatter/Gather

    Zooming into a large document collection Interactive browsing paradigm

    Effective Information access tool

    Helpful in situation where the query is unspecified

    Comparatively fast algorithms Buckshot andfractionation linear-time preprocessing

    constant-time query processing

    Effective geometric clustering Tool

    Limitations

    Even Buckshot or Fractionation algorithms may be too slow for

    large corpus on theWeb

    Quality of clustering

  • 8/8/2019 Syn Presentation(6!05!10)1

    14/29

    Scoring Cluster

    Suffix Tree Construction

    Merging Clusters

    Labeling Clusters

    Preparing the Doc

    Web Document Clustering Using SuffixTree Algorithm

    Clusters x and y if (Bx By) / |Bx|>k

    (Bx By) / |By|>k

    SC = NC * p(li)

    one or more labels in the original suffix tree

  • 8/8/2019 Syn Presentation(6!05!10)1

    15/29

    The definition of STC an incremental, o(n) time clustering

    algorithm that satisfies these requirements

    Effective for Information Retrieval

    Snippets versus Whole Documents Clustering

    Execution Time is less

    Analysis of STCApplications!

    Analysis of the STCDrawbacks!

    Non-Exclusiveness

    Incompleteness

    Documents may appear in more than one No specific category

    Share only few short word Not contain all documents

    Absoluteness

    Topic Generating

    No information about document lengths or suffix mismatches

    Topic identification for document clusters

  • 8/8/2019 Syn Presentation(6!05!10)1

    16/29

    ClusteringHigh-Dimensional Data Clustering high-dimensional data

    Many applications: text documents, DNA micro-array data

    Major challenges:

    Many irrelevant dimensions may mask clusters

    Distance measure becomes meaninglessdue to equi-distance

    Clusters may exist only in some subspaces

    Methods

    Feature transformation: only effective if most dimensions are

    relevant

    PCA & SVD useful only when features are highly

    correlated/redundant

    Feature selection: wrapper or filter approaches

    useful to find a subspace where the data have nice clusters

    Subspace-clustering: find clusters in all the possible subspaces

    CLIQUE, ProClus, and frequent pattern-based clustering

  • 8/8/2019 Syn Presentation(6!05!10)1

    17/29

    Feature Selection Feature selection strategy

    Remove non-informative words from documents Improve categorization effectiveness

    Reduce computational complexity

    Remove redundant data

    Result: Dimensionality Reduction

    n m1 km2>> >> >>

    Data Space Feature Space Cluster/Class

    Dimensionality Reduction

  • 8/8/2019 Syn Presentation(6!05!10)1

    18/29

    Document Clustering usingFeature

    selection

    Feature Selection

    Preprocessing(Stop word Elimination,

    Stemming,)

    ClusteringAlgorithm

    Documents

    Clusters

    Feature Extraction(Document-Term Matrix)

  • 8/8/2019 Syn Presentation(6!05!10)1

    19/29

    Feature Selection A good feature set is

    Efficient Low dimension as mush as possible - Objective

    Effective Discriminating documents as much as possible Subjective

    Feature selection process: Optimization process,minimizing the number of features and maximizingthe discriminating property of the feature set

    Problem statements

    Searching the feature space to find an optimum subset

    of features to satisfy goal

    Silent about the clusters of different subspaces

  • 8/8/2019 Syn Presentation(6!05!10)1

    20/29

    The Curse of Dimensionality

    When the number of dimension increases, the distance between any two points is nearly

    the same

    Surprising results!

    This is the reason why we need to study subspace clustering

  • 8/8/2019 Syn Presentation(6!05!10)1

    21/29

    Document Clustering using Subspace

    Preprocessing(Stop word Elimination,

    Stemming,)

    Documents

    Clusters

    Subspace Clustering

  • 8/8/2019 Syn Presentation(6!05!10)1

    22/29

    Why Subspace Clustering?

    To integrate feature evaluation and clustering in order to find

    clusters in different subspaces

    Uncover complex relationship in data set

    Subspace-clustering: find clusters in all the subspaces Cover all the document collection to make sub space

    Can handle the new features

    Extension of feature selection

    Top-down subspace clustering search

    Bottom-up subspace clustering search

    Dense Unit-based Method

    Entropy-Based Method

    Transformation-Based Method

  • 8/8/2019 Syn Presentation(6!05!10)1

    23/29

    Top-down Subspace

    Clustering Algorithms

    Multiple iterations of expensive

    clustering algorithms

    Find out Initial Clustering in full set of

    Dimension

    Evaluate the Subspace of each cluster

    Iterative processing will be done to

    improve the result

    Bottom-up Subspace

    Clustering Algorithm

    Integrate the clustering and subspa

    selection

    Find the dense regions in low

    dimension spaces

    Combine them to form cluster

    Text mining are particularly relevant and present unique challenges to subspace

    clustering.

    Subspace Clustering

  • 8/8/2019 Syn Presentation(6!05!10)1

    24/29

    Information Integration

    Web Text Mining

    DNA Microarray

    Applications of Subspace Clustering

    WebText Mining

    Web Page in Document-Term Matrix

    Instance

    (Pages)

    Feature

    (Keywords)

    Find set of keywords (Subspace) for given group of Page

    Keywords connect the group

    Cluster represent the Domain

  • 8/8/2019 Syn Presentation(6!05!10)1

    25/29

    ExampleData Set

    (400 instances)

    ClusterI

    (100 instances)ClusterII

    (100 instances)

    ClusterIII

    (100 instances)

    ClusterIV

    (100 instances)

    3-D

    (a,b,c)

    2-D

    (a,b)

    3-D

    (b,c)

    2-D

    (a,b)

    3-D

    (b,c)

    Apply k-means

    Do poor Job finding the Cluster

    As each cluster are in irrelevant Dimensions

    Consider the Fewer Dimension

  • 8/8/2019 Syn Presentation(6!05!10)1

    26/29

    Apply Feature Transformation

    Transform the dimension from high to low

    Relative distance preserve

    Unaffected the irrelevant dimensions

    Apply Feature Selection

    Reduce the dimensionality

    Find the cluster in the same subspace

    Not explain the cluster in different subspace

    Find the Cluster in each subspace

  • 8/8/2019 Syn Presentation(6!05!10)1

    27/29

    Apply Subspace Clustering

    Represent the cluster in interpretable and meaningful ways

    Represent cluster as well as subspace in which it exists

    Uncover the complex relationship found in data

    In order to this

    Unique challenges in subspace clustering

    Finding appropriate result depends on cluster technique

    Strength,Weakness & biases of potential clustering algorithm

  • 8/8/2019 Syn Presentation(6!05!10)1

    28/29

    ResearchProposal

    To investigate computationally efficient ways for combininginformation retrieval with clustering.

    Efforts will be made to explore the efficient clustering algorithms,which work better in high dimensional datasets and apply them for

    document clustering.

    Work on feature vector representation and reduction of itsdimensionality using feature selection and subspace clustering willbe investigated to make clustering algorithm more efficient for

    large set of documents. Specifically we will focus on the word co-occurrence frequency to reduce feature space for clustering.

  • 8/8/2019 Syn Presentation(6!05!10)1

    29/29

    ThanksSuggestions!!!!