Syn Presentation(6!05!10)1

8/8/2019 Syn Presentation(6!05!10)1

1/29

Efficient Clustering Approaches for

Organizing Document Collection

School of Computer &System SciencesJawaharlal Nehru University

New Delhi-110067

Dr. Aditi Sharan Sonia

Assistant Professor PhD Scholar


2/29

Table of Contents

Information Retrieval

Efficient Retrieval System

Document ClusteringClustering Algorithm

Feature SelectionDimensionality Reduction

Subspace ClusteringSubspace Creation

Research ProposalObjective


3/29

IR System

IR

SystemQuery

String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc23. Doc3

.

.

Web

Spider

Web Search System

?


4/29

A perfect IRS always retrieves all relevant

documents without retrieving any non

relevant document.

In reality , systems retrieve relevant aswell as non relevant documents.

To measure effectiveness of retrieval

two ratios are used :precision and recall. Present document according to user need

IRSystem


5/29

Document Clustering Automatically partition documents into clusters based on

content

Documents within each cluster should be similar

Documents in different clusters should be different

Discover categories in an unsupervised manner

No sample category labelsprovided by humans

It is a common and important task that finds manyapplications in IR and other places


6/29

Example Star


7/29

Why cluster documents?

Whole corpus analysis/navigation Better user interface

For improving recall in search applications

Better search results

For better navigation of search results

Effective user recall will be higher

For speeding up vector space retrieval Faster search


8/29

ChallengingTask What are the challenges ofWeb Data ?

Why it is difficult to ClusterWeb data?

Structure Based Problem Unstructured Heterogeneous Distributed Language dependent

Information Based Problem

Larger repository

Unlabelled

Dynamic Duplication Interconnected (Hyper Link)

User based Problem

Insufficient Query

Heterogeneous User

Dynamic Requirements

Behavioral Changes


9/29

Why it is difficult to ClusterWeb data?

Data is Heterogeneous High dimensionality of Data

No good definition of similarity itself

Pre-clustering of Data Therefore traditional clustering algorithms

have to modified ornew algorithm should

be developed to cluster we

bdata


10/29

Self-organizing maps (SOM)

Multidimensional scaling (MDS)

Latent Semantic Indexing (LSI)

Generative Distributions forDocuments

Expectation Maximization ( EM)

Multiple Cause Mixture Model (MCMM)

AspectModels and Probabilistic LSI

Bottom-up clustering

Top-down clustering

Model and Feature Selection

Generative Models

& Probabilistic

Geometric

EmbeddingHierarchal

Clustering

Algorithm

Partitioning

Clustering Algorithms

Buckshot

Fractionation

K-means

Clustering

Pre-Clustering

Post-ClusteringCombining Clustering with IR

Pre-Clustering

To retrieve one or more clusters in their entirety to a query


11/29

Post-Clustering Approaches

Clustering is used in Improving document search and

retrieval An attempt to improve conventional search techniques

Enhancing of near-neighbor search

A document browsing technique that employs document clustering as its primaryA document browsing technique that employs document clustering as its primary

operation.operation.

Scatter/Gather MethodScatter/Gather Method

Document clustering algorithms are often slow, with

quadratic running times

How clustering can be effective method in its own right


12/29

Scatter/Gather : A Cluster Based Approach

How it works

The system clusters documents into small no of groups - Scatter

The system displays short summaries of them

Userchooses one or more of the groups for further study

Selected groups are gatheredtogether to form a subcollection

With each successive iteration the groups become smallerand moredetailed

The groups become small enough, this process bottoms out by displayingindividual documents


13/29

Application to Scatter/Gather

Zooming into a large document collection Interactive browsing paradigm

Effective Information access tool

Helpful in situation where the query is unspecified

Comparatively fast algorithms Buckshot andfractionation linear-time preprocessing

constant-time query processing

Effective geometric clustering Tool

Limitations

Even Buckshot or Fractionation algorithms may be too slow for

large corpus on theWeb

Quality of clustering


14/29

Scoring Cluster

Suffix Tree Construction

Merging Clusters

Labeling Clusters

Preparing the Doc

Web Document Clustering Using SuffixTree Algorithm

Clusters x and y if (Bx By) / |Bx|>k

(Bx By) / |By|>k

SC = NC * p(li)

one or more labels in the original suffix tree


15/29

The definition of STC an incremental, o(n) time clustering

algorithm that satisfies these requirements

Effective for Information Retrieval

Snippets versus Whole Documents Clustering

Execution Time is less

Analysis of STCApplications!

Analysis of the STCDrawbacks!

Non-Exclusiveness

Incompleteness

Documents may appear in more than one No specific category

Share only few short word Not contain all documents

Absoluteness

Topic Generating

No information about document lengths or suffix mismatches

Topic identification for document clusters


16/29

ClusteringHigh-Dimensional Data Clustering high-dimensional data

Many applications: text documents, DNA micro-array data

Major challenges:

Many irrelevant dimensions may mask clusters

Distance measure becomes meaninglessdue to equi-distance

Clusters may exist only in some subspaces

Methods

Feature transformation: only effective if most dimensions are

relevant

PCA & SVD useful only when features are highly

correlated/redundant

Feature selection: wrapper or filter approaches

useful to find a subspace where the data have nice clusters

Subspace-clustering: find clusters in all the possible subspaces

CLIQUE, ProClus, and frequent pattern-based clustering


17/29

Feature Selection Feature selection strategy

Remove non-informative words from documents Improve categorization effectiveness

Reduce computational complexity

Remove redundant data

Result: Dimensionality Reduction

n m1 km2>> >> >>

Data Space Feature Space Cluster/Class

Dimensionality Reduction


18/29

Document Clustering usingFeature

selection

Feature Selection

Preprocessing(Stop word Elimination,

Stemming,)

ClusteringAlgorithm

Documents

Clusters

Feature Extraction(Document-Term Matrix)


19/29

Feature Selection A good feature set is

Efficient Low dimension as mush as possible - Objective

Effective Discriminating documents as much as possible Subjective

Feature selection process: Optimization process,minimizing the number of features and maximizingthe discriminating property of the feature set

Problem statements

Searching the feature space to find an optimum subset

of features to satisfy goal

Silent about the clusters of different subspaces


20/29

The Curse of Dimensionality

When the number of dimension increases, the distance between any two points is nearly

the same

Surprising results!

This is the reason why we need to study subspace clustering


21/29

Document Clustering using Subspace

Preprocessing(Stop word Elimination,

Stemming,)

Documents

Clusters

Subspace Clustering


22/29

Why Subspace Clustering?

To integrate feature evaluation and clustering in order to find

clusters in different subspaces

Uncover complex relationship in data set

Subspace-clustering: find clusters in all the subspaces Cover all the document collection to make sub space

Can handle the new features

Extension of feature selection

Top-down subspace clustering search

Bottom-up subspace clustering search

Dense Unit-based Method

Entropy-Based Method

Transformation-Based Method


23/29

Top-down Subspace

Clustering Algorithms

Multiple iterations of expensive

clustering algorithms

Find out Initial Clustering in full set of

Dimension

Evaluate the Subspace of each cluster

Iterative processing will be done to

improve the result

Bottom-up Subspace

Clustering Algorithm

Integrate the clustering and subspa

selection

Find the dense regions in low

dimension spaces

Combine them to form cluster

Text mining are particularly relevant and present unique challenges to subspace

clustering.

Subspace Clustering


24/29

Information Integration

Web Text Mining

DNA Microarray

Applications of Subspace Clustering

WebText Mining

Web Page in Document-Term Matrix

Instance

(Pages)

Feature

(Keywords)

Find set of keywords (Subspace) for given group of Page

Keywords connect the group

Cluster represent the Domain


25/29

ExampleData Set

(400 instances)

ClusterI

(100 instances)ClusterII

(100 instances)

ClusterIII

(100 instances)

ClusterIV

(100 instances)

3-D

(a,b,c)

2-D

(a,b)

3-D

(b,c)

2-D

(a,b)

3-D

(b,c)

Apply k-means

Do poor Job finding the Cluster

As each cluster are in irrelevant Dimensions

Consider the Fewer Dimension


26/29

Apply Feature Transformation

Transform the dimension from high to low

Relative distance preserve

Unaffected the irrelevant dimensions

Apply Feature Selection

Reduce the dimensionality

Find the cluster in the same subspace

Not explain the cluster in different subspace

Find the Cluster in each subspace


27/29

Apply Subspace Clustering

Represent the cluster in interpretable and meaningful ways

Represent cluster as well as subspace in which it exists

Uncover the complex relationship found in data

In order to this

Unique challenges in subspace clustering

Finding appropriate result depends on cluster technique

Strength,Weakness & biases of potential clustering algorithm


28/29

ResearchProposal

To investigate computationally efficient ways for combininginformation retrieval with clustering.

Efforts will be made to explore the efficient clustering algorithms,which work better in high dimensional datasets and apply them for

document clustering.

Work on feature vector representation and reduction of itsdimensionality using feature selection and subspace clustering willbe investigated to make clustering algorithm more efficient for

large set of documents. Specifically we will focus on the word co-occurrence frequency to reduce feature space for clustering.


29/29

ThanksSuggestions!!!!