A novel clustering algorithm based on weighted support and its application

A novel clustering algorithm based on weighted support and its application

Author : Xiang-Rong Yang Jun-Yi Shen

Qlang Liu Graduate : Chien-Ming Hsiao

Outline

Motivation Objective Introduction Description of some Terms Algorithm and Analysis Experimental results Conclusions Personal opinion

Motivation

Many efficient clustering algorithms have been proposed but most of these works focus on numerical data.

Objective

To present a novel and efficient algorithm WeiSC for clustering categorical data

Introduction

Clustering is an important KDD problem. Objective : to group data into sets

Intra-cluster similarity is maximized Inter-cluster similarity is minimized

Most of these works focus on numerical data whose inherent geometric properties can be exploited naturally to define distance functions between data points.

Introduction

The basic idea of WeiSC It repeatedly read tuples from dataset one by one When the first tuple arrives, it forms a cluster alone The consequent tuples are either put into existing cluster or rejecte

d by all existing clusters to form a new cluser by given similarity function defined between tuple and cluser.

Only makes one scan over the dataset

Description of some Terms

m1

im21

DD domains with attributes lcategorica

ofset a is A where tuples,ofset a be A ,,A ,A DLet

eevery tupl of ID unique ofset thebe TIDLet

i

i

A tid, valas drepresente is

tupleingcorrespond of A attributefor value theTID, each tidFor


DEFINITION 1

DEFINITION 2

DEFINITION 3

TID ofsubset is TID} tid| {tid Cluster

C tid A tid,val CVAL : as defined is C repect towith

Aon valuesattribute ofset theC,cluster aGiven

ii

i

SUM_CONTACONTAWEI

is A attribute of weight the,ACONTASUM_CONT

,A of valueattributedistinct ofcount thei.e. ,DACONTLet

ii

imi

iii


DEFINITION 4

DEFINITION 5

iiiii

iii

atid.A tidAWEIa wei_sp: as definded is A repect to

with Cin a ofsupport weighted the,D alet C,cluster aGiven

C tidatid.Av a wei_sp,aCont ,aVS where

mi1VS CID,Summary : as defined is Cfor summary theC,cluster a Give

iiiiii

i

Algorithm and Analysis

Overview Initially, the first tuple in the database is read and a cluster is con

structed. Then the consequent tuples are read iteratively.

The similarity between the new tuple and each existed clusters is computed according to

The similarity must be above the threshold, denoted as σ When computing the similarity, we use the clusters’ summary instea

d of the clusters themselves, since the information needed contained in clusters’ summary

Ccluster in tuplesofcount theis where, _

1 , 1 CC

aspweitidCsim

m

ii

Computational complexities

The time and space complexities of the WeiSC algorithm depend on

The size of dataset (|D|) The number of attributes (m) The number of the clusters (p) , f (σ) The size of each cluster, g (σ)

Time complexity O(|D| * m * f (σ)) Space complexity O(|D| + m * f (σ) * g (σ))

Experimental results

The experimental results on the performance of WeiSC

Compare the clustering result with ROCK’s on the same data set

Quality of clustering results with real-life datasets

Mushroom dataset (real-life) get from the UCI machine learning Corresponding to 23 species of gilled mushrooms

Each species is identified as definitely edible, definitely poisonous

Has 21 attributes with 8124 tuples The number of edible is 4208 The number of poisonous is 3916

The effect of σ

The parameter of σ Is the only parameter needed in WeiSC algorithm Effects the results of clustering and the speed of algorit

hm

Can use the percentage of misclassified tuples as measure of the effect Since the “edible” or “poisonous” has been labeled in e

ach tuple

Conclusions

The WeiSC algorithm is robust and efficient From inference and experimental Read dataset only once

Used in IDS Is speedy and deserves good efficiency

Personal Opinion

We can compare WeiSC algorithm with our algorithm.

A novel clustering algorithm based on weighted support and its application

Documents

novel clustering algorithm

efficient algorithm

results of clustering

setsintracluster similarity

existing clusters

experimental read dataset

numerical data

clusters p