Top Banner

of 19

Copy of Clustering and Similarity Search Over Sequences

Apr 06, 2018

Download

Documents

Abdul Majeed
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    1/19

    CLUSTERINGCLUSTERING && SIMILARITYSIMILARITY

    SEARCH OVER SEQUENCESSEARCH OVER SEQUENCESSUBJECT: ADVANCED DATA BASE MANAGEMENT

    SYSTEMS

    SUBMITTED BY:

    NAME : BINDU N V

    REGNO :

    BRANCH : 1ST SEM M.TECH(QIP) CS&E

    COLLEGE : NMAMIT-NITTE

    SUBMITTED TO:

    GURURAJ BIJU

    ASST.PROF

    DEPT OF CS&E

    NMAMIT-NITTE

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    2/19

    CLUSTERINGCLUSTERING && SIMILARITYSIMILARITYSEARCH OVER SEQUENCESSEARCH OVER SEQUENCES

    BY

    Name : BINDU N.V

    Date : 14-11-2010

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    3/19

    TOPICS:TOPICS:

    y CLUSTERING

    y CLUSTERING ALGORITHM-BIRCH

    y

    SIMILARITY SEARCH OVERSEQUENCES

    y ALGORITHM TO FIND SIMILAR

    SEQUENCES

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    4/19

    CLUSTERINGCLUSTERING

    y Data mining task of finding clusters from a

    given set of records

    y Partition the given set of records intogroups

    Such that records within a group are

    similar to each other and records that

    belongs to two different groups aredissimilar

    y Such a group is known as a cluster

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    5/19

    y Similarity between the records are measured

    by distance function

    y Distance function takes two input records andreturns a value that is a measure of their

    similarity

    y Output of clustering algorithm consists of

    summarized representation of each cluster

    y Summarized representation depends on the

    type and shape of clusters the algorithmcomputes

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    6/19

    y For example, if we have spherical clusters,

    we can summarize each cluster by itscenter C and its radius R as follows

    n

    C = 1 ri andn i = 1

    n

    R = (ri- c)

    i = 1

    n

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    7/19

    y Clustering algorithm are of two types

    i) partitional :- partitions the data into kgroups

    ii)hierarchical :- generates a sequence of

    partitions

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    8/19

    A clustering algorithmA clustering algorithm--BIRCHBIRCH

    y Handles large databases

    y Based on two assumptions

    i)number of records is very largeii)only a limited amount of memory is

    available

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    9/19

    y

    A user can set two parameters to control theBIRCH algorithm

    i) K

    -threshold on the amount of main memory

    -finds out how many clusters can bemaintained in main memory

    ii)

    -for the radius of cluster- controls the number of clusters the

    algorithm discovers

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    10/19

    y

    If

    is small, discovers many small clustersy If is large ,discovers very few large clusters

    y

    A cluster is compact if its radius is smallerthan

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    11/19

    y BIRCH always maintains K or fewer cluster

    summaries (Ci,Ri) in main memory

    y If its not possible to maintain with givenamount of memory is increased as given

    below

    -algorithm reads records from databasesequentially and processes them as given

    below

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    12/19

    1.Compute the distance between record r andeach of the existing cluster centers. Let ibe the

    cluster index such that the distance between rand C

    iis the smallest

    2. Compute the value of the new radius Ri

    the ith

    cluster under the assumption that r is insertedinto it. if Ri< ,then the ith cluster remainscompact, and we assign r to the ith cluster byupdating its center and setting its radius to R

    i.If

    Ri> ,then the ith cluster would no longer be

    compact if we insert r into it. There fore we starta new cluster containing only the record r

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    13/19

    y Problem with second step is that :-

    If we have already have the maximumnumber of cluster summaries k, we have toincrease the radius threshold in order to

    merge existing clusters

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    14/19

    SIMILARITYSEARCH OVERSEQUENCESSIMILARITYSEARCH OVERSEQUENCES

    y A lot of information stored in database consistsof sequences

    y To perform similarity search on these sequenceswe assume that:-

    y User specifies a query sequence and wants toretrieve all data sequences that are similar toquery sequence

    y Here we are interested not only in exactly

    matching query sequence but also in those thatdiffer only slightly from the query sequence

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    15/19

    y A data sequence X is a series of numbers

    x=< x1,x2,..xk>,where k is the length of thesequence

    y A subsequence Z =is obtained

    from this series by deleting numbers from front

    and back of this sequence X.y If we have two sequences X and Y we can define

    the Euclidean norm as the distance between

    the two sequences as followsII X-Y II = (xi-yi)

    2

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    16/19

    y Similarity search can be classified into two

    types1)complete sequence matching:-query

    sequence and sequences in the database have

    the same length.2)subsequence matching:-query sequence is

    shorter than the sequences in the database

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    17/19

    An Algorithm to find similar sequencesAn Algorithm to find similar sequences

    i) Simple Method:-retrieve each sequence and

    compute distance and find out similarity

    Disadvantage:-it retrieves every sequence

    ii) High dimensional indexing method :-

    y each data sequence and query sequence canbe represented as a point in a k dimensional

    spacey so we can query the similar sequences by

    querying the indexes

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    18/19

    y Since we want to retrieve all sequences with

    distance of the query sequence ,we dontuse point query,

    y

    Instead, we query the index with hyperrectangle that has side length 2 and query

    sequence as center and we retrieve allsequences which falls within the rectangle

  • 8/3/2019 Copy of Clustering and Similarity Search Over Sequences

    19/19