Top Banner
C.Watters CS6403 1 Clustering
28

C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

Dec 14, 2015

Download

Documents

Suzan Hill
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 1

Clustering

Page 2: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 2

Clustering

• What

• Why

• How

• Results

Page 3: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 3

Clustering

• Assign items to groups based on some calculation of degree of likeness between items

• Groups are not known before hand

• Uses multivariate analysis techniques

• Feature set determination critical

Page 4: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 4

Example

• News data

• Sports, World news, Entertainment etc

• Short items, items with photos, items with names

Page 5: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 5

Why

• Improve efficiency of retrieval

• Improve effectiveness of retrieval

• Ranking of retrieved results

• Visualization of results

• Karnaugh and SOM (self organizing maps)

• Discovery of content

• Discovery of relationships

Page 6: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 6

How

• Put items into groups so that members have a high degree of association within the group

• AND items have low degree of association with items in other groups

• Association for IR documents?

• Feature set?

Page 7: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 7

Feature Sets for IR Clustering

• Term occurrences

• Citations

• Names

• Structure (tags)

• Co-occurences (thesaurus construction)

Page 8: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 8

Problems

• Choosing the best feature set

• Choosing the similarity measure

• Evaluation of results

• Updates

• Searching clusters

Page 9: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 9

Measures of Similarity

• Need to quantify the degree of association of an item with others

• Generally want a measure that is normalized by document vector length

• Not clear that weighted document terms are better than binary ones in clustering

Page 10: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 10

General Measures

• Dice coefficient

• Jaccard Coefficient

• Cosine Coefficient

Page 11: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 11

Dice Coefficient

• Binary weights

C= Terms in common, A terms in i, and B terms in j

Page 12: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 12

Jaccard Coefficient

• Binary Weights

C= Terms in common, A terms in i, and B terms in j

Page 13: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 13

Cosine Coefficient

• Binary weights

C= Terms in common, A terms in i, and B terms in j

Page 14: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 14

Now what?

• Need to be able to compare any doc to any other doc

• Need?11 12 13 14 15

21 22 23 24 25

31 32 33 34 35

41 42 43 44 45

51 52 53 54 55

Doc-Doc Similarity Matrix

Page 15: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 15

Generating Similarity Matrix

• Use inverted file

• Documents with no terms in common do not need similarity calculation

• Generally generate only one row at a time as needed

Page 16: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 16

Algorithms

• Problem: sort N things into M groups, where M=[1,N]

• Choice of algorithm determines– M– membership

Page 17: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 17

General Classes of Algorithms

• Hierarchical

•Non-hierarchical

No overlap

Centroid

Nested groups

Pairwise connections made

Page 18: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 18

Evaluation of results

• Was method appropriate for data set

• Do the clusters represent the data well

• Are the docs in the right cluster

Page 19: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 19

How to test?

• Overlap test Run a known query set and evaluate against known results

• Randomly select docs and judge relevance to group members

• Examine distribution of docs in groups

• Density test = term occurrences

• docs x unique terms

Page 20: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 20

Concepts to keep in mind

• Cluster hypothesis

• Nearest neighbour

• centroid

Page 21: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 21

Cluster Hypothesis

• Associations between documents are related to the relevance of documents to queries

• Van Rijsbergen, 1979

Page 22: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 22

Nearest Neighbour

• Find the document most similar to the given one

• This one is most likely closely related

• Works with terms, citations, & clusters

Page 23: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 23

Centroids

• Representative of a cluster

• May be a document from that cluster

• May be a composite of doc features from that cluster

• Why: query-centroid calculations– higher level representations of data set– build ontologies and thesauri

Page 24: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 24

Visualization of Clusters

• Kohonen Maps

• Star maps

• SOM (self organizing maps)

• Etc

Page 25: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 25

Samples

Page 26: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 26

Cluster Map

Page 27: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 27

Starfield

Page 28: C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 28