C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 1

Clustering

C.Watters CS6403 2

Clustering

• What

• Why

• How

• Results

C.Watters CS6403 3

Clustering

• Assign items to groups based on some calculation of degree of likeness between items

• Groups are not known before hand

• Uses multivariate analysis techniques

• Feature set determination critical

C.Watters CS6403 4

Example

• News data

• Sports, World news, Entertainment etc

• Short items, items with photos, items with names

C.Watters CS6403 5

Why

• Improve efficiency of retrieval

• Improve effectiveness of retrieval

• Ranking of retrieved results

• Visualization of results

• Karnaugh and SOM (self organizing maps)

• Discovery of content

• Discovery of relationships

C.Watters CS6403 6

How

• Put items into groups so that members have a high degree of association within the group

• AND items have low degree of association with items in other groups

• Association for IR documents?

• Feature set?

C.Watters CS6403 7

Feature Sets for IR Clustering

• Term occurrences

• Citations

• Names

• Structure (tags)

• Co-occurences (thesaurus construction)

C.Watters CS6403 8

Problems

• Choosing the best feature set

• Choosing the similarity measure

• Evaluation of results

• Updates

• Searching clusters

C.Watters CS6403 9

Measures of Similarity

• Need to quantify the degree of association of an item with others

• Generally want a measure that is normalized by document vector length

• Not clear that weighted document terms are better than binary ones in clustering

C.Watters CS6403 10

General Measures

• Dice coefficient

• Jaccard Coefficient

• Cosine Coefficient

C.Watters CS6403 11

Dice Coefficient

• Binary weights

C= Terms in common, A terms in i, and B terms in j

C.Watters CS6403 12

Jaccard Coefficient

• Binary Weights


C.Watters CS6403 13

Cosine Coefficient

• Binary weights


C.Watters CS6403 14

Now what?

• Need to be able to compare any doc to any other doc

• Need?11 12 13 14 15

21 22 23 24 25

31 32 33 34 35

41 42 43 44 45

51 52 53 54 55

Doc-Doc Similarity Matrix

C.Watters CS6403 15

Generating Similarity Matrix

• Use inverted file

• Documents with no terms in common do not need similarity calculation

• Generally generate only one row at a time as needed

C.Watters CS6403 16

Algorithms

• Problem: sort N things into M groups, where M=[1,N]

• Choice of algorithm determines– M– membership

C.Watters CS6403 17

General Classes of Algorithms

• Hierarchical

•Non-hierarchical

No overlap

Centroid

Nested groups

Pairwise connections made

C.Watters CS6403 18

Evaluation of results

• Was method appropriate for data set

• Do the clusters represent the data well

• Are the docs in the right cluster

C.Watters CS6403 19

How to test?

• Overlap test Run a known query set and evaluate against known results

• Randomly select docs and judge relevance to group members

• Examine distribution of docs in groups

• Density test = term occurrences

• docs x unique terms

C.Watters CS6403 20

Concepts to keep in mind

• Cluster hypothesis

• Nearest neighbour

• centroid

C.Watters CS6403 21

Cluster Hypothesis

• Associations between documents are related to the relevance of documents to queries

• Van Rijsbergen, 1979

C.Watters CS6403 22

Nearest Neighbour

• Find the document most similar to the given one

• This one is most likely closely related

• Works with terms, citations, & clusters

C.Watters CS6403 23

Centroids

• Representative of a cluster

• May be a document from that cluster

• May be a composite of doc features from that cluster

• Why: query-centroid calculations– higher level representations of data set– build ontologies and thesauri

C.Watters CS6403 24

Visualization of Clusters

• Kohonen Maps

• Star maps

• SOM (self organizing maps)

• Etc

C.Watters CS6403 25

Samples

C.Watters CS6403 26

Cluster Map

C.Watters CS6403 27

Starfield

C.Watters CS6403 28

C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

Documents

j slide

membership slide

needed slide

right cluster slide

neighbour centroid slide

items groups

b terms

groups association