Top Banner
Computational Linguistics Clustering Stefan Thater & Dietrich Klakow FR 4.7 Allgemeine Linguistik (Computerlinguistik) Universität des Saarlandes Summer 2014
57

Clustering - coli.uni-saarland.de

Mar 25, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering - coli.uni-saarland.de

Computational Linguistics

Clustering

Stefan Thater & Dietrich Klakow

FR 4.7 Allgemeine Linguistik (Computerlinguistik)

Universität des Saarlandes

Summer 2014

Page 2: Clustering - coli.uni-saarland.de

Cluster Analysis

Goal:

group similar items together in a group

Steps:

define similarity between sample

define a loss function

find an algorithm that minimizes this loss function

Page 3: Clustering - coli.uni-saarland.de

Examples

Page 4: Clustering - coli.uni-saarland.de

Clustering Search Results

Cluster Text (e.g. search results)

Page 5: Clustering - coli.uni-saarland.de

Cluster Word

(speech “I have a dream”)

From http://neoformix.com/2011/wcd_KingIHaveADream.png

Page 6: Clustering - coli.uni-saarland.de

http://people.cs.uchicago.edu/~pff/segment/

Cluster Image Regions:

Image Segmentation

Page 7: Clustering - coli.uni-saarland.de

Vector quantization to compress

images

Bishop, PRML

Cluster Image Regions

Page 8: Clustering - coli.uni-saarland.de

Unsupervised learning

Page 9: Clustering - coli.uni-saarland.de

Class1

Class2

Supervised Classification:

Labels known

Page 10: Clustering - coli.uni-saarland.de

Class1

Class2

Page 11: Clustering - coli.uni-saarland.de

Class1

Class2

Page 12: Clustering - coli.uni-saarland.de

Class1

Class2

Page 13: Clustering - coli.uni-saarland.de

Clustering:

No labels!

Page 14: Clustering - coli.uni-saarland.de

Clustering:

No labels!

Cluster

Page 15: Clustering - coli.uni-saarland.de

Clustering:

No labels!

Page 16: Clustering - coli.uni-saarland.de

Clustering:

No labels!

Page 17: Clustering - coli.uni-saarland.de

Clustering:

No labels!

Cluster???

Page 18: Clustering - coli.uni-saarland.de

Similarity Measures

Page 19: Clustering - coli.uni-saarland.de

Euclidean Distances

x = (5,5)

y = (9,8) L2-norm: d(x,y) = (42+32)

= 5

L1-norm: dist(x,y) = 4+3 = 7

4

3 5

Page 20: Clustering - coli.uni-saarland.de

Axioms of a Distance Measure

d is a distance measure if it is a function from pairs of points to

reals such that:

d(x,y) > 0.

d(x,y) = 0 iff x = y.

d(x,y) = d(y,x).

d(x,y) < d(x,z) + d(z,y) (triangle inequality ).

Page 21: Clustering - coli.uni-saarland.de

Distances measures

K

k

kk yxyxd1

2

2 ||),(

L2 distance (Euclidian distance)

K

k

kk yxyxd1

1 ||),(

L1 distance (Manhattan distance)

L distance (maximum distance)

|)(|max),( kkk

yxyxd

Page 22: Clustering - coli.uni-saarland.de

•Calculate the distance of

•Use all three distance measures introduced on the

previous slide

Example

3

0

1

3

x

1

2

2

1

y

Example

Page 23: Clustering - coli.uni-saarland.de

• Cosine

• Edit distance

• Jaccard

• Kernels

Other distance measures

Page 24: Clustering - coli.uni-saarland.de

K-Means Clustering

Page 25: Clustering - coli.uni-saarland.de

1. For each cluster, decide on a mean

2. Assign each data point to the nearest mean

3. Recalculate means according to assignment

4. If mean changed go back to 1

Recipe

Page 26: Clustering - coli.uni-saarland.de
Page 27: Clustering - coli.uni-saarland.de
Page 28: Clustering - coli.uni-saarland.de
Page 29: Clustering - coli.uni-saarland.de
Page 30: Clustering - coli.uni-saarland.de
Page 31: Clustering - coli.uni-saarland.de
Page 32: Clustering - coli.uni-saarland.de
Page 33: Clustering - coli.uni-saarland.de
Page 34: Clustering - coli.uni-saarland.de
Page 35: Clustering - coli.uni-saarland.de

Example

Assignment

)L e.g. choice, (your distance

cluster th-j the of mean

(vector) sample training th-n

otherwise

if

2:,

:

:

0

,minarg1

,

jn

j

n

jnj

kn

xd

x

xdkr

Page 36: Clustering - coli.uni-saarland.de

Example

Example

otherwise

if

0

,minarg1

,

jnj

kn

xdkr

See white board

Page 37: Clustering - coli.uni-saarland.de

Example

Update mean

N

n

kn

N

n

nkn

k

r

xr

1

,

1

,

Interpret the denominator

Page 38: Clustering - coli.uni-saarland.de

Example

N

n

K

k

knkn xdrJ1 1

, ,

Loss Function: Distortion Measure

Which of the two

has the smaller J?

Page 39: Clustering - coli.uni-saarland.de

Distortion Function after each iteration

Page 40: Clustering - coli.uni-saarland.de

How to initialize K-Means

•Converges to local optimum

•Outcome of clustering depends on initialization

•Heuristic:

pick k vectors from training data

(being furthest apart)

Page 41: Clustering - coli.uni-saarland.de

How to determine k

N

n

K

k

knkn xdrJ1 1

, ,

What about

picking k such J

becomes as small

as possible? ?

Page 42: Clustering - coli.uni-saarland.de

How to determine K

•For K=N the distortion J=0

•Solution: find large jump

Page 43: Clustering - coli.uni-saarland.de

Other aspects of clustering

Page 44: Clustering - coli.uni-saarland.de

Soft clustering

No strict assignment to a cluster

Just probabilities

Original data

Overlapping class regions

No class

information

Soft clustering

Page 45: Clustering - coli.uni-saarland.de

Hierarchical Clustering

Organize cluster in a hierarchy

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Page 46: Clustering - coli.uni-saarland.de

Text clustering

Derive features from documents

•Frequency of words

•TF-IDF of words

•Stop wording?

•Stemming?

Page 47: Clustering - coli.uni-saarland.de

Practical Example

Page 48: Clustering - coli.uni-saarland.de

Exercise

At

http://research.microsoft.com/enus/um/people/cmbishop/prml/webdatasets/faithful.txt

You find some two dimensional data.

Implement a k-means algorithm for two clusters using

just the first (!) column

Page 49: Clustering - coli.uni-saarland.de

Homework

1. Apply your method only to the second column

2. Generalize your algorithm to vector valued data and

an arbitrary number of clusters. Apply it to the full data

set with both columns

3. Suppose the first column has value c1(i) and the

second c2(i) . Is there a new c(i) = a * c1(i) + b * c2(i)

with suitable well picked a and b such that clustering

based on c(i) is better than the one done on c1(i) or

c2(i) alone.

Page 50: Clustering - coli.uni-saarland.de

Word Clustering using the Brown Algorithm

Page 51: Clustering - coli.uni-saarland.de

Idea

Cluster words together that have similar neighbours

Minimize perplexity on training test

Page 52: Clustering - coli.uni-saarland.de

The Brown Algorithm

gw : calls of word w

Page 53: Clustering - coli.uni-saarland.de

Example clustering

Page 54: Clustering - coli.uni-saarland.de

Application in Named Entity Tagging

Word Class label Tag

Düsseldorf C2 City

is X O

the X O

capital X O

of X O

NRW X O

Training

Page 55: Clustering - coli.uni-saarland.de

Application in Named Entity Tagging

Word Class label Tag

The X O

Hofbräuhaus X O

is X O

in X O

Munich C2 ???

Testing

How to tag if Munich is not in the training data?

Page 56: Clustering - coli.uni-saarland.de

Application

Use class labels as features in named entity tagging

Page 57: Clustering - coli.uni-saarland.de

Summary

• Clustering: finding similar items

• Distance metrics

• K-Means

• Brown Algorithm