Advanced Topics in Text Mining - uni-heidelberg.de · Advanced Topics in Text Mining Summer Term 2017: Text Clustering & Topic Modeling Erich Schubert Lehrstuhl für Datenbanksysteme,

Advanced Topics in Text MiningSummer Term 2017:

Text Clustering & Topic Modeling

Erich Schubert

Lehrstuhl für Datenbanksysteme,

Institut für Informatik,

Ruprecht-Karls-Universität Heidelberg

Summer Term, 2017, Heidelberg

0 / 166

PrefaceThis is the print version of the slides of the first ATM class in summer 2017.

Some animations available in the screen version – in particular in the clustering section – are

condensed into a single slide to reduce redundancy in the printout and the number of pages.

This class had 11 lecture sessions of 90 minutes each (including time for organizational and QA)

plus tutorial sessions with hand-on experience, and gave 4 ECTS points.

Some slides are withheld because of uncertain image copyright.

Organizational slides are not included in this version.

The screen version contained about 172 numbered frames with a total of 361 slides.

Screen version, homework assignments, etc. are currently not available publicly.

This material is made available as-is, with no guarantees on completeness or correctness.

All rights reserved. Re-upload or re-use of these contents requires the consent of the author(s).

E. Schubert Advanced Topics in Text Mining 2017-04-17

0 / 166

Changes for future versionsThe following changes are suggested for a future iteration:

I Reduce: word2vec in Foundations, as there is a separate chapter on word and document

embeddings. Initially it was not clear if we will be able to cover this in this class.

I Add: intrinsic dimensionality to the curse of dimensionality (which received unexpected

interest by the aendees)

I Add: a topic modeling on book title example / homework

I Add: Collection frequency vs. document frequency

I Add: discuss n-gram in the preprocessing.

I Add: Evaluation of IR, e.g., NDCG, Perplexity?

I Add: discussion of PMI, PPMI, in word2vec etc.?

I Add: nonnegative matrix factorization (NMF)

I Add: computation examples with word2vec?


Introduction Text Mining is Difficult 1: 1 / 6

Why is Text Mining Difficult?Example: Homonyms

Apples have become more expensive.

I Apple computers?

I Apple fruit?

Many homonyms:

I Bayern: the state of Bavaria, or the soccer club FC Bayern?

I Word: the Microso product, or the linguistic unit?

I Jam: traic jam, or jelly?

I A duck, or to duck? A bat, or to bat?

I Light: referring to brightness, or to weight?



Why is Text Mining Difficult?Example: Negation, sarcasm and irony

This phone may be great, but I fail to see why.

I This actor has never been so entertaining.

I The least oensive way possible

I Colloquial: [. . . ] is the shit!

I Sarcasm: Tell me something I don’t know.

I Irony: This cushion is so like a brick.



Why is Text Mining Difficult?Example: Errors, mistakes, abbreviationsPeople are lazy and make mistakes, in particular in social media.

I Let’s eat, grandma. (German: Komm, wir essen, Oma.)

I I like cooking, my family, and my pets.

I They’re there with their books.

I You’re going too fast with your car.

I I need food. I am so hungary.

I Let’s grab some bear.

I Next time u r on fb check ur events.

I I’m hangry. (Hungry + angry = hangry)



Recent success is impressive, but also has limitsWe cannot “learn” everythingWe have seen some major recent successes:

I AI assistants like Google Assistant, Siri, Cortana, Alexa.

I Machine translation like Google, Skype.

But that does not mean this approach works everywhere.

I Require massive training data.

I Require labeled data.

I Most functionality is command based, fallback to web search

E.g. “take a selfie” is a defined command, and not “understood” by the KI.

For example machine translation: the EU translates millions of pages per year, much of which is

publicly available for training translation systems.

Unsupervised text mining—the focus of this class—is much harder!



Why is Text Mining Difficult?Example: Stanford CoreNLPStanford CoreNLP: The standard solution for Natural Language Processing (NLP) [Man+14].

NLP is still hard, even just sentence spliing:

Example (from a song list):

All About That Bassby Scott Bradlee’s Postmodern Jukebox feat. Kate Davis

Sentence 1: All About That Bass by Scott Bradlee’s Postmodern Jukebox feat.Sentence 2: Kate DavisNamed entity: Postmodern Jukebox feat

Best accuracy: 97% on news (and <90% on other text [HEZ15])⇒ several errors per document!

Many more, and even worse in German (e.g. splits 1. Bundesliga into two sentences!)



Why is Text Mining Difficult?Example: Never-Ending Language LearnerNever-Ending Language Learner (NELL) [Mit+15]:

I learning computer system

I reading the web 24 hours/day since January 2010

I knowledge base with over 80 million confidence-weighted beliefs

What NELL believes to know about apple (plant): (with high confidence!)

“steve” is believed to be a Canadian journalist.

The first “jobs” is a mixture of Steve Jobs, Steve Wozniak, and Steve Ensminger (LSU football coach)

and some other Steve with a wife named Sarah? (The CEO of Apple Bank is Steven Bush)

“steve_jobs” is believed to be a professor and the CEO of “macworld (publication)”.

The second “jobs” is a building located in the city vegas.


Introduction References 1: 7 / 6

References I[HEZ15] T. Horsmann, N. Erbs, and T. Zesch. “Fast or Accurate? - A Comparative Evaluation of PoS Tagging Models”. In: German

Society for Computational Linguistics and Language Technology, GSCL. 2015, pp. 22–30.

[Man+14] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. “The Stanford CoreNLP Natural

Language Processing Toolkit”. In: ACL System Demonstrations. 2014.

[Mit+15] T. M. Mitchell, W. W. Cohen, E. R. H. Jr., P. P. Talukdar, J. Beeridge, A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel,

J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. A. Platanios, A. Rier, M. Samadi, B. Seles,

R. C. Wang, D. T. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. “Never-Ending Learning”. In:

Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. 2015,

pp. 2302–2310.


Prerequisites Linear Algebra 2: 1 / 15

Linear AlgebraWe will usually be representing documents as vectors!

I Vector space math

I Vectors

I Matrices

I Multiplication

I Transpose

I Inverse

I Matrix factorization

I Principal Component Analysis (PCA, “Hauptachsentransformation”),

Eigenvectors and Eigenvalues, . . .

I Singular Value Decomposition

M = UΣV T



Linear Algebra IIVector:

~v = (v1, v2, . . . , vj)T

Matrix:

M =

m11 m12 · · · m1j

m21 m22 · · · m2j...

.

.

.

...

.

.

.

mi1 mi2 · · · mij

=

~m1T

~m2T

.

.

.

~miT

= ( ~m1, ~m2, . . . ~mi)T

Transpose laws:

(M~x)T = ~xTMT

Orthogonality:

OT = O−1

We will omit the ~ andT

where

they can be easily inferred.

Computer scientists usually

store vectors in rows, not columns.

So notation will vary across sources.

Double-check dimensions every time.



Linear Algebra III – PCAOn centered (or standardized) data, with weights

∑i ωi = 1, we have

XTX =∑i

ωivivTi = Cov(X)

Decompose this Covariance matrix into:

XTX = W TΣ2W = (ΣW )T ΣW

where W is an orthonormal (rotation) matrix, and Σ is a diagonal (scaling) matrix.

The vectors of W are called eigenvectors, the values of Σ are called eigenvalues.

Project using:

x′ := Σ︸︷︷︸Scale

W︸︷︷︸Rotate

x



Linear Algebra IV – PCA II

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

2

3

PCA



Linear Algebra V – SVDThe decomposition in PCA is usually implemented using the SVD routine.

Singular Value Decomposition is a more general decomposition procedure.

It allows us to decompose any m× n matrix into

A = UΣV T

where U is an orthogonal m×m matrix,

Σ is a diagonal m× n matrix,

V is an orthogonal n× n matrix.

By convention, Σ is arranged by descending values.



Linear Algebra VI – SVD II

00

A

=

U

×Σ

×V T

At first sight, this makes the data even larger. So why do we want to do this?

I Matrix properties are beneficial: orthogonal (U , V ) respectively diagonal (Σ).

I If Σ has zeros on the diagonal, we can reduce the matrix sizes without loss.I We can approximate (with least-squared error) the data by further shrinking the matrix.

Intuition: U maps rows to factors, Σ is the factor weight, and V maps factors to columns.

PCA is oen used the same way – by keeping only the most important components!


Prerequisites Statistics 2: 7 / 15

StatisticsWe are oen using statistical language models!

I Elementary probability theory

I Conditional probabilities

P (A|B)

I Bayes’ rule

P (A|B) =P (B|A)P (A)

P (B)

I Random variables, probability density, cumulative density

I Expectation and variance.



Statistics I – ProbabilitiesSome basic terms and properties:

I P (A): Probabilitiy that A occurs

I P (AB) = P (A ∧B) = P (A ∩B) = P (A,B): Probability that A and B occur

I P (A|B) = P (A∧B)P (B) : Conditional probability that A occurs if B occurs

I P (A ∨B) = P (A ∪B) = P (A) + P (B)− P (AB): A or B (or both) occur

I P (A ∩B) = P (A) · P (B)⇔: A and B are independent

I 0 ≤ P (A) ≤ 1; P (∅) = 0; P (¬A) = 1− P (A)



Statistics II – Bayes’ rule

P (A|B) =P (A ∧B)

P (B)

⇒ P (A ∧B) = P (A|B) · P (B)

⇒ P (B ∧A) = P (B|A) · P (A)

Because A ∧B = B ∧A, these are equal:

⇒ P (A|B)P (B) = P (B|A)P (A)

And we can derive Bayes’ rule:

⇒ P (B|A) =P (A|B)P (B)

P (A)



Statistics III — Bayes’ rule II

P (B|A) = P (A|B)P (B)

P (A)

Why is Bayes’ rule important?Bayes’ rule is important, because it allows us to reason “backwards”.

If we know the probability of B → A, we can compute the probability of A→ B.

If one of these values cannot be observed, we can estimate it using Bayes’ rule.

If we do not know either P (A) or P (B), we may still be able to cancel it out it some equations

(if we can look at the relative likelihood of two complementary options, e.g., spam vs. not spam).



Statistics IV — Bayes’ rule IIIExample – Breast Cancer detection:

Probability Test positive Test negative

Breast cancer 1% 80% 20%

No breast cancer 99% 9.6% 90.4%

estion: is this a good test for breast cancer?Naive answer: Test results are 80–90% correct.

If the test result is positive, what is the probability of having cancer (= test is correct)?

P (cancer|positive) = P (positive|cancer) · P (cancer)

P (positive)

= 80% · 1%

80% · 1% + 9.6% · 99%

≈ 7.8%

In > 90% of “cancer detected” cases, the patient is fine! (Because 10% of 99% 80% of 1%.)



Statistics V – Random Variables and Probability DensityRandom variable X : maps outcomes to some measureable quantity (typically: real value).

Example: throw of acutal dices on the table︸︷︷︸Unique event

7→ sum of eyes on dices︸︷︷︸Variable that we model

Variables are oen binary (head=1, tail=0), discrete (2. . .12 eyes), or real valued.

Discrete:Probability mass function: pmfX(xi) = P (X = xi)Continuous (real valued):Probability density function: pdfX(x) = d

dx cdfX(x)Cumulative density function: cdfX(x) = P (X ≤ x) (cdfX(x) =

∫ x−∞ pdfX(x) dx)

Notes: The point probability P (X = x) of a continuous variable is usually 0, because there is an infinite number of real

numbers – consider the probability of measuring a temperature of exactly π: P (Temperature = π).

The cdf(x) exists for discrete and continuous, but is less commonly used with discrete variables.

The pdf(x) can be larger than 1, and is not a probability (the cdf and pmf are)!



Statistics V – Random Variables and Probability DensityProbability mass function (pmf): This is a discrete

distribution.

x

1 2 3 4 5 6 7 8 9 10 11 12 13−∞ ∞

0.05

0.10

0.15

Sum of eyes of two fair dice



Statistics V – Random Variables and Probability DensityCumulative density function (cdf):

x

1 2 3 4 5 6 7 8 9 10 11 12 13−∞ ∞

0.10.20.30.4

0.50.60.70.80.9

1

Sum of eyes of two fair dice



Statistics V – Random Variables and Probability DensityProbability density function (pdf): No y axis scale given, be-

cause this depends on σ.

x

−4σ −3σ −2σ −σ 0 σ 2σ 3σ 4σ−∞ ∞Normalverteilung N (0;σ)



Statistics V – Random Variables and Probability DensityCumulative density function (cdf):

x0.10.20.30.40.5

0.60.70.80.9

1

−4σ −3σ −2σ −σ 0 σ 2σ 3σ 4σ−∞ ∞Normalverteilung N (0;σ)



Statistics VI – Expectation and VarianceExpected value (discrete):

E[X] = µX =∑i

pixi =∑i

pmfX(xi)xi

Expected value (continuous):

E[X] = µX =

∫ ∞−∞

pdfX(x)x dx

Variance:

Var[X] = σ2X = E

[(X − E[X])2

]Useful for proofs, but problematic with floating point numerics:

Var[X] = E[X2]− E[X]2

σ is called the standard deviation.


Prerequisites Information Theory 2: 14 / 15

Information TheoryData mining is all about information!

In information theory,

logarithms are usually base 2.

I Shannon Entropy

H = −∑i

pi log pi

I Mutual Information

I(X,Y ) = H(X)−H(X|Y ) = H(Y )−H(Y |X)

I Kullback-Leibler Divergence

KL(P |Q) =∑i

pi logpiqi


Prerequisites Data Mining 2: 15 / 15

Data MiningText mining is data mining applied to text!

I Classification

I Support Vector Machines & Kernel Trick

I Nearest-Neighbor classification

I Naive Bayes

I Cluster analysis

I Hierarchical clustering

I k-means clustering

I Frequent Itemset Mining

I APRIORI, Eclat, FPgrowth

We will summarize this as necessary in the lecture, but it is best if you already know the basics.


Foundations Lexical Units 3: 1 / 21

Lexical UnitsA tiny bit of linguisticsTerminology we may use at some point in the lecture:

I A document is a longer piece of text.

I A section is a logical section within a document.

I A sentence is a sequence of tokens in a section, usually ending with a dot.1

I A token usually is a word, but can also be, e.g., interpunction.

I A phrase is a short sequence of tokens (part of a sentence).

I A stem is the prefix of a word with inflection endings removed (e.g. fishing→ fish).

I A lemma is the logical base form (e.g., beer→ good).2

I A POS-tag is the part-of-speech interpretation of a token, e.g., verb, or noun.

1

At the end of a logical section—e.g., a headline—the dot may be missing

2

The noun “a meeting” and the verb “to meet” (e.g. in “we are meeting”) are dierent lemmata.


Foundations Lexical Units 3: 2 / 21

Part of speech and syntactic structureA little bit of linguistics

A dog is chasing a boy on the playground . Tokens

DT NN VBZ VBG DT NN IN DT NN . POS

Noun phrase Complex Verb Noun phrase Noun phrase

Verb phrase Prepositional phrase

Verb phrase

Sentence

Structure

Animal Person Location

CHASE ON

Entities &Relations

Request to stop the dog. Intention

Example taken from ChengXiang Zhai [ZM16]


Foundations Vector Space Model 3: 3 / 21

MotivationWhy we want to represent data as vectors.We have text, i.e., a sequence of characters.

Many algorithms expect a vector from Rd.

z We need to convert text to vectors.

Text = bytes = vector? ASCII: A=65, B=66, C=67, D=68, E=69, . . .

Example = [69, 120, 97, 109, 112, 108, 101]?

This does not work well at all!

Algorithms assume that for every dimension i: xi ∼ yi if x and y are similar.

But documents can be similar, yet dierent in almost every byte position:

H e s a i d : T h i s i s a n e x a m p l e .S h e s a i d : T h i s i s a n e x a m p l e .

z We need a vector space with meaningful positions.



Bag of WordsVectorizing TextSeparate the documents into words, discard word order:

A “bag” or “multiset” is a set

that can contain multiple

instances of the same element

Birds of a feather flock together .→ a bird feather flock of together

It is the early bird that gets the worm .→ bird early get is it that the×2 worm

But the second mouse gets the cheese .→ but cheese get mouse second the×2

Early to bed and early to rise , makes a man healthy , wealthy and wise .→ a and×2 bed early×2 healthy make man rise to×2 wealthy wise

For beer results, normalize words: birds → bird , gets → get . Discard . and , .



Bag of Words IITerm-Document MatrixDim 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

a and

bed

bird

but

cheese

early

feather

flock

get

healthy

is it make

man

mouse

rise

of second

that

the

to together

wealthy

wise

worm

Doc 1 1 1 1 1 1 1

Doc 2 1 1 1 1 1 1 2 1

Doc 3 1 1 1 1 1 2

Doc 4 1 2 1 2 1 1 1 1 2 1 1

Note: for a large document collection, we will have thousands of dimensions!Denote as:

tf to,Doc 4 = 2

Do not store 0s

Use sparse data!

Similar documents should now have similar vectors.

z How can we measure similarity?



Bag of Words IIITF-IDFWords such as a , and , but , is , it , of , that , the to are not helpful for dierentiating

documents. We could remove them (→ stopword removal), or we assign them a low weight.

TF-IDF (Term frequency × inverse document frequency) is a popular solution to this.3

TF: Term Frequency

IDF: Inverse Document Frequency

idft := logN

dft= − log

dftN

= − logP (t ∈ d)

(where N = |D| is the number of documents,

and dft = |d ∈ D ∧ t ∈ d| is the document frequency – number of documents with term t)

This weight is similar to Shannon information. More informative terms have more weight.

But the theoretical justification is diicult [Spä72; Spä73; RS76; SWR00a; SWR00b; Rob04].

3

TF-IDF is also wrien as “tf.idf” and “tf×idf”. This is not a minus, but a hyphen; we always use multiplication.



Bag of Words IVTF-IDF variations

Increased importance

of repeated words

Common terms

get less weight

Normalize for

document length

There exist many variations of TF-IDF (in SMART notation [Sal71; SB88]): [MRS08]

Term frequency (if tft,d> 0) Document frequency Document Normalization

n (natural) tft,d n (no) 1 n (none) 1

l (logarithm) 1 + log tft,d t (idf) logN/dft c (cosine) 1/√∑

i w2i

a (augmented) 0.5 +0.5·tft,dmaxt tft,d

p (prob idf) max0, log N−dftdft

u (pivoted unique) 1/u (see [MRS08])

b (boolean) 1 (idf smooth) logN/(dft +1) b (byte size) 1/CharLengthα, α < 1

L (log mean)1+log tft,d

1+meant∈d log tft,d

d (double log) 1 + log(1 + log tft,d)

(BM25)k·tft,d

k+b·(L−1)+tft,d(BM25) log N−dft +.5

dft +.5

(three val.) mintft,d, 2(log1p) log(1 + tft,d)(sqrt)

√tft,d (manhaan) 1

/∑i wi

Xapian and Lucene search now default to the BM25 variant [RZ09].



Bag of Words VTF and IDF variations visualized

0

1

2

3

4

5

6

0 1 2 3 4 5 6 7 8 9 10

rela

tive

tf(x

) / tf

(1)

absolute term frequency

naturallogarithm

augmentedboolean

double loglog1p

sqrtthree valuesBM25 k=1.2



Bag of Words VTF and IDF variations visualized

0

2

4

6

8

10

12

14

16

18

20

10-6 10-5 10-4 10-3 10-2 10-1 100

(idf is the diagonal)

idf w

eigh

t

relative document frequency [logscale]

constantidf

idf smoothprob idf

BM25


Foundations Exact Text Search 3: 9 / 21

Exact Text SearchBinary Retrieval ModelIf we want to search for bird + early , this corresponds to using AND on these columns:

Dim 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

a and

bed

bird

but

cheese

early

feather

flock

get

healthy

is it make

man

mouse

rise

of second

that

the

to together

wealthy

wise

worm

Doc 1 1 1 1 1 1 1

Doc 2 1 1 1 1 1 1 1 1

Doc 3 1 1 1 1 1 1

Doc 4 1 1 1 1 1 1 1 1 1 1 1

bird AND early = 1100 AND 0101 = 0100

Matching document: Bit 2 = Document 2.



Exact Text Search IIInverted IndexWe cannot store the binary matrix for large document collections.

Solution: We store only the 1’s, for each term.

bird → Doc 1 , Doc 2early → Doc 2 , Doc 4

Optimizations:

I Sort lists, and use a merge operation to intersect lists in O(n+m)

I Skip pointers to skip multiple entries at once.

I B-trees with prefix compression to organize lists.

I Compress lists by storing deltas, variable-length integer encoding etc. [LB15; LKK16]

(Reduce IO cost, use SIMD instructions for fast decoding and intersection.)



Exact Text Search IIIInverted IndexWe can store auxiliary information in inverted indexes:

Auxiliary information Symbol Cost

Total number of documents dft O(|Terms |)Number of occurrences tft,d O(N · avg-length)Total number of occurrences

∑d tft,d O(|Terms |)

Maximum number of occurrences maxd tft,d O(|Terms |)Term positions within document O(N · avg-length)

(for phrase matches and the “NEAR” operator – usually 2–4× larger index)

For TF-IDF, we can store the quantity/position along with the document:

bird → Doc 1×1, Doc 2×1early → Doc 2×1, Doc 4×2

early → Doc 2 @ 4, Doc 4 @ 1,5


Foundations Ranked Retrieval 3: 12 / 21

Ranked RetrievalApproximative MatchingExact matches (boolean model) tend to return no results, or too many results.

z We oen want to see the “best” matches only.

We need a scoring function to sort results.

Intuition: TF: the more words match in the document, the beer.

Intuition: IDF: common words should be given less weight than rare words.

Cosine similarity:

cos(A,B) :=A ·B‖A‖ · ‖B‖

=

∑i aibi√∑

i a2i ·√∑

i b2i

where A, B are the TF-IDF vectors.

Note: with c normalization, cos(A,B) =‖A‖=1

‖B‖=1

A ·B =∑

i aibi.



Ranked Retrieval IIApproximative Matching with TF-IDFFor text search (unweighted query Q), we can simplify this to:

score(Q,D) :=∑t∈Q

tf-idft,d

This has the benefits:

I Eicient computation, one term at a time, and Q is usually small.

I Documents not in the inverted list get +0 (i.e. no change).

Improvements:

I Order terms t ∈ Q by descending idft, and try to stop early if the remaining documents

cannot become a top-k result anymore (if we know maxd tf-idft,d).

I Stop early and skip low-idf terms, even if we cannot guarantee the result to be correct

(the similarity it is only an approximation of real relevance anyway – it is never “correct”, and

frequent low-idf terms such as “in” and “the” will not change much anyway.)



Ranked Retrieval IIITF-IDF Similarity for ClusteringFor clustering etc. we do not want such a query-document asymmetry.

So we will usually use cosine similarity. If we normalize documents, this can be computed as the

dot product of the TF-IDF vectors (tf-idfA · tf-idfB):

cos(A,B) =‖A‖=1

‖B‖=1

∑t

tf-idft,A · tf-idft,B

Since t 6∈ A ∧ t 6∈ B ⇒ tft,A = 0 ∧ tft,B = 0⇒ tf-idft,A · tf-idft,B = 0, we only need:

cos(A,B) =‖A‖=1

‖B‖=1

∑t∈A∩B

tf-idft,A · tf-idft,B

Note: t ∈ A ∩B is usually a small set if the documents are dissimilar.

Do not materialize A ∩B – compute the dot product by merging the (sorted) sparse vectors,

skipping all t where t 6∈ A ∧ t 6∈ B (optimized sparse vector product).


Foundations Contextual Word Representations 3: 15 / 21

Contextual Word RepresentationsWord Context

A dog is chasing a on the playground .?

boy

I The context of a word is representative of the word.

I Similar words oen have a similar context (e.g., girl ).

I Statistics can oen predict the word, based on the context.

I Context of a word ≈ a document: a , chasing , is , on , playground , the

z Try to model words based on their context

But: many documents per word, same problems as with real documents, . . .



Contextual Word Representations IIAlternatives to Bag-of-WordsCan we learn a Rd representation of words/text? [RHW86; Ben+03; RGP06]

(Recently: word2vec [Mik+13], vLBL [MK13], HPCA [LC14], [LG14a], GloVe [PSM14])

Basic idea:

1. Train a neural network (a map function, factorize a matrix) to either:

I predict a word, given the preceding and following words (Continuous Bag of Words, CBOW)

I predict the preceding and following words, given a word (Skip-Gram)

2. Configure one layer of the network to have d dimensions (for small d)

Usually: one layer network (not deep), 100 to 1000 dimensions.

3. Map every word to this layer, and use this as feature.

Note: this maps words, not documents!

We can treat the document ID like a word, and map it the same way. [LM14]



Word2VecBeware of cherry pickingFamous example (with the famous “Google News” model):

Berlin is to Germany as Paris is to z France

Berlin − Germany = Paris − France

Beware of cherry picking!

Berlin is to Germany as Washington_D.C. is to z Spending_SurgesOttawa is to Canada as Washington_D.C. is to z Quake_DamageGermany is to Berlin as United_States is to z U.S.Apple is to Microsoft as Volkswagen is to z VWman is to king as boy is to z kingsking is to man as prince is to z woman

Computed using https://rare-technologies.com/word2vec-tutorial/E. Schubert Advanced Topics in Text Mining 2017-04-17


Word2Vec IIBeware of data biasMost similar words to Munich :

Munich_Germany , Dusseldorf , Berlin , Cologne , Puchheim_westwardz Many stock photos with “Puchheim westward of Munich”, used in gas price articles.

Most similar words to Berlin :

Munich , BBC_Tristana_Moore , Hamburg , Frankfurt , Germanyz Tristana Moore is a key BBC correspondent in Berlin.

Most similar words to Heidelberg :

CEO_Bernhard_Schreier , CFO_Dirk_Kaliebe , Würzburg , Heidleberg ,

Heidelberger_Druckmaschinen_AGz CEO, CFO of Heidelberger Druckmaschinen. Würzburg – because of Koenig & Bauer?

Context of Munich

in Reuters News!

Computed using https://rare-technologies.com/word2vec-tutorial/E. Schubert Advanced Topics in Text Mining 2017-04-17

https://rare-technologies.com/word2vec-tutorial/

https://rare-technologies.com/word2vec-tutorial/


Word2Vec and Word EmbeddingsStrengths & Limitations

I Focused primarily on words, not on documents

I Captures certain word semantics surprisingly well

I Mostly preseves linguistic relations: plural, gender, language, . . .

(And thus very useful for machine translation)

I Requires massive training data

(Needs to learn projection matrixes of size N × d)

I Only works for frequent-enough words, unreliable on low-frequency words

I Does not distinguish homonyms, and is aected by training data bias

I How to fix, if it does not work as desired?

I Not very well understood yet [GL14; LG14b; LGD15], but related to matrix

factorization [PSM14]



SummaryI We oen need a Rd representation of documents.

I Text needs to be tokenized, maybe NLP analysis.

I Sparse representation allows using text search techniques.

I Similarity is oen measured by Cosine (on sparse representations).

I TF-IDF normalization improves search and similarity results.

I Many heuristic choices (e.g., TF-IDF variant)

I Dense models (e.g., word2vec) are a recent hype

But: need huge training data, no guarantees, hard to fix if they do not work right,

mostly used for single word similarity and machine translation.



LiteratureRecommended literature:

I Vector space model and information retrieval:

Chapter 6 “Scoring, term weighting and the vector space model”

Chapter 7 “Computing scores in a complete search system”

Chapter 11 “Probabilistic information retrieval“

C. D. Manning, P. Raghavan, and H. Schütze

Introduction to information retrievalCambridge University Press, 2008

isbn: 978-0-521-86571-5

url: http://nlp.stanford.edu/IR-book/I Word embeddings:

[Ben+03; RGP06; Mik+13; MK13; LM14; GL14; LC14; LG14a; PSM14; LGD15]


http://nlp.stanford.edu/IR-book/


Foundations References 3: 22 / 21

References I[Ben+03] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. “A Neural Probabilistic Language Model”. In: J. Machine Learning

Research 3 (2003), pp. 1137–1155.

[GL14] Y. Goldberg and O. Levy. “word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method”.

In: CoRR abs/1402.3722 (2014).

[LB15] D. Lemire and L. Boytsov. “Decoding billions of integers per second through vectorization”. In: Sow., Pract. Exper. 45.1

(2015), pp. 1–29.

[LC14] R. Lebret and R. Collobert. “Word Embeddings through Hellinger PCA”. In: European Chapter of the Association forComputational Linguistics, EACL. 2014, pp. 482–490.

[LG14a] O. Levy and Y. Goldberg. “Linguistic Regularities in Sparse and Explicit Word Representations”. In: ComputationalNatural Language Learning, CoNLL. 2014, pp. 171–180.

[LG14b] O. Levy and Y. Goldberg. “Neural Word Embedding as Implicit Matrix Factorization”. In: Neural Information ProcessingSystems, NIPS. 2014, pp. 2177–2185.

[LGD15] O. Levy, Y. Goldberg, and I. Dagan. “Improving Distributional Similarity with Lessons Learned from Word

Embeddings”. In: TACL 3 (2015), pp. 211–225.

[LKK16] D. Lemire, G. S. Y. Kai, and O. Kaser. “Consistently faster and smaller compressed bitmaps with Roaring”. In: Sow.,Pract. Exper. 46.11 (2016), pp. 1547–1569.



References II[LM14] Q. V. Le and T. Mikolov. “Distributed Representations of Sentences and Documents”. In: International Conference on

Machine Learning, ICML. 2014, pp. 1188–1196.

[Mik+13] T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Eicient Estimation of Word Representations in Vector Space”. In: CoRRabs/1301.3781 (2013).

[MK13] A. Mnih and K. Kavukcuoglu. “Learning word embeddings eiciently with noise-contrastive estimation”. In: NeuralInformation Processing Systems, NIPS. 2013, pp. 2265–2273.

[MRS08] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.

isbn: 978-0-521-86571-5. url: http://nlp.stanford.edu/IR-book/.

[PSM14] J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global Vectors for Word Representation”. In: Empirical Methods inNatural Language Processing, EMNLP. 2014, pp. 1532–1543.

[RGP06] D. L. Rohde, L. M. Gonnerman, and D. C. Plaut. “An improved model of semantic similarity based on lexical

co-occurrence”. self-published. 2006.

[RHW86] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by back-propagating errors”. In: Nature323.6088 (Oct. 1986), pp. 533–536.

[Rob04] S. Robertson. “Understanding inverse document frequency: on theoretical arguments for IDF”. In: Journal ofDocumentation 60.5 (2004), pp. 503–520.

[RS76] S. E. Robertson and K. Spärck Jones. “Relevance weighting of search terms”. In: JASIS 27.3 (1976), pp. 129–146.



References III[RZ09] S. E. Robertson and H. Zaragoza. “The Probabilistic Relevance Framework: BM25 and Beyond”. In: Foundations and

Trends in Information Retrieval 3.4 (2009), pp. 333–389.

[Sal71] G. Salton. The SMART Retrieval System—Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA:

Prentice-Hall, Inc., 1971.

[SB88] G. Salton and C. Buckley. “Term-Weighting Approaches in Automatic Text Retrieval”. In: Inf. Process. Manage. 24.5

(1988), pp. 513–523.

[Spä72] K. Spärck Jones. “A statistical interpretation of term specificity and its application in retrieval”. In: Journal ofDocumentation 28.1 (1972), pp. 11–21.

[Spä73] K. Spärck Jones. “Index term weighting”. In: Information Storage and Retrieval 9.11 (1973), pp. 619–633.

[SWR00a] K. Spärck Jones, S. Walker, and S. E. Robertson. “A probabilistic model of information retrieval: development and

comparative experiments - Part 1”. In: Inf. Process. Manage. 36.6 (2000), pp. 779–808.

[SWR00b] K. Spärck Jones, S. Walker, and S. E. Robertson. “A probabilistic model of information retrieval: development and

comparative experiments - Part 2”. In: Inf. Process. Manage. 36.6 (2000), pp. 809–840.

[ZM16] C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and TextMining. New York, NY, USA: Association for Computing Machinery and Morgan & Claypool, 2016. isbn:

978-1-97000-117-4.


http://dx.doi.org/10.1002/spe.2203

http://dx.doi.org/10.1002/spe.2402



http://dx.doi.org/10.1038/323533a0

http://dx.doi.org/10.1108/00220410410560582

http://dx.doi.org/10.1002/asi.4630270302

http://dx.doi.org/10.1561/1500000019

http://dx.doi.org/10.1016/0306-4573(88)90021-0

http://dx.doi.org/10.1108/eb026526

http://dx.doi.org/10.1016/0020-0271(73)90043-0

http://dx.doi.org/10.1016/S0306-4573(00)00015-7

http://dx.doi.org/10.1016/S0306-4573(00)00015-7

http://dx.doi.org/10.1016/S0306-4573(00)00016-9

http://dx.doi.org/10.1016/S0306-4573(00)00016-9

http://dx.doi.org/10.1145/2915031

http://dx.doi.org/10.1145/2915031

Text Clustering Introduction 4: 1 / 81

What is clustering?Core conceptsDivide data into clusters:

I Clusters not defined beforehand (otherwise: use classification)

I Similar objects should be in the same cluster

I Dissimilar objects in dierent clusters

I Dierent notions of (dis-) similarity

I Based on statistical properties such as:

I Connectivity

I Separation

I Least squared deviation

I Density



What is clustering?Clustering use examplesUsage examples:

I Customer segmentation:

Optimize ad targeting or product design for dierent “focus groups”.

I Web visitor segmentation:

Optimize web page navigation for dierent user segments.

I Data aggregation:

Represent many data points with a single (representative) example.

E.g., reduce color palee of an image

I Text collection organization:

Group text documents into (previously unknown) topics.



Clustering algorithmsDifferent categories of algorithms

Paradigm:

I Distance

I Variance

I Density

I Connectivity

I Probability

I Subgraph

Properties:

I Partitions: strict, hierarchical, overlapping

I Outliers, or total clustering

I Hard (binary) assignment or so (fuzzy) assignment

I Full dimensional or subspace

I Rectangular, spherical, or correlated


Text Clustering Hierarchical Clustering 4: 4 / 81

Hierarchical Agglomerative Clustering IRepeated merging of clustersOne of the earliest clustering methods [Sne57; Sib73; Har75; KR90]:

1. Initially, every object is a cluster

2. Find two most similar clusters, and merge them

3. Repeat (2) until only one cluster remains

4. Plot tree (“dendrogram”), and choose interesting subtrees

Many variations that dier by:

I Distance / similarity measure of objectsI Distance measure of clusters (“linkage”)

I Optimizations

0 1 2 3 4 5 6 7 8 9 10 11 120

1

2

3

4

5

6

7

8

9

10

11

12

0

1

2

3

4



Hierarchical Agglomerative Clustering IIDistance of objectsWe first need distances of single objects.

Both Euclidean distance (the most common distance we use):

dEuclidean(x, y) =

√∑d(xd − yd)2 = ‖x− y‖2

and Manhaan distance (city block metric):

dManhattan(x, y) =∑

d|xd − yd| = ‖x− y‖1

are special cases of Minkowski norms (Lp distances):

dLp(x, y) =(∑

d|xd − yd|p

)1/p= ‖x− y‖p

z Many more distance functions [DD09]!

‖x‖ usually refers to

the Euclidean norm.



Hierarchical Agglomerative Clustering IIIDistance of objects IIInstead of distances, we can also use similarities:

Cosine similarity:

cos(X,Y ) :=X · Y‖X‖ · ‖Y ‖

=

∑i xiyi√∑

i x2i ·√∑

i y2i

on L2 normalized data, this simplifies to:

cos(X,Y ) :=‖X‖=1

‖Y ‖=1

X · Y =∑i

xiyi

z Careful: if we use similarities, large values are beer

– with distances, small values are beer.



Curse of DimensionalityConcentration of DistancesCurse of Dimensionality of Beyer et al. [Bey+99]

If limd→∞

Var

(‖Xd‖

E[‖Xd‖]

)= 0, then

Dmax −Dmin

Dmin→ 0.

0

1

2

3

4

5

6

7

1 10 100 1000

Nor

mal

ized

dis

tanc

e

Dimensionality

Normal distribution

Mean +- stddev Actual min Actual max



Curse of Dimensionality IIIllustration: “shrinking” (?) hyperspheres [ZSK12] n-balls grow slower

than n-cubes



Curse of Dimensionality IIIIllustration: “shrinking” (?) hyperspheres II [ZSK12]

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30

Vol

ume

Dimensionality

Volume of an n-ball

radius 0.90radius 0.95radius 1.00radius 1.05radius 1.10

10-120

10-100

10-80

10-60

10-40

10-20

100

1020

0 20 40 60 80 100

Vol

ume

Dimensionality

Volume of an n-ball

radius 0.2radius 0.5radius 1.0radius 1.5radius 2.0

Vn(R) =πn2

Γ(n2 + 1

)Rn



Curse of Dimensionality IVSummaryDue to the Curse of Dimensionality, distances become very similar.

I Usually at around 10–50 dimensions, this becomes a problem.

I Text usually has 10000+ dimensions (but mostly 0s – sparse)

Intrinsic dimensionality of text is still high!

I While some claim “cosine is beer in high dimensionality” this is false

(because cosine∼= Euclidean on the unit sphere)

I The signal-to-noise-ratio is essential [ZSK12]

I In text, we (naturally) have a lot of noise, because of typos and discrete input!

I Rankings can be more meaningful than scores [Hou+10]

I The Curse has many dierent forms [ZSK12]

cos(X,Y ) =‖X‖=1

‖Y ‖=1

X · Y =∑

ixiyi︸︷︷︸± many tiny errors = a lot of noise



Hierarchical Agglomerative Clustering IVDistance of clusters ISingle-linkage: minimum distance

∼= maximum similarity

dsingle(A,B) := mina∈A,b∈B

d(a, b) ∼= maxa∈A,b∈B

s(a, b)

Complete-linkage: maximum distance∼= minimum similarity

dcomplete(A,B) := maxa∈A,b∈B

d(a, b) ∼= mina∈A,b∈B

s(a, b)

Average-linkage (UPGMA): average distance∼= average similarity

daverage(A,B) := 1|A|·|B|

∑a∈A

∑b∈B

d(a, b)

Centroid-linkage: distance of cluster centers (Euclidean only)

dcentroid(A,B) := ‖µA − µB‖2



Hierarchical Agglomerative Clustering VDistance of clusters IIMciy (WPGMA): average of previous sub-clusters

Defined recursively, e.g., via Lance-Williams equation.

Average distance to the previous two clusters.

Median-linkage (Euclidean only): distance from midpoint

Defined recursively, e.g., via Lance-Williams equation.

Median is the halfway point of the previous merge.

Ward-linkage (Euclidean only): Minimum increase of squared error

dWard(A,B) := = |A|·|B||A∪B| ‖µA − µB‖

2

Mini-Max-linkage: Best maximum distance, best minimum similarity

dminimax(A,B) := minc∈A∪B

maxp∈A∪B

d(c, p) ∼= maxc∈A∪B

minp∈A∪B

s(c, p)



Hierarchical Agglomerative Clustering VIAGNES – Agglomerative Nesting [KR90]AGNES, using the Lance-Williams equations [LW67]:

1. Compute the pairwise distance matrix of objects

2. Find position of the minimum distance d(i, j) (similarity: maximum similarity s(i, j))

3. Combine rows and columns of i and j into one using Lance-Williams update equations

d(A ∪B,C) = LanceWilliams (d(A,C), d(B,C), d(A,B))

using only the stored, known distances d(A,C), d(B,C), d(A,B).

4. Repeat from (2.) until only one entry remains

5. Return dendrogram tree



Hierarchical Agglomerative Clustering VIIAGNES – Agglomerative Nesting [KR90] IILance-Williams update equation have the general form:

D(A ∪B,C) = α1d(A,C) + α2d(B,C) + βd(A,B) + γ|d(A,C)− d(B,C)|

Several (but not all) linkages can be expressed in this form (for distances):

α1 α2 β γSingle-linkage 1/2 1/2 0 −1/2Complete-linkage 1/2 1/2 0 +1/2

Average-group-linkage (UPGMA)|A|

|A|+|B||B|

|A|+|B| 0 0

Mciy (WPGMA) 1/2 1/2 0 0

Centroid-linkage (UPGMC)|A|

|A|+|B||B|

|A|+|B|−|A||B|(|A|+|B|)2 0

Median-linkage (WPGMC) 1/2 1/2 −1/4 0

Ward|A|+|C|

|A|+|B|+|C||B|+|C|

|A|+|B|+|C|−|C|

|A|+|B|+|C| 0



Hierarchical Agglomerative Clustering VIIIAGNES – Agglomerative Nesting [KR90] IIIExample with complete linkage (= maximum of the distances):

Scaer plot

A

B

C

D

E

F

0 1 2 3 4 5 60

1

2

3

4

5

Distance matrix

A B C D E F

A

B

C

D

E

F

0

0

0

0

0

0

0.71

0.71

5

5

2.92

2.92

2.5

2.5

3.54

3.54

5.70

5.70

3.61

3.61

3.20

3.20

4.24

4.24

2.55

2.55

2.69

2.69

1.58

1.58

0.5

0.5

1

1

1.12

1.12

2.92

2.5

3.61

3.20

2.55

2.69 0.5

1 1.12

2.92 3.61

2.69

1.12

0.71

5 5.70

3.54 4.24

5.70

4.24 1.584.24

2.69

5.70

Dendrogram

A B C D E F

1

2

3

4

5

6

We may need to merge

non-adjacent rows!

We don’t know the optimum

label positions in advance



Hierarchical Agglomerative Clustering VIIIAGNES – Agglomerative Nesting [KR90] IIIExample with single linkage (= minimum of the distances):

Scaer plot

A

B

C

D

E

F

0 1 2 3 4 5 60

1

2

3

4

5

Distance matrix

A B C D E F

A

B

C

D

E

F

0

0

0

0

0

0

0.71

0.71

5

5

2.92

2.92

2.5

2.5

3.54

3.54

5.70

5.70

3.61

3.61

3.20

3.20

4.24

4.24

2.55

2.55

2.69

2.69

1.58

1.58

0.5

0.5

1

1

1.12

1.12

2.92

2.5

3.61

3.20

2.55

2.69 0.5

1 1.12

3.202.5

2.55

1

0.71

5 5.70

3.54 4.24

5

3.54 1.58

2.5

1.58

2.5

Dendrogram

A B C D E F

0.5

1

1.5

2

2.5

3

In this very simple example,

single and complete linkage

are very similar



Hierarchical Agglomerative Clustering IXExtracting clustersAt this point, we have the dendrogram – but not yet “clusters”.

z E.g., set a distance limit, or stop at k clusters.

Complexity analysis:

1. Computing the distance matrix: O(n2) time and memory.

2. Finding the maximum: O(n2) · i3. Updating the matrix: O(n) · i4. Number of iterations: i = O(n)

Total: O(n3) time and O(n2) memory!

Beer algorithms can run in “usually n2” time [Sib73; And73; Def77].

z Hierarchical clustering does not scale to large data, code optimization maers [KSZ16].



Hierarchical Agglomerative Clustering XBenefits and limitationsBenefits:

I Very general: any distance / similarity (for text: cosine!)

I Easy to understand and interpret

I Dendrogram visualization can be useful

I Many variants

Limitations:

I Scalability is the main problem (in particular, O(n2) memory)

I Unbalanced cluster sizes (i.e., number of points)

I Outliers


Text Clustering k-means Clustering 4: 18 / 81

k-means ClusteringCore conceptsThe k-means problem:

I Divide data into k subsets (k is a parameter)

I Subsets represented by their arithmetic mean in each aribute µC,dI Optimize the least squared error SSQ :=

∑C

∑d

∑xi∈C(xi,d − µC,d)2

History of least squares estimation (Legendre, Gauss):

https://en.wikipedia.org/wiki/Least_squares#History

I Squared errors put more weight on larger deviations

I Arithmetic mean is the maximum likelihood estimator of centrality

I Connected to the normal distribution

z k-means is a good choice, if we have k signals and normal distributed measurement error



k-means Clustering IISum of Squares objectiveThe sum-of-squares objective:

SSQ :=∑

C︸︷︷︸every cluster

∑d︸︷︷︸

× every dimension×

∑xi∈C︸︷︷︸

every point

(xi,d − µC,d)2︸︷︷︸squared deviation from mean

For every cluster C and dimension d, the arithmetic mean minimizes∑xi∈C

(xi,d − µC,d)2is minimized by µC,d =

1

|C|∑

xi∈Cxi,d

For every point xi, we can choose the cluster C to minimize SSQ, too.

Note: sum of squares ≡ squared Euclidean distance:∑d(xi,d − µC,d)2 ≡ ‖xi − µC‖2 ≡ d2

Euclidean(xi, µC)

We can therefore say that every point is assigned the “closest” cluster, but we cannot use arbitrary

other distance functions in k-means (because the arithmetic mean only minimizes SSQ).

We can rearrange these sums

because of communtativity



k-means Clustering IIIThe standard algorithm (Lloyd’s algorithm)The standard algorithm for k-means [Ste56; For65; Llo82]:

1. Choose k points randomly4

as initial centers

2. Assign every point to the least-squares closest center

3. Update the centers with the arithmetic mean

4. Repeat (2.)-(3.) until no point is reassigned anymore

This is not the most eicient algorithm (despite everybody teaching this variant).

ELKI [Sch+15] contains ≈ 10 variants (e.g., Sort-Means [Phi02]; benchmarks in [KSZ16]).

The name k-means was first used by Maceen for a slightly dierent algorithm [Mac67].

k-means was invented several times, and has an interesting history [Boc07].

(2.) and (3.) both

minimize SSQ

4

or by any other rule, e.g., k-means++ [AV07] or predefined “seeds”


https://en.wikipedia.org/wiki/Least_squares#History


k-means Clustering IVThe standard algorithm (Lloyd’s algorithm) II

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

k-means has converged

in the third iteration

with SSQ = 61.5



k-means Clustering VThe standard algorithm (Lloyd’s algorithm) III

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

Result with dierent starting centroids.


in the second iteration

with SSQ = 54.4



k-means Clustering VIThe standard algorithm (Lloyd’s algorithm) IV

1 2 3 4 5 6 7 8 9 10 11 12

1

2

3

4

5

6

7

8

9

10

11

12

Result with dierent starting centroids.


in the second iteration

with SSQ = 72.9



k-means Clustering VIINon-determinism & non-optimalityMost k-means algorithms

I do not guarantee to find the global optimum (would be NP-hard – too expensive)

I give dierent local optima,5

depending on the starting point

In practical use:

I data is never exact, or complete

I the “optimum”6

result is not necessarily the most useful

z Usually, we gain lile by finding the true optimum

z It is usually good enough to try a few random initializations and keep the “best”6

5

In fact, the standard algorithm may fail to even find a local minimum [HW79].

6

Least squares, i.e., lowest SSQ – this does not mean it will actually give the most insight.



k-means Clustering VIIIComplexityIn the standard algorithm:

1. Initialization is usually cheap, O(k) (k-means++: O(N · k · d) [AV07])

2. Reassignment is O(N · k · d)

3. Mean computation is O(N · d)

4. Number of iterations i ∈ 2Ω(√N)

[AV06] (but fortunately, usually i N )

5. Total: O(N · k · d · i)

Worst case is superpolynomial, but in practice the method will usually run much beer than n2.

We can force a limit on the number of iterations, e.g., i = 100, with lile loss in quality usually.

In practice, oen the fastest clustering algorithm we have / use.



k-means Clustering IXk-means for text clusteringk-means cannot be used with arbitrary distances, but only with Bregman divergences [Ban+05].

Cosine similarity is closely connected to squared Euclidean distance (c.f. Assignment 2).

Spherical k-means [DM01] uses:

I Input data is normalized to have ‖xi‖ = 1

I At each iteration, the new centers are normalized to µ′C := ‖µC‖ = 1

I µ′C minimizes average cosine similarity [DM01]∑xi∈C

⟨xi, µ

′C

⟩ ∼= |C| −∑xi∈C

∥∥xi, µ′C∥∥2

I Sparse nearest-centroid computations in O(d′) where d′ is the number of non-zero values

I Result is similar to a SVD factorization of the document-term-matrix [DM01]



k-means Clustering XChoosing the “optimum” k for k-meansA key challenge of k-means is choosing k:

I Trivial to prove: SSQoptimum,k ≥ SSQ

optimum,k+1.

z Avoid comparing SSQ for dierent k or dierent data (including normalization).

I SSQk=N = 0 — “perfect” solution? No: useless.

I SSQk may exhibit an “elbow” or “knee”: initially it improves fast, then much slower.

I Use alternate criteria such as Silhouee [Rou87], AIC [Aka77], BIC [Sch78; ZXF08].

z Computing silhouee is O(n2) – more expensive than k-means.

z AIC, BIC try to reduce overfiing by penalizing model complexity (= high k).

More details will come in evaluation section.

I Nevertheless, these measures are heuristics – other k can be beer in practice!

I Methods such as X-means [PM00] split clusters as long as a quality criterion improves.



k-means Clustering XIChoosing the “optimum” k for k-means II

Toy “mouse” data set:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Best with k = 3:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Best with k = 5:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



k-means Clustering XIChoosing the “optimum” k for k-means IIBest results for 25 initializations, k = 1 . . . 20:

0

5

10

15

20

25

30

35

1 10

Sum

of S

quar

es

Number of clusters k

Min Mean Max

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10

Silh

oue

e


Min Mean Max Best SSQ

All tested measures either prefer 3 or 5 clusters.

Typical SSQ curve

Knee?Knee?

0.5 is not considered

to be a good Silhouee

k = 3 is best,

but k = 5 is similar



k-means Clustering XIIClusters changes are increasingly incrementalConvergence on Mouse data set with k = 3:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



k-means Clustering XIIIk-means benefits and drawbacksBenefits:

I Very fast algorithm (O(k · d ·N), if we limit the number of iterations)

I Convenient centroid vector for every cluster

(We can analyze this vector to get a “topic”)

I Can be run multiple times to get dierent results

Limitations:

I Cannot be used with arbitrary distances

I Diicult to choose the number of clusters, k

I Does not produce the same result every time

I Sensitive to outliers (squared errors emphasize outliers)

I Cluster sizes can be quite unbalanced (e.g., one-element outlier clusters)


Text Clustering The E-M principle 4: 31 / 81

Expectation-Maximization ClusteringFrom k-means to Gaussian EMk-means can not handle clusters with dierent “radius” well.

Toy “mouse” data set:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Best 3-means:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Best 5-means:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

z could we estimate mean and radius?

z model the data with multivariate Gaussian distributions



Expectation-Maximization ClusteringIterative refinementEM (Expectation-Maximization) is the underlying principle in Lloyd’s k-means:

1. Choose initial model parameters θ

2. Expect latent variables (e.g., cluster assignment) from θ and the data.

3. Update θ to maximize the likelihood of observing the data

4. Repeat (2.)-(3.) until a stopping condition holds

Recall Lloyd’s k-means:

1. Choose k centers randomly (θ: random centers)

2. Expect cluster labels by choosing the nearest center as label

3. Update cluster centers with maximum-likelihood estimation of centrality

4. Repeat (2.)-(3.) until change = 0



Expectation-Maximization ClusteringIterative refinementEM (Expectation-Maximization) is the underlying principle in Lloyd’s k-means:

1. Choose initial model parameters θ

2. Expect latent variables (e.g., cluster assignment) from θ and the data.

3. Update θ to maximize the likelihood of observing the data

4. Repeat (2.)-(3.) until a stopping condition holds

Gaussian Mixture Modeling (GMM): [DLR77]

1. Choose k centers randomly, and unity covariance (θ = (µ1,Σ1, µ2,Σ2, . . . µk,Σk))

2. Expect cluster labels based on Gaussian distribution density

3. Update Gaussians with mean and covariance matrix

4. Repeat (2.)-(3.) until change < ε


Text Clustering Gaussian Mixture Modeling 4: 33 / 81

Gaussian Mixture Modeling IExpectation-MaximizationExpectation-step: For every point p, and cluster center µi with covariance matrix Σi compute:

pdf(p, µi,Σi) :=1√

(2π)d|Σi|· e−

12((p−µi)TΣ−1

i (p−µi))

Estimate point weights (cluster membership)

wpi :=pdf(p, µi,Σi)∑j pdf(p, µj ,Σj)

w proportional to pdfwpi ∝ pdf(p, µi,Σi)

Maximization step: Use weighted mean and weighted covariance to recompute cluster model.

µi,x = 1∑p wpi

∑pwpipx

Σi,x,y = 1∑p wpi

∑pwpi(px − µx)(py − µy)

weighted mean(X) / cov(X,Y )



Gaussian Mixture Modeling IIFitting multiple Gaussian distributions to dataProbability density function of a multivariate Gaussian:

pdf(p, µ,Σ) :=1√

(2π)d|Σ|· e−

12((p−µ)TΣ−1(p−µ))

If we constrain Σ we can control the cluster shape:

I We always want symmetric and positive semi-definite

I Σ covariance matrix: rotated ellipsoid (A)

I Σ diagonal (“variance matrix”): ellipsoid (B)

I Σ scaled unit matrix: spherical (C)

I Same Σ for all clusters, or dierent Σi each

C

A

B



Gaussian Mixture Modeling IIIUnderstanding the Gaussian densityMultivariate normal distribution:

pdf(p, µ,Σ) :=1√

(2π)d|Σ|· e−

12((p−µ)TΣ−1(p−µ))

1-dimensional normal distribution:

pdf(x, µ, σ) :=1√

(2π)σ2· e−

12((x−µ)σ−2(x−µ))

Normalization (to a total volume of 1) and squared deviation from center

Compare this to Mahalanobis distance:

dMahalanobis(x, µ,Σ)2 := (x− µ)TΣ−1(x− µ)

z Σ/Σ−1plays a central role here, the remainder is squared Euclidean distance!



Gaussian Mixture Modeling IVInverse covariance matrix – Σ−1

Covariance matrixes are symmetric, non-negative on the diagonal, and can be inverted.

(This may need a robust numerical implementation.)

We can decompose this using

V ΛV −1 = Σ ≡ V Λ−1V −1 = Σ−1

where V contains the eigenvectors and Λ contains the eigenvalues.

z Interpret this decomposition as V ∼= rotation, Λ ∼= squared scaling!

(Recall foundations: PCA and SVD)



Gaussian Mixture Modeling VInverse covariance matrix – Σ−1 II

Build Ω using ωi = 1/√λi = λ

− 12

i . Then ΩΩ = Λ−1, and ΩT = Ω.

Σ−1 = V Λ−1V −1 = V ΩTΩV T = (ΩV T )TΩV T

d2Mahalanobis

= (x− µ)TΣ−1(ΩV T )TΩV T (x− µ)

=⟨ΩV T (x− µ),ΩV T (x− µ)

⟩=∥∥ΩV T (x− µ)

∥∥2

z Mahalanobis ≈ Euclidean distance aer PCA



Gaussian Mixture Modeling VIISoft-assignment changes slower than k-meansClustering mouse data set with k = 3:

#00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#150 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#300 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#600 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#800 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#900 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

#1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



Expectation-Maximization ClusteringClustering text dataWe cannot use Gaussian EM on text:

I Text is not Gaussian distributed.

I Text is discrete and sparse, Gaussians are continuous.

I Covariance matrixes have O(d2) entries:

I Memory requirements (text has a very high dimensionality d)

I Data requirements (to reliably estimate the parameters, we need very many data points)

I Matrix inversion is even O(d3)

But the general EM principle can be used with other distributions.

For example: mixture of Bernoulli or multinomial distributions


Text Clustering Mixture of Bernoulli Distributions 4: 40 / 81

Mixture of Bernoulli DistributionsEM Clustering meets Bernoulli Naïve Bayes [MRS08]The Bernoulli model uses boolean vectors, indicating the presence of terms.

A cluster i is modeled a weight αi and term frequencies qi,t.

Multivariate Bernoulli probability of a document x in cluster i:

P (x | i, qi) =(∏

t∈xqi,t

)(∏t6∈x

1− qi,t)

Mixture of clusters 1 . . . k with weights α1 . . . αk:

P (x | α, q) =∑k

i=1αi

(∏t∈x

βi,t

)(∏t6∈x

1− βi,t)

Note: when implementing this, we need a Laplacian correction to avoid zero values!



Mixture of Bernoulli DistributionsProbabilistic generative modelThis equation arises, if we assume the data is generated by:

1. Choose a cluster i with probability αi

2. For every token t, include it in the document with probability βi,t

This is a very simple model:

I No word frequency

I No word order

I No word correlations

If we would use this to really generate documents, they would be gibberish.

But naïve Bayes classification uses this, too, and it oen works!



Mixture of Bernoulli DistributionsExpectation-Maximziation algorithmTo learn the weights α, β we can employ EM:

1. Choose αi = 1/k, and choose k documents as initial βi (similar to k-means).

2. Expectation step:

P (x ∈ Ci | α, β) =αi(∏

t∈x βi,t) (∏

t6∈x 1− βi,t)

∑kj=1 αj

(∏t∈x βj,t

) (∏t6∈x 1− βj,t

)3. Maximization step:

βi,t =

∑x P (x ∈ Ci | α, β)1(t ∈ x)∑x P (x ∈ Ci | α, β)

αi =

∑x P (x ∈ Ci | α, β)

N

4. Repeat (2.)-(3.) until change < ε.

Probability of a

document in cluster icontaining token t

1(c) = 1 if c true else 0

What share of all

documents is in the cluster?



Mixture of Bernoulli DistributionsExpectation-Maximziation algorithmTo learn the weights α, β we can employ EM:

1. Choose αi = 1/k, and choose k documents as initial βi (similar to k-means).

2. Expectation step:

P (x ∈ Ci | α, β) ∝ αi(∏

t∈xβi,t

)(∏t6∈x

1− βi,t)

3. Maximization step:

βi,t =

∑x P (x ∈ Ci | α, β)1(t ∈ x)∑x P (x ∈ Ci | α, β)

αi ∝∑

xP (x ∈ Ci | α, β)

4. Repeat (2.)-(3.) until change < ε.

Probability of a

document in cluster icontaining token t

1(c) = 1 if c true else 0

What share of all

documents is in the cluster?



Mixture of Bernoulli DistributionsFrom Bernoulli to multinomialThe Bernoulli model is simple, but it ignores quantitative information completely.

The closest distribution that uses quantity is the multinomial distribution.

Again, this is similar to multinomial naïve Bayes classification.


Text Clustering Mixture of Multinomial Distributions 4: 44 / 81

Mixture of Multinomial DistributionsClustering text dataThe multinomial mixture model:

P (x | α, β) :=

k∑j=1

αjlen(x)!∏dt=1 tft,x!

d∏t=1

βtft,xt,j∑k

sum of multinomial distributions (“clusters”, or “topics”, index j = 1 . . . k)

αj relative size of topic j (α is a vector of length k)

len!∏tf ! number of permutations of the document with the same word vector∏βtf

probability of seeing this word vector in topic j (β is a k × d matrix)

βt,j frequency of word t in topic jtf number of times the term occurred in the document (document-term-matrix)



Mixture of Multinomial DistributionsProbabilistic generative modelThe model assumes our data set was generated by a process like this:

1. Sample a topic distribution α

2. For every topic t, sample a word distribution βt3. For every document d

3.1 Sample a topic t from α3.2 Sample l words from the word distribution βt

Note: this is not what we do, but our underlying assumptions.



Mixture of Multinomial DistributionsProbabilistic generative model IIWe want to apply Bayes’ rule:

P (α, β | X) ∝ P (X | α, β)P (α)P (β)

Because α and β are independent, and P (X) can be treated as constant.

P (α, β | X) ∝

∏x∈X

k∑j=1

αj

d∏t=1

βtft,xt,j

k∏j=1

αλα−1j

k∏j=1

d∏t=1

βλβ−1t,j

By disregarding everything that does not depend on α, β.

Unfortunately, maximizing this directly is in general intractable.



Mixture of Multinomial DistributionsBack to Expectation MaximizationExpectation step:

P (x ∈ j | α, β) ∝ αj∏d

t=1β

tft,xt,j

by removing shared terms independent of j

Maximization step:

αj ∝ λα − 1 +∑x∈X

P (x ∈ j | α, β)

βt,j ∝ λβ − 1 +∑x∈X

tft,x P (x ∈ j | α, β)

This is called PLSA/PLSI [Hof99], and will be discussed in more detail in the next chapter.



Mixture of Multinomial DistributionsResults with Expectation MaximizationResults reported with this approach are mixed:

I Sensitive to initialization [MRS08; MS01]

because there are many local optima. E.g., use k-means result as starting point.

I Sensitive to rare words

I Converges fast to binary assignments, usually

(not necessarily good, tends to get stuck in local optima because of this)

Many more algorithms use this general EM optimization procedure!

E.g., for clustering web site navigation paerns (clickstreams) [YH02; Cad+03; JZM04]


Text Clustering Biclustering 4: 49 / 81

Biclustering & Subspace ClusteringClustering attributes and variablesPopular in gene expression analysis.

I Every row is a gene

I Every column is a sample

or transposed

I Only a few genes are relevant

I No semantic ordering of rows or columns

I Some samples may be contaminated

I Numerical value may be unreliable, only “high” or “low”

z Key idea of biclustering: [CC00]

Find a subset of rows and columns (submatrix, aer permutation),

such that all values are high/low or exhibit some paern.



BiclusteringBicluster patterns [CC00]Some examples for bicluster paerns:

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

constant

1

1

1

1

1

1

2

2

2

2

2

2

3

3

3

3

3

3

4

4

4

4

4

4

5

5

5

5

5

5

6

6

6

6

6

6

rows

6

5

4

3

2

1

6

5

4

3

2

1

6

5

4

3

2

1

6

5

4

3

2

1

6

5

4

3

2

1

6

5

4

3

2

1

columns

6

5

4

3

2

1

7

6

5

4

3

2

8

7

6

5

4

3

9

8

7

6

5

4

10

9

8

7

6

5

11

10

9

8

7

6

additive

1

3

2

4

1.5

0.5

2

6

4

8

3

1

4

12

8

16

6

2

0

0

0

0

0

0

4

12

8

16

6

2

3

9

6

12

4.5

1.5

multiplicative

Clusters may overlap in rows and columns!

Paerns will never be this ideal, but noisy!

Many algorithms focus on the constant paern type only, as there are O(2N ·d) possibilities.



Subspace ClusteringDensity-based clusters in subspacesSubspace clusters may be visible in one projection, but not in another:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Column 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Column 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Column 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Column 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Column 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Column 3



Subspace ClusteringDensity-based clusters in subspacesSubspace clusters may be visible in one projection, but not in another:

00.010.020.030.040.050.060.070.080.090.1

0.110.120.130.140.150.160.170.180.190.2

0.210.220.230.24

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Popular key idea:

I Find dense areas in 1-dimensional projections

I Combine subspaces as long as the cluster remains dense

Examples: CLIQUE [Agr+98], PROCLUS [Agg+99], SUBCLU [KKK04]

There also exist “correlation clustering”, for rotated subspaces.



Biclustering & Subspace ClusteringRelationship to text clustering and topic modelingWhile these algorithms have similar ideas to text clustering,

there are some subtle dierences that make them not work well with text:

I Text is sparse, and we need to treat 0 specially

(We do not want to find a cluster “all aributes are zero, except for a few”)

I We do not see “dense” regions of non-zero values

I We do not have the biological notion of “highly expressed” for genes

What is a high value in TF-IDF? Everything except 0?

However, on some special (non-natural) text, these methods may work, e.g.:

I Tags / Keywords

I Log files


Text Clustering Frequent Itemset Mining 4: 53 / 81

Frequent Itemset MiningFinding co-occurrencesMarket basket analysis:

I Which items are frequently bought together?

I If a customer has bought A and B, should I oer C?

I Cross-marketing opportunities?

I Optimize product arrangement in the store?

z For more details on the market basket analysis scenario, see KDD lecture!



Frequent Itemset MiningTransaction dataData model of frequent itemset mining:

I Items I = i1, . . . , iK: literals, e.g., products or product groups

I Itemset: a set X ⊆ I of items

I k-itemset: itemset of length |X| = k

I Transactions T = X1, . . . , XN, X ⊆ I : observed itemsets and stored in the DatabaseI Support of an itemset X : |Xi ∈ T | X ⊆ Xi|

Notes:

I itemsets are usually kept sorted for eiciency

I items are usually abstracted to product types, e.g., Milk, rather than brands and package sizes

I quantity information is usually not used



Frequent Itemset MiningCombinatorial explosionToo many possibilities (2K ):

∅

A B

A,B

∅

A B C

A,B A,C B,C

A,B,C

∅

A B C D

A,B A,C A,D B,C B,D C,D

A,B,C A,B,D A,C,D B,C,D

A,B,C,D



Frequent Itemset MiningCombinatorial explosionToo many possibilities (2K ):

∅

A B C D E

A,B A,C A,D A,E B,C B,D B,E C,D C,E D,E

A,B,C A,B,D A,B,E A,C,D A,C,E A,D,E B,C,D B,C,E B,D,E C,D,E

A,B,C,D A,B,C,E A,B,D,E A,C,D,E B,C,D,E

A,B,C,D,E



Frequent Itemset MiningAPRIORI [AS94]Incremental construction (with a minimum support threshold):

1. Find frequent 1-itemsets by counting.

2. Given the k-itemsets, find the k + 1-itemsets.

For this, generate as few candidates as possible using monotonicity:

I Support is monotone decreasing: Support(X ∪ i) ≤ Support(X)I Therefore, Support(X = x1, . . . , xk+1) ≤ mini Support(X \ xi) and

X \ xi 6∈ k-itemsets⇒ Support(X \ xi) < minSupp⇒ Support(X) < minSupp

I Optimization: only combine itemsets which agree on the first k − 1 items:

x1, . . . , xk−1, xk ∪ x1, . . . , xk−1, x′k = x1, . . . , xk−1, xk, x′kI Optimization: keep everything sorted, then x1, . . . , xk−1, _ are sequential

I Check if all other subsets x1, . . . , xi−1, xi+1, . . . , xk ∈ k-itemsets for i = 1 . . . k − 2

3. Use a single scan of the database for each k, count support of candidates

4. Discard candidates with too lile support

5. Stop if no more itemsets can be found in the next round (|k + 1-itemsets| < k + 2)



Frequent Itemset MiningCombinatorial explosionPruning the search space with monotonicity: if A is not frequent, we can prune

∅

A B C D

A,B A,C A,D B,C B,D C,D

A,B,C A,B,D A,C,D B,C,D

A,B,C,D



Frequent Itemset Mining ExampleNetFlix data2 GB of movie ratings from NetFlix, ca. 2006.

Simplify: only 5 star ratings⇒ 23 million items in 480189 transactions (users)

But: even with minimum support 1000, we get billions of frequent itemsets!

Maximum supported k-itemsets:

k = 1 96535 Lord of the Rings: The Two Towers

k = 2 77878 above + Lord of the Rings: The Fellowship of the Ring

k = 3 66083 above + Lord of the Rings: The Return of the King

k = 4 43177 above + Lord of the Rings: The Two Towers: Extended Edition

k = 5 40123 above + Lord of the Rings: The Fellowship of the Ring: Extended Edition

k = 6 36267 above + Lord of the Rings: The Return of the King: Extended Edition

k = 7 22429 above + Star Wars: Episode V: The Empire Strikes Back

k = 8 18789 above + Star Wars: Episode IV: A New Hope

k = 9 16497 above + Star Wars: Episode VI: Return of the Jedi

k = 10 12241 above + Raiders of the Lost Ark



Frequent Itemset MiningCombinatorial explosion – ExampleNumber of frequent itemsets (logscale!) depending on minimum support:

0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000100

102

104

106

108

Minimum Support

#F

req

uen

tItem

sets



Frequent Itemset MiningCombinatorial explosion – ExampleEect of APRIORI optimizations on the search space:

100

102

104

106

108

1010

1012

1 2 4 6 8 10 12 14

Num

ber

of It

emse

ts

Itemset length

APRIORI on Movie Recommendations of 463196 Users on 8502 Movies

All ItemsetsNaive-pairwise 5000x

Prefix-Join 5000xPruning 5000x

Frequent 5000x



Frequent Itemset MiningCombinatorial explosion – ExampleEect of minimum support on the search space:

100

102

104

106

108

1010

1012

1 2 4 6 8 10 12 14

Num

ber

of It

emse

ts

Itemset length

APRIORI on Movie Recommendations of 463196 Users on 8502 Movies

All ItemsetsFrequent 50000xFrequent 20000xFrequent 10000xFrequent 5000x



Frequent Itemset MiningFP-Growth [HPY00]Key-ideas of FP-Growth:

I many itemsets are duplicate or similar

I a prefix-tree-like aggregation can exploit redundancy

I the tree can oen be held in main memory even when the database cannot

I find frequent paerns from the aggregated tree via projection (see KDD lecture)

E.g., transactions A,B,C,D × 3, A,B,C,E × 2, A,E × 2, and D × 1:

∅:8A:7

B:5 C :5

D:3

E:2

E:2

D:1

The FP-tree is a compressed summary of the transaction database.



Frequent Itemset MiningText and transactionsComparison to text:

I item ∼ token (word)

I itemset ∼ binary term vector

I transactions ∼ documents

z Use frequent itemset mining to find paerns in text?

Does not work well on natural text: common words are frequent, interesting words are rare.

⇒ frequent itemsets involve mostly frequent words

On non-standard text such as tags, log files, etc. this can work well!


Text Clustering Evaluation 4: 62 / 81

Evaluation of ClusteringDifferent kinds of evaluationWe can distinguish between four kinds of evaluation:

I Unsupervised evaluation (usually based on distance / deviation)

Statistics such as SSQ, Silhouee, Davies-Bouldin, . . .

I Supervised evaluation based on labels

Indexes such as Adjusted Rand Index, Purity, . . .

I Indirect evaluation

Based on the performance of some other (usually supervised) algorithm in a later stage

E.g., how much does classification improve, if we use the clustering as feature

I Expert evaluation

Manual evaluation by a domain expert



Unsupervised Evaluation of ClusteringsUnderlying principlesRecall the basic idea of distance-based clustering algorithms:

I Items in the same cluster should be more similar

I Items in other clusters should be less similar

z compare the distance within the same cluster to distances to other clusters

Some simple approaches:

MeanDistance(C) := 1N

∑Ci

∑x∈Ci

d(x, µCi)

MeanSquaredDistance(C) := 1N

∑Ci

∑x∈Ci

d2(x, µCi)

RMSD(C) :=

√1N

∑Ci

∑x∈Ci

d2(x, µCi)



Variance-based EvaluationExplained Variance IVariance is proportional to the pairwise deviations:

VarX = E[X2]− E[X]2 = 1N

∑xx2 −

(1N

∑xx)2

= 1N

∑xx2 − 1

N

∑xx · 1

N

∑yy

= 1N

∑xx2 − 1

N2

∑x,yxy

= 12N2

(N∑

xx2 +N

∑yy2 − 2

∑x,yxy)

= 12N2

∑x,y

(x2 − 2xy + y2

)= 1

2N2

∑x,y

(x− y)2

z 2N2 VarX =∑

x,y(x− y)2



Variance-based EvaluationExplained Variance IIWe can decompose the variance for clusters C1, . . . , Ck:

2N2 VarX =∑

x,y(x− y)2

=∑

x

∑y(x− y)2

︸︷︷︸Total Sum of Squares (TSS) =

=∑

Ci

∑x,y∈Ci

(x− y)2︸︷︷︸within cluster Ci︸︷︷︸

Within Cluster Sum of Squares (WCSS)

+∑

Ci 6=Cj

∑x∈Ci,y∈Cj

(x− y)2︸︷︷︸between clusters Ci,Cj︸︷︷︸

+ Between Cluster Sum of Squares (BCSS)

z Because total sum of squares TSS is constant, minimizing WCSS = maximizing BCSS

Explained variance := BCSS

/TSS = (TSS −WCSS)

/TSS ∈ [0, 1]

Note: k-means tries to find a local optimum of WCSS.

Note: Ward-linkage joins clusters with least increase in WCSS.



SilhouetteDistance to second nearest cluster [Rou87]Define the mean distance of x to its own cluster, and to the closest other cluster:

a(x ∈ Ci) := meany∈Ci,y 6=x d(x, y)

b(x ∈ Ci) := minCj 6=Ci meany∈Cj d(x, y)

The silhouee width of a point x (can be used for ploing) then is:

s(x) := b(x)−a(x)maxa(x),b(x)

The silhouee of a clustering C then is:

Silhouee(C) := meanx s(x)

Silhouee: 1 if a b, 0 if a = b, and -1 if a b.An average Silhouee of > 0.5 is considered “reasonable”, > 0.7 is “strong”.



SilhouetteChallenges with Silhouette

I The complexity of Silhouee is: O(n2)⇒ does not scale to large data sets.

Simplified Silhouee: use distances to cluster centers instead of average distances, O(n · k).

I When a cluster has a single point, a(x) is not defined.

Rousseeuw [Rou87] suggests to use s(x) = 0 then.

I Which distance should we use, e.g., with k-means – Euclidean, or squared Euclidean?

I In high-dimensional data, a(x)→ b(x) due to the curse of dimensionality. Then s(x)→ 0



Davies-Bouldin IndexScatter vs. separation [DB79]Let the scaer of a cluster Ci be (recall Lp-norms):

Si :=(

1|Ci|

∑x∈Ci

‖x− µCi‖pp

)1/p

Let the separation of clusters be:

Mij :=∥∥µCi − µCj∥∥p

The similarity of two clusters then is defined as:

Rij :=Si+SjMij

Clustering quality is the average maximum similarity:

DB(C) := meanCi maxCj 6=Ci Rij

A small Davies-Bouldin index is beer, i.e., Si + Sj Mij , scaer distance

Power mean (with power p) of

(‖x− y‖p)p =

∑i(xi − yi)

p

For p = 2, Si is standard deviation,

Mij is the Euclidean distance!



Other Internal Clustering IndexesMany more indexesThere have been many more indexes proposed over time:

I Dunn index: cluster distance / maximum cluster diameter [Dun73; Dun74]

I Calinski-Harabasz variance ratio criterion [CH74]:

BCSS/(k − 1)

WCSS/(N − k)=N − kk − 1

BCSS

WCSS

I Gamma and Tau: P (within-cluster distance < between-cluster distance ) [BH75]

I C-Index: sum of within-cluster distances / same number of smallest distances [HL76]

I PBM Index: distance to cluster center / distance to total center [PBM04]

I DBCV: Density based cluster validation [Mou+14]

I . . .



Unsupervised Cluster EvaluationExamples: Mouse dataRevisiting k-means on the mouse data:

0

5

10

15

20

25

30

35

1 10

Sum

of S

quar

es


Min Mean Max

0

0.1

0.2

0.3

0.4

0.5

0.6

1 10

Silh

oue

e



0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 10

Dav

ies

Bou

ldin

Inde

x



0

100

200

300

400

500

600

700

800

1 10

Var

ianc

e R

atio

Cri

teri

a



0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 10

C-I

ndex



0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

1 10

PBM

-Ind

ex



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10

Gam

ma



0

0.1

0.2

0.3

0.4

0.5

0.6

1 10

Tau





Unsupervised Cluster EvaluationExamples: Tutorial’s Wikipedia data setRevisiting k-means on the Wikipedia data set (c.f., tutorials):

1550

1600

1650

1700

1750

1800

1850

1900

1 10 100

Sum

of S

quar

es


Min Mean Max

0

0.01

0.02

0.03

0.04

0.05

0.06

1 10 100

Silh

oue

e



2

3

4

5

6

7

8

9

10

1 10 100

Dav

ies

Bou

ldin

Inde

x



0

5

10

15

20

25

30

35

40

1 10 100

Var

ianc

e R

atio

Cri

teri

a



0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 10 100

C-I

ndex



0

1

2

3

4

5

6

7

8

9

10

1 10 100

PBM

-Ind

ex



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 10 100

Gam

ma



0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 10 100

Tau





Supervised Cluster EvaluationEvaluation with a “ground truth”External evaluation measures assume we know true clusters.

In the following, every point has a cluster C(x) and a true class K(x).

The “raw data” (e.g., vectors) will not be used.

Popular in literature to compare algorithms.

Oen, classification data is used, and it is assumed that good clusters = the classes.

Oen not usable with real data – no labels available.

Sometimes, we can at least label some data, or treat some properties as potential labels,

then choose the clustering that makes “most sense”.



Supervised Cluster EvaluationEvaluation with a “ground truth” IIThe matching problem:

I Clusters C are usually enumerated 1, 2, 3, . . . , k

I True classes K are usually labeled with meaningful classes

I Which C is which class K?

z Clustering is not classification, we cannot evaluate it the same way

I What if there are more clusters than classes?

I What if a cluster contains two classes?

I What if a class contains two clusters?

To overcome this

I Choose the best (C,K) matching with, e.g., the Hungarian algorithm (uncommon)

I Compare every cluster C to every class K



Supervised Cluster EvaluationPurity, Precision and RecallA simple measure popular in text clustering (but not much in “other” clustering):

Purity(Ci,K) := maxKj|Ci∩Kj ||Ci|

Purity(C,K) := 1N

∑i

|Ci|Purity(Ci,K) =1

N

∑i

maxKj |Ci ∩Kj |

⇒ A cluster, which only contains elements of class Kj has purity 1z similar to “precision” in classification

But: every document is its own cluster has purity 1, is this really optimal?

We could also define a “recall” equivalent as maxKi |Cj ∩Ki|/|Ki|.It works even worse: everything in a single cluster is optimal.



Supervised Cluster EvaluationPair-counting: classifying as relatedMany cluster evaluation measures are based on pair-counting.

If (and only if) i and j are in the same cluster, then (i, j) is a pair.

This gives us a binary, classification-like problem:

C(i) = C(j): pair C(i) 6= C(j): no pair

K(i) = K(j) pair true positive (a) false negative (c)

K(i) 6= K(j) no pair false positive (b) true negative (d)

Objects are a pair if they are related

⇒ we are predicting which objects are related, and which are not

subject to the transitivity constraint: (i, j) ∧ (j, k)⇒ (i, k)



Supervised Cluster EvaluationPair-counting measures

C(i) = C(j): pair C(i) 6= C(j): no pair

K(i) = K(j) pair true positive (a) false negative (c)

K(i) 6= K(j) no pair false positive (b) true negative (d)

Precision = aa+b Recall = a

a+c

Rand index [Ran71] = a+da+b+c+d = Accuracy

Fowlkes-Mallows [FM83] =√

Precision · Recall = a/√

(a+ b) · (a+ c)

Jaccard = aa+b+c

F1-Measure =2Precision·Recall

Precision+Recall= 2a

2a+b+c

Fβ-Measure = (β2+1)·Precision·Recall

β2·Precision+Recall

Adjusted Rand Index = Rand index−E[Rand index]Optimal Rand index−E[Rand index]



Supervised Cluster EvaluationMutual Information: Information-theoretic Evaluation [Mei03; Mei05; Mei12]Mutual Information:

I(C,K) =∑

i

∑jP (Ci ∩Kj) log

P (Ci ∩Kj)

P (Ci) · P (Kj)

=∑

i

∑j

|Ci ∩Kj |N

logN |Ci ∩Kj ||Ci| · |Kj |

Entropy (c.f. preliminaries)

H(C) =−∑

iP (Ci) logP (Ci) = −

∑i

|Ci|N

log|Ci|N

= I(C,C)

Normalized Mutual Information (NMI):7

NMI(C,K) =I(C,K)

(H(C) +H(K))/2=I(C,K) + I(K,C)

I(C,C) + I(K,K)

Using log ab

=− log ba

7

There exist at least 5 variations.



Supervised Cluster EvaluationOther measures and variantsSome further evaluation measures:

I Adjustment for chance is general principle,

Adjusted Index = Index−E[Index]Optimal Index−E[Index]

For example Adjusted Rand Index [HA85] or Adjusted Mutual Information [VEB10]

I B-Cubed evaluation [BB98]

I Set matching purity [ZK01] and F1 [SKK00]

I Edit distance [PL02]

I Visual comparison of multiple clusterings [Ach+12]

I Gini-based evaluation [Sch+15]



Supervised Cluster EvaluationExamples: Mouse dataRevisiting k-means on the mouse data:

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

1 10

Jacc

ard



0

0.1

0.2

0.3

0.4

0.5

0.6

1 10

AR

I



0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 10

NM

I Joi

nt



On this toy data set, unsupervised methods predicted K = 3. go back



Supervised Cluster EvaluationExamples: Tutorial’s Wikipedia data setRevisiting k-means on the Wikipedia data set (c.f., tutorials):

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 100

Jacc

ard



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 100

AR

I



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 100

NM

I Joi

nt



The best k = 5 matches the true number of clusters in this data set.

Unsupervised measures would have preferred k = 2. go back



Text ClusteringConclusionsClustering text data is hard because:

I Preprocessing (TF-IDF etc.) has major impact

(But we do not know which preprocessing is “correct” or “best”)

I Text is high-dimensional, and our intuition of “distance” and “density” do not work well

I Text is sparse, and many clustering assume dense, Gaussian data.

I Text is noisy, and many documents may not be part of a cluster at all.

I Some cases can be handled with biclustering or frequent itemset mining.

I The (proper) evaluation of clustering is very diicult: [JD88]

The validation of clustering structures is the most diicult and frustrating part ofcluster analysis.Without a strong eort in this direction, cluster analysis will remain a black artaccessible only to those true believers who have experience and great courage.

z We need methods designed for text: topic modeling


Text Clustering References 4: 82 / 81

References I[Ach+12] E. Achtert, S. Goldhofer, H. Kriegel, E. Schubert, and A. Zimek. “Evaluation of Clusterings - Metrics and Visual

Support”. In: IEEE 28th International Conference on Data Engineering (ICDE 2012). 2012, pp. 1285–1288.

[Agg+99] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. “Fast Algorithms for Projected Clustering”. In: Proc.ACM SIGMOD International Conference on Management of Data. 1999, pp. 61–72.

[Agr+98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. “Automatic Subspace Clustering of High Dimensional Data for

Data Mining Applications”. In: Proc. ACM SIGMOD International Conference on Management of Data. 1998, pp. 94–105.

[Aka77] H. Akaike. “On entropy maximization principle”. In: Applications of Statistics. 1977, pp. 27–41.

[And73] M. R. Anderberg. “Cluster analysis for applications”. In: Probability and mathematical statistics. Academic Press, 1973.

Chap. Hierarchical Clustering Methods, pp. 131–155. isbn: 0120576503.

[AS94] R. Agrawal and R. Srikant. “Fast Algorithms for Mining Association Rules in Large Databases”. In: Proc. VLDB. 1994,

pp. 487–499.

[AV06] D. Arthur and S. Vassilvitskii. “How slow is the k-means method?” In: Symposium on Computational Geometry. 2006,

pp. 144–153.

[AV07] D. Arthur and S. Vassilvitskii. “k-means++: the advantages of careful seeding”. In: ACM-SIAM Symposium on DiscreteAlgorithms (SODA). 2007, pp. 1027–1035.

[Ban+05] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. “Clustering with Bregman Divergences”. In: J. Machine LearningResearch 6 (2005), pp. 1705–1749.


http://dx.doi.org/10.1109/ICDE.2012.128

http://dx.doi.org/10.1109/ICDE.2012.128

http://dx.doi.org/10.1145/304182.304188

http://dx.doi.org/10.1145/276304.276314

http://dx.doi.org/10.1145/276304.276314

http://dx.doi.org/10.1145/1137856.1137880

http://dx.doi.org/10.1145/1283383.1283494


References II[BB98] A. Bagga and B. Baldwin. “Entity-Based Cross-Document Coreferencing Using the Vector Space Model”. In:

COLING-ACL. 1998, pp. 79–85.

[Bey+99] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Sha. “When Is “Nearest Neighbor” Meaningful?” In: InternationalConference on Database Theory (ICDT). 1999, pp. 217–235.

[BH75] F. B. Baker and L. J. Hubert. “Measuring the Power of Hierarchical Cluster Analysis”. In: Journal American StatisticalAssociation 70.349 (1975), pp. 31–38.

[Boc07] H. Bock. “Clustering Methods: A History of k-Means Algorithms”. In: Selected Contributions in Data Analysis andClassification. Ed. by P. Brito, G. Cucumel, P. Bertrand, and F. Carvalho. Springer, 2007, pp. 161–172.

[Cad+03] I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. “Model-Based Clustering and Visualization of Navigation

Paerns on a Web Site”. In: Data Min. Knowl. Discov. 7.4 (2003), pp. 399–424.

[CC00] Y. Cheng and G. M. Church. “Biclustering of Expression Data”. In: ISMB. 2000, pp. 93–103.

[CH74] T. Caliński and J. Harabasz. “A dendrite method for cluster analysis”. In: Communications in Statistics 3.1 (1974),

pp. 1–27.

[DB79] D. Davies and D. Bouldin. “A cluster separation measure”. In: Paern Analysis and Machine Intelligence 1 (1979),

pp. 224–227.

[DD09] M. M. Deza and E. Deza. Encyclopedia of Distances. 3rd. Springer, 2009. isbn: 9783662443415.



References III[Def77] D. Defays. “An Eicient Algorithm for the Complete Link Cluster Method”. In: The Computer Journal 20.4 (1977),

pp. 364–366.

[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. “Maximum Likelihood from Incomplete Data via the EM algorithm”. In:

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39.1 (1977), pp. 1–31.

[DM01] I. S. Dhillon and D. S. Modha. “Concept Decompositions for Large Sparse Text Data Using Clustering”. In: MachineLearning 42.1/2 (2001), pp. 143–175.

[Dun73] J. C. Dunn. “A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters”. In:

Journal of Cybernetics 3.3 (1973), pp. 32–57.

[Dun74] J. C. Dunn. “Well separated clusters and optimal fuzzy partitions”. In: Journal of Cybernetics 4 (1974), pp. 95–104.

[FM83] E. B. Fowlkes and C. L. Mallows. “A Method for Comparing Two Hierarchical Clusterings”. In: Journal AmericanStatistical Association 78.383 (1983), pp. 553–569.

[For65] E. W. Forgy. “Cluster analysis of multivariate data: eiciency versus interpretability of classifications”. In: Biometrics 21

(1965), pp. 768–769.

[HA85] L. Hubert and P. Arabie. “Comparing partitions”. In: Journal of Classification 2.1 (1985), pp. 193–218.

[Har75] J. A. Hartigan. Clustering Algorithms. New York, London, Sydney, Toronto: John Wiley&Sons, 1975.

[HL76] L. J. Hubert and J. R. Levin. “A general statistical framework for assessing categorical clustering in free recall.” In:

Psychological Bulletin 83.6 (1976), pp. 1072–1080.



References IV[Hof99] T. Hofmann. “Probabilistic Latent Semantic Indexing”. In: ACM SIGIR. 1999, pp. 50–57.

[Hou+10] M. E. Houle, H. Kriegel, P. Kröger, E. Schubert, and A. Zimek. “Can Shared-Neighbor Distances Defeat the Curse of

Dimensionality?” In: Int. Conf. on Scientific and Statistical Database Management (SSDBM). 2010, pp. 482–500.

[HPY00] J. Han, J. Pei, and Y. Yin. “Mining Frequent Paerns without Candidate Generation”. In: Proc. SIGMOD. 2000, pp. 1–12.

[HW79] J. A. Hartigan and M. A. Wong. “Algorithm AS 136: A k-means clustering algorithm”. In: Journal of the Royal StatisticalSociety: Series C (Applied Statistics) (1979), pp. 100–108.

[JD88] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Englewood Clis: Prentice Hall, 1988.

[JZM04] X. Jin, Y. Zhou, and B. Mobasher. “Web usage mining based on probabilistic latent semantic analysis”. In: ACMSIGKDD. 2004, pp. 197–205.

[KKK04] P. Kröger, H. Kriegel, and K. Kailing. “Density-Connected Subspace Clustering for High-Dimensional Data”. In: Proc. ofthe Fourth SIAM International Conference on Data Mining. 2004, pp. 246–256.

[KR90] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analyis. John Wiley&Sons, 1990.

isbn: 9780471878766.

[KSZ16] H. Kriegel, E. Schubert, and A. Zimek. “The (black) art of runtime evaluation: Are we comparing algorithms or

implementations?” In: Knowledge and Information Systems (KAIS) (2016), pp. 1–38.

[Llo82] S. P. Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions on Information Theory 28.2 (1982), pp. 129–136.


http://dx.doi.org/10.1007/3-540-49257-7_15

http://dx.doi.org/10.2307/2285371

http://dx.doi.org/10.1007/978-3-540-73560-1_15

http://dx.doi.org/10.1023/A:1024992613384

http://dx.doi.org/10.1023/A:1024992613384

http://dx.doi.org/10.1080/03610927408827101

http://dx.doi.org/10.1093/comjnl/20.4.364

http://dx.doi.org/10.1023/A:1007612920971

http://dx.doi.org/10.1080/01969727308546046

http://dx.doi.org/10.1080/01969727408546059

http://dx.doi.org/10.1037/0033-2909.83.6.1072

http://dx.doi.org/10.1145/312624.312649

http://dx.doi.org/10.1007/978-3-642-13818-8_34

http://dx.doi.org/10.1007/978-3-642-13818-8_34

http://dx.doi.org/10.1145/342009.335372

http://dx.doi.org/10.1145/1014052.1014076

http://dx.doi.org/10.1137/1.9781611972740.23

http://dx.doi.org/10.1002/9780470316801

http://dx.doi.org/10.1007/s10115-016-1004-2

http://dx.doi.org/10.1007/s10115-016-1004-2

http://dx.doi.org/10.1109/TIT.1982.1056489


References V[LW67] G. N. Lance and W. T. Williams. “A General Theory of Classificatory Sorting Strategies. 1. Hierarchical Systems”. In:

The Computer Journal 9.4 (1967), pp. 373–380.

[Mac67] J. Maceen. “Some Methods for Classification and Analysis of Multivariate Observations”. In: 5th BerkeleySymposium on Mathematics, Statistics, and Probabilistics. Vol. 1. 1967, pp. 281–297.

[Mei03] M. Meila. “Comparing Clusterings by the Variation of Information”. In: Computational Learning Theory (COLT). 2003,

pp. 173–187.

[Mei05] M. Meila. “Comparing Clusterings – An Axiomatic View”. In: Int. Conf. Machine Learning (ICML). 2005, pp. 577–584.

[Mei12] M. Meilă. “Local equivalences of distances between clusterings–a geometric perspective”. In: Machine Learning 86.3

(2012), pp. 369–389.

[Mou+14] D. Moulavi, P. A. Jaskowiak, R. J. G. B. Campello, A. Zimek, and J. Sander. “Density-based Clustering Validation”. In:

SIAM SDM. 2014, pp. 839–847.

[MRS08] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008.

isbn: 978-0-521-86571-5. url: http://nlp.stanford.edu/IR-book/.

[MS01] C. D. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, 2001. isbn:

978-0-262-13360-9.

[PBM04] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik. “Validity index for crisp and fuzzy clusters”. In: Paern Recognition37.3 (2004), pp. 487–501.



References VI[Phi02] S. J. Phillips. “Acceleration of K-Means and Related Clustering Algorithms”. In: ALENEX’02. 2002, pp. 166–177.

[PL02] P. Pantel and D. Lin. “Document clustering with commiees”. In: ACM SIGIR. 2002, pp. 199–206.

[PM00] D. Pelleg and A. Moore. “X-means: Extending k-means with eicient estimation of the number of clusters”. In:

Proceedings of the 17th International Conference on Machine Learning (ICML). Vol. 1. 2000, pp. 727–734.

[Ran71] W. M. Rand. “Objective criteria for the evaluation of clustering methods”. In: Journal American Statistical Association66.336 (1971), pp. 846–850.

[Rou87] P. J. Rousseeuw. “Silhouees: a graphical aid to the interpretation and validation of cluster analysis”. In: Journal ofcomputational and applied mathematics 20 (1987), pp. 53–65.

[Sch+15] E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. “A Framework for Clustering Uncertain Data”. In:

Proceedings of the VLDB Endowment 8.12 (2015), pp. 1976–1979.

[Sch78] G. Schwarz. “Estimating the dimension of a model”. In: The Annals of Statistics 6.2 (1978), pp. 461–464.

[Sib73] R. Sibson. “SLINK: An Optimally Eicient Algorithm for the Single-Link Cluster Method”. In: The Computer Journal 16.1

(1973), pp. 30–34.

[SKK00] M. Steinbach, G. Karypis, and V. Kumar. “A comparison of document clustering techniques”. In: KDD workshop on textmining. Vol. 400. 2000, pp. 525–526.

[Sne57] P. H. A. Sneath. “The Application of Computers to Taxonomy”. In: Journal of General Microbiology 17 (1957),

pp. 201–226.



References VII[Ste56] H. Steinhaus. “Sur la division des corp materiels en parties”. In: Bull. Acad. Polon. Sci 1 (1956), pp. 801–804.

[VEB10] N. X. Vinh, J. Epps, and J. Bailey. “Information Theoretic Measures for Clustering Comparison: Variants, Properties,

Normalization and Correction for Chance”. In: J. Machine Learning Research 11 (2010), pp. 2837–2854.

[YH02] A. Ypma and T. Heskes. “Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov

Models”. In: Workshop WEBKDD. 2002, pp. 35–49.

[ZK01] Y. Zhao and G. Karypis. Criterion Functions for Document Clustering: Experiments and Analysis. Tech. rep. 01-40.

University of Minnesota, Department of Computer Science, 2001.

[ZSK12] A. Zimek, E. Schubert, and H. Kriegel. “A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical

Data”. In: Statistical Analysis and Data Mining 5.5 (2012), pp. 363–387.

[ZXF08] Q. Zhao, M. Xu, and P. Fränti. “Knee Point Detection on Bayesian Information Criterion”. In: IEEE InternationalConference on Tools with Artificial Intelligence (ICTAI). 2008, pp. 431–438.


http://dx.doi.org/10.1007/978-3-540-45167-9_14

http://dx.doi.org/10.1145/1102351.1102424

http://dx.doi.org/10.1007/s10994-011-5267-2

http://dx.doi.org/10.1137/1.9781611973440.96



http://dx.doi.org/10.1016/j.patcog.2003.06.005

http://dx.doi.org/10.1007/3-540-45643-0_13

http://dx.doi.org/10.1145/564376.564412

http://dx.doi.org/10.14778/2824032.2824115

http://dx.doi.org/10.1214/aos/1176344136

http://dx.doi.org/10.1093/comjnl/16.1.30

http://dx.doi.org/10.1099/00221287-17-1-201

http://dx.doi.org/10.1007/978-3-540-39663-5_3

http://dx.doi.org/10.1007/978-3-540-39663-5_3

http://dx.doi.org/10.1002/sam.11161

http://dx.doi.org/10.1002/sam.11161

http://dx.doi.org/10.1109/ICTAI.2008.154

Topic Modeling Introduction 5: 1 / 25

Topic ModelingMotivationFind the latent structure in a text corpus that:

I resembles “topics” (also “concepts”)

I best summarize the collection

I is based on statistical paerns

I are obscured by synonyms, homonyms, stopwords, . . .

I may overlap

Similar to clustering, but with a slightly dierent “mindset”:

I In clustering, the emphasis is on the data points / documents

I In topic modeling, the emphasis is on the topics / clusters themselves



LiteratureGeneral introduction to LDA:

D. M. Blei. “Probabilistic topic models”. In: Commun. ACM 55.4 (2012), pp. 77–84

Lecture by David Blei:

http://videolectures.net/mlss09uk_blei_tm/

Probabilistic graphical modeling textbook:

D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press,

2009. isbn: 978-0-262-01319-2

Topic modeling chapter (17) of this textbook:

C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction toInformation Retrieval and Text Mining. New York, NY, USA: Association for Computing Machinery

and Morgan & Claypool, 2016. isbn: 978-1-97000-117-4



LSI/LSA: Topics via Matrix FactorizationLatent Semantic Indexing (LSI) [Fur+88; Dee+90] was developed to improve information retrieval.Also called Latent Semantic Analysis (LSA).

In information retrieval, synonymy and polysemy are a challenge:

I exact search will not find synonyms

I exact search will include polynyms and homonyms

Idea: identify “factors” that can contain multiple words, or parts of a word

Factors are a lower-dimensional representation of the document.

Factor analysis of the document-term matrix:

I similarity of words based on the documents they cooccur in

I similarity of documents based on the words they contain


http://dx.doi.org/10.1145/2133806.2133826

http://videolectures.net/mlss09uk_blei_tm/

http://dx.doi.org/10.1145/2915031

http://dx.doi.org/10.1145/2915031


Topics via Matrix FactorizationRecall Singular Value Decomposition (SVD):

00

A

=

U

×Σ

×V T

I A: document-term matrix

I U : document-topic map (“topic distribution”)

I Σ : topic importance

I V : term-topic map (V T: “term distribution”)

By truncating the matrix to k topics, we get the best (least-squares) approximation.



Topics via Matrix Factorization IIComplexity of SVD on a m× n matrix is: O(minmn2,m2n) = O(mn ·minm,n)

We can approximate this in O(k2 ·minm,n) using Monte-Carlo sampling [FKV04]

if we only need k components, to be more eicient.

z If we do a stochastic approximation, can we use a probabilistic model directly?

U ∼ a topic distribution for every document

V T ∼ a word distribution for every topic

z model these as probabilities P (Tk | di) and P (wj | Ti)

Note: SVD does not yield probabilities, but factors can contain negative values.



Topic ModelingFrom Matrix Factors to Topic Models

cat dog

mouse bird

elephant lion

Topic 1

computer

soware

mouse Twier

Topic 2

The cat

chases a

mouse.

The com-

puter is con-

trolled by a

mouse.

Twier’s icon

is a bird.



Topic ModelingProbabilistic Mathematical ModelBasic idea of probabilistic topic modeling:

Every document di is a mixture of topics Tk (with θi,k ≥ 0):∑kP (di ∈ Tk) =

∑kθi,k︸︷︷︸

“topic distribution” of document i

= 1

Every word wi,j is drawn from one of the documents’ topics:

P (wi,j) =∑

kP (wi,j | di ∈ Tk)P (di ∈ Tk) =

∑kϕk(wi,j)︸︷︷︸

“word distribution” of topic k

θi,k



Probabilistic Topic ModelingpLSI: probabilistic Latent Semantic Indexing [Hof99b; Hof99a]

θd zdi wdi

each word ieach document d

Where zdi is the topic of the ith word in document d.

This assumes conditional independence given an unobserved topic t: [BNJ03]

P (w, d) = P (d)∑t

P (w | t)P (t | d)

z See the EM chapter for an EM-algorithm for the multinomial distribution.

Prior Hidden var. Observed var. Dependency Repeated element



Probabilistic Topic ModelingProbabilistic generative modelThe model assumes our data set was generated by a process like this:

1. For every topic t, sample a word distribution ϕt

2. For every document d, sample a topic distribution θd3. For every document d, generate l words i = 1 . . . l:

3.1 Sample a topic zdi from θd3.2 Sample a word wdi from the distribution ϕzdi

Note: this is not what we do, but our underlying assumptions.



Probabilistic Topic ModelingLikelihood of PLSIIn PLSI, we model a document d as mixtures of k topics:

P (w | d)︸︷︷︸Word w in

document (model) d

=∑

tθd,t︸︷︷︸

Weight of topic tin document d

ϕt,w︸︷︷︸Prob. word w

in topic t

logP (d)︸︷︷︸Loglikelihood

of document d

=∑

wlog P (w | d)

logP (D)︸︷︷︸Loglikelihood

of all documents

=∑

dlogP (d)

logP (D) =∑

d

∑w

log[∑

tθd,t ϕt,w

]Log-loss function

to maximize



Probabilistic Topic ModelingEM-Algorithm for PLSIExpectation-Step (estimate zd,w from previous θ, ϕ):

P (zd,w= t)︸︷︷︸Prob. that word w

is in topic t

∝ θd,t︸︷︷︸Weight of topic tin document d


in topic t

s.t. ∀d,w :∑

tP (zd,w= t) = 1

Maximization-Step (optimize θ, ϕt,w from zd,w):

θd,t ∝∑

wdfw,d P (zd,w = t) s.t. ∀d :

∑tθd,t = 1

ϕt,w ∝∑

ddfw,d P (zd,w = t) s.t. ∀t :

∑wϕt,w = 1

Note: because of θd, which depends on the document d,

the topic estimation of a word can be dierent in dierent documents!

This is the Maximum-Likelihood Estimation (MLE).

Normalize to a proper

probability distribution



Probabilistic Topic ModelingIncorporating prior knowledgePLSI will model all words, including stopwords!

This causes some problems, we therefore can:

I Remove stopwords

I Add a “background” topic (c.f. [ZM16])

I Use Maximum-a-posterior (MAP) estimation, with a prior word distribution

ϕt,w ∝∑

ddfw,d P (zd,w = t) + µϕ′w s.t. ∀t :

∑wϕt,w = 1

Where µ ∈ [0;∞] controls the strength of prior information,

and ϕ′ is the prior word distribution. [ZM16]



Probabilistic Topic ModelingLDA: Latent Dirichlet Allocation [BNJ01; BNJ03; Ble12]

α θd zdi wdi

ϕkβ

each word i

each topic k

each document d

Prior Hidden var. Observed var. Dependency Repeated element

For every word, we first draw a topic zdi, then draw a word wdi from this topic.

Topic and word distributions use “sparse” Dirichlet distributions.



Latent Dirichlet AllocationGenerative Process of LDALDA assumes, documents are generated as follows:

1. For each document d, draw a topic distribution θd from a Dirichlet distribution

θd ∼ Dirichlet(α).

2. For each topic k, draw a word distribution ϕk from a Dirichlet distribution

ϕk ∼ Dirichlet(β).

3. For each word wdi in each document d draw a topic zdi from the multinomial θd

zdi ∼ Discrete(θd).

4. Draw a word from the multinomial ϕzdi

wdi ∼ Discrete(ϕzdi).



Latent Dirichlet AllocationDirichlet DistributionThe Dirichlet prior of LDA – density with k = 3, α > 1:

But: we will be using α < 1, where this distribution becomes sparse.

PD Image from Wikipedia, https://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png


https://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png


Latent Dirichlet AllocationDirichlet DistributionThe Dirichlet prior of LDA – samples with k = 20, α = 0.2

6 7 8 9 10

1 2 3 4 5

1 2 3 4 5 6 7 8 91011121314151617181920 1 2 3 4 5 6 7 8 91011121314151617181920 1 2 3 4 5 6 7 8 91011121314151617181920 1 2 3 4 5 6 7 8 91011121314151617181920 1 2 3 4 5 6 7 8 91011121314151617181920

0.00

0.25

0.50

0.75

1.00

0.00

0.25

0.50

0.75

1.00

item

valu

e

Most values are ≈ 0



Latent Dirichlet AllocationLikelihood of LDALDA adds the Dirichlet prior to model the likeliness of word and topic distributions.

logP (d)︸︷︷︸Loglikelihood

of document d

=

∫ ∑w

log∑

tθd,t︸︷︷︸

Weight of topic tin document d


in topic t

P (ϕt | α) dϕt︸︷︷︸Likelihood of

word distribution

logP (D) =

∫ ∑d

∑w

log[∑

tθd,t ϕt,w

]∏tP (ϕt | α) dϕ1 · · · dϕk

A model is beer if the word distributions match our Dirichlet prior beer!



Latent Dirichlet AllocationComputation of LDAWe do not generate random documents, but we need to compute the likelihood of a document,

and optimize (hyper-) parameters to best explain the documents.

We cannot solve this exactly, but we need to approximate this.

I Variational inference [BNJ01; BNJ03]

I Gibbs sampling [PSD00; Gri02]

I Expectation propagation [ML02]

I Collapsed Gibbs sampling [GS04]

I Collapsed variational inference [TNW06]

I Sparse collapsed Gibbs sampling [YMM09]

I Metropolis-Hastings-Walker sampling [Li+14]



Gibbs SamplingMonte-Carlo MethodsWe need to estimate complex functions that we cannot handle analytically.

Estimates of a function f(x) usually look like this:

E[f(x)] =∑

zf(y)p(y) '

∫yf(y)p(y) dy

where p(y) is the likelihood of the input parameters x being x = y.

Monte-Carlo methods estimate from a sample set Y = y(i):

E[f(x)] ≈ 1|Y |

∑y(i)

f(y(i))

Important: we require the y(i)to occur with p(y(i)).

Example: Estimateπ4 by choosing points in the unit square uniformly,

and testing if they are within the unit circle (here, uniform is okay).

No p(x), because the

y(i) are observed with

probability p(y(i))



Gibbs SamplingMarkov ChainsMonte Carlo is simple, but how do we get such y(i)

according to their probabilities p(y(i))?

In a Markov process, the new state y(t+1)only depends on the previous state y(t)

:

P (y(t+1) | y(1), . . . , y(t)) = P (y(t+1) | y(t))

We need to design a transition function g such that y(t+1) = g(y(t)) and p(y(t+1)) as desired.

y(0) y(1) y(2) · · · y(t) y(t+1) · · ·g g g g g g

The first B are oen ignoredThese occur with p(y(i))

For g, we can use, e.g., Gibbs sampling. We then can estimate our hidden variables!

Because of autocorrelation, it is common to use only every Lth sample.

(We require P above to be ergodic, but omit details in this lecture.)

A really nice introduction to Markov-Chain-Monte-Carlo (MCMC) and Gibbs sampling can be found in [RH10].

A more formal introduction is in the textbook [Bis07].



Gibbs SamplingUpdating variables incrementallyAssume that our state y(t)

is a vector with k > 1 components,

we can update one variable at a time for i = 1 . . . k:

y(t+1)i ∼ P (Yi | y(t+1)

1 , . . . , y(t+1)i−1︸︷︷︸

already updated

, y(t)i+1, . . . , y

(t)k︸︷︷︸

not yet updated

)

Our function g then is to do this for each i = 1 . . . k.

Informally: in every iteration (t→ t+ 1), for every variable i, we choose a new value y(t+1)i

randomly, but we prefer values more likely given the current state of the other variables.

More likely values of y will be more likely returned (even with the desired likelihood of p(y)).

yi omied



Gibbs SamplingBenefits and detailsP (Yi | y1, . . .) may not depend on all yj , but only on the “Markov blanket”.

(Markov blanket: parents, children, and other parents of the node’s children in the diagram.)

Sometimes we can also “integrate out” some yj to further simplify P .

This is also called a “collapsed” Gibbs sampler.

If we have a conjugate prior (e.g., Beta for Bernoulli, Dirichlet for Multinomial),

then we get the same family (but dierent parameters) a posterior,

which usually yields much simpler equations.



Gibbs SamplingCollapsed Gibbs sampler [GS04]We need to draw the topic of each word according to:

P (zi = t | w, d, . . .) ∝∏k

P (ϕk | β)︸︷︷︸P (topic)

∏d

P (θd | α)︸︷︷︸P (document)

∏w

P (zdw | θd)P (wdw | ϕzdw)︸︷︷︸P (word given topic)

Aer integrating out ϕ and θ, we get the word-topic probability:

P (zi = t | w, d, . . .) ∝∏

t

[Γ(ntd + αk) · Γ(ntw+βw)

Γ(∑w′ ntw′+βw′ )

]By exploiting properties of the Γ function, we can simplify this to:

P (zi = t | w, d, . . .) ∝(n−ditd + αt) · (n−ditw +βw)

n−dit +∑w′ βw′

where n−ditd , n−ditw , and n−dit are the number of occurrences of a topic-document assignment,

topic-word assignment, or topic, ignoring the current word wdi and its topic assignment zdi.

Computing this

is very expensive

A detailed derivation can be found in Appendix D of [Cha11].



Latent Dirichlet AllocationInference with Gibbs SamplingPuing everything together:

1. Initialization:

1.1 Choose prior parameters α and β.

1.2 For every document and word, choose zdi randomly.

1.3 Initialize ntd, ntw , and nt.

2. For every Markov-Chain iteration j = 1 . . . I :

2.1 For every document d and word wdi:

2.1.1 Remove old zdi, wdi from ntd, ntw , and nt2.1.2 Sample a new random topic zdi2.1.3 Update ntd, ntw , and nt with new zdi, re-add wdi.

2.2 If j ≥ B (burn in) and only every Lth sample (decorrelation):

2.2.1 Monte-Carlo update all θd, ϕk from zdi

We try puing words in

dierent topics randomly

The more oen we see a topic,

the more relevant it is.



Probabilistic Topic ModelingMore complicated modelsMany variations have been proposed.

I We can vary the prior assumptions (to draw θ, ϕ).

E.g. Rethinking LDA: Why priors maer [WMM09]

But conjugate priors like Dirichlet-Multinomial are easier to compute.

I Also learn the number of topics, α, and β (may require labeled data).

I Hierarchical Dirichlet Processes [Teh+06]

I Pitman-Yor and Poisson Dirichlet Processes [PY97; SN10]

I Correlated Topic Models [BL05]

I Application to other domains (instead of text).



Evaluation of Topic Models“Reading Tea Leaves” [Cha+09; LNB14]Topic model evaluation is diicult:

“There is a disconnect between how topic models are evaluated and why we expect topicmodels to be useful.” – David Blei [Ble12]

I Oen evaluated with a secondary task (e.g., classification, IR) [Wal+09]

I By the ability to explain held out documents with existing clusters [Wal+09]

(A document is “well explained” if it has a high probability in the model)

I Manual inspection of the most important words in each topic

I Word intrusion task [Cha+09]

(Can a user identify a word that was artificially injected into the most important words?)

I Topic intrusion task [Cha+09]

(Can the user identify a topic that doesn’t apply to a test document?)


Topic Modeling References 5: 26 / 25

References I[Bis07] C. M. Bishop. Paern recognition and machine learning, 5th Edition. Information science and statistics. Springer, 2007.

isbn: 9780387310732.

[BL05] D. M. Blei and J. D. Laerty. “Correlated Topic Models”. In: Neural Information Processing Systems, NIPS. 2005,

pp. 147–154.

[Ble12] D. M. Blei. “Probabilistic topic models”. In: Commun. ACM 55.4 (2012), pp. 77–84.

[BNJ01] D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation”. In: Neural Information Processing Systems, NIPS.

2001, pp. 601–608.

[BNJ03] D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation”. In: J. Machine Learning Research 3 (2003),

pp. 993–1022.

[Cha+09] J. Chang, J. L. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. “Reading Tea Leaves: How Humans Interpret Topic

Models”. In: Neural Information Processing Systems, NIPS. 2009, pp. 288–296.

[Cha11] J. Chang. “Uncovering, Understanding, and Predicting Links”. PhD thesis. Princeton University, 2011.

[Dee+90] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. “Indexing by Latent Semantic

Analysis”. In: JASIS 41.6 (1990), pp. 391–407.

[FKV04] A. M. Frieze, R. Kannan, and S. Vempala. “Fast monte-carlo algorithms for finding low-rank approximations”. In: J.ACM 51.6 (2004), pp. 1025–1041.


http://dx.doi.org/10.1145/2133806.2133826

http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

http://dx.doi.org/10.1145/1039488.1039494


References II[Fur+88] G. W. Furnas, S. C. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Lochbaum.

“Information Retrieval using a Singular Value Decomposition Model of Latent Semantic Structure”. In: ACM SIGIR.

1988, pp. 465–480.

[Gri02] T. L. Griiths. Gibbs sampling in the generative model of latent dirichlet allocation. Tech. rep. Stanford University, 2002.

[GS04] T. L. Griiths and M. Steyvers. “Finding scientific topics”. In: Proceedings of the National Academy of Sciences 101.suppl

1 (2004), pp. 5228–5235.

[Hof99a] T. Hofmann. “Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and

Categorization”. In: Neural Information Processing Systems, NIPS. 1999, pp. 914–920.

[Hof99b] T. Hofmann. “Probabilistic Latent Semantic Indexing”. In: ACM SIGIR. 1999, pp. 50–57.

[KF09] D. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009. isbn:

978-0-262-01319-2.

[Li+14] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. “Reducing the sampling complexity of topic models”. In: ACM SIGKDD.

2014, pp. 891–900.

[LNB14] J. H. Lau, D. Newman, and T. Baldwin. “Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and

Topic Model ality”. In: European Chapter of the Association for Computational Linguistics, EACL. 2014, pp. 530–539.

[ML02] T. P. Minka and J. D. Laerty. “Expectation-Propogation for the Generative Aspect Model”. In: UAI ’02. 2002,

pp. 352–359.



References III[PSD00] J. K. Pritchard, M. Stephens, and P. Donnelly. “Inference of Population Structure Using Multilocus Genotype Data”. In:

Genetics 155.2 (2000), pp. 945–959. issn: 0016-6731.

[PY97] J. Pitman and M. Yor. “The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator”. In: Ann.Probab. 25.2 (Apr. 1997), pp. 855–900.

[RH10] P. Resnik and E. Hardisty. Gibbs sampling for the uninitiated. Tech. rep. CS-TR-4956. University of Maryland, 2010.

[SN10] I. Sato and H. Nakagawa. “Topic models with power-law using Pitman-Yor process”. In: ACM SIGKDD. 2010,

pp. 673–682.

[Teh+06] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. “Hierarchical Dirichlet Processes”. In: J. American StatisticalAssociation 101.476 (2006), pp. 1566–1581.

[TNW06] Y. W. Teh, D. Newman, and M. Welling. “A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet

Allocation”. In: Neural Information Processing Systems, NIPS. 2006, pp. 1353–1360.

[Wal+09] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. M. Mimno. “Evaluation methods for topic models”. In: InternationalConference on Machine Learning, ICML. 2009, pp. 1105–1112.

[WMM09] H. M. Wallach, D. M. Mimno, and A. McCallum. “Rethinking LDA: Why Priors Maer”. In: Neural InformationProcessing Systems, NIPS. 2009, pp. 1973–1981.

[YMM09] L. Yao, D. M. Mimno, and A. McCallum. “Eicient methods for topic model inference on streaming document

collections”. In: ACM SIGKDD. 2009, pp. 937–946.



References IV[ZM16] C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text

Mining. New York, NY, USA: Association for Computing Machinery and Morgan & Claypool, 2016. isbn:

978-1-97000-117-4.


http://dx.doi.org/10.1145/62437.62487

http://dx.doi.org/10.1073/pnas.0307752101

http://dx.doi.org/10.1145/312624.312649

http://dx.doi.org/10.1145/2623330.2623756

http://dx.doi.org/10.1214/aop/1024404422

http://dx.doi.org/10.1145/1835804.1835890

http://dx.doi.org/10.1198/016214506000000302

http://dx.doi.org/10.1145/1553374.1553515

http://dx.doi.org/10.1145/1557019.1557121

http://dx.doi.org/10.1145/1557019.1557121

http://dx.doi.org/10.1145/2915031

http://dx.doi.org/10.1145/2915031

Word and Document Embeddings Introduction 6: 1 / 13

Word EmbeddingsMotivationConsider these two sentences:

8

Obama speaks to the media in Illinois .

PresidentThe greets the press in Chicago .

Cosine similarity: 0, if stop words were removed.

Want to recognize:

Obama ∼ President

speaks ∼ greets

press ∼ media

Illinois ∼ Chicago

8

Example taken from [Kus+15]



Word EmbeddingsMotivationThe bag-of-words representation does not capture word similarity:

For example the words Obama and President:

Obama = ( 0 , 0 , 0 , 1 , 0 , . . . , 0)President = ( 0 , 1 , 0 , 0 , 0 , . . . , 0)

Obama · President = ( 0 , 0 ·1, 0 , 1 ·0, 0 , . . . , 0) = 0

Because of this, the documents are completely dissimilar (except for stopwords):

sim(Obama , speaks , press , Illinois, President , greets , media , Chicago) = 0

z We want a word representation where 0 sim(Obama , President ) < 1.

Preferrably also of lower dimensionality (100–500) than our vocabulary!



Contextual informationWhat is a Wampimuk?10

Humans infer meaning from the context [MR01]:

He filled the wampimuk , passed it around and we all drunk some .

z Probably a drink?

We found a cute , hairy wampimuk sleeping behind the tree .

z Probably an animal?

We want computers to be able to make such inferences!

9

9

Image from [LBB14], but probably originally internet folklore.

10

This example is probably from Marco Baroni ca. 2011 to illustrate

the distributional hypothesis of [MR01], but is frequently (incorrectly?) aributed to [MR01].


Word and Document Embeddings Vector Representations 6: 4 / 13

Vector representations of wordsTowards word similarityIn the bag of words model, we can interpret our document vectors as:

~d =∑

w∈dew =

∑w∈d

(0, . . . , 0, 1, 0, . . . , 0︸︷︷︸1 only at position w

)

where ew is a unit vector containing a 1 in the column corresponding to word w.

In so-called “distributed representations”, the word information is not in a single position anymore:

Apple

Banana

Car

Dog

Elephant

an

im

al

fru

it

wh

eels

elo

ngate

ovate

ed

ib

le

bark

s

tail

similar

similar

Manually seing

these properties

is a huge eort!



Vector representations of wordsDifferent explicit vector representationsWe can obtain such representations with dierent approaches:

Document occurrences (term-document-matrix):

Apple

Doc

6

Doc

7

Doc

10

Neighboring words (cooccurrence vectors):

Apple

gre

en

red

peel

tree

Neighboring words with positions:

Apple

undern

eath

tree

Character trigraphs:

Apple

_A

pA

pp

ppl

ple

le_



Vector representations of wordsLearned vector representationsThe previous examples were engineered, high-dimensional, and sparse features.

We can get some success with Cosine similarity to compare words.

z Can we learn lower-dimensional (dense) features from the data?

LSA can be seen as such an approach: factorize the term-document-matrix

Many methods can be seen as a variant of this:

I build a (large) explicit representation

I factorize11

into a lower-dimensional approximation

I use approximation as new feature vector instead

11

Not necessarily by SVD, but instead, e.g., similar to neural networks


Word and Document Embeddings Neural Models for Word Similarity 6: 7 / 13

Neural Models for Word SimilaritySkip-Gram with Negative Sampling (SGNS, word2vec [Mik+13; LM14])One hidden layer neural network:

ewiT

×

Win

→

fwiT

×

W Tout

→

ewjT

Every word corresponds to one row in the “encoder matrix” Win (= word vectors).

Every word corresponds to one column in the “decoder matrix” Wout (usually discarded).

Weight matrixes Win and Wout are iteratively optimized to best predict the

neighbor words wj for j ∈ i− c, . . . , i− 1, i+ 1, . . . , i+ c and c ≈ 5.

Neural networks functions:

Somax, hierarchical somax,

negative sampling



Neural Models for Word SimilarityContinuous Bag of Words (CBOW, word2vec [Mik+13; LM14])One hidden layer neural network:∑

j ewjT

×

Win

→

fwiT

×

W Tout

→

ewiT

Use words wi−c, . . . , wi−1, wi+1, . . . , wi+c to predict word wi.

Intuition: sum the rows of every input word in Win,

find the most similar column in W Tout

as output.



Neural Models for Word SimilarityInterpretation [GL14; LGD15; LGD15]For skip-gram, we can interpret Win ×Wout as follows:

Win

×

W Tout

≈

X

vi

uj

Where X = xij is a word cooccurrence matrix.

z word2vec is an (implicit) matrix factorization.

We can get similar (but not quite as good) results with SVD [LG14].



Neural Models for Word SimilarityLoss functions of word2vec [Mik+13; LM14]Probability of word wj given wi:

p(wj | wi) ∝ exp(uTj · vi

)such that

∑jp(wj | wi) = 1

Loss function for Skip-Gram:

Lskip-gram = − 1|S|

∑i∈S

∑j=−c,...,−1,+1,...,c

log p(wi+j | wi)

Loss function for CBOW:

LCBOW = − 1|S|

∑i∈S

log p(wi |

∑j=−c,...,−1,+1,...,c

wi+j

)where |S| is the number of context windows.

For performance, approximate somax (testing all j is too expensive)

with, e.g., hierarchical somax or negative sampling.

Train with: backpropagation, stochastic gradient descent

Somax

Predict each

word separately

Aggregate all words



Neural Models for Word SimilarityOptimizing skip-gramWe can optimize the weights using stochastic gradient descent and back-propagation.

The basic idea is to update the rows of Win and Wout with a learning rate η:

w(t+1)i =w

(t)i − η

∑jεj · hj

where εj is the prediction error wrt. the jth target, and hj is the jth target.

Intuitively, in each iteration we

I make the “good” output vector(s) more similar to output we computed

I make the “bad” output vector(s) less similar to output we computed

use negative sampling: do not update all of them, only a sample

I make the input vector(s) more similar to the vector of the desired output

I make the input vector(s) less similar to the vector of the undesired output



Neural Models for Word SimilarityGlobal Vectors for Word Representation [PSM14]If we aggregate all word cooccurrences into a matrix X = xij, the skip-gram objective:

Lskip-gram = − 1|S|

∑i∈S

∑j=−c,...,−1,+1,...,c

log p(wi+j | wi)

becomes

Lskip-gram = −∑

i

∑jxji log p(wj | wi)

This is similar to the loss function of GloVe:

LGloVe = −∑

i

∑jf(xji) (log(xij − uTj · vi))2

weight divergence



Neural Models for Document SimilarityFrom word2vec to doc2vec [LM14]The early approaches used the average word vector, but it did not work too well.

We can design the vector representation as we like.

Idea: also include the document.

Concatenate the word vector with a document indicator (0, . . . , 0, 1, 0, . . . , 0).

⇒ we also optimize a vector for each (training) document.


Word and Document Embeddings References 6: 14 / 13

References I[GL14] Y. Goldberg and O. Levy. “word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method”.

In: CoRR abs/1402.3722 (2014).

[Kus+15] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. “From Word Embeddings To Document Distances”. In:

International Conference on Machine Learning, ICML. 2015, pp. 957–966.

[LBB14] A. Lazaridou, E. Bruni, and M. Baroni. “Is this a wampimuk? Cross-modal mapping between distributional semantics

and the visual world”. In: Annual Meeting of the Association for Computational Linguistics, ACL. 2014, pp. 1403–1414.

[LG14] O. Levy and Y. Goldberg. “Neural Word Embedding as Implicit Matrix Factorization”. In: Neural Information ProcessingSystems, NIPS. 2014, pp. 2177–2185.

[LGD15] O. Levy, Y. Goldberg, and I. Dagan. “Improving Distributional Similarity with Lessons Learned from Word

Embeddings”. In: TACL 3 (2015), pp. 211–225.

[LM14] Q. V. Le and T. Mikolov. “Distributed Representations of Sentences and Documents”. In: International Conference onMachine Learning, ICML. 2014, pp. 1188–1196.

[Mik+13] T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Eicient Estimation of Word Representations in Vector Space”. In: CoRRabs/1301.3781 (2013).

[MR01] S. Mcdonald and M. Ramscar. “Testing the distributional hypothesis: The influence of context on judgements of

semantic similarity”. In: Proceedings of the Cognitive Science Society. 23. 2001, pp. 611–617.

[PSM14] J. Pennington, R. Socher, and C. D. Manning. “GloVe: Global Vectors for Word Representation”. In: Empirical Methods inNatural Language Processing, EMNLP. 2014, pp. 1532–1543.


Summary & Conclusions Summary 7: 1 / 5

SummaryRepresenting DocumentsWe learned about dierent ways of representing documents as vectors:

I Bag of Words (BoW), with dierent weights (TF-IDF)

I Topic distributions (pLSI, LDA)

I Embeddings (doc2vec)

Challenges:

I Stop words & weighting, spelling errors

I Synonyms, homonyms, negation, sarcasm, irony

I Short documents

I High dimensionality

I Similarity computations

I Evaluation



SummaryClusters & TopicsClusters:

I Typically every document belongs to exactly one cluster

I So assignment variants exist (EM, Fuzzy c-means, . . . )

I Oen assume dense vectors

Frequent itemsets / frequent sequences / subspace clustering:

I A document can contain multiple frequent itemsets, or none

Topics:

I Documents are mixtures of topics

I LDA: Dirichlet prior – only a few topics per document

I Sparse vectors assumed



SummaryRecurring themesMany methods can be interpreted as matrix factorization:

I Explicit, e.g., LSI/LSA

I Implicit: k-means as one-hot factorization of the feature matrix [Bau16]

I Implicit: pLSI, LDA as non-negative matrix factorization (NMF) [DLJ10]

I Implicit: word2vec as factorization of the cooccurrence matrix [LG14]

Optimization:

I k-means (Lloyds algorithm) as expectation-maximization

I EM for optimizing pLSI

I Gibbs sampling and Markov-Chain-Monte-Carlo

I Stochastic Gradient Descent


Summary & Conclusions Conclusions 7: 4 / 5

ConclusionsText clustering / topic modeling:

I Two sides of the same coin

I Rely on statistical models

I Trained by numerical optimization

Usage is not trivial:

I Many parameters to choose and tune

I Many variants

I Requires fine-tuning for best results

I Hard to evaluate

z Do not treat these methods as a “black box” algorithms,

but try to understand how they work, so you can adapt them to your problem!


Summary & Conclusions Conclusions 7: 5 / 5

Open ProblemsMany open problems – a lot of potential for thesis topics:

I Multi-lingual support (German is more diicult, and current quality can be prey bad)

I Hierarchical topics (e.g., G20 riots as part of G20 overall)

I Temporal aspects (emerging new topics, such as the G20 riots)

I Evaluation and topic summarization

I Bias problems (“racist” algorithms, gender bias, . . . )

man is to soccer as woman is to z volleyball


Summary & Conclusions References 7: 6 / 5

References I[Bau16] C. Bauckhage. “k-Means Clustering via the Frank-Wolfe Algorithm”. In: Lernen, Wissen, Daten, Analysen (LWDA). 2016,

pp. 311–322.

[DLJ10] C. H. Q. Ding, T. Li, and M. I. Jordan. “Convex and Semi-Nonnegative Matrix Factorizations”. In: IEEE Trans. PaernAnal. Mach. Intell. 32.1 (2010), pp. 45–55.

[LG14] O. Levy and Y. Goldberg. “Neural Word Embedding as Implicit Matrix Factorization”. In: Neural Information ProcessingSystems, NIPS. 2014, pp. 2177–2185.


http://dx.doi.org/10.1109/TPAMI.2008.277