Latent Dirichlet Allocation

Topic Modeling using Latent Dirichlet Allocation

Mohit KothariComputer Science and EngineeringUniversity of California, San Diego

[email protected]

Sonali RahagudeComputer Science and EngineeringUniversity of California, San Diego

[email protected]

Abstract

Latent Dirichlet Allocation (LDA) is a probabilistic, generative model designed todiscover latent topics in text corpora. The idea behind LDA is to model documentsas arising from multiple topics, where a topic is defined to be a distribution overa fixed vocabulary of terms. In this report, we train LDA model on two datasets,namely, Classic400 and BBC News. We use the method of collapsed Gibbs sam-pling to train the model. We discuss issues related to Gibbs sampling, defininggoodness-of-fit criteria, parameter tuning, convergence etc. and analyze the ex-perimental results. We test the effectiveness of LDA in modeling and discoveringlatent topics in the corpus using VI distance measure.

1 Introduction

In recent years, the amount of data such as text, media available to us has increased exponentiallyand people have been continuously trying to extract useful information from it. For example, givena set of raw text documents, a good way to extract information is to find some keywords that suc-cinctly describe what the document is about. We can then discover different themes that may spana given corpora of documents. Hence, the goal is to find short descriptions for documents that en-able efficient processing of large collections while preserving the essential statistical structure of thedocuments.

Latent Dirichlet Allocation (LDA)[1] is the simplest topic model that specifically aims to find theseshort descriptions for members in a data corpus. LDA is an unsupervised, generative model thatproposes a stochastic procedure for modeling the words in the given collection of documents. LDAwas originally proposed in the context of text mining but its applications have spanned to a varietyof fields including domains such as collaborative filtering, content-based image retrieval and bioin-formatics. Because words carry very strong semantic information, documents that contain similarcontent will most likely use a similar set of words. As such, mining an entire corpus of text docu-ments can expose sets of words that frequently co-occur within documents. These sets of words canbe interpreted as topics and they act as building blocks of the short descriptions.

The report is structured as follows, we cover some theoretical background in section 2, section 3describes the design choices we make and details about goodness-of-fit of a topic model. Lateron, section 4 discusses the results we obtain and tries to comment on the behavior observed withchanging hyper-parameters K, α and β. Section 5 concludes with a summary.

2 Background

LDA is a probabilistic model designed to discover latent topics in the text corpora. It is a three-level hierarchical Bayesian model[1] in which each document in a collection is modeled as a finitemixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over

1

an underlying set of topic probabilities. These topic probabilities provide an explicit representationof a document.

2.1 Notation and terminology

Formally, we define the following terms[1],

1. A word is the basic unit of discrete data, defined to be an item from a vocabulary indexedby {1, ..., V }. For implementation purposes, set of all words is defined as a V dimensionalvector.

2. A document m is a sequence of Nm words denoted by w = (w1, w2, ..., wNm), where wn

is the nth word in the sequence.

3. A corpus is a collection of M documents denoted by D = {w1,w2, ...,wM}.

Variables in bold represent vectors and the same notation is followed for rest of the paper.

We wish to train an LDA model on the training corpus that not only assigns high probability to thedocuments in the corpus, but also assigns high probability to other “similar” documents which areunseen during the training phase.

2.2 Simplification

The foremost goal of mining a text corpus is to find an apt representation of the corpora for eachdocument. Intuitively, it makes sense to choose a model that not only takes the content but alsoretains the ordering of the words i.e. captures the structure of the document. However, LDA is basedon the “bag-of-words” assumption - that the order of words in a document can be neglected[1].In the language of probability theory, this is an assumption of exchangeability for the words in adocument[2]. LDA also assumes that documents are exchangeable; the specific ordering of thedocuments in a corpus can also be neglected. This assumption helps in reducing the complexityof the algorithm without compromising much on the quality. Based on the total distinct words thatappear in the corpus, a global vocabulary list is built. Most of the times, to reduce the dimensionalityof the vocabulary list, lot of data pre-processing is done; some common methods include stop-wordremoval, stemming, synonyms etc.

2.3 Latent Dirichlet allocation (LDA)

Latent Dirichlet allocation is a generative probabilistic model of a corpus. The basic idea is thatdocuments are represented as random mixtures over latent topics, where each topic is characterizedby a distribution over words. Here, words are modeled as observed random variables, while topicsare modeled as latent random variables. Once the generative procedure is established, we define itsjoint distribution and then use statistical inference to compute the probability distribution over thelatent variables, conditioned on the observed variables.

2.4 Multinomial distribution

LDA uses multinomial distribution to model the training set of documents[1, 3]. Once we havefinalized the parameters of this model, we can evaluate the probability of a test document predictedby this model. The distribution is represented as follows,

p(x; θ) = (n!∏Vj=1 xj !

)(

V∏j=1

θxj

j ) (1)

where the data x is a vector of non-negative integers and the parameters θ is a real-valued vector.Both the vectors have the same length V. In equation (1), the first factor in parentheses is called a”multinomial coefficient”. It is the size of the equivalence class of x, that is the number of differentword sequences that yield the same counts. The second factor in parentheses is the probability ofany individual member of the equivalence class of x.

2

2.5 LDA generative process

The generative process for a document collection D under the LDA model is as follows [1, 4],

1. For k = 1...K :

(a) φ(k) ∼ Dirichlet(β)

2. For each document d ∈ D:

(a) θd ∼ Dirichlet(α)

(b) For each word wi ∈ d:

i. zi ∼ Discrete(θd)ii. wi ∼ Discrete(φ(zi))

where K is the number of latent topics in the collection, φ(k) is a discrete probability distributionover a fixed vocabulary that represents the kth topic distribution, θd is a document-specific distribu-tion over the available topics, zi is the topic index for word wi, and α and β are hyper-parametersfor the symmetric Dirichlet distributions from which the discrete distributions are drawn.

The generative process described above results in the following joint distribution,

p(w, z, θ, φ|α, β) =[

K∏k=1

p(φ|βk)][

M∏d=1

p(θd|α)]

[

Nd∏n=1

(p(zd,n|θd)p(wd,n|zd,n, φzd,n))]

(2)

which can be directly inferred from the plate notation of the LDA as shown in the figure 1

Figure 1: Graphical model representation of the LDA model used for this project. The boxes are“plates” representing replicates. The outer plate represents documents, while the inner plate repre-sents the repeated choice of topics and words within a document.

The unobserved (latent) variables z, θ, and φ are of interest to us. Each θd is a low-dimensionalrepresentation of a document in “topic space”. Each zi represents the topic that generates the wordinstance wi and φ represents a K × V matrix where φj,i = p(wi|zj)[5]. One of the most inter-esting aspects of LDA is that it can learn words that we would associate with certain topics in anunsupervised manner. This is expressed through the topic distributions φ.

As mentioned in equation 2, we use Dirichlet distribution for our priors. Dirichlet distribution is aprobability density function over the set of all multinomial parameter vectors, given by

p(γ|α) =1

D(α)

V∏s=1

γαs−1s (3)

3

where, γ is any parameter of length V over which the Dirichlet distribution is acted such that γs >0 ∀ s and

∑Vs=1 γs = 1. α is the parameter of the Dirichlet distribution itself, where

D(α) =

∫γ

V∏s=1

γαs−1s

=

∏Vs=1 Γ(αs)

Γ(∑Vs=1 αs)

(4)

where Γ is defined by Γ(k) = (k − 1)! for integer k. The reason for using Dirichlet distribution isbecause it is a conjugate of multinomial distribution. This simplifies the probability expression asmentioned in equation(5), thus making the training algorithm more efficient and making it easy toinfer new unseen documents.

2.6 Collapsed Gibbs sampling

The training algorithm that we use in this report is known as collapsed Gibbs sampling. Rather thaninferring θ and φ(k) distributions directly, it infers the latent variable z for each word occurrence ineach document i.e. the topic from which the word is coming. Because a word can appear at differentplaces in a document, each appearance of the word has its own z value. Thus the same word cancome from different topics.

Suppose we have a vector z for a document such that z = {z1, ..., zn} with the distributionp(zi|z1, ..., zi−1, zi+1, ..., zn;w). Gibbs sampling uses the following algorithm to reach the truedistribution of p(zi| ¯z−i;w) where ¯z−i = {z1, ..., zi−1, zi+1, ..., zn}. The steps involved in Gibbssampling are as follows,

1. Select an arbitrary initial guess for z = {z1, ..., zn}.2. Draw z1 according to p(z1| ¯z−1;w) and so on for z2,z3 etc., till zn.3. Update z with new drawn values and repeat step (2).

Further, if step (2) is repeated very large number of times, the process converges to the actual dis-tribution of vector z for each w. Skipping the derivation of the probability distribution, the finalprobability is given by

p(zi = j| ¯z−i, w) ∝(n

(−i)d,k + αk) ∗ (n

(−i)k,w + βw)∑

k′(n(−i)d,k′ + αk′) ∗

∑w′(n

(−i)k,w′ + βw′)

(5)

where nd,k is the number of times words in document d are assigned to topic k and where nk,w isthe number of times word w is assigned to topic k. And superscript (−i) signifies leaving the ithtoken out of the calculation. The pseudocode for the algorithm is presented in Appendix A.

3 Design

We now describe the design choices we make for training the LDA model using collapsed Gibbssampling.

3.1 Datasets

We use two datasets to experiment with our implementation of LDA. The Classic400 dataset [6]contains 400 documents over a total vocabulary of 6205 words. It is already pre-processed wherethe documents are stored in the form of a MATLAB sparse matrix. The dataset also contains classlabels for the documents in the corpus and each document belongs to one of the 3 distinct classesC ∈ {1, 2, 3}.The other dataset is derived from British Broadcasting Corporation (BBC) news articles 1. Thisdataset consists of 2225 documents over a total vocabulary of 9635 words. The documents corre-spond to stories in five areas of topics dated between (2004-2005), namely, business, entertainment,

1http://mlg.ucd.ie/datasets/bbc.html

4

http://mlg.ucd.ie/datasets/bbc.html

politics, sports, technology. As with the Classic400 dataset, it also provides the class labels of thedocuments i.e. C ∈ {1, 2, 3, 4, 5}. The dataset is available in matrix market format2. Since thisformat is deprecated, some effort is spent to convert it into MATLAB sparse matrix format similarto the Classic400 dataset, so that our framework can handle it without any changes.

3.2 Pre-processing for BBC dataset

MATLAB does not provide functions to load files which are in Matrix market format and so we needto take help of a third party solution to load BBC dataset. We modify rdcood function written by R.Pozo3 which converts the data in matrix market format into sparse matrix format. We use the sametechnique to load the word list and class labels of the BBC dataset.

3.3 Choice of hyper-parameters α ,β

As shown previously in equation(5), the hyper-parameters α and β influence the prior belief of thedocument-topic and topic-term distributions respectively. More generally, the scalar α is the numberof pseudowords that belong to each topic j in each document d. Intuitively, when α is bigger, it iseasier for different positions in the same document to be assigned to different topics. On the otherhand, β is the pseudocount of prior occurrences of each word in each topic. When β is bigger, it iseasy for two appearances of a word to be assigned to different topics.

As explained by David Blei [1], these hyper-parameters also have a smoothing effect on the dis-tributions and hence the plate notations represented in figure 1 is also known as smoothed LDA.Lowering their values reduces this smoothing effect and results in more decisive topic associations,thus both φ and θ become more sparse.

Since they both have a joint effect and there is no algorithmic way of identifying which pair or whatkind of pair gives the best possible model, we run our experiments for the following combinationsof α and β, α ∈ {1/K, 2/K, 5/K, 50/K} and β ∈ {1, 0.01, 0.0001} for the Classic400 dataset.However, for the BBC dataset, we run it for a smaller subset, α ∈ {2/K, 50/K} and β ∈ {1, 0.01}because of the computational complexity of running these experiments.. Another simplificationthat we apply is the assumption of symmetric Dirichlet distributions, i.e. we assume α and β aresymmetric and thus we can assume same values for all βk and same values for all αd.

3.4 Choice of number of topics

There is always a dilemma in choosing the number of topics while training a topic model. Asthis is an unsupervised learning, we don’t know how many topics the underlying corpus contains.We think its more of a black art to detect how many topics would be a good fit. For this project,both our datasets contain the class labels for the documents in the corpus. So, to start with, weassume the number of topics K equal to the number of classes |C|. But we don’t stop there, weexperiment with values K > |C| as well as K < |C|. It’s always exciting to see the results of theseconfigurations, i.e. if two or more topics can be generalized into a single topic or if any underlyingtopic contains subtopics. In doing so, there is also a risk of either over-generalization or very finegrain categorization of the topics i.e. adding noise to your topics.

For the Classic400 dataset, documents are already classified into 3 classes; so we run our experi-ments for K ∈ {3, 4, 5}. Running it for 2 topics really doesn’t make much sense, intuitively itwould lead to very high generalization. For BBC News dataset, documents are already classifiedinto 5 classes and hence we run our experiments for K ∈ {4, 5, 6}. We also report some intuitionin section 4.5, when we train the model for K = 4 and K = 6 topics on BBC dataset.

3.5 Principle component analysis

It’s always exciting to visualize how the model is getting trained. One way of visualization could beto plot the document distribution, i.e. θ for different number of epochs. In MATLAB, 3D graphs canbe plotted using scatter3 or plot3, so we can plot the per-document topic distribution for K = 3.

2http://math.nist.gov/MatrixMarket/formats.html3http://math.nist.gov/pozo

5

http://math.nist.gov/MatrixMarket/formats.html

http://math.nist.gov/pozo

Since for a given document d, θd sum to 1, we can also plot the distribution for K = 4 as the degreeof freedom is 3.

But while using higher number of topics i.e K > 4, we cannot directly plot them. Hence, we usePrincipal Component Analysis (PCA)[7] to perform dimensionality reduction on the per-documenttopic distribution, reducing it to principal components. The idea behind using PCA is that the datasettends to be distributed along a low dimensional subspace. Using PCA, you can find patterns in thedata and reduce the number of dimensions without much loss of information. We use an onlinetutorial to implement our own version of PCA 4. The steps involved in PCA are pretty straightfor-ward; you translate the dataset so that the center is at origin, calculate the covariance matrix, findthe principal components by rotations and plot the reduced dimensional dataset.

3.6 Analysing topic models for goodness-of-fit

Topic models like LDA broadly perform soft estimations of the documents over latent topics andobserved entities, i.e. words, documents etc. While learning a model, it is often required to evaluatethe quality of the model and its correctness in discovering the topics in the given corpus. This qualitymeasure for the LDA model can be defined as goodness-of-fit for the model. We now describe 3techniques that can be used to compute goodness-of-fit.

3.6.1 Clustering accuracy

The LDA model already provides a soft clustering of the documents as well as the terms in a corpusby associating them with topics. It is useful to measure the quality of such clustering. One wayto evaluate goodness-of-fit for an LDA model could be to perform subjective inspection of topicassignments to different documents in the corpus. A more concrete method would be to use aclassifier that predicts the class of a document based on the θ vectors. If we know a priori what classthat document belongs to, we can compare it with the class predicted by the classifier and see if theyconfer or not. Essentially, we are treating the topic j as a class comprising of documents with thehighest θj and comparing it with the true class labels for the documents.

3.6.2 Variation of information distance

An alternative measure for goodness-of-fit could be based on soft clustering of the LDA. Here, thegoal is to treat the given set of class labels as a deterministic topic distribution for each document, andcompute a distance measure between this deterministic distribution and the LDA topic distributions.One such measure described by Gregor Heinrich et.al.[8, 9] is variation of information distance alsoknown as VI distance. Assume we have documents d1, ..., dn. A soft clustering C assigns to eachdocument di, a distribution p(c = r|di) for r = 1, ..., k. If we have a second clustering C ′, wehave a new distribution p(c′ = r|di) for r = 1, ..., k′. Note that both clusterings may have differentnumber of clusters (k and k′ can be different).

If clusterings are very similar, there will be pairs of clusters that will often occur together. On theother hand, if both clusterings are independent, the pairs of clusters c = r and c′ = s will appearwith probability p(c = r)p(c′ = s). Therefore, we can determine the Kullback-Leibler divergencebetween these “independent” distribution and the actual distribution p(c = r, c′ = s). This is justthe mutual information between the random variables induced by the clustering[10, 8]. The mutualinformation is given by,

I(C,C ′) =

k∑r=1

k′∑s=1

p(c = r, c′ = s)log(p(c = r, c′ = s)

p(c = r)p(c′ = s)) (6)

4http://nghiaho.com/?page_id=1030

6

http://nghiaho.com/?page_id=1030

Where the required probabilities are computed by averaging over the distributions of cluster ofdocuments.

p(c = r) =1

M

M∑i=1

p(c = r|di)

p(c = r, c′ = s) =1

M

M∑i=1

p(c = r, c′ = s|di)

(7)

The mutual information between two random variables becomes 0 for independent variables. Fur-ther, I(C,C ′) ≤ min{H(C), H(C ′)} where H(C) = −

∑kr=1 p(c = r)log2p(c = r) is the

entropy of C. This inequality becomes an equality I(C,C ′) = H(C) = H(C) if and only if thetwo clusterings are equal. Meila [9] uses these properties to define the variation of informationdistance measure,

DV I(C,C ′) = H(C) +H(C ′)− 2I(C,C ′) (8)

and shows that DV I(C,C ′) is a true metric, i.e. it is always non-negative. It observes the triangleinequality, DV I(C,C ′) + DV I(C ′, E) ≥ DV I(C,E) and becomes zero if and only if C = C ′.Further, the VI distance metric only depends on the proportions of cluster associations with dataitems, it is invariant to the absolute numbers of data items.

3.6.3 Perplexity

The above described methods can be adopted when the class labels for the documents are knowna priori. In the absence of class labels, a common criterion for measuring the goodness-of-fit ofLDA model is the likelihood of held-out data under the trained model. Perplexity for a model triesto measure this likelihood. Perplexity is defined as the reciprocal of geometric mean of the wordlikelihoods in the test corpus given the model(M).

p(W|M) = exp−∑Md=1 log(p(wd|M))∑M

d=1Nd(9)

We are not going into the details of how to compute this likelihood. We choose variational informa-tion (VI distance) to measure the goodness-of-fit as we already have the true labels for the dataset.VI distance provides richer information about the topic model than clustering accuracy. We do notchoose perplexity because our initial dataset consists of 400 documents only and creating a hold-outvalidation set would leave very small number of documents to train for LDA.

3.7 Convergence for Gibbs sampling and overfitting measure

Gibbs sampling being a MCMC method, faces the difficulty to determine when the Markov chainhas reached its stationary distribution. In practice, the convergence of some measure of modelquality is used instead. Heinrich[10] proposes the use of perplexity and test-set likelihood towardsconvergence monitoring for an LDA model.

In many practical cases, in addition to using perplexity and likelihood of held-out data for thispurpose, it is possible to perform intermediate convergence monitoring steps using the likelihoodor perplexity of the training data. Because no additional sampling of held-out data topics has tobe performed, this measurement is more efficient compared to using held-out data. As long as nooverfitting occurs, the difference between both types of likelihood remains low, a fact that can beused to monitor overfitting as well.

Thus, we choose VI distance to evaluate goodness-of-fit and use a fixed number of epochs (1000)instead of stopping condition for Gibbs sampling while training the LDA model.

4 Results

We now present all the major results of our experiments. All the experiments are performed inMATLAB.

7

4.1 Estimation of hyper-parameters α and β

Figures 2 and 3 show plots of the variation of information (VI) distance versus the number of epochsfor the Classic400 and BBC News datasets, respectively. It is evident from the figures that the valuesα = 2/K and β = 1 give the least VI distance for both the datasets. Here,K is the number of topics;hence for K = { 3, 4, 5, 6 }, we have α = { 0.67, 0.5, 0.4, 0.33 }.We choose the values of α = 2/K and β = 1 to report our results for the remainder of the exper-iments, since we do not want to overload the reader with a plethora of graphs of per-topic distribu-tions.

0 10 20 30 40 50 60 70 80 90 1001

1.5

2

2.5

3

3.5

4

Iterations

Variation of Information Distance

[16.67, 0.01, 3]

[0.4, 0.01, 5] [12.5, 1, 4] [12.5, 0.01, 4]

[0.67, 1, 3]

[0.4, 1, 5]

[0.5, 1, 4]

[0.67, 0.01, 3]

[0.5, 0.01, 4]

Figure 2: VI Distance for Classic400 dataset [α, β, No. of topics]

0 5 10 15 20 25 30 35 40 45 501.5

2

2.5

3

3.5

4

4.5

5

Iterations

Variation of Information Distance

[12.5, 0.01, 4][0.4, 0.01, 5]

[0.5, 0.01, 4]

[0.33, 1, 6][0.5, 1, 4]

[0.5, 0.01, 4]

[0.33, 0.01, 6]

[10, 0.01, 5][8.33, 0.01, 6 ]

[0.4, 1, 5]

Figure 3: VI Distance for BBC dataset [α, β, No. of topics]

We use less number of epochs in the plot since the VI distance for different (α, β) pairs diverge withthe number of epochs. Hence, it suffices to infer the relevant values for α and β from 100 epochsfor the Classic400 dataset and from 50 epochs for the BBC dataset.

8

4.2 Top 10 most probable words for topics

4.2.1 Classic400 dataset

Table 1 lists the 10 most probable words for the Classic400 dataset for K = 3. We have labeled thetopics discovered with our own interpretation of the ”natural” topics contained in the corpus, basedon the words associated with them.

Topic 1 Topic 2 Topic 3boundary patients system

layer ventricular researchwing left fattymach cases scientific

supersonic nickel retrievalratio aortic acidswings septal science

velocity visual languageshock defect methodseffects pulmonary glucose

”Aerospace-Physics” ”Medical-Medicine” ”Research-Science”

Table 1: 10 most probable words per topic for Classic400 dataset

4.2.2 BBC dataset

Table 2 lists the 10 most probable words for the topics discovered with K = 5 on the BBC dataset.We notice that some of the words have been chopped off at their ends. This is because the BBCdataset has been pre-processed with stemming.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5game govern peopl year filmplai peopl game compani bestwin labour technolog market award

player parti mobil firm yearengland elect phone bank musicagainst minist servic sale star

first blair on price showyear plan user share on

world tori comput growth includ”Sports” ”Politics” ”Technology” ”Business” ”Entertainment”

Table 2: 10 most probable words per topic for BBC dataset

4.3 Sparsity in per-document topic distribution

4.3.1 Classic400 dataset

For the Classic400 dataset, we observe that there is an increase in the sparsity of per-document topicdistribution with the number of epochs. This is depicted in figure 4. As seen, the per-document topicdistribution is sparser for 1000 epochs of Gibbs sampling as compared to 10 and 100 epochs. Fromthis result, we conclude that the LDA model converges to the true topic assignment as we increasethe number of epochs of Gibbs sampling.

9

0 0.2 0.4 0.6 0.8 1

00.5

10

0.2

0.4

0.6

0.8

1

PCA(2)PCA(1)

PCA(3)

(a) 10 epochs

00.5

1

00.5

10

0.2

0.4

0.6

0.8

1

PCA(1)PCA(2)

PCA(3)

(b) 100 epochs

00.5

1

00.5

10

0.2

0.4

0.6

0.8

1

PCA(1)PCA(2)

PCA(3)

(c) 1000 epochs

Figure 4: Per-document topic distributions for different epochs - Classic400 dataset

4.3.2 BBC dataset

In case of BBC dataset, we observe a slightly different trend for the per-document topic distributionover the number of epochs. Figure 5 shows plots of the per-document topic distribution for differentnumber of epochs run for training the LDA model. We notice that though the sparsity of the distri-bution increases with the number of epochs, the distribution is not as sparse as the one obtained fromthe Classic400 dataset, at the end of 1000 epochs. We thus infer that sparsity of the per-documenttopic distribution is a characteristic of the dataset for which the LDA model is trained. In case ofthe Classic400 dataset, documents tend to contain words related to a single topic while in the BBCdataset, documents tend to contain words that span a number of topics.

−0.8−0.6

−0.4−0.2

00.2

0.40.6

0.8

−0.8−0.6

−0.4−0.2

00.2

0.40.6

0.8−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

PCA(1)PCA(2)

PCA(3)

(a) 10 epochs

−0.8−0.6

−0.4−0.2

00.2

0.40.6

0.8

−0.8−0.6

−0.4−0.2

00.2

0.40.6

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

PCA(1)PCA(2)

PCA(3)

(b) 100 epochs

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−0.8−0.6

−0.4−0.2

00.2

0.40.6

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

PCA(1)PCA(2)

PCA(3)

(c) 1000 epochs

Figure 5: Per-document topic distributions for different epochs - BBC dataset

10

4.4 Comparison of different number of topics

As described in section 3.4, we choose the number of topics for the Classic400 dataset as K ∈{3, 4, 5} and for the BBC dataset as K ∈ {4, 5, 6}. We run our LDA models for each of theseK values. We observe that for the Classic400 dataset, K = 3 gives the most natural per-wordtopic distributions where distinct topic definitions can be derived from the 10 most probable wordsbelonging to a topic. However, for values K = 4 and K = 5, we could see some overlap in the10 most probable words belonging to different topics. Thus, for values K = 4 or K = 5, themodel generates some noise in the per-word topic distribution. Hence, we conclude that the givencorpus consists of 3 topics only. One reason for the model not giving distinct interpretable topics forK = 4, 5 could be the small size of the corpus.

However, in case of the BBC dataset, we observe some interesting results with variation in thenumber of topics. These are described in the next section.

4.5 Number of topics in BBC dataset

By subjective evaluation, we can observe that the BBC dataset consists of five topics. This is alsohinted by number of the class labels provided along with the dataset. Thus, fixing the number oftopics, i.e. K = 5, gives a per-word topic distribution as shown in table 4. We also train the LDAmodel for number of topics, K = 4, 6. The 10 most probable words for each are listed in tables 3and 5.

Looking at tables 3, 4 and 5, it is interesting to note that for the model trained withK = 4, the topicsEntertainment and Sports have been combined together in a single topic. Similarly, for the modeltrained with K = 6, the topic of Entertainment from the LDA model of K = 5 has been furthersplit into the topics of Movies and Music. We are purposefully re-stating the top 10 most probablewords for K = 5 below to improve readability.

Topic 1 Topic 2 Topic 3 Topic 4year peopl govern film

compani game labour yearmarket technolog peopl plai

firm mobil parti bestbank music elect gamesale phone minist win

share on blair firstprice servic plan on

growth get tori award”Business” ”Technology” ”Politics” ”Entertainment-Sports”

Table 3: 10 most probable words for BBC dataset with K=4

5 Conclusions

Given a corpus of documents, trying to identify different themes in the corpus is a very interestingproblem. For this report we look at a very simple model of identifying the topics inherent in agiven corpus. We use latent Dirichlet allocation which is a flexible generative probabilistic modelfor collections of discrete data. LDA is based on a simple exchangeability assumption of the wordsand topics. The results of training an LDA model on the given document corpus depend on thehyper-parameters α and β. We observe that with greater values of α, words in a given documenttend to be assigned to different topics. For greater values of β, same appearances of a word can beassigned different topics. We also notice that sparsity of per-document distribution is a characteristicof the dataset on which the LDA model is being trained. Determining the number of topics for agiven corpus can be a tricky issue. We observe that setting K to a number more than the actualnumber of topics leads to fine-grained topics while setting K less than the actual number leads togeneralization of some of the topics.

11

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5game govern peopl year filmplai peopl game compani bestwin labour technolog market award

player parti mobil firm yearengland elect phone bank musicagainst minist servic sale star

first blair on price showyear plan user share on

world tori comput growth includ”Sports” ”Politics” ”Technology” ”Business” ”Entertainment”


Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6govern film game year peopl gamelabour award plai compani technolog musicpeopl best win market mobil yearparti star player firm phone plaielect year england bank servic on

minist show against sale user songblair actor first price comput bandplan director year share on recordtori includ world growth firm albumsai nomin time economi digit top

”Politics” ”Movies” ”Sports” ”Business” ”Technology” ”Music”


A Appendix

Algorithm 1 LDA Generative process with collapsed Gibbs Sampling

Input: words w ∈ documents d ∈ [1, D]1: randomly initialize z and increment counters2: for iteration i ∈ [1, epoch] do3: for document d ∈ [1, D] do4: for word ∈ [1, Nd] do5: topic← z[word]6: decrement counters according to document d, topic and word7: for k ∈ [1,K] do8: calculate p(z = k|.) using Gibbs equation9: end for

10: newTopic← sample from p(z|.)11: z[word]← newTopic12: decrement counters according to document d, newTopic and word13: end for14: end for15: end for

References[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of

Machine Learning Research, 3:993–1022, 2003.

12

[2] David Aldous. Exchangeability and related topics. Ecole d’Ete de Probabilites de Saint-FlourXIIIa1983, pages 1–198, 1985.

[3] Charles Elkan. Text mining and topic models. University of California, San Diego, February2014.

[4] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and MaxWelling. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages569–577. ACM, 2008.

[5] William M Darling. A theoretical and practical implementation tutorial on topic modeling andgibbs sampling. 2011.

[6] Charles Elkan. Clustering documents with an exponential-family approximation of the dirich-let compound multinomial distribution. In Proceedings of the 23rd international conferenceon Machine learning, pages 289–296. ACM, 2006.

[7] Lindsay I Smith. A tutorial on principal components analysis. Cornell University, USA, 51:52,2002.

[8] Gregor Heinrich, Jorg Kindermann, Codrina Lauth, Gerhard Paaß, and Javier Sanchez-Monzon. Investigating word correlation at different scopes—a latent concept approach. InWorkshop Lexical Ontology Learning at Int. Conf. Mach. Learning, 2005.

[9] Marina Meila. Comparing clusterings—an information based distance. Journal of MultivariateAnalysis, 98(5):873–895, 2007.

[10] Gregor Heinrich. Parameter estimation for text analysis. Technical report, 2004.

13

Latent Dirichlet Allocation

Documents