Top Banner
Topic modeling experiments benchmark and simple evaluations 11/15, 11/16
22

Topic modeling experiments benchmark and simple evaluations

Feb 22, 2016

Download

Documents

gabe

Topic modeling experiments benchmark and simple evaluations. 11 /15, 11/16 . What to measure. Empirical running time (1000 samplings) * Both Dirichlet and Logistic normal * Both a single prior and a mixture model - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic modeling experiments  benchmark and simple evaluations

Topic modelingexperiments

benchmark and simple evaluations

11/15, 11/16

Page 2: Topic modeling experiments  benchmark and simple evaluations

What to measure

• Empirical running time (1000 samplings) * Both Dirichlet and Logistic normal * Both a single prior and a mixture model * Both small number(k=100) and a large number(k=1000) of topics * 4 different size of dataset (1000, 2000, 5000, 10000 posts)

• Person/topic modeling quality * See topic distribution of people * See what topics are extracted * Try both priors with K=50, 500, and 1000 topics with 10000 posts dataset

Page 3: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 1-1. (100 topics)

1000 2000 5000 100000

1000

2000

3000

4000

5000

6000

A DirichletDirichlet MixtureA Logistic Normal"Logistic Normal Mixture"

#of posts

#runningtime (sec)

Using 4 different size dataset, we assume 5 mixture of priors for both Dirichlet and Logistic normal mixture model

Page 4: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 1-2. details (100 topics)

Page 5: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 1-3. analysis(100 topics)

• Mixture of Dirichlet is kind of slower. This could be because we always have to keep track of latest word-topic assignment history for each mixture component and we have to copy a huge “z” matrix sampling iteration

• Mixture of Dirichlet has high initial #topic/word value. This is because we have to do initial random topic assignment for all mixture component respectively

• In contrast, there is almost no speed difference between a logistic normal and the mixture of logistic normals. This is because we directory sample person-specific topic distribution from a corresponding prior

Page 6: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 2-1. (1000 topics)

#of posts

#runningtime (sec)

Using 4 different size dataset, we assume 5 mixture of priors for both Dirichlet and Logistic normal mixture model

To be added tomorrow

Page 7: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 2-2. details (1000 topics)

Complete set of running time will be available tomorrow

Page 8: Topic modeling experiments  benchmark and simple evaluations

Running time comparison 2-3. analysis(1000 topics)

• Running time for single/mixture of Dirichlet is almost completely proportional for the number of topics or the size of dataset

• Logistic Normal prior (both single/mixture) seems to have (not so much significant) additional time growth according to the size of topics like 10 times topic size growth results in 14 or 15 times running time growth. Regarding to the dataset, the time growth is linear.

Page 9: Topic modeling experiments  benchmark and simple evaluations

Person/topic modeling quality

We have two ways of user analysis presentation

• 1. Showing topic distribution heatmap of the set of people

• 2. Projecting K-dimensional vectors of each person into 2 dimension (x-y coordinates) using Multidimensional scaling

Page 10: Topic modeling experiments  benchmark and simple evaluations

Multidimensional Scaling for 104 people with a Dirichlet prior

50 topics 500 topics 1000 topics

Page 11: Topic modeling experiments  benchmark and simple evaluations

Multidimensional Scaling for 104 people with a mixture of Dirichlet priors

50 topics 500 topics 1000 topics

Page 12: Topic modeling experiments  benchmark and simple evaluations

Multidimensional Scaling for 104 people with a Logistic Normal prior

50 topics 500 topics 1000 topics

Page 13: Topic modeling experiments  benchmark and simple evaluations

Multidimensional Scaling for 104 people with a mixture of Logistic Normal priors

50 topics 500 topics 1000 topics

Page 14: Topic modeling experiments  benchmark and simple evaluations

Topic distribution heatmap of 50 topics (Dirichlet) Users are sorted according to the k-means result with K=10

Single prior Mixture of priors

Page 15: Topic modeling experiments  benchmark and simple evaluations

Topic distribution heatmap of 50 topics (Logistic Normal) Users are sorted according to the k-means result with K=10

Single prior Mixture of priors

Page 16: Topic modeling experiments  benchmark and simple evaluations

Analysis -Dirichlet

• Since we are using a large number of topics for relatively small data set (109 people), Dirichlet tends to assign many same weights for most topics and because of the normalization method we are taking, just a few topics tends to dominate the weight of his vector (it seems the mixture model a little bit alleviates this issue )

Page 17: Topic modeling experiments  benchmark and simple evaluations

Analysis –Logistic Normal

• Since we are using a large number of topics for relatively small data set (109 people), Dirichlet tends to assign many same weights for most topics and because of the normalization method we are taking, just a few topics tends to dominate the weight of his vector (it seems the mixture model a little bit alleviates this issue )

Page 18: Topic modeling experiments  benchmark and simple evaluations

Issues

• Person topic distribution estimation: In (mixture of) Dirichlet, we can easily get a set of document-topic distributions of a person. Then, we need to estimate the total topic distribution of the person from them. Simply taking the product of the vectors will be suffered from numerical underflow. Right now, we are taking Rong’s log-exponentiate ratio method, but this method can assign excessively high weights for topics of relatively high weights. Is there any alternative idea which might preserve the true topic distribution better?

Page 19: Topic modeling experiments  benchmark and simple evaluations

Extracted topics with DirichletSingle Dirichlet with K=1000

Dirchlet mixture with K=1000

Single Dirichlet with K=500

Page 20: Topic modeling experiments  benchmark and simple evaluations

Extracted topics with logistic normalSingle LN with K=500

Single LN with K=500

LNMixture with K=1000

Page 21: Topic modeling experiments  benchmark and simple evaluations

Topic quality analysis

• Both priors find some interesting and clean topics with K = 500 or K =1,000

• Still subjective, but the logistic normal seems to be somewhat better than Dirichlet (it seems the general frequent words seem to less appear in so many topics)

• Also, logistic normal seems to assign relatively low tf-idf score for the common words than Dirichlet

Page 22: Topic modeling experiments  benchmark and simple evaluations

Next steps (in addition to parallelism)?

• Explore a way to improve theta estimation with Dirichlet?

• Build community-based data with demographic information?