Top Banner
Topic Modeling NATHAN MILLER
24

Topic Modeling

Feb 18, 2017

Download

Documents

Nathan Miller
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic Modeling

Topic ModelingNATHAN MILLER

Page 2: Topic Modeling

Uses Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation

Page 3: Topic Modeling

Corpus – a defined grouping of similar documents◦ All Fairy Tales

Document – user-defined body of text◦ Cinderella

Term – a word in a document◦ Slipper

Text Mining Terms

Slipper

Term: Slipper Document: Cinderella Corpus: All Fairy Tales

Page 4: Topic Modeling

Pre-ProcessingCLEAN INGTOKEN IZ INGSTEM MINGTDM/DTM

Page 5: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

“Well,” said the wolf, “then I'll huff and I'll puff and I'll blow your straw house in.”

Hansel left a trail of crumbs behind him to mark the way.

Corp

usDocum

ent

Page 6: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Tokenization

• The• quick• brown• fox• jumps

• over• the• lazy• dog;• then,

• Foxy• Cow• jumped• over• the

• moon

bag-o-words

Page 7: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Tokenization

• The quick• quick brown• brown fox• fox jumps• jumps over

• over the• the lazy• lazy dog• dog; then,• then, Foxy

• Foxy Cow• Cow jumped• jumped over• over the• the moon.

• moon.

n-grams

Page 8: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Stop Word Removal

• quick• brown• fox• jumps• lazy

• dog• Foxy• Cow• jumped• moon

feature selection

Page 9: Topic Modeling

TF-IDF: Term Frequency-Inverse Document Frequency

tf(t,d)

Stop Word Removal

Inverse Document Frequency

Number of docs d (within the corpus D) in which a term t appears.

raw frequency: Frequency of a term t in a document d.

Document Frequency

Number of docs d within a corpus D; N = |D|

One method of determining stop words in a corpus

Page 10: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

Stemming

• quick• brown• fox• jump• lazy

• dog• Fox• Cow• jump• moon

Page 11: Topic Modeling

The quick brown fox jumps over the lazy dog; then, Foxy Cow jumped over the moon.

DTM/TDM

Term Doc 1 Doc 2 … Doc n

quick 1 0 … 0

brown 1 0 … 0

fox 2 0 … 1

jump 2 1 … 0

lazy 1 0 … 0

dog 1 0 … 0

cow 1 1 … 1

… … … … …

moon 1 0 … 0

Document-Term Matrix/ Term-Document Matrix

Page 12: Topic Modeling

Topic ModelingK-MEANSLATENT DIRICHLET ALLOCATION (LDA)

Page 13: Topic Modeling

see also https://en.wikipedia.org/wiki/K-means_clustering

K-Means

number of clusters

an observation; a term

mean (centroid) of a cluster

every word in a cluster

iterates through

k clusters

=Within-Cluster Sum of Squares: SST within a particular cluster

Minimize this function:

Page 14: Topic Modeling

Within-Cluster Sum of Squares

Page 15: Topic Modeling

K-means Learning1. 2. 3. 4.

Randomly pick k points (“means”)

Assign each observation to nearest “mean”

Calculate mean (centroid) of each cluster

Repeat steps 2 and 3 until convergence

Page 17: Topic Modeling

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

LDA

“Hidden” Topical Structure <= (“hidden” variables|observed words)

1. Uncover the hidden topical patterns (topics)

2. Annotate the documents according to those topics

3. Use the annotations to organize, summarize and/or search the texts

Posterior Distribution

Latent Dirichlet Allocation

Page 18: Topic Modeling

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

Page 19: Topic Modeling

LDA Learning1. Iterate through all words in a document and randomly assign each word to a topic

2. Calculate A. P(topic|document) and B. P(word|topic)

3. Reassign each word to a topic using P(topic|document)*P(word|topic) = Probability that a topic “generated” a word

Page 20: Topic Modeling

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

Page 21: Topic Modeling

Posterior Distribution

SEE TOPIC MODELS BY DAVID BLEI AND JOHN LAFFERTY: HTTPS://WWW.CS.PRINCETON.EDU/~BLEI/PAPERS/BLEILAFFERTY2009.PDF

Page 22: Topic Modeling
Page 23: Topic Modeling

Text Mining Document Summarization

Machine Translation

Named Entity Recognition

Natural Language Understanding/ Generation

Optical Character Recognition

Part-of-speech Tagging

Sentiment Analysis

Topic Segmentation◦ K-means◦ LDA