1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Post on 03-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

1

CSC 594 Topics in AI –Text Mining and Analytics

Fall 2015/16

7. Topic Extraction

• Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts).

• The same idea can be applied to understand the association between documents associated to a topic.

Text Topics

2

Problems with “Term as Topic”

• Using single term to define a topic is problematic.– Lack of expressive power

• Can only represent simple topics

• Cannot represent complicated topics

– Incompleteness in vocabulary coverage• Cannot capture variations of vocabulary (e.g. related terms)

– Ambiguous word• Many words have more than one meaning/sense.

3

Multiple Terms as Topic

• A solution is to use multiple terms to define a topic.– Topic = {word1, word2, .. }– A weight assigned to each term represents the

importance/relevance of the term in the topic.– Every document in the corpus can be given a score that

represents the strength of association to a topic.– A document can contain zero, one or many topics.

4

Approach (1): Probabilistic Topic Mining

Coursera, Text Mining and Analytics, ChengXiang Zhai 5

Topic as Word Distribution

Coursera, Text Mining and Analytics, ChengXiang Zhai 6

Probabilistic Topic Mining

Coursera, Text Mining and Analytics, ChengXiang Zhai 7

Techniques for Probabilistic Topic Mining

• Several techniques have been used in probabilistic topic mining to extract topics.– Maximum Likelihood– Bayesian– Mixture Model (where parameters are estimated typically

using the Expectation Maximization (EM) algorithm)

8

Mixture Model for Topic Extraction (1)

Coursera, Text Mining and Analytics, ChengXiang Zhai 9

Mixture Model for Topic Extraction (2)

Coursera, Text Mining and Analytics, ChengXiang Zhai 10

Mixture Model as a Generative Model

Coursera, Text Mining and Analytics, ChengXiang Zhai 11

Mixture of Two Unigram Language Models

Coursera, Text Mining and Analytics, ChengXiang Zhai 12

Coursera, Text Mining and Analytics, ChengXiang Zhai 13

Coursera, Text Mining and Analytics, ChengXiang Zhai 14

Coursera, Text Mining and Analytics, ChengXiang Zhai 15

Expectation-Maximization (EM) Algorithm

Coursera, Text Mining and Analytics, ChengXiang Zhai 16

Coursera, Text Mining and Analytics, ChengXiang Zhai 17

18

Approach (2): Dimensionality Reduction for Topics Extraction

• Reduced dimensions can also be considered topics.• Singular Value Decomposition derives eigenvectors

(SVD dimensions/Principal Components) Topics.

D1: “I love iPad.”D2: “iPad is great for kids.”

D3: “Kids love to play soccer.”

D4: “I play soccer at OSU.”

19

Example: Topics extracted by SAS Enterprise Miner for the yelp data

20

• Term topic weight – relevance of the term in the topic• Each term is assigned a weight corresponding to each topic.• Since each topic is an SVD dimension, the term topic weights for

a term are the coordinates of the term in the SVD space.• The Term cutoff is used to determine whether a term belongs to

a topic.

• Document topic weight – relevance of the document to the topic• Every document in the corpus is assigned a weight corresponding

to each topic.• The document topic weight of a document towards a topic is the

normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights.

• The Document cutoff is used to determine whether a document belongs to a topic.

21

Interpretability of Extracted Topics

• A topic as a collection of weighted terms provides precise information about the topic.

• But some analysts find the binary topics are easier to understand.

top related