Top Banner
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction
21

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Jan 03, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

1

CSC 594 Topics in AI –Text Mining and Analytics

Fall 2015/16

7. Topic Extraction

Page 2: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

• Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts).

• The same idea can be applied to understand the association between documents associated to a topic.

Text Topics

2

Page 3: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Problems with “Term as Topic”

• Using single term to define a topic is problematic.– Lack of expressive power

• Can only represent simple topics

• Cannot represent complicated topics

– Incompleteness in vocabulary coverage• Cannot capture variations of vocabulary (e.g. related terms)

– Ambiguous word• Many words have more than one meaning/sense.

3

Page 4: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Multiple Terms as Topic

• A solution is to use multiple terms to define a topic.– Topic = {word1, word2, .. }– A weight assigned to each term represents the

importance/relevance of the term in the topic.– Every document in the corpus can be given a score that

represents the strength of association to a topic.– A document can contain zero, one or many topics.

4

Page 5: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Approach (1): Probabilistic Topic Mining

Coursera, Text Mining and Analytics, ChengXiang Zhai 5

Page 6: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Topic as Word Distribution

Coursera, Text Mining and Analytics, ChengXiang Zhai 6

Page 7: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Probabilistic Topic Mining

Coursera, Text Mining and Analytics, ChengXiang Zhai 7

Page 8: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Techniques for Probabilistic Topic Mining

• Several techniques have been used in probabilistic topic mining to extract topics.– Maximum Likelihood– Bayesian– Mixture Model (where parameters are estimated typically

using the Expectation Maximization (EM) algorithm)

8

Page 9: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Mixture Model for Topic Extraction (1)

Coursera, Text Mining and Analytics, ChengXiang Zhai 9

Page 10: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Mixture Model for Topic Extraction (2)

Coursera, Text Mining and Analytics, ChengXiang Zhai 10

Page 11: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Mixture Model as a Generative Model

Coursera, Text Mining and Analytics, ChengXiang Zhai 11

Page 12: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Mixture of Two Unigram Language Models

Coursera, Text Mining and Analytics, ChengXiang Zhai 12

Page 13: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Coursera, Text Mining and Analytics, ChengXiang Zhai 13

Page 14: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Coursera, Text Mining and Analytics, ChengXiang Zhai 14

Page 15: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Coursera, Text Mining and Analytics, ChengXiang Zhai 15

Page 16: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Expectation-Maximization (EM) Algorithm

Coursera, Text Mining and Analytics, ChengXiang Zhai 16

Page 17: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

Coursera, Text Mining and Analytics, ChengXiang Zhai 17

Page 18: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

18

Approach (2): Dimensionality Reduction for Topics Extraction

• Reduced dimensions can also be considered topics.• Singular Value Decomposition derives eigenvectors

(SVD dimensions/Principal Components) Topics.

D1: “I love iPad.”D2: “iPad is great for kids.”

D3: “Kids love to play soccer.”

D4: “I play soccer at OSU.”

Page 19: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

19

Example: Topics extracted by SAS Enterprise Miner for the yelp data

Page 20: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

20

• Term topic weight – relevance of the term in the topic• Each term is assigned a weight corresponding to each topic.• Since each topic is an SVD dimension, the term topic weights for

a term are the coordinates of the term in the SVD space.• The Term cutoff is used to determine whether a term belongs to

a topic.

• Document topic weight – relevance of the document to the topic• Every document in the corpus is assigned a weight corresponding

to each topic.• The document topic weight of a document towards a topic is the

normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights.

• The Document cutoff is used to determine whether a document belongs to a topic.

Page 21: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction.

21

Interpretability of Extracted Topics

• A topic as a collection of weighted terms provides precise information about the topic.

• But some analysts find the binary topics are easier to understand.