1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 7. Topic Extraction
Jan 03, 2016
1
CSC 594 Topics in AI –Text Mining and Analytics
Fall 2015/16
7. Topic Extraction
• Word association, represented by concept links, is useful in understanding the relationships between terms (as concepts).
• The same idea can be applied to understand the association between documents associated to a topic.
Text Topics
2
Problems with “Term as Topic”
• Using single term to define a topic is problematic.– Lack of expressive power
• Can only represent simple topics
• Cannot represent complicated topics
– Incompleteness in vocabulary coverage• Cannot capture variations of vocabulary (e.g. related terms)
– Ambiguous word• Many words have more than one meaning/sense.
3
Multiple Terms as Topic
• A solution is to use multiple terms to define a topic.– Topic = {word1, word2, .. }– A weight assigned to each term represents the
importance/relevance of the term in the topic.– Every document in the corpus can be given a score that
represents the strength of association to a topic.– A document can contain zero, one or many topics.
4
Approach (1): Probabilistic Topic Mining
Coursera, Text Mining and Analytics, ChengXiang Zhai 5
Topic as Word Distribution
Coursera, Text Mining and Analytics, ChengXiang Zhai 6
Probabilistic Topic Mining
Coursera, Text Mining and Analytics, ChengXiang Zhai 7
Techniques for Probabilistic Topic Mining
• Several techniques have been used in probabilistic topic mining to extract topics.– Maximum Likelihood– Bayesian– Mixture Model (where parameters are estimated typically
using the Expectation Maximization (EM) algorithm)
8
Mixture Model for Topic Extraction (1)
Coursera, Text Mining and Analytics, ChengXiang Zhai 9
Mixture Model for Topic Extraction (2)
Coursera, Text Mining and Analytics, ChengXiang Zhai 10
Mixture Model as a Generative Model
Coursera, Text Mining and Analytics, ChengXiang Zhai 11
Mixture of Two Unigram Language Models
Coursera, Text Mining and Analytics, ChengXiang Zhai 12
Coursera, Text Mining and Analytics, ChengXiang Zhai 13
Coursera, Text Mining and Analytics, ChengXiang Zhai 14
Coursera, Text Mining and Analytics, ChengXiang Zhai 15
Expectation-Maximization (EM) Algorithm
Coursera, Text Mining and Analytics, ChengXiang Zhai 16
Coursera, Text Mining and Analytics, ChengXiang Zhai 17
18
Approach (2): Dimensionality Reduction for Topics Extraction
• Reduced dimensions can also be considered topics.• Singular Value Decomposition derives eigenvectors
(SVD dimensions/Principal Components) Topics.
D1: “I love iPad.”D2: “iPad is great for kids.”
D3: “Kids love to play soccer.”
D4: “I play soccer at OSU.”
19
Example: Topics extracted by SAS Enterprise Miner for the yelp data
20
• Term topic weight – relevance of the term in the topic• Each term is assigned a weight corresponding to each topic.• Since each topic is an SVD dimension, the term topic weights for
a term are the coordinates of the term in the SVD space.• The Term cutoff is used to determine whether a term belongs to
a topic.
• Document topic weight – relevance of the document to the topic• Every document in the corpus is assigned a weight corresponding
to each topic.• The document topic weight of a document towards a topic is the
normalized sum of the TF*IDF weights for each term in the document multiplied by their term topic weights.
• The Document cutoff is used to determine whether a document belongs to a topic.
21
Interpretability of Extracted Topics
• A topic as a collection of weighted terms provides precise information about the topic.
• But some analysts find the binary topics are easier to understand.