CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.
CSE217 INTRODUCTION TO DATA SCIENCE
Spring 2019Marion Neumann
LECTURE 11: TOPIC MODELS
Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.
• text does not come as numerical vectors• requires feature extraction
• typically we want to analyze multiple text documents à corpus of documents
RECAP: TEXT DATA
2
great
small
location
friends
…
Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends.
Thankfully there is great outdoor seating to escape the noise.
MAKING SENSE OF TEXT
à Suppose you want to learn something about a corpus that’s too big to read…
3
• What topics are trending today on Twitter? • What research topics are most active? • What issues are considered by Congress? • Are certain topics discussed more in certain languages on Wikipedia?
TOPIC MODELS• Topic models can help you automatically discover
patterns in a corpus à unsupervised learning
• Topic models automatically...• group topically-related words in topics• associate terms and documents with those topics
4
What is a topic? à a grouping of words that are likely to appear in the same context
WHAT ARE TOPICS MODELS?
1) associate words (terms) with topics, then2) associate topics with documents
5
WHAT ARE TOPICS MODELS?
1) associate words (terms) with topics, then2) associate topics with documentsà document summarization!
6
TOPICS MODELS à THE MODEL• What are the topics and associations?
à we need to learn them form the corpus
• we need a model first!à let’s model documents as a set of words being
generated from a set of topics
à generative ML model7
TOPICS MODELS à THE MODEL
8
TOPICS MODELS à THE MODEL
• probability of each possible word:
9
TOPICS MODELS à LEARNING TASK• Given: observed words in a corpus• Task: learn what topic model has generated the data (corups)• this means we have to infer the
• probability distribution over words associated with each topic, • the distribution over topics for each document, and• the topic responsible for generating each word.
10
TOPICS MODELS à LEARNING TASK
• Task: learn what topic model has generated the data (corpus)à rephrased: how likely is it that our corpus was generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)
11
this is likelihood maximization!
…we don’t even have the parameters…….
TOPICS MODELS à LEARNING TASK
• Task: learn what topic model has generated the data (corpus)àrephrased: how likely is it that our corpus was
generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)
• Solution: • Dirichlet prior à Latent Dirichlet Allocation• iterative algorithm
12
we need some more probability and statisticsknowledge to derive and
understand this
TOPIC MODELS – LINEAR ALGEBRA VIEW• matrix factorization
• singular value decomposition of word document co-occurrence matrix
13
we need some more matrix algebra (linear algebra)
knowledge to derive and understand this
14
• [DSFS] • Ch9 Getting Data: Scraping the Web (p2108-110)• Ch20 Natural Language Processing: Topic Modelling (p247-252)
• Your Easy Guide to LDA https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d• Topic Modeling and Latent Dirichlet Allocation (LDA) in Python
https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
SUMMARY & READING
• A lot of today’s data is free-form text data à huge corpora
• Topic models provide a way of summarizing and organizing text documents.
• Instead of hard-coding topics, we learn them from a given corpus. No labels à unsupervised learning