Top Banner
CSE217 INTRODUCTION TO DATA SCIENCE Spring 2019 Marion Neumann LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.
14

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019Marion Neumann

LECTURE 11: TOPIC MODELS

Contents in these slides may be subject to copyright. Some of there materials are derived from Michael Paul, Johns Hopkins University.

Page 2: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

• text does not come as numerical vectors• requires feature extraction

• typically we want to analyze multiple text documents à corpus of documents

RECAP: TEXT DATA

2

great

small

location

friends

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends.

Thankfully there is great outdoor seating to escape the noise.

Page 3: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

MAKING SENSE OF TEXT

à Suppose you want to learn something about a corpus that’s too big to read…

3

• What topics are trending today on Twitter? • What research topics are most active? • What issues are considered by Congress? • Are certain topics discussed more in certain languages on Wikipedia?

Page 4: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPIC MODELS• Topic models can help you automatically discover

patterns in a corpus à unsupervised learning

• Topic models automatically...• group topically-related words in topics• associate terms and documents with those topics

4

What is a topic? à a grouping of words that are likely to appear in the same context

Page 5: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

WHAT ARE TOPICS MODELS?

1) associate words (terms) with topics, then2) associate topics with documents

5

Page 6: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

WHAT ARE TOPICS MODELS?

1) associate words (terms) with topics, then2) associate topics with documentsà document summarization!

6

Page 7: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à THE MODEL• What are the topics and associations?

à we need to learn them form the corpus

• we need a model first!à let’s model documents as a set of words being

generated from a set of topics

à generative ML model7

Page 8: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à THE MODEL

8

Page 9: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à THE MODEL

• probability of each possible word:

9

Page 10: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à LEARNING TASK• Given: observed words in a corpus• Task: learn what topic model has generated the data (corups)• this means we have to infer the

• probability distribution over words associated with each topic, • the distribution over topics for each document, and• the topic responsible for generating each word.

10

Page 11: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à LEARNING TASK

• Task: learn what topic model has generated the data (corpus)à rephrased: how likely is it that our corpus was generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)

11

this is likelihood maximization!

…we don’t even have the parameters…….

Page 12: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPICS MODELS à LEARNING TASK

• Task: learn what topic model has generated the data (corpus)àrephrased: how likely is it that our corpus was

generated by topic model A (where topic model A is defined by the parameters 𝜃" ’s and 𝛽$’s)

• Solution: • Dirichlet prior à Latent Dirichlet Allocation• iterative algorithm

12

we need some more probability and statisticsknowledge to derive and

understand this

Page 13: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

TOPIC MODELS – LINEAR ALGEBRA VIEW• matrix factorization

• singular value decomposition of word document co-occurrence matrix

13

we need some more matrix algebra (linear algebra)

knowledge to derive and understand this

Page 14: CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 11: TOPIC …m.neumann/sp2019/cse217/slides/11_T… · LECTURE 11: TOPIC MODELS Contents in these slides may be subject to copyright. Some

14

• [DSFS] • Ch9 Getting Data: Scraping the Web (p2108-110)• Ch20 Natural Language Processing: Topic Modelling (p247-252)

• Your Easy Guide to LDA https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d• Topic Modeling and Latent Dirichlet Allocation (LDA) in Python

https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24

SUMMARY & READING

• A lot of today’s data is free-form text data à huge corpora

• Topic models provide a way of summarizing and organizing text documents.

• Instead of hard-coding topics, we learn them from a given corpus. No labels à unsupervised learning