Top Banner
Dynamic Topic Modeling via Non-negative Matrix Factorization Derek Greene University College Dublin
24

Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Jan 23, 2018

Download

Technology

Sebastian Ruder
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Dynamic Topic Modeling via Non-negative Matrix Factorization

Derek Greene University College Dublin

Page 2: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Overview

• Topic Modeling • Non-negative Matrix Factorization• Dynamic Topic Modeling

• Proposed Approach • Dynamic Topic Modeling via Non-negative

Matrix Factorization• Application • Topic Modeling European

Parliamentary Speeches

September 2016 2

Page 3: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Topic Modeling

September 2016 3

• Goal: Discover hidden thematic structure in a corpus of text (e.g. tweets, Facebook posts, news articles, political speeches).

• Unsupervised approach, no prior annotation required.

Input Output

DataPreparation

Topic Modeling Algorithm

Topic 1

Topic 2

Topic k

• Output of topic modeling is a set of k topics. Each topic has:1. A descriptor, based on highest-ranked terms for the topic.2. Membership weights for all documents relative to the topic.

Page 4: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Topic Modeling with NMF

• Non-negative Matrix Factorization (NMF): Family of linear algebra algorithms for identifying the latent structure in data represented as a non-negative matrix (Lee & Seung, 1999).

• NMF can be applied for topic modeling, where the input is a document-term matrix, typically TF-IDF normalized.

September 2016 4

Input Matrix (documents x terms)

• Input: Document-term matrix A; User-specified number of topics k.• Output: Two k-dimensional factors W and H approximating A.

An

m

Factor(documents x topics)

NMF Wn

k

Factor(topics x terms)

H

m

Page 5: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Example: NMF Topic Modeling

• Apply standard NMF to document-term matrix A (6 rows x 10 columns) for k=3 topics…

September 2016 5

document 1

document 2

document 3

document 4

document 5

document 6

rese

arch

stem

educ

atio

n

dise

ase

patie

nt

heal

th

budg

et

finan

ce

bank

ing

bond

s

Page 6: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Example: NMF Topic Modeling

September 2016 6

research

stem

education

disease

patient

health

budget

finance

banking

bonds

Topic 1 Topic 2 Topic 3

Factor HWeights for terms

document 1

document 2

document 3

document 4

document 5

document 6

Topic 1 Topic 2 Topic 3

Factor W Weights for documents

Page 7: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

(D. Blei, 2012)

Dynamic Topic Models

• Standard topic modeling approaches assume the order of documents does not matter. Not suitable for time-stamped data.

• Dynamic topic modeling: Approaches to track how language changes and topics evolve over time in a time-stamped corpus.

September 2016 7

Inaugural address

Page 8: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Dynamic Topic Modeling via Non-negative Matrix

Factorization

Page 9: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Proposed Approach

• Two-Level approach: Link together related topics found in different time windows to track topics over time.

9

Rank Term1 eurozone2 greece

3 imf

4 loan5 debt

Rank Term1 greece2 debt

3 germany

4 reparations5 eu

Rank Term1 greece2 russia

3 debt

4 eu5 loan

Topic inWindow 1

Topic inWindow 2

Topic inWindow 3

Divide corpus into 𝜏 time windows of equal duration (e.g. days, weeks, months, quarters, or years).Level 1: Apply NMF topic modeling to documents in each window to produce window topics.Level 2: Apply another layer of NMF to all topics from Step 1 to find dynamic topics which span multiple time windows.

Page 10: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Proposed Approach

• Key Idea for Level 2: • View the topic basis vectors (columns of factor H) found in

each time window as “topic documents”.• Construct a new combined representation from these H

factors. Similar to idea of “stacking” in supervised ensembles.• Apply NMF to this new representation.

September 2016 10

𝜏 x Time Window Datasets 𝜏 x NMF H Factors

Factor H from Window 1

Factor H from Window 2

Factor H from Window 3

Factor H from Window 𝜏

m’ termsn’

top

ic d

ocum

ents

Topic-Term Matrix

Page 11: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Example: Dynamic Topic Modeling

11

Topic-term matrix for 2 time window results, each with 3 topics.

Window1-01

Window1-02

Window1-03

Window2-01

Window2-02

Window2-03

Topics forTime

Window 1

Topics forTime

Window 2

heal

th

patie

nt

dise

ase

citiz

en

rese

arch

educ

atio

n

budg

et

finan

ce

bank

ing

Topic-Term Matrix Heatmap

Page 12: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Application: European Parliament

Collaboration with Dr. James Cross UCD School of Politics & International Relations

Page 13: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Exploring the European Parliament Agenda

September 2016 13

• Directly elected parliamentary institution of the EU.

• 8th term began in July 2014.• 751 Members of European

Parliament (MEPs) from 28 member states.

• 12 plenary sessions per year are held in Strasbourg.• During sessions, members may speak after being called by the

President. Speaking time available to MEPs is strictly limited.• MEPs use speeches to state their positions on policies, to

explain votes, and to demonstrate to their electorates that they are representing their interests in Europe.

Page 14: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Data Collection

• In Autumn 2014 we collected ~400k records from EuroParl.

• Covers activities of MEPS in the European parliament during terms 5-7 (1999-2014).

• Focus on records of speeches in plenary. Accounts for 54.3% of all Europarl records.

14

http://europarl.europa.eu

Page 15: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Data Collection

• Original corpus contains 269,696 plenary speeches.• Identified subset of 210,247 English language speeches, either

native or translated.

15

• Divided these into 60 “time window” datasets. Each time window is a quarter from 1999-Q3 to 2014-Q2.

Time Window (Quarter Number)

Num

ber

of S

peec

hes

Page 16: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Time Window Topic Modeling

• Applied NMF to document-term matrix for the speeches in each of the 60 time windows.

• Use automated topic coherence approach to choose number of topics k for each window (O’Callaghan et al, 2015).

➡ Output: 60 sets of time window topics.

September 2016 16

Page 17: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Time Window Topic Modeling

Example Topic: 2003-Q1

17

Top 10 terms suggest that this topic relates to the Iraq war.

Top 10 speeches for this topic provide the context.

Page 18: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Dynamic Topic Modeling Results

• Applying dynamic topic modeling to the resulting topic-term matrix with parameter selection yields 57 dynamic topics which show varied nature of European Parliament’s agenda…

18

Page 19: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Example: Climate Change

19

0

100

200

300

400

500

600

2000 2002 2004 2006 2008 2010 2012 2014

Num

ber o

f Spe

eche

s

Year

Climate Change Package

Cancun

CopenhagenMontreal

Page 20: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Example: Financial & Euro Crisis

20

0

200

400

600

800

1000

1200

2000 2002 2004 2006 2008 2010 2012 2014

Num

ber o

f Spe

eche

s

Year

Financial crisisEuro crisis

A

D

C

B

Page 21: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Dynamic Topics by Politician

We associate MEPs with dynamic topics based on the number of speeches by the MEP associated with its window topics.

September 2016 21

Pat Cox (Ireland)

Top 10 Most Relevant Dynamic Topics

Page 22: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

Dynamic Topics by Country

22

Ireland

Cyprus

Page 23: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

More Information

European Parliament Speeches - Topic Explorerhttp://erdos.ucd.ie/europarl

September 2016 23

Python Code and Documentationhttps://github.com/derekgreene/dynamic-nmf

D. Greene, J. P. Cross, “Unveiling the Political Agenda of the European Parliament Plenary: A Topical Analysis,” in Proc. ACM Web Science’15, 2015.

[email protected] @derekgreene

D. Greene, J. P. Cross. “Exploring the political agenda of the European parliament using a dynamic topic modeling approach”, Political Analysis, 2017 (in press).

Page 24: Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)

References

• D. Blei, A. Y. Ng, M. Jordan. “Latent dirichlet allocation”. Journal of Machine Learning Research, 3:993–1022, 2003.

• D. Blei. “Probabilistic topic models”. Communications of the ACM, 2012.• D. D. Lee & H. S. Seung. “Learning the parts of objects by non-negative

matrix factorization”. Nature, 401:788–91, 1999.• D. O’Callaghan, D. Greene, J. Carthy & P. Cunningham. “An analysis of the

coherence of descriptors in topic modeling”. Expert Systems with Applications (ESWA), 2015.

• Zhao, Wayne Xin, et al. "Comparing twitter and traditional media using topic models." Advances in Information Retrieval, 2011.

• J. Grimmer. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18 (1). 1–35, 2010.

September 2016 24