Top Banner
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis Damian Trilling & Jeroen Jonkman [email protected] @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam WAPOR, Buenos Aires, 16–19 June 2015
29

Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Aug 06, 2015

Download

Education

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Packing and Unpacking the Bag of Words:Introducing a Toolkit for Inductive Automated

Frame Analysis

Damian Trilling & Jeroen Jonkman

[email protected]@damian0604

www.damiantrilling.net

Afdeling CommunicatiewetenschapUniversiteit van Amsterdam

WAPOR, Buenos Aires, 16–19 June 2015

Page 2: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Automated Framing analysis

Deductive

• simple: word lists and searchstrings

• advanced: supervisedmachine learning

Inductive

• word frequencies andco-occurrences

• visualizations• principal component

analysis• cluster analysis• latent dirichlet allocation• . . .

This is the focus of our study

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 3: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Automated Framing analysis

Deductive

• simple: word lists and searchstrings

• advanced: supervisedmachine learning

Inductive

• word frequencies andco-occurrences

• visualizations• principal component

analysis• cluster analysis• latent dirichlet allocation• . . .

This is the focus of our study

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 4: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Automated Framing analysis

Deductive

• simple: word lists and searchstrings

• advanced: supervisedmachine learning

Inductive

• word frequencies andco-occurrences

• visualizations• principal component

analysis• cluster analysis• latent dirichlet allocation• . . .

This is the focus of our study

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 5: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Automated Framing analysis

Deductive

• simple: word lists and searchstrings

• advanced: supervisedmachine learning

Inductive

• word frequencies andco-occurrences

• visualizations• principal component

analysis• cluster analysis• latent dirichlet allocation• . . .

This is the focus of our study

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 6: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Methodological issues

Methodological issues

What constitutes a frame?— and how does this translate to an operationalization?

• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)

• Do we expect each element to occur in one and only oneframe? (⇒ PCA)

• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?

• . . .

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 7: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Methodological issues

Methodological issues

What constitutes a frame?— and how does this translate to an operationalization?

• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)

• Do we expect each element to occur in one and only oneframe? (⇒ PCA)

• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?

• . . .

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 8: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Methodological issues

Methodological issues

What constitutes a frame?— and how does this translate to an operationalization?

• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)

• Do we expect each element to occur in one and only oneframe? (⇒ PCA)

• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?

• . . .

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 9: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Methodological issues

Methodological issues

What constitutes a frame?— and how does this translate to an operationalization?

• Is a frame fundamentally different from a (sub-)topic? (⇒topic modeling)

• Do we expect each element to occur in one and only oneframe? (⇒ PCA)

• Do we need to distinguish between actors, actions, . . . — orare all words taken into consideration equally?

• . . .

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 10: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Practical issues

Practical issues

• no standard software (but: more and more R-packages andPython modules)

• reliance on inaccessible, self-written, or proprietary software

• lack of knowledge in the field

• size of the datasets

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 11: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

A catalogue of criteria

A catalogue of criteria

A toolkit for automated framing analysis should. . .

1 not depend on commercial software

2 run on all major operating systems

3 be scalable: usable on a laptop, but also on powerful serversto analyze millions of documents.

4 be flexible and open: adoptable to own needs

5 have a powerful database engine on the background

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 12: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Sample implementation: INFRA

To meet these criteria, we wrote INFRA in Python, using theNoSQL database MongoDB. The toolkit will be made freelyavailable, both as source code and via a web interface.

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 13: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Data (e.g., LexisNexis articles)

Import filter

NoSQL database

Cleaning and pre-processing filters

Cleaned NoSQLdatabase

word frequenciesand co-occurences

log likelihood visualizations

define details foranalysis (e.g., im-

portant actors)

dictionary filter/namedentity recognition

Latent dirich-let allocation

Principal com-ponent analysis

Cluster analysis

Data management phase

Analysis phase

Page 14: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Central storage

Data management phase handled on the server; analyses can behandled either on the server (SSH) or locally (INFRA)

External data

MongoDB server

Computer2 Computer3Computer1 Computer4

Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 15: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Enjoying the advantages of BOW — and overcoming itsshortcomings

In the preprocessing phase

• all information is still

• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C

• extremely useful for unifying actors that are referred in several ways

In the analysis phase

• work with a much faster dataset that contains only thenecessary information

• no need to deal with misspellings and variations any more

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 16: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Enjoying the advantages of BOW — and overcoming itsshortcomings

In the preprocessing phase

• all information is still

• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C

• extremely useful for unifying actors that are referred in several ways

In the analysis phase

• work with a much faster dataset that contains only thenecessary information

• no need to deal with misspellings and variations any more

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 17: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Enjoying the advantages of BOW — and overcoming itsshortcomings

In the preprocessing phase

• all information is still

• we can use custom regexp-based rules and filterse.g.: if a text contains [list of synomys of A] and [list of synomys of B],replace [synomys of A] with C

• extremely useful for unifying actors that are referred in several ways

In the analysis phase

• work with a much faster dataset that contains only thenecessary information

• no need to deal with misspellings and variations any more

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 18: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Towards a “best practice” of inductive framing analysis

In the data management phase

• spend much time on re-coding relevant multi-word entities toavoid noise (of course, “Barack” and “Obama” occurtogether) and recode synonyms (how would you otherwisereliably estimate frequencies?)⇒ especially important for questions like “how is actor Xframed?”

• regular expressions instead of simple word lists!

• make an informed decision on how to harmonize the dataset(stopword removal, stemming (?), POS tagging (?))

And: share these procedures!

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 19: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Design

Towards a “best practice” of inductive framing analysis

In the analysis phase

• background knowledge necessary (face validity)

• robustness: do slightly different parameters deliver similarresults?

• too small dataset ⇒ sensitivity for atypical events (scandalsetc.) ⇒ discovering topic rather than frame

• difference between statistical predictive power andmeaningfulness

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 20: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Empirical example:Dutch business news

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 21: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Steps

Preprocessing steps

1 Ingest and parse all possibly relevant articles (≈ 500 000)

2 Compose list of ≈ 1 500 regular expressions to substitutesynonyms and combinations to correctly code actors, allowingfor conditional substitutions

3 Remove stopwords, punctuation, etc.

4 Determine part-of-speech, keep only nouns, adjectives, adverbs

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 22: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Steps

Analysis steps

1 Determine relevant actors with frequency counts, filtering outall non-Dutch words (alternative: named entity recognition)

2 Conduct PCA, cluster analysis, and LDA – additionally, countfrequency of actor mentions

3 Finetuning, repeating, choose final model

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 23: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Output

Example: Attention over time

Overview of news attention: attention to 100 firms in companynews and entropy (red line) from 2007 to 2013.

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 24: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Output

Example: TopicsResults of a topic model

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 25: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Output

Example: ComponentsResults of a principal component analysis

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 26: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Output

Example: co-occurrencesResults of a network visualization of co-occurrances

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 27: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Conclusions

• We developed a toolkit that integrates all recent methodsused for automated inductive framing analysis

• It is free

• It works with large-scale datasets

• It can be used by a whole group together

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 28: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Next steps

• RE the tool: graphical interface

• RE the method: systematic validation study; comparingdifferent approaches and settings

Packing and Unpacking the Bag of Words Trilling & Jonkman

Page 29: Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Overview Problems Sample implementation: INFRA Empirical example Conclusions

Questions

Questions?

[email protected]@damian0604

www.damiantrilling.net

Packing and Unpacking the Bag of Words Trilling & Jonkman