Top Banner
Introduction ITMS Preprocessing Data Data Visualization Cluster Analysis Topic Modeling Google Book API Future Directions References Interactive Visual Data Analysis Part Two Interactive Text Mining Suite Olga Scrivner Indiana University Workshop in Methods 1 / 33
36

Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Feb 28, 2019

Download

Documents

doque
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Interactive Visual Data AnalysisPart Two

Interactive Text Mining Suite

Olga Scrivner

Indiana University

Workshop in Methods

1 / 33

Page 2: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Outline

1 Introduce a web application for text processing and mining

2 Learn about natural language processing techniques

3 Develop practical skills

2 / 33

Page 3: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Data Mining

“As our collective knowledge continues to be digitized andstored (...) it becomes more difficult to find and discover what

we are looking for.” (Blei 2012)

3 / 33

Page 4: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

New Ways of Exploring Data Collections

Word clouds (Vuillemot et al., 2009)

4 / 33

Page 5: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Visualization Methods

Social network graphs (Rydberg-Cox, 2011)

5 / 33

Page 6: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Visualization Methods

Tracking emotion and sentiment in fairy tales(Mohammad, 2012)

6 / 33

Page 7: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Topic Modeling

Discovering underlying theme of collection from Science magazine1990-2000 (Blei 2012)

7 / 33

Page 8: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Technological and Methodological Obstacles

Many tools require some programming skills (Mallet,Meta, R and Python libraries)

GUI tools are limited to certain formats and functions(Voyant, PaperMachine)

Lack of active control by users

8 / 33

Page 9: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Interactive Text Mining Suite

A user-friendly tool for quantitative analysis andvisualization of unstructured data

Platform-independent

Interactive

9 / 33

Page 10: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

ITMS Structure

1 File Uploads

Upload files (txt, pdf, rdf and Google books API)

2 Data Preparation

Data preprocessing (stopwords, stemming, metadata)

3 Data Visualization

Word frequencies, Cluster analysis and topic modeling

10 / 33

Page 11: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

ITMS Structure

1 File Uploads

Upload files (txt, pdf, rdf and Google books API)

2 Data Preparation

Data preprocessing (stopwords, stemming, metadata)

3 Data Visualization

Word frequencies, Cluster analysis and topic modeling

10 / 33

Page 12: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Workshop Files

Download 3 text files

http://ssrc.indiana.edu/seminars/wim.shtml

NY Times articles (3 documents in a plain text format)

ITMS Web site:

http://www.interactivetextminingsuite.com

11 / 33

Page 13: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Upload File

12 / 33

Page 14: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Upload File

12 / 33

Page 15: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Upload File

12 / 33

Page 16: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Preprocessing Data

Before performing data analysis we should preprocess data.

13 / 33

Page 17: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Preprocessing Options

Select preprocessing options and click apply.

14 / 33

Page 18: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Stopwords

Stopwords (e.g. the, and): select Default for English

15 / 33

Page 19: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Manual Removal of Stopwords

Based on the need, remove any additional stopwords that youmay consider a noise, e,g, paper, shows etc

Select apply

16 / 33

Page 20: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Stemming

To improve analytics, you can stem all your tokens, ex. insteadof worked, works, working, you will have only one relevantstem work

17 / 33

Page 21: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Metadata Extraction

You can extract or upload metadata. You will need datestamp(year) information for chronological topic modeling.

18 / 33

Page 22: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Visualization

19 / 33

Page 23: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Word Cloud Representation

20 / 33

Page 24: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Customization

21 / 33

Page 25: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Cluster Analysis

You need to have at least three documents

Documents will be grouped based on their term similaritymeasures

22 / 33

Page 26: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Cluster Analysis

23 / 33

Page 27: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Topic Modeling

LDA (Latent Dirichlet allocation)

STM (Structural Topic model)

Chronological topic visualization (lda): requires metadata

24 / 33

Page 28: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Topic Modeling Tuning

Selection of topics (how many different themes)

Selection of words per theme (how many words per topic)

Selection of iteration

25 / 33

Page 29: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Topic Model Selection

26 / 33

Page 30: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

LDA Topic Model

27 / 33

Page 31: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

STM Topic Model

28 / 33

Page 32: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Other Formats - Google Books

Before switching to other data formats, refresh your localbrowser.

Start with File Uploads and select Structured Data

29 / 33

Page 33: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Other Formats - Google Books

Select your search terms and submit

Current limitation is 40 books

30 / 33

Page 34: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Future Options

Shiny Web Application is highly customizable

1 Part-of-speech tagging (tm package)

2 Network analysis (igraph package)

3 Name Entity Recognition (NLP package)

4 Twitter Streaming (twitterR package) - will requires user’stwitter set-up for streaming but information will beprovided how to set it up

Open for other suggestions and collaboration - [email protected]

31 / 33

Page 35: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

Acknowledgements

I would like to thank WIM for providing this opportunity.

Contributors: Jefferson Davis, Irina Trapido, Jay Lee

32 / 33

Page 36: Interactive Visual Data Analysis Part Two Interactive Text ... · Interactive Visual Data Analysis Part Two Interactive Text Mining Suite ... References Outline 1 Introduce a web

Introduction

ITMS

PreprocessingData

DataVisualization

ClusterAnalysis

TopicModeling

Google BookAPI

FutureDirections

References

References I

[1] Many open source R packages: tm, shiny, NLP, stringi, stringr, topicmodels, lda and many more

[2] Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge:Cambridge University Press

[3] Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber RandiReppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: CambridgeUniversity Press

[4] Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. Quantitative Methods in theHumanities and Social Sciences. Springer International Publishing, Cham

[5] Moretti, Franco. 2005. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso

[6] Oelke, Daniella, Dimitrios Kokkinakis, and Mats Malm. 2012. Advanced visual analytics methods forliterature analysis. Proceedings of the 6th EACL Workshop on Language Technology for CulturalHeritage, Social 561Sciences, and Humanities, pages 35–44image credits: https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif

33 / 33