Top Banner
Capitalizing on Machine Reading to Engage “Big(ger) Data” FOUR USE CASES FROM HIGHER ED Summer Institute on Distance Learning and Instructional Technology (SIDLIT 2016) August 4 - 5, 2016
103

Capitalizing on Machine Reading to Engage Bigger Data

Apr 12, 2017

Download

Data & Analytics

Shalin Hai-Jew
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Capitalizing on Machine Reading to Engage Bigger Data

Capitalizing on Machine Reading to Engage “Big(ger) Data”FOUR USE CASES FROM HIGHER ED

Summer Institute on Distance Learning and Instructional Technology (SIDLIT 2016)August 4 - 5, 2016

Page 2: Capitalizing on Machine Reading to Engage Bigger Data

OverviewWhat are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles* gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.

*Note: In NVivo 11 Plus, it seems easier to process .docx and .txt textual data than .PDF, so some file transcoding may be necessary for smoother processing.

2

Page 3: Capitalizing on Machine Reading to Engage Bigger Data

Presentation Topic Headings What is Big(ger) Data?

What is Machine Reading?

Some Machine Reading Sequences

Four “Use Cases” from Higher Education ◦ Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document sets)◦ Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data ◦ Use Case #3: Exploring Manual and Machine Coding ◦ Use Case #4: Machine Reading for Sentiment Analysis

3

Page 4: Capitalizing on Machine Reading to Engage Bigger Data

What is Big(ger) Data?

4

Page 5: Capitalizing on Machine Reading to Engage Bigger Data

What is Big(ger) Data? STRUCTURED DATA

Thousands of individual records extracted from a microblogging or social networking platform

…and others

UNSTRUCTURED DATA

Hundreds to thousands of scraped images from a content-sharing platform

Hundreds of videos

Complex mixed media (heterogeneous) sets of data

…and others

5

Page 6: Capitalizing on Machine Reading to Engage Bigger Data

Where Does the “Bigger Data” Come From?Social media platforms (crowd-sourced encyclopedias, social networking sites, microblogging sites, blogging sites, image sharing sites, video sharing sites, email sites, collaborative work sites, and others)

Databases of published contents

World Wide Web (WWW) and Internet

Synthetic data from various software tools or sites or other processes,

…and others

6

Page 7: Capitalizing on Machine Reading to Engage Bigger Data

What is Machine Reading?

7

Page 8: Capitalizing on Machine Reading to Engage Bigger Data

What is Machine Reading? Use of computers to decode text (which generally was created from natural language processes)◦ Also sometimes called “distant reading” (Moretti, 2000) or “not-reading”

Capturing of quantitative measures of a text ◦ Reliable and consistent, reproducible / repeatable ◦ Patterned ◦ May be generalizable and transferable to other contexts

8

Page 9: Capitalizing on Machine Reading to Engage Bigger Data

What is Machine Reading? (cont.)

Some common types of machine-based text analytics include the following: ◦ Linguistic analysis (based on word counts of function words, punctuation, and other parts of written

and spoken language) ◦ Style / stylometry◦ Sentiment analysis ◦ Emotion analysis◦ Cognitive analysis ◦ Social standing analysis ◦ Deception analysis, and others

9

Page 10: Capitalizing on Machine Reading to Engage Bigger Data

What is Machine Reading? (cont.)

◦ Word frequency counts (and resulting word clouds) ◦ Word searches (and resulting word trees) ◦ Word network analysis (and resulting network graphs) ◦ Word similarity clustering (and resulting 2D and 3D word clusters) ◦ Word proximity clustering (and resulting 2D and 3D word clusters) ◦ Topic modeling or theme/subtheme extraction through unsupervised machine learning

10

Page 11: Capitalizing on Machine Reading to Engage Bigger Data

A Brief History of Machine Reading Technologies originated in 1960s ◦ Assistive technology track (Haskins Laboratories, 1970s; Kurzweil Computer Products, 1975; and others)◦ Natural language processing track (Bobrow, 1964; Weizenbaum, 1965; Schank, 1969; Woods, 1970;

Winograd, 1971; Hendrix, 1982; and others)

Recently ◦ applied to Web and Internet scale texts ◦ popularizing to individual academic researcher-level applications

11

Page 12: Capitalizing on Machine Reading to Engage Bigger Data

Supervised and Unsupervised Machine Learning from Text / Text Corpora SUPERVISED OR SEMI-SUPERVISED (WITH DIRECT HUMAN INPUTS)

Coding by existing pattern (computer emulates human coding over part of a text set and codes the rest of the text set, based on human coding examples)

XML coding and analysis of coded segments of text ◦ Often manual coding ◦ Sometimes machine-enhanced XML coding

UNSUPERVISED (WHOLLY BASED ON COMPUTER ALGORITHM) ◦ Sentiment, emotion, cognitive, and other types

of analysis of text data ◦ Word frequency counts with stopwords lists

(and resulting word clouds, treemaps,) ◦ Word searches (and resulting word trees) ◦ Word network analysis (and resulting network

graphs)

12

Page 13: Capitalizing on Machine Reading to Engage Bigger Data

Supervised and Unsupervised Machine Learning from Text / Text Corpora (cont.)

SUPERVISED OR SEMI-SUPERVISED (WITH DIRECT HUMAN INPUTS)

Uses of human-labeled data

UNSUPERVISED (WHOLLY BASED ON COMPUTER ALGORITHM) ◦ Word similarity clustering (and resulting 2D and

3D word clusters) (and resulting dendrogramvisualizations, 2D and 3D cluster diagrams, ring lattice graphs)

◦ Word proximity clustering (and resulting dendrogram visualizations, 2D and 3D word clusters diagrams, ring lattice graphs)

◦ Clustering from factor analysis ◦ Topic modeling or theme/subtheme extraction

through unsupervised machine learning (with intensity matrices, bar charts, hierarchy diagrams like treemaps and sunbursts)

13

Page 14: Capitalizing on Machine Reading to Engage Bigger Data

Supervised and Unsupervised Machine Learning from Text / Text Corpora (cont.)

All with extracted data tables (from which data visualizations are created)

All with extracted word lists (from which data visualizations are created)

14

Page 15: Capitalizing on Machine Reading to Engage Bigger Data

Features and Affordances of Machine Reading FEATURES

Still some level of human interpretation needed

Extracted clusters, but the human has to apply the interpretation of what those clusters represent (factor analysis, principal components analysis, and others)

AFFORDANCES

Speed

Scale

Reproducibility

15

Page 16: Capitalizing on Machine Reading to Engage Bigger Data

Under the Hood… Bag of words (parsing: removal of punctuation, breaking writing apart into words) vs. structure and context-preserving methods

Sliding “windows” for co-occurrence and proximity captures

Languages are fairly predictably structured, so statistical methods and counting can be applied based on those patterns

Matrices are a common tool to capture relationships between words and parts of a text and documents; all relational matrices may be re-visualized as network graphs and other types of data visualizations

Pre-coded word sets or dictionaries are often used for sentiment analysis, emotion analysis, and the extraction of psychometric features

16

Page 17: Capitalizing on Machine Reading to Engage Bigger Data

Under the Hood… (cont.)

Often a sequence of algorithmic procedures (and often black box except for the few software makers who are highly focused on documenting accurately)

17

Page 18: Capitalizing on Machine Reading to Engage Bigger Data

Some Software and Capabilities Linguistic Inquiry and Word Count (LIWC2015): application of psychometrics and other constructs across over 100 dimensions; two-plus decades of empirical, lab, and other research; dictionaries available in multiple languages; customized dictionaries may be applied

AutoMap and NetScenes: extraction of content networks, and others (http://www.casos.cs.cmu.edu/projects/automap/)

Coh-Metrix (http://cohmetrix.com/)

DICTION (http://www.dictionsoftware.com/)

Latent Semantic Analysis @ CU Boulder (http://lsa.colorado.edu/)

(The presenter has only tried the top two. MS Word has a brief lexical element which enables the extraction of readability statistics based on the Flesch Reading Ease test and the Flesch-Kincaid Grade Level test.)

18

Page 19: Capitalizing on Machine Reading to Engage Bigger Data

NVivo 11 PlusPrimary focus: Qualitative research data analysis suite ◦ Digital data curation ◦ Manual coding ◦ Data queries ◦ Auto coding ◦ Data visualizations, and others

Some built-in data analytics tools that enable machine reading of texts

This presentation will only focus on this particular tool for the use cases.

19

Page 20: Capitalizing on Machine Reading to Engage Bigger Data

Some Machine Reading Sequences

20

Page 21: Capitalizing on Machine Reading to Engage Bigger Data

A Conceptualization of a Recursive and Semi-Linear SequenceReview of the Literature*

Research Design / Exploration and Discovery*

Multimedia Collection / Text Collection*

Close Human Reading

Data Cleaning

Text and Multimedia Data Curation

Data Runs with Machine Reading Software

Extractions of Data Tables and Text Sets

Data Visualizations

Write-up Presentation

*all potential start points

21

Page 22: Capitalizing on Machine Reading to Engage Bigger Data

22

Page 23: Capitalizing on Machine Reading to Engage Bigger Data

Power in Combinations and in SequencesMany points-of-entry to machine reading

Ranges of tactics and strategies and capabilities based on available texts, various software tools, and researcher capabilities

Text processing is sensitive to sequential time, so it is important to be very clear about what is happening at each processing phase (and how the data changes) ◦ Need clear documentation of what changes happened in each step

23

Page 24: Capitalizing on Machine Reading to Engage Bigger Data

Why Text? May be human-collected data sets, sometimes computer program-scraped data sets, or a mix

Text of various types based on conventions

Text as a lowest common denominator in terms of multimedia objects (images, audio, video, and others)

24

Page 25: Capitalizing on Machine Reading to Engage Bigger Data

Text Curation and Data Cleaning Selection of texts to include and those to exclude

Consistent and informative file naming protocols

Rendering of multimedia files to searchable text ones

Ensuring texts are machine readable ◦ Rendering of texts to searchability

De-duplication of files

Normalizing textual data, and others

25

Page 26: Capitalizing on Machine Reading to Engage Bigger Data

26

dendrogram hierarchy showing similarity clustering

Note: Start with the leaves at the far right and read “up” the tree to branches and then the trunk, in order to understand clustering relationships…from the most granular to the most broad or coarse.

Page 27: Capitalizing on Machine Reading to Engage Bigger Data

27

treemap diagram showing theme and subtheme extraction

Page 28: Capitalizing on Machine Reading to Engage Bigger Data

28

treemap diagram showing theme and subtheme extraction without color overlay

Page 29: Capitalizing on Machine Reading to Engage Bigger Data

29

word tree based on a text search for “data”

Page 30: Capitalizing on Machine Reading to Engage Bigger Data

30

2D cluster visualization based on source ties based on word similarity

Page 31: Capitalizing on Machine Reading to Engage Bigger Data

Setup of Machine Reading Processes There are a half-dozen widely-available software programs that may be used for machine reading of texts. Those who use high-level computer languages also have packages that enable various types of text analysis.

It helps to know what the capabilities are for the various software tools.

It helps to know what to set the parameters at for the various processes.

It helps to know how the respective data visualizations are to be interpreted and to mitigate for the fact that data visualizations are summary data. It is important to mitigate negative learning.

31

Page 32: Capitalizing on Machine Reading to Engage Bigger Data

AssertabilityIt helps to know what may be asserted from the respective machine reading processes, so the findings may be accurately represented.

Also, many researchers use multiple processes and outside-understandings in order to contextualize the data from machine reading. Those insights should be properly couched as well.

32

Page 33: Capitalizing on Machine Reading to Engage Bigger Data

Four “Use Cases” from Higher Education • APPLICATION OF MACHINE READING FOR:

• Learning • Awareness • Research

33

Page 34: Capitalizing on Machine Reading to Engage Bigger Data

Overview of Use Cases Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document sets)

Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data

Use Case #3: Exploring Manual and Machine Coding

Use Case #4: Machine Reading for Sentiment Analysis

34

Page 35: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document / text sets)

Using a computer to read a large amount of articles or contents to extract topics ◦ Can be a “knowledge poor” approach in which no prior information about the domain is applied to the

topic extraction (as in the case of NVivo 11 Plus)

Topic modeling may be used to identify which works should be read using human “close reading”◦ Article histograms through theme / subtheme extraction (topic modeling)◦ Understanding of topics as a finite feature set ◦ Classification of articles by their main named contents

Using a computer to extract themes and sub-themes to understand the general gist of an article or a text set

Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the structure of the data)

35

Page 36: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #1 Work Sequence Collecting the documents as separate items in a text corpus

Cleaning the text sets

Running the theme and sub-theme extraction against the set (NVivo 11 Plus)

Selecting the appropriate themes and subthemes

Auto-coded at the more granular sentence level (vs. paragraph level)

Exporting the text-document intensity matrix

Mapping the entire set as a line graph (Excel)

Mapping separate articles as article histograms

36

Page 37: Capitalizing on Machine Reading to Engage Bigger Data

Assumptions of the Topic Modeling ApproachWord counts in respective documents are usually normalized to account for varying document lengths (so as not to overweight the words appearing in longer documents)

Each article is represented as a list of identified “important” words ◦ Important words are likely semantic meaning-bearing terms ◦ Important words are sometimes “rare” ones◦ Important words are not functional words (articles, pronouns, etc.) usually on a “stopwords” list◦ If the TF-IDF (term frequency-inverse document frequency) calculation is used,

◦ words are upweighted if they occur a fair amount in a local document but occur rarely globally in the set, and ◦ words are downweighted if they occur a lot across documents (globally) … in order to lower the influence of frequently appearing

words over rarer-occurring ones

◦ Words directly linked to the identified themes are placed as sub-themes (in bigrams, three-grams, and so on)

37

Page 38: Capitalizing on Machine Reading to Engage Bigger Data

VisualizationsA dataset of articles about “machine reading” from Springer, IEEE Xplore Digital Library, ACM and Google Scholar

Kept at document size and stored in one folder

38

Page 39: Capitalizing on Machine Reading to Engage Bigger Data

39

autocoded nodes based on extracted themes and subthemes

Page 40: Capitalizing on Machine Reading to Engage Bigger Data

40

an intensity matrix of extracted themes from an article set

Page 41: Capitalizing on Machine Reading to Engage Bigger Data

41

a combined bar chart of articles and extracted top-level themes

Page 42: Capitalizing on Machine Reading to Engage Bigger Data

42

treemap diagram showing theme and subtheme extraction

Page 43: Capitalizing on Machine Reading to Engage Bigger Data

43

sunburst hierarchy diagram showing theme and subtheme extraction

Page 44: Capitalizing on Machine Reading to Engage Bigger Data

44

an article histogram created in Excel

Page 45: Capitalizing on Machine Reading to Engage Bigger Data

45

interactive word tree based on “machine reading”

Page 46: Capitalizing on Machine Reading to Engage Bigger Data

An Integrated File of All Articles1751 pp., > a million words from articles based on “machine reading” from the academic literature

Binder1 file

63 MB

Saved out as a .txt file from .pdf

Saved out as an .rtf file from .pdf

46

combining of article set of “machine reading” articles for corpus-based summaries

Page 47: Capitalizing on Machine Reading to Engage Bigger Data

47

synthesized set theme and subtheme extraction from text corpus

Page 48: Capitalizing on Machine Reading to Engage Bigger Data

48

frequency word cloud from synthesized article set

Page 49: Capitalizing on Machine Reading to Engage Bigger Data

49

treemap diagram from synthesized article set

Page 50: Capitalizing on Machine Reading to Engage Bigger Data

50

interactive word tree from “machine” seeded text search from synthesized article set

Page 51: Capitalizing on Machine Reading to Engage Bigger Data

51

treemap diagram showing theme and subtheme extraction from synthesized article set

Page 52: Capitalizing on Machine Reading to Engage Bigger Data

52

sunburst diagram showing theme and subtheme extraction from synthesized article set

Page 53: Capitalizing on Machine Reading to Engage Bigger Data

53

sub-sunburst reflecting “system” theme and linked subthemes from synthesized article set

Page 54: Capitalizing on Machine Reading to Engage Bigger Data

54

treemap diagram showing auto-extracted sentiments from synthesized article set

Page 55: Capitalizing on Machine Reading to Engage Bigger Data

55

bar chart showing autocoded sentiment in four classifications from synthesized article set

intensity matrix

Page 56: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data Using NCapture web browser add-on (on MS’s IE or Google Chrome) to extract a Twitter Tweetstream

Using NCapture web browser add-on to extract a Facebook wall of postings

Ability to capture text messages, URLs, geographical coordinates, thumbnail images, and other data

Ability to run sentiment analysis over the collected data

Ability to run theme and sub-theme extraction over the collected data

Ability to map social networks (sociograms) based on interaction data

56

Page 57: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #2 Work Sequence Capture a microblogging text set (Tweetstream, #hashtag conversation, keyword conversations, or other data extractions) from Twitter’s API

Run a text frequency count to capture a gist of the focus of the target text set (NVivo 11 Plus)

Run a word or phrase or name search to capture a sense of the gist of the targeted word use (NVivo 11 Plus)

Process the data and extract social networks to identify main communicators in the #hashtag network or the keyword network

57

Page 58: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #2 Work Sequence (cont.)

Map the microblogging data to a geographical map

Capture the URLs from the data set for more analysis

Capture related imagery from the data set for more analysis

Look at the postings over time for more analysis

Conduct a theme and sub-theme analysis of the text set to capture meaning

Map the text set for sentiment to capture the general sentiment of the text set ◦ Examine the extracted textual data for deeper insights

58

Page 59: Capitalizing on Machine Reading to Engage Bigger Data

59

extracted Tweetstream dataset from Twitter using NCapture of NVivo 11 Plus

Page 60: Capitalizing on Machine Reading to Engage Bigger Data

60

dendrogram showing user names clustered by word similarity (suggestive of intercommunications and topical interests)

Page 61: Capitalizing on Machine Reading to Engage Bigger Data

61

3D cluster chart of Twitter users clustered by word similarity (zoomable, pannable, interactive)

Page 62: Capitalizing on Machine Reading to Engage Bigger Data

62

ring lattice graph / circle graph showing Twitter user accounts and clustering by word similarity from exchanged microblogged messaging

Page 63: Capitalizing on Machine Reading to Engage Bigger Data

63

mapped locations of accounts of Twitter communicators based on shared geolocational data

Page 64: Capitalizing on Machine Reading to Engage Bigger Data

64

interactive word tree from a text search for “asia”

Page 65: Capitalizing on Machine Reading to Engage Bigger Data

65

sociogram / sociogram depicting target user account on Twitter’s out-degree ego neighborhood (1 deg.)

Page 66: Capitalizing on Machine Reading to Engage Bigger Data

66

data table showing descending word count from a word frequency count

Page 67: Capitalizing on Machine Reading to Engage Bigger Data

67

frequency word cloud from a Tweetstream with hashtag networks and (semantic) keywords

Page 68: Capitalizing on Machine Reading to Engage Bigger Data

68

treemap from frequency word count of the Tweetstream

Page 69: Capitalizing on Machine Reading to Engage Bigger Data

69

treemap from autocoded theme and subtheme extraction from the Tweetstream

Page 70: Capitalizing on Machine Reading to Engage Bigger Data

70

sunburst hierarchy chart showing extracted themes and subthemes from the Tweetstream

Page 71: Capitalizing on Machine Reading to Engage Bigger Data

71

sunburst hierarchy chart showing the “educator” theme and its related subthemes from Twitter Tweetstream

Page 72: Capitalizing on Machine Reading to Engage Bigger Data

72

bar chart of extracted themes (in alphabetical order) from the Twitter account Tweetstream; subthemes not depicted here

Page 73: Capitalizing on Machine Reading to Engage Bigger Data

73

least frequent word count to capture the “long tail” from the Tweetstream

often alphanumeric garble, misspellings, and a few nuggetsof insight in the long tail

power law curve frequency distribution

Page 74: Capitalizing on Machine Reading to Engage Bigger Data

In the area marked by ellipses …

74

treemap from Excel 2016

Page 75: Capitalizing on Machine Reading to Engage Bigger Data

75

0

5

10

15

20

25

30

trea

tmen

tw

elco

me

test

edm

ason

a88d

a9id

efin

clud

ing

ther

apy

#saa

m7p

ppg0

epoe

adva

nces

bust

ers

coor

dina

tion

driv

ing

five

high

est

kille

dm

ovem

ent

per

rece

ive

sign

ifica

ntth

ree

wee

kend

#for

ewis

hes

#sm

okin

g20

085o

ulv6

6ucg

9ocv

gfzr

26@

nych

ealth

yaf

3qhw

dkqy

back

upbx

sqo6

wle

ecm

vdc

qd9t

8ssu

durt

xyyr

ljex

iste

nce

fron

thb

dzzw

enxu

iden

tifie

siz

0i7c

q4lx

kqgr

4oez

zcm

9bip

8im

rry2

ntcv

q nypa

ges

prep

ped

rbtv

gkry

nmfo

uobu

spon

sore

dte

chni

ques

trut

hva

cuna

wild

land

yq4t

l0no

as

Freq

uenc

y Co

unt

Words and String Data in the Long Tail

An Example of the Long Tail (from @CDCGov Tweetstream)

frequency bar chart from Excel 2016

Page 76: Capitalizing on Machine Reading to Engage Bigger Data

In the Long Tail… (with a somewhat arbitrary raw count to indicate the start of the long tail)

NOT USEFUL

Misspelled words

Alphanumeric garble

POSSIBLY USEFUL

URLs (uniform resource locators) or web addresses

Rare topics

Concepts

Unusual terms

Names, proper nouns, named entities

76

Page 77: Capitalizing on Machine Reading to Engage Bigger Data

77

data table starting with frequency counts of one

Page 78: Capitalizing on Machine Reading to Engage Bigger Data

78

treemap of auto-extracted sentiment: neutral, positive, negative, and mixed

Page 79: Capitalizing on Machine Reading to Engage Bigger Data

79

bar chart of auto-extracted sentiments in four categories or a binary split: • very negative, moderately negative, moderately positive, and very positive (and “neutral”); • also as negative-positive polarity (and “neutral”)

Page 80: Capitalizing on Machine Reading to Engage Bigger Data

80

text set from one of the four sentiment categories

Page 81: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #3: Exploring Manual and Machine CodingExploring human-coded (and / or auto-coded or machine-coded) nodes for pattern identification: interrelationships, similarity clustering, and others

Data queries to enable coding exploration: word frequency count, text search, and others

Matrix coding query to explore interrelationships between codes (nodes)

81

Page 82: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #3 Work SequenceRaw source data ingested

Manual (and / or automated) coding of that data

Data queries of the coding to observe patterns in coding of that data◦ Ability to set up text frequency counts based on various parameters: exact matches, stemmed words,

synonyms, specializations, and generalizations

Text search with special character capabilities

Proximity searches for words occurring near a certain term

82

Page 83: Capitalizing on Machine Reading to Engage Bigger Data

Special Features for Text Searches ◦ Wildcard searches (?) Any one character ◦ Wildcard searches (*) any characters ◦ AND ◦ OR ◦ NOT ◦ Required ◦ Prohibit ◦ Fuzzy ◦ Near… (proximity searches) / an extension of memory

83

Page 84: Capitalizing on Machine Reading to Engage Bigger Data

GroupingEnabling text searches, word frequency counts, and other queries from most specific to increasing gradations of generality (based on placement of the slider)

Exact matches

With stemmed words

With synonyms

With specializations

With generalizations

84

Page 85: Capitalizing on Machine Reading to Engage Bigger Data

85

interactive word tree based on “player” seeding term from coded data (whether manual-coded or auto/machine-coded or combined)

Page 86: Capitalizing on Machine Reading to Engage Bigger Data

86

color-coded intensity matrix of auto-coded sentiment analysis of the selected code

Page 87: Capitalizing on Machine Reading to Engage Bigger Data

87

treemap diagram of theme and sub-theme extraction from text code

Page 88: Capitalizing on Machine Reading to Engage Bigger Data

88

sunburst diagram of auto-extracted theme and subtheme extraction of coding references

Page 89: Capitalizing on Machine Reading to Engage Bigger Data

89

scatterplot of auto-extracted sentiment in coding as expressed in Excel 2016

Page 90: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #4: Machine Reading for Sentiment Analysis Ability to conduct a fast extraction of sentiment ◦ either as a polarity (positive-negative), with uncoded neutral text, or as ◦ a four-category set of sentiment (very negative, moderately negatively, moderately positive, and very

positive) and one of neutrality

Ability to explore the sentiment-coded text sets for textual contents ◦ Ability to query the coded text sets for additional word relationships and patterns

Ability to uncode or re-code autocoded text for sentiment to increase accuracy (through human oversight)

90

Page 91: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #4: Machine Reading for Sentiment Analysis (cont.)

Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the structure of the data) ◦ Sentences (granular) and paragraphs (coarser) are common in documents ◦ Cells are common in data tables, and many contain structured data but also phrases, URLs, and

thumbnail imagery (from social media); data tables from online surveys may contain whole sentences and paragraphs

91

Page 92: Capitalizing on Machine Reading to Engage Bigger Data

Use Case #4 Work SequenceRun sentiment analysis on the text set

Auto-coded at paragraph level (coarser than at the granular sentence level)

Review the data visualizations (intensity matrix, bar chart)

Explore the text in each sentiment sub-category (very negative, moderately negative, moderately positive, very positive) ◦ Run word frequency count ◦ Run selected text searches◦ Run theme and sub-theme extraction ◦ Create additional related data visualizations

92

Page 93: Capitalizing on Machine Reading to Engage Bigger Data

93

intensity matrix of auto-extracted sentiment from raw article sources

Page 94: Capitalizing on Machine Reading to Engage Bigger Data

94

bar chart of sentiment from comparative articles in a text set

Page 95: Capitalizing on Machine Reading to Engage Bigger Data

95

auto-coded nodes based on sentiment dictionary in either a four-category mode (top) or a binary two-category mode (bottom)

Page 96: Capitalizing on Machine Reading to Engage Bigger Data

96

access to exportable coded text sets in each of the sentiment categories

Page 97: Capitalizing on Machine Reading to Engage Bigger Data

97

access to exportable coded text sets in each of the sentiment categories, with ability to re-code and un-code

Page 98: Capitalizing on Machine Reading to Engage Bigger Data

98

treemap of comparative sentiment coding per document

Page 99: Capitalizing on Machine Reading to Engage Bigger Data

99

sunburst graph showing auto-coded sentiment extraction across comparative documents

Page 100: Capitalizing on Machine Reading to Engage Bigger Data

Value-added Aspects to Machine Reading Machine reading augments human capabilities for knowing. It decodes (“reads”) beyond a surface level of understanding of words in various contexts. ◦ It enables efficient access to latent insights. ◦ There are ways to chain processes and compare / contrast information in a way that illuminates new

insights.

NVivo 11 Plus enables ways to save “macros” of the various data queries and autocodingsequences to enhance re-runs of the macro sequences. This may be helpful with continuing data collection, in order to update query results after the acquisition of new data.

There is value in comparing the outcomes from one software program to another…and to see what happens with different settings for the different machine reading approaches. ◦ Researchers may export data tables for additional analytics and data visualizations outside of the

software.

100

Page 101: Capitalizing on Machine Reading to Engage Bigger Data

Demos (if time allows)

101

Page 102: Capitalizing on Machine Reading to Engage Bigger Data

Questions? Comments?

102

Page 103: Capitalizing on Machine Reading to Engage Bigger Data

Conclusion and Contact Dr. Shalin Hai-Jew◦ iTAC, Kansas State University ◦ 212 Hale / Farrell Library ◦ [email protected]◦ 785-532-5262

No ties: The presenter has no formal tie to QSR International, the maker of NVivo 11 Plus, nor to Microsoft, the maker of Excel.

About the data visualizations: Some of the data extractions and processes used only a few dozen source items, in part because of the need for coherent data visualizations and also to enable processing of the data on a local Windows machine. ◦ However, in a server-hosted context, text sets closer to big(ger) data may be run using NVivo 11 Plus.

A sampling: The four “use cases” presented here are by no means comprehensive. These offer a taste of some of the possibilities.

103