Top Banner
Coding the Twitter Sphere: Humans and Machines Learning Together Dr. Stuart Shulman @stuartwshulman [email protected] 1
39

Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

May 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Coding the Twitter Sphere: Humans and Machines Learning Together

Dr. Stuart Shulman @stuartwshulman [email protected]

1

Page 2: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Acknowledgements

The National Science FoundationMark J. Hoy

2

Page 3: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Conflict of Interest Disclosure

I am the sole manager of Texifter

We sell DiscoverText licenses

We sell Gnip data licenses

3

Page 4: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

A Master Metaphor: Sifter

4

Page 5: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

An Open Source Kernel

5

Page 6: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Three Primary Tasks in CAT

6

Page 7: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Classification of Text

A 2500 year-old problem

Plato argued it would be frustrating

It still is…

7

Page 8: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Grimmer & Stewart “Text as Data”Political Analysis (2013)Volume is a problem for scholars

Coders are expensive Groups struggle to accurately label text at scale

Validation of both humans and machines is “essential” Some models are easier to validate than others

All models are wrong Automated models enhance/amplify, but don’t replace humans

There is no one right way to do this “Validate, validate, validate”

“What should be avoided then, is the blind use of any method without a validation step.”

8

Page 9: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

9

(Patent Pending)

Page 10: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Three Important Books

10

Page 11: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

One Particularly Important Idea

11

Page 12: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Five Pillars of Text Analytics

SearchFilterCode

ClusterClassify

You can execute all five using DT12

Page 13: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Pillar #1: Search

13

Page 14: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Search for Negative Cases

14

Page 15: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Defined Search (Multi-term)

15

Page 16: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Pillar #2: Filters

16

Page 17: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Another Common Filter

17

Page 18: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

18

Page 19: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Pillar#3: Human Coding

19

Page 20: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Keystroke Coding is Fast

20

Page 21: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Coding Off a List is Faster

21

Page 22: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Data Cleaning is Fundamental

22

Page 23: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Pillar #4: Clustering

23

Page 24: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

24

Page 25: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Latent Dirichlet Allocation (LDA) Topic Models

25

Page 26: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

LDA on the Christie Data

26

Topic 1 : christie, sandy, christies, funds, relief, feds, investigating, daily, gov, feminized Topic 2 : with, daniel, didnt, after, murder, time, agatha, death, former, mayor

Topic 3 : bridge, about, traffic, more, scandal, chris, nj, some, just, says Topic 4 : like, gop, bridgegate, what, 2016, know, now, will, bully, dont

Topic 5 : obama, benghazi, impeachment, dem, have, probe, lawmaker, floats, possibility, gwb Topic 6 : jersey, over, stages, still, aides, grief, bogus, hes, news, subpoenas

Topic 7 : rove, closures, karl, york, while, federal, party, tea, governor, president Topic 8 : irs, political, been, show, republicans, media, get, laws, word, scandals

Page 27: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Pillar#5: Machine-Learning

27

Page 28: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Create a Dataset to Code

Any archive or bucket

Use the random sampling tool

Standard: All coders get all items

Triage: Coders get next uncoded item28

Page 29: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Select from Three Coding Styles

Default: Mutually Exclusive Codes

Option 1: Non-Mutually Exclusive Codes

Option 2: User-Defined Codes (Grounded Theory)

29

Page 30: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Assign Peers to Code a Dataset

How many coders?

How many items need to be coded?

How many test or training sets?

There are no cookbook answers30

Page 31: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Look at Inter-Rater Reliability

Highly reliable coding (easy tasks)

Unreliable coding (interesting tasks)

If humans can’t, neither can machines

Some tasks better suited for machines31

Page 32: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Adjudication: The Secret Sauce

Expert review or consensus process

Invalidate false positives

Identify strong and weak coders

Exclude false positives from training sets32

Page 33: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

33

Page 34: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

34

Page 35: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Use Classification Scores as Filters

Iteration plays a critical role

Train, classify, filter

Repeat until the model is trusted

Each round weeds out false positives35

Page 36: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Classifier Histograms: More Filtering

36

Page 37: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

http://sifter.texifter.com

Page 38: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:
Page 39: Coding the Twitter Sphere - DIMACSdimacs.rutgers.edu/.../04.Shulman_CodingtheTwittersphere.pdfCreate a Dataset to Code Any archive or bucket Use the random sampling tool Standard:

Thanks for Listening

Dr. Stuart Shulman @[email protected] discovertext.comsifter.texifter.com

39