Drew Conway: A Social Scientist's Perspective on Data Science

A social scientist’s perspectives on data science

Drew ConwayNYC Data Science

MeetupMarch 5, 2013http://www.flickr.com/photos/uiowa/8047

195100/

http://www.flickr.com/photos/uiowa/8047195100/

http://www.flickr.com/photos/uiowa/8047195100/

Hacking Skills

Obtain Munge

I hold the following truths to be self-evident...

1.Data come from many sources

2.Data come in many form(at)s

A .zip file of PDFs ≠ data‣Data scientist must know where to get data and how to obtain it

‣Work with big text files

$ head publicvotes-20101018_votes.dump

‣Work with APIs

$ curl http://search.twitter.com/search.json?q=@drewconway > drewconway.json

Real data are messy ‣Even curated data: duplicates, missing values, date formats

‣Combine data from multiple sources/formats

‣Tools• *NIX tools: sed, awk, grep• Scripting languages: Perl, Python

and R

$ cat ufo_awesome.tsv | grep probe | wc -l 131

Hacking Skills

While 80% of effort is spent here, perhaps most straightforward to teach

Heavily tool focused, borrow from CS/EE curriculums

‣Comfort working at the command-line, with text editors

‣A language for every season!

Conveying findings in creative and compelling ways

Math & Stats

Knowledge

If: Better data beats better mathThen: What methods should be taught?

How do you find structure in new data?

‣Scatter plots‣Density plotsData exploration that scales

‣Reduce dimensionality‣PCA, SVD, MDS

Methods must match data

‣Text‣Geospatial‣Web-scaleWhat is the ‘best’

model?‣Most predictive‣Most parsimonious‣Cross-validation

Explore Model

}

Math & Stats

Knowledge

Universities good at methods training......but what methods fit into Data Science?

Things data scientist like...‣Illustrating the current state of the

world‣Predicting future observations‣Classifying/ranking observations

Things social scientists like...‣Testable theoretical models‣Natural experiments‣Causality

1.When applicable2.Right tool / right

job3.Open black

boxes4.Learn limitations

Substantive

Expertise

Data Science, as a discipline, is fundamentally about human behavior

Inquire InterpretFocus on questions / not tech

‣What new questions can be asked from web-scale data?

‣Tools are a means to an end

Social science has questions

‣Markets‣Organization‣Decision making

How do we know when the results we get make sense, if ever?

http://www.flickr.com/photos/cawley/3242403224/

Case Study: Methods for Collecting Large-Scale Non-Expert Text Coding



Median Voter Theorem

Theorem: In a majority rules system, the preference of the median voter will succeed

http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-voter/

Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension



Median Voter Theorem

http://voteview.com/blog/?p=564

How do we calculate these numbers?



We make it up...

http://www.flickr.com/photos/estherlairlandesa/4649566079/

But, we have to!



http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congress.jpg

http://www.flickr.com/photos/becca02/6727193557/

A tale of two disciplines

Physics Political Science

Build instrument Measure Observe action Infer





One thing we have a lot of: text

Politicians‣Speeches‣Constituent communication

Parties‣Platform / manifestos‣Position statements

Countries‣Diplomatic cables‣Military declarations

ExpertCodin

g!

How expert coding (typically) works

http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party

Expert Code Book

1. Health & Safety: We propose to ban Self Responsibilty on

the grounds that it may be dangerous to your health.

2. M.P’s Expenses: We propose that instead of a second home

allowance M.P’s will have a caravan which will be parked

outside the Houses of Parliament. This will make it easier

as flipping a caravan is easier than flipping homes

3. Eurofit: The European Constitution which will be sorted

out by going for a long Walk. “As everyone knows that

walking is good for the constitution”

Manifesto

Party Year Score

Monster Raving Loony

2010 -2

DATA!



What’s wrong with experts?

They’re slow

They’re biased

They’re expensive

They’re wrong

Can we use non-Can we use non-experts to code experts to code

political political manifestos?manifestos?

How can we How can we measure the measure the

quality/validity quality/validity of non-expert of non-expert

codings?codings?Use Use

Mechanical Mechanical Turk to code Turk to code

many many manifesto manifesto fragments.fragments.

Experimental approach

Expert codings

Texts: 18 “big 3” British party manifestos 1987-2010

Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty

Coding: deliberately simple schema

Baseline data

Three experiments

No Qualificatio

n

Low-Threshold

High-Threshold

Anyone in

4/6 Correct

5/6 Correct

MT codings

Experimental designHypothesis: Stronger filter on Turkers leads to better coding

Filter: Use MT qualification test as gatekeeper

How do we think about coding a manifesto fragment?

Example text coding HIT from the experiment

How do we implement this (aka, the glue)?

Expert codings

[{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ...]

Random sample, as JSON

EC2

S3

MT

Dynamically generate HITs

MT codings

Push HITs + retrieve results

Statistical analysis of results

Scholarship, FTW!

https://github.com/drewconway/mturk_coder_quality



What’s good about MT non-experts?

They’re fast

They’re biased?

They’re cheap

They’re wrong?

The last crowd-sourced coding job for 600 sentences

and got 4,300 sentences coded in

about 20 hours (about 3.6

sentences per minute)

•We pay about $0.02 / sentence

•Typical manifesto (in British set) has 1,000 sentences•Whole manifesto coded for $20

•By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence

Results Kappa Statistic

Experiment

Sentences

# MT Coders

% Agreement

k*Std.

Errorz

No Qual.

1,315 89 0.65 0.47 0.13 22.6

Low-Threshold

1,393 56 0.7 0.54 0.12 26.7

High-Threshold

1,250 23 0.62 0.41 0.13 18.3

* A k value between 0.4-0.6 is considered “moderate” agreement

Agreement by experiment

ExperimentExpert

CodingMT %

Agreement

No Qual.

Economic 0.77

Social 0.92

Neither 0.22

Low-Threshold

Economic 0.87

Social 0.98

Neither 0.2

High-Threshold

Economic 0.77

Social 0.91

Neither 0.09

Agreement by expert-coding

Results of initial MT experiments

Results Kappa Statistic

Experiment

Sentences

# MT Coders

% Agreement

k*Std.

Errorz

Econ-only

942 15 0.62 0.23 0.1 4.28

Soc-only

955 32 0.6 0.17 0.09 0.95

* A k value between 0.4-0.6 is considered “moderate” agreement

ExperimentExpert

CodingMT %

Agreement

Economic 0.92

Economic-only

Neither 0.28

Social 0.97

Social-only Neither 0.19

Non-experts have a very hard time with a “null” coding!

Separating Social and Economic Sentences

Joint work with...

Michael LaverNYU

Kenneth BennoitLSE

Slava MikhaylovUCL

Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437Presentation: http://bit.ly/nonexperts

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437



ProjectFlorida

No Qualification

Coder performance stability

Low-threshold

High-threshold

Performance becomes very stable after approximately 20 HITs

Party shifts: economic

Party shifts: social

Drew Conway: A Social Scientist's Perspective on Data Science

Technology

party shifts

expert codings

threshold high

threshold economic 0

data

threshold 1

methods

77 social 0