Top Banner
A social scientist’s perspectives on data science Drew Conway NYC Data Science Meetup March 5, 2013 http://www.flickr.com/photos/uiowa /8047195100/
28

Drew Conway: A Social Scientist's Perspective on Data Science

Nov 12, 2014

Download

Technology

mortardata

At the NYC Data Science meetup on March 5, 2014, Drew Conway (head of data at Project Florida and co-author of the book Machine Learning for Hackers) spoke about his own research using the tools of data science to tackle problems in political science.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Drew Conway: A Social Scientist's Perspective on Data Science

A social scientist’s perspectives on data science

Drew ConwayNYC Data Science

MeetupMarch 5, 2013http://www.flickr.com/photos/uiowa/8047

195100/

Page 2: Drew Conway: A Social Scientist's Perspective on Data Science
Page 3: Drew Conway: A Social Scientist's Perspective on Data Science

Hacking Skills

Obtain Munge

I hold the following truths to be self-evident...

1.Data come from many sources

2.Data come in many form(at)s

A .zip file of PDFs ≠ data‣Data scientist must know where to get data and how to obtain it

‣Work with big text files

$ head publicvotes-20101018_votes.dump

‣Work with APIs

$ curl http://search.twitter.com/search.json?q=@drewconway > drewconway.json

Real data are messy ‣Even curated data: duplicates, missing values, date formats

‣Combine data from multiple sources/formats

‣Tools• *NIX tools: sed, awk, grep• Scripting languages: Perl, Python

and R

$ cat ufo_awesome.tsv | grep probe | wc -l 131

Page 4: Drew Conway: A Social Scientist's Perspective on Data Science

Hacking Skills

While 80% of effort is spent here, perhaps most straightforward to teach

Heavily tool focused, borrow from CS/EE curriculums

‣Comfort working at the command-line, with text editors

‣A language for every season!

Conveying findings in creative and compelling ways

Page 5: Drew Conway: A Social Scientist's Perspective on Data Science

Math & Stats

Knowledge

If: Better data beats better mathThen: What methods should be taught?

How do you find structure in new data?

‣Scatter plots‣Density plotsData exploration that scales

‣Reduce dimensionality‣PCA, SVD, MDS

Methods must match data

‣Text‣Geospatial‣Web-scaleWhat is the ‘best’

model?‣Most predictive‣Most parsimonious‣Cross-validation

Explore Model

Page 6: Drew Conway: A Social Scientist's Perspective on Data Science

}

Math & Stats

Knowledge

Universities good at methods training......but what methods fit into Data Science?

Things data scientist like...‣Illustrating the current state of the

world‣Predicting future observations‣Classifying/ranking observations

Things social scientists like...‣Testable theoretical models‣Natural experiments‣Causality

1.When applicable2.Right tool / right

job3.Open black

boxes4.Learn limitations

Page 7: Drew Conway: A Social Scientist's Perspective on Data Science

Substantive

Expertise

Data Science, as a discipline, is fundamentally about human behavior

Inquire InterpretFocus on questions / not tech

‣What new questions can be asked from web-scale data?

‣Tools are a means to an end

Social science has questions

‣Markets‣Organization‣Decision making

How do we know when the results we get make sense, if ever?

Page 8: Drew Conway: A Social Scientist's Perspective on Data Science

http://www.flickr.com/photos/cawley/3242403224/

Case Study: Methods for Collecting Large-Scale Non-Expert Text Coding

Page 9: Drew Conway: A Social Scientist's Perspective on Data Science

Median Voter Theorem

Theorem: In a majority rules system, the preference of the median voter will succeed

http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-voter/

Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension

Page 10: Drew Conway: A Social Scientist's Perspective on Data Science

Median Voter Theorem

http://voteview.com/blog/?p=564

How do we calculate these numbers?

Page 11: Drew Conway: A Social Scientist's Perspective on Data Science

We make it up...

http://www.flickr.com/photos/estherlairlandesa/4649566079/

But, we have to!

Page 12: Drew Conway: A Social Scientist's Perspective on Data Science

http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congress.jpg

http://www.flickr.com/photos/becca02/6727193557/

A tale of two disciplines

Physics Political Science

Build instrument Measure Observe action Infer

Page 13: Drew Conway: A Social Scientist's Perspective on Data Science

One thing we have a lot of: text

Politicians‣Speeches‣Constituent communication

Parties‣Platform / manifestos‣Position statements

Countries‣Diplomatic cables‣Military declarations

ExpertCodin

g!

Page 14: Drew Conway: A Social Scientist's Perspective on Data Science

How expert coding (typically) works

http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party

Expert Code Book

1. Health & Safety: We propose to ban Self Responsibilty on

the grounds that it may be dangerous to your health.

2. M.P’s Expenses: We propose that instead of a second home

allowance M.P’s will have a caravan which will be parked

outside the Houses of Parliament. This will make it easier

as flipping a caravan is easier than flipping homes

3. Eurofit: The European Constitution which will be sorted

out by going for a long Walk. “As everyone knows that

walking is good for the constitution”

Manifesto

Party Year Score

Monster Raving Loony

2010 -2

DATA!

Page 15: Drew Conway: A Social Scientist's Perspective on Data Science

What’s wrong with experts?

They’re slow

They’re biased

They’re expensive

They’re wrong

Page 16: Drew Conway: A Social Scientist's Perspective on Data Science

Can we use non-Can we use non-experts to code experts to code

political political manifestos?manifestos?

How can we How can we measure the measure the

quality/validity quality/validity of non-expert of non-expert

codings?codings?Use Use

Mechanical Mechanical Turk to code Turk to code

many many manifesto manifesto fragments.fragments.

Page 17: Drew Conway: A Social Scientist's Perspective on Data Science

Experimental approach

Expert codings

Texts: 18 “big 3” British party manifestos 1987-2010

Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty

Coding: deliberately simple schema

Baseline data

Three experiments

No Qualificatio

n

Low-Threshold

High-Threshold

Anyone in

4/6 Correct

5/6 Correct

MT codings

Experimental designHypothesis: Stronger filter on Turkers leads to better coding

Filter: Use MT qualification test as gatekeeper

Page 18: Drew Conway: A Social Scientist's Perspective on Data Science

How do we think about coding a manifesto fragment?

Page 19: Drew Conway: A Social Scientist's Perspective on Data Science

Example text coding HIT from the experiment

Page 20: Drew Conway: A Social Scientist's Perspective on Data Science

How do we implement this (aka, the glue)?

Expert codings

[{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ...]

Random sample, as JSON

EC2

S3

MT

Dynamically generate HITs

MT codings

Push HITs + retrieve results

Statistical analysis of results

Scholarship, FTW!

https://github.com/drewconway/mturk_coder_quality

Page 21: Drew Conway: A Social Scientist's Perspective on Data Science

What’s good about MT non-experts?

They’re fast

They’re biased?

They’re cheap

They’re wrong?

The last crowd-sourced coding job for 600 sentences

and got 4,300 sentences coded in

about 20 hours (about 3.6

sentences per minute)

•We pay about $0.02 / sentence

•Typical manifesto (in British set) has 1,000 sentences•Whole manifesto coded for $20

•By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence

Page 22: Drew Conway: A Social Scientist's Perspective on Data Science

Results Kappa Statistic

Experiment

Sentences

# MT Coders

% Agreement

k*Std.

Errorz

No Qual.

1,315 89 0.65 0.47 0.13 22.6

Low-Threshold

1,393 56 0.7 0.54 0.12 26.7

High-Threshold

1,250 23 0.62 0.41 0.13 18.3

* A k value between 0.4-0.6 is considered “moderate” agreement

Agreement by experiment

ExperimentExpert

CodingMT %

Agreement

No Qual.

Economic 0.77

Social 0.92

Neither 0.22

Low-Threshold

Economic 0.87

Social 0.98

Neither 0.2

High-Threshold

Economic 0.77

Social 0.91

Neither 0.09

Agreement by expert-coding

Results of initial MT experiments

Page 23: Drew Conway: A Social Scientist's Perspective on Data Science

Results Kappa Statistic

Experiment

Sentences

# MT Coders

% Agreement

k*Std.

Errorz

Econ-only

942 15 0.62 0.23 0.1 4.28

Soc-only

955 32 0.6 0.17 0.09 0.95

* A k value between 0.4-0.6 is considered “moderate” agreement

ExperimentExpert

CodingMT %

Agreement

Economic 0.92

Economic-only

Neither 0.28

Social 0.97

Social-only Neither 0.19

Non-experts have a very hard time with a “null” coding!

Separating Social and Economic Sentences

Page 24: Drew Conway: A Social Scientist's Perspective on Data Science

Joint work with...

Michael LaverNYU

Kenneth BennoitLSE

Slava MikhaylovUCL

Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437Presentation: http://bit.ly/nonexperts

Page 25: Drew Conway: A Social Scientist's Perspective on Data Science

ProjectFlorida

Page 26: Drew Conway: A Social Scientist's Perspective on Data Science

No Qualification

Coder performance stability

Low-threshold

High-threshold

Performance becomes very stable after approximately 20 HITs

Page 27: Drew Conway: A Social Scientist's Perspective on Data Science

Party shifts: economic

Page 28: Drew Conway: A Social Scientist's Perspective on Data Science

Party shifts: social