Top Banner
An Automated Snowball Census of the Political Web Abe Gong University of Michigan JITP 2011
29

An Automated Snowball Census of the Political Web - JITP 2011

Nov 29, 2014

Download

Technology

Abe Gong

Working abstract: This paper solves a persistent methodological problem for social scientists studying the political web: representative sampling. Virtually all existing studies of the political web are based on incomplete samples, and therefore lack generalizability. In this paper, I combine methods from computer science and sampling theory to conduct an automated snowball census of the political web and constructs an all-but-complete index of English political websites. I check the robustness of this index, use it to generate descriptive statistics for the entire political web, and demonstrate that studies based on ad hoc sampling strategies are likely to be biased in important ways. In future research, this bias can be eliminated by using this index as a sampling universe. In addition, the methods and open-source software presented here can be used to creating similar sampling frames for other online content domains.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Automated Snowball Census of the Political Web - JITP 2011

An Automated Snowball Censusof the Political Web

Abe GongUniversity of Michigan

JITP 2011

Page 2: An Automated Snowball Census of the Political Web - JITP 2011

Motivation

Page 3: An Automated Snowball Census of the Political Web - JITP 2011

Motivation

Page 4: An Automated Snowball Census of the Political Web - JITP 2011

Motivation

Page 5: An Automated Snowball Census of the Political Web - JITP 2011

Motivation

The blogosphere is one of the best sources of political data in all history.

Understanding political bloggers can help us understand political participation more broadly.

In order to compare “the average blogger” to “the average citizen,” we need a representative sample of bloggers.

Page 6: An Automated Snowball Census of the Political Web - JITP 2011

Wanted: A sampling frame for all political bloggers

Page 7: An Automated Snowball Census of the Political Web - JITP 2011

Challenges: scale and sparseness

No complete index of blogs exists, let alone political blogs

• 250 million web sites• 40 new sites created every minutes• Only 3 in 1,000 sites are political

Page 8: An Automated Snowball Census of the Political Web - JITP 2011

Previous research

Sample Types• Convenience

Examples

● Johnson and Kaye, 2004

● Lescovek, Backstrom and Kleinberg, 2009

Big Data, but no attempt at representativeness

Page 9: An Automated Snowball Census of the Political Web - JITP 2011

Previous research

Sample Types• Convenience • Prominence

Examples• McKenna and

Pole, 2008• Wallsten, 2008

Good data, but only includes popular sites.

Page 10: An Automated Snowball Census of the Political Web - JITP 2011

Previous research

Sample Types• Convenience • Prominence • Snowball

Examples• Hindman,

Tsioutsiouliklis, and Johnson, 2003

• Karpf, 2008

Sample properties unclear

Page 11: An Automated Snowball Census of the Political Web - JITP 2011

Previous research

Sample Types• Convenience • Prominence • Snowball • Over-sample

Examples

• Lenhart and Fox, 2006• Schlozman, Verba, and

Brady, 2010• Lawrence, Sides, and

Farrell, 2010• Karen's US-IMPACT study

Representative sample, but linking to Big Data is hard

Page 12: An Automated Snowball Census of the Political Web - JITP 2011

Methodology

1. Start from a seed batch of political sites.

2. Download and classify each site in the batch.

3. For political sites, harvest outbound hyperlinks and add unvisited links to the next batch.

4. Repeat from step 2 until no new links are found.

Page 13: An Automated Snowball Census of the Political Web - JITP 2011

Toy Example

Page 14: An Automated Snowball Census of the Political Web - JITP 2011

Toy Example

Page 15: An Automated Snowball Census of the Political Web - JITP 2011

Toy Example

Page 16: An Automated Snowball Census of the Political Web - JITP 2011

Toy Example

Page 17: An Automated Snowball Census of the Political Web - JITP 2011

Bag-of-words logit regression

Prob(political) ≈ logit(α+βX)

X = Vector of word counts

α = Bias term

β = Word weights

1. Hand-code a training sample (n=2,000)

2. Calibrate the computer

3. Hand-code a testing sample (n=200)

4. Evaluate the classifier

Page 18: An Automated Snowball Census of the Political Web - JITP 2011

Text Classifier Word Cloud

Page 19: An Automated Snowball Census of the Political Web - JITP 2011

Classifier reliability

Human-human: 80.9%

Human-computer: 81.0%

Krippendorff's Alpha: .733

Page 20: An Automated Snowball Census of the Political Web - JITP 2011

Census Results

Implemented in python: SnowCrawl

Executes in less than 24 hours

1.8 million sites crawled

800,000 political

42% blogs

http://code.google.com/p/snowcrawl

Page 21: An Automated Snowball Census of the Political Web - JITP 2011

Comparison by strata

Top 500 Top 5,000 Census

Organization

Owned by orgs 66.1*** 53.1 44.4

Multiple authors 75.2* 66.7 62.2

M-updates/day 43.4*** 19.4*** 6.1

Design

Advertising 67.3** 57.1 51.2

Blogroll 57.5* 66.3*** 45.1

Video 48.7*** 35.7*** 18.3

Page 22: An Automated Snowball Census of the Political Web - JITP 2011

Comparison by strata

Top 500 Top 5,000 CensusPolls and public opinion 70.8*** 65.3* 52.4Elections and campaigns 50.4 45.9 51.2Legislation and law-making 43.4 41.8 43.9Implementation of policy 38.1 39.8 30.5Decisions by courts 34.5*** 24.5 17.1Political figures 46.0*** 39.8** 24.4Political parties 38.9*** 32.7* 20.7Philosophical discussion 26.5 29.6 25.6State and local government 36.3* 38.8** 24.4Foreign policy 42.5*** 38.8*** 15.9

International relations 31.9** 33.7** 18.3

Page 23: An Automated Snowball Census of the Political Web - JITP 2011

Where next?

● Survey of bloggers● Poststratification weighting● Network analysis● Content analysis of blogs● Blog post panel● Sentiment analysis/Survey imputation● Re-implement in Hadoop

Page 24: An Automated Snowball Census of the Political Web - JITP 2011

Where next ...?

ANES

GSS

Roxy...?

?

?

?

Page 25: An Automated Snowball Census of the Political Web - JITP 2011

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary data is the key for the compSocSci research agenda.

Page 26: An Automated Snowball Census of the Political Web - JITP 2011

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary data is the key for the compSocSci research agenda.

http://code.google.com/p/snowcrawl

Page 27: An Automated Snowball Census of the Political Web - JITP 2011

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.

Page 28: An Automated Snowball Census of the Political Web - JITP 2011

Thank you!

Questions? Comments?

Abe Gong

Public policy, political science, complex systems

University of Michigan

[email protected]

lowlywonk.blogspot.com

Www-personal.umich.edu/~agong

Page 29: An Automated Snowball Census of the Political Web - JITP 2011