An Automated Snowball Census of the Political Web - JITP 2011

Post on 29-Nov-2014

498 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Working abstract: This paper solves a persistent methodological problem for social scientists studying the political web: representative sampling. Virtually all existing studies of the political web are based on incomplete samples, and therefore lack generalizability. In this paper, I combine methods from computer science and sampling theory to conduct an automated snowball census of the political web and constructs an all-but-complete index of English political websites. I check the robustness of this index, use it to generate descriptive statistics for the entire political web, and demonstrate that studies based on ad hoc sampling strategies are likely to be biased in important ways. In future research, this bias can be eliminated by using this index as a sampling universe. In addition, the methods and open-source software presented here can be used to creating similar sampling frames for other online content domains.

Transcript

An Automated Snowball Censusof the Political Web

Abe GongUniversity of Michigan

JITP 2011

Motivation

Motivation

Motivation

Motivation

The blogosphere is one of the best sources of political data in all history.

Understanding political bloggers can help us understand political participation more broadly.

In order to compare “the average blogger” to “the average citizen,” we need a representative sample of bloggers.

Wanted: A sampling frame for all political bloggers

Challenges: scale and sparseness

No complete index of blogs exists, let alone political blogs

• 250 million web sites• 40 new sites created every minutes• Only 3 in 1,000 sites are political

Previous research

Sample Types• Convenience

Examples

● Johnson and Kaye, 2004

● Lescovek, Backstrom and Kleinberg, 2009

Big Data, but no attempt at representativeness

Previous research

Sample Types• Convenience • Prominence

Examples• McKenna and

Pole, 2008• Wallsten, 2008

Good data, but only includes popular sites.

Previous research

Sample Types• Convenience • Prominence • Snowball

Examples• Hindman,

Tsioutsiouliklis, and Johnson, 2003

• Karpf, 2008

Sample properties unclear

Previous research

Sample Types• Convenience • Prominence • Snowball • Over-sample

Examples

• Lenhart and Fox, 2006• Schlozman, Verba, and

Brady, 2010• Lawrence, Sides, and

Farrell, 2010• Karen's US-IMPACT study

Representative sample, but linking to Big Data is hard

Methodology

1. Start from a seed batch of political sites.

2. Download and classify each site in the batch.

3. For political sites, harvest outbound hyperlinks and add unvisited links to the next batch.

4. Repeat from step 2 until no new links are found.

Toy Example

Toy Example

Toy Example

Toy Example

Bag-of-words logit regression

Prob(political) ≈ logit(α+βX)

X = Vector of word counts

α = Bias term

β = Word weights

1. Hand-code a training sample (n=2,000)

2. Calibrate the computer

3. Hand-code a testing sample (n=200)

4. Evaluate the classifier

Text Classifier Word Cloud

Classifier reliability

Human-human: 80.9%

Human-computer: 81.0%

Krippendorff's Alpha: .733

Census Results

Implemented in python: SnowCrawl

Executes in less than 24 hours

1.8 million sites crawled

800,000 political

42% blogs

http://code.google.com/p/snowcrawl

Comparison by strata

Top 500 Top 5,000 Census

Organization

Owned by orgs 66.1*** 53.1 44.4

Multiple authors 75.2* 66.7 62.2

M-updates/day 43.4*** 19.4*** 6.1

Design

Advertising 67.3** 57.1 51.2

Blogroll 57.5* 66.3*** 45.1

Video 48.7*** 35.7*** 18.3

Comparison by strata

Top 500 Top 5,000 CensusPolls and public opinion 70.8*** 65.3* 52.4Elections and campaigns 50.4 45.9 51.2Legislation and law-making 43.4 41.8 43.9Implementation of policy 38.1 39.8 30.5Decisions by courts 34.5*** 24.5 17.1Political figures 46.0*** 39.8** 24.4Political parties 38.9*** 32.7* 20.7Philosophical discussion 26.5 29.6 25.6State and local government 36.3* 38.8** 24.4Foreign policy 42.5*** 38.8*** 15.9

International relations 31.9** 33.7** 18.3

Where next?

● Survey of bloggers● Poststratification weighting● Network analysis● Content analysis of blogs● Blog post panel● Sentiment analysis/Survey imputation● Re-implement in Hadoop

Where next ...?

ANES

GSS

Roxy...?

?

?

?

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary data is the key for the compSocSci research agenda.

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary data is the key for the compSocSci research agenda.

http://code.google.com/p/snowcrawl

Conclusions

1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.

2. Sampling matters! With a little extra effort, we can sample populations on the web.

3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.

Thank you!

Questions? Comments?

Abe Gong

Public policy, political science, complex systems

University of Michigan

agong@umich.edu

lowlywonk.blogspot.com

Www-personal.umich.edu/~agong

top related