An Automated Snowball Census of the Political Web - JITP 2011
Post on 29-Nov-2014
498 Views
Preview:
DESCRIPTION
Transcript
An Automated Snowball Censusof the Political Web
Abe GongUniversity of Michigan
JITP 2011
Motivation
Motivation
Motivation
Motivation
The blogosphere is one of the best sources of political data in all history.
Understanding political bloggers can help us understand political participation more broadly.
In order to compare “the average blogger” to “the average citizen,” we need a representative sample of bloggers.
Wanted: A sampling frame for all political bloggers
Challenges: scale and sparseness
No complete index of blogs exists, let alone political blogs
• 250 million web sites• 40 new sites created every minutes• Only 3 in 1,000 sites are political
Previous research
Sample Types• Convenience
Examples
● Johnson and Kaye, 2004
● Lescovek, Backstrom and Kleinberg, 2009
Big Data, but no attempt at representativeness
Previous research
Sample Types• Convenience • Prominence
Examples• McKenna and
Pole, 2008• Wallsten, 2008
Good data, but only includes popular sites.
Previous research
Sample Types• Convenience • Prominence • Snowball
Examples• Hindman,
Tsioutsiouliklis, and Johnson, 2003
• Karpf, 2008
Sample properties unclear
Previous research
Sample Types• Convenience • Prominence • Snowball • Over-sample
Examples
• Lenhart and Fox, 2006• Schlozman, Verba, and
Brady, 2010• Lawrence, Sides, and
Farrell, 2010• Karen's US-IMPACT study
Representative sample, but linking to Big Data is hard
Methodology
1. Start from a seed batch of political sites.
2. Download and classify each site in the batch.
3. For political sites, harvest outbound hyperlinks and add unvisited links to the next batch.
4. Repeat from step 2 until no new links are found.
Toy Example
Toy Example
Toy Example
Toy Example
Bag-of-words logit regression
Prob(political) ≈ logit(α+βX)
X = Vector of word counts
α = Bias term
β = Word weights
1. Hand-code a training sample (n=2,000)
2. Calibrate the computer
3. Hand-code a testing sample (n=200)
4. Evaluate the classifier
Text Classifier Word Cloud
Classifier reliability
Human-human: 80.9%
Human-computer: 81.0%
Krippendorff's Alpha: .733
Census Results
Implemented in python: SnowCrawl
Executes in less than 24 hours
1.8 million sites crawled
800,000 political
42% blogs
http://code.google.com/p/snowcrawl
Comparison by strata
Top 500 Top 5,000 Census
Organization
Owned by orgs 66.1*** 53.1 44.4
Multiple authors 75.2* 66.7 62.2
M-updates/day 43.4*** 19.4*** 6.1
Design
Advertising 67.3** 57.1 51.2
Blogroll 57.5* 66.3*** 45.1
Video 48.7*** 35.7*** 18.3
Comparison by strata
Top 500 Top 5,000 CensusPolls and public opinion 70.8*** 65.3* 52.4Elections and campaigns 50.4 45.9 51.2Legislation and law-making 43.4 41.8 43.9Implementation of policy 38.1 39.8 30.5Decisions by courts 34.5*** 24.5 17.1Political figures 46.0*** 39.8** 24.4Political parties 38.9*** 32.7* 20.7Philosophical discussion 26.5 29.6 25.6State and local government 36.3* 38.8** 24.4Foreign policy 42.5*** 38.8*** 15.9
International relations 31.9** 33.7** 18.3
Where next?
● Survey of bloggers● Poststratification weighting● Network analysis● Content analysis of blogs● Blog post panel● Sentiment analysis/Survey imputation● Re-implement in Hadoop
Where next ...?
ANES
GSS
Roxy...?
?
?
?
Conclusions
1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.
2. Sampling matters! With a little extra effort, we can sample populations on the web.
3. Complementary data is the key for the compSocSci research agenda.
Conclusions
1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.
2. Sampling matters! With a little extra effort, we can sample populations on the web.
3. Complementary data is the key for the compSocSci research agenda.
http://code.google.com/p/snowcrawl
Conclusions
1. Combinations of tools are much more powerful than individual tools – share ideas across disciplines.
2. Sampling matters! With a little extra effort, we can sample populations on the web.
3. Complementary, horizontal, and offline data is key for the compSocSci research agenda.
Thank you!
Questions? Comments?
Abe Gong
Public policy, political science, complex systems
University of Michigan
agong@umich.edu
lowlywonk.blogspot.com
Www-personal.umich.edu/~agong
top related