When visibility means power Tracking political races from the Internet Stephane Gauvin Université Laval ECIG: October 2009
Nov 29, 2014
When visibility means power Tracking political races from the Internet
Stephane Gauvin Université Laval ECIG: October 2009
What can we learn from
Search engines count data?
Webometrics The 2004 experience The 2009 experience (USA) The 2009 experience (Canada) Building a measurement scale Meaning and sentimentalism
Webometrics – painful beginnings
1996: WordOfNet measures visibility Closely aligned with SEO Highly unreliable – disappears 2001
2003: Factiva’s visibility index Venture owned by Dow Jones & Reuters Tracks media mentions of democrats
2004 Democrat convention
2004 consensus
Web based metrics unreliable See:
Bar-Ilan, 2001, 2008 Björneborn & Ingwersen, 2001 Clarke & Willett, 1997 Cothey, 2004 Ingwersen & Björneborn, 2005 Lawrence & Giles, 1999 Mettrop & Nieuwenhuysen, 2001 Oppenheim, Morris, McKnight, & Lowley, 2000 Shafi & Rather, 2005 Snyder & Rosenbaum, 1999 Vaughan & Thelwall, 2004
That was before the social www 2004: 1M blogs / 2009 200M blogs 2004: 4G urls / 2009 1T urls
Measurement issues Latency
Unlike stars in the sky, visibility doesn’t reveal itself Document centric (url seed(s) and follow links)
• Amounts to convenience sampling • Bias is shown in Vaughan & Thelwall 2004
Concept centric (rely on extensive generic crawling/indexing – i.e. Google)
Domain definition Narrow (Senator John McCain) Wide (McCain) Variants (typos, nicknames)
Variance Ex: Yahoo! doesn’t agree with Google (next slide)
Partitions Digital space is not homogeneous: news, blogs, social, images, videos, www
Raw scores all over the place
Building a measurement scale
Identify independent instruments
Harvest data
Weed out using Cronbach’s alpha
Independent instruments
Harvest data Every day, a script mimics a user (not an API using a
sub-index)
Up to 6 trials if the engine fails to return a result (dropped connexion, busy, etc.)
Machine parsed to extract count data
Compute visibility shares to alleviate extreme outliers (engines may return results several orders of magnitude larger than what they should be, which makes correlations unreliable. Visibility shares are always in to 0..1 interval)
High reliability
Google often low
2009 US presidential
2008 : Harper vs Dion
Blogs as early signal?
Blogs as early signal?
Visibility vs opinion polls
Correlations between signals
Absolute values above .18 are significant at p < 0.05
Summary
Web metrics are highly reliable They appear to be valid indicators
French presidential US presidential Canadian elections
But Anecdotal (only 2-3 instances) In the political realm (what about brands or social themes?) Questionable (what about sentiment?)
Sentiment
Mere visibility works because it embodies sentiment i.e. a rotten politician will soon become « invisible »
Changes in visibility may convey sentiment (steady gains signals positive, explosion signals negative)
Sentiment analysis is difficult for several reasons: Volume makes human analysis impractical Complexity makes machine analysis difficult Conceptually not clear what is good or bad (pro-life?)
Next
Apply to other concepts Investigate metric properties
Consider simple sentiment analysis (SA) Goal is to call turning points (ex: Gore gets Nobel,
Spitzer gets prostitute) When there is a news storm, sentiment is usually
obvious, making SA pointless But some events are ambiguous (ex: Sarkozy-Bruni) And other events are unsentimental (ex: H1N1)