When visibility means power

When visibility means power Tracking political races from the Internet

Stephane Gauvin Université Laval ECIG: October 2009

What can we learn from

Search engines count data?

  Webometrics   The 2004 experience   The 2009 experience (USA)   The 2009 experience (Canada)   Building a measurement scale   Meaning and sentimentalism

Webometrics – painful beginnings

  1996: WordOfNet measures visibility  Closely aligned with SEO  Highly unreliable – disappears 2001

  2003: Factiva’s visibility index  Venture owned by Dow Jones & Reuters  Tracks media mentions of democrats

2004 Democrat convention

2004 consensus

Web based metrics unreliable   See:

  Bar-Ilan, 2001, 2008   Björneborn & Ingwersen, 2001   Clarke & Willett, 1997   Cothey, 2004   Ingwersen & Björneborn, 2005   Lawrence & Giles, 1999   Mettrop & Nieuwenhuysen, 2001   Oppenheim, Morris, McKnight, & Lowley, 2000   Shafi & Rather, 2005   Snyder & Rosenbaum, 1999   Vaughan & Thelwall, 2004

  That was before the social www   2004: 1M blogs / 2009 200M blogs   2004: 4G urls / 2009 1T urls

Measurement issues   Latency

  Unlike stars in the sky, visibility doesn’t reveal itself   Document centric (url seed(s) and follow links)

•  Amounts to convenience sampling •  Bias is shown in Vaughan & Thelwall 2004

  Concept centric (rely on extensive generic crawling/indexing – i.e. Google)

  Domain definition   Narrow (Senator John McCain)   Wide (McCain)   Variants (typos, nicknames)

  Variance   Ex: Yahoo! doesn’t agree with Google (next slide)

  Partitions   Digital space is not homogeneous: news, blogs, social, images, videos, www

Raw scores all over the place

Building a measurement scale

  Identify independent instruments

  Harvest data

  Weed out using Cronbach’s alpha

Independent instruments

Harvest data   Every day, a script mimics a user (not an API using a

sub-index)

  Up to 6 trials if the engine fails to return a result (dropped connexion, busy, etc.)

  Machine parsed to extract count data

  Compute visibility shares to alleviate extreme outliers (engines may return results several orders of magnitude larger than what they should be, which makes correlations unreliable. Visibility shares are always in to 0..1 interval)

  High reliability

  Google often low

2009 US presidential

2008 : Harper vs Dion

Blogs as early signal?

Blogs as early signal?

Visibility vs opinion polls

Correlations between signals

Absolute values above .18 are significant at p < 0.05

Summary

  Web metrics are highly reliable   They appear to be valid indicators

  French presidential   US presidential   Canadian elections

  But   Anecdotal (only 2-3 instances)   In the political realm (what about brands or social themes?)   Questionable (what about sentiment?)

Sentiment

  Mere visibility works because it embodies sentiment i.e. a rotten politician will soon become « invisible »

  Changes in visibility may convey sentiment (steady gains signals positive, explosion signals negative)

  Sentiment analysis is difficult for several reasons:   Volume makes human analysis impractical   Complexity makes machine analysis difficult   Conceptually not clear what is good or bad (pro-life?)

Next

  Apply to other concepts   Investigate metric properties

  Consider simple sentiment analysis (SA)  Goal is to call turning points (ex: Gore gets Nobel,

Spitzer gets prostitute)  When there is a news storm, sentiment is usually

obvious, making SA pointless  But some events are ambiguous (ex: Sarkozy-Bruni)  And other events are unsentimental (ex: H1N1)

When visibility means power

Documents