Top Banner
TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University of Maryland Group Members • Enkh-Amgalan Baatarjav • Jedsada Chartree • Thiraphat Meesumrarn
37

TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Mar 29, 2015

Download

Documents

Jermaine Maher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

TwitterStand: News in Tweets

Jagan SankaranarayananHanan SametBenjamin E. TeitlerMicael D. Lieberman Jon Sperling

Department of Computer Science University of Maryland

Group Members• Enkh-Amgalan Baatarjav• Jedsada Chartree• Thiraphat Meesumrarn

Page 2: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

OutlineIntroduction to Twitter

Problem statement

Contributions

Key concepts

Methodology

Assumptions

Questions

References

Page 3: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Introduction: TwitterThree actors

UserFollowers Friend

RelationshipUnidirectionalBidirectional + =

Multi-interfaceWebsite, SMS, applications, IM, etc

Page 4: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Introduction to Twitter

Page 5: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.
Page 6: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Search Engines

Page 7: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Twitter Services: APITwitter API

Functions to obtain user-specific information

Twitter dataset samples through public feedsPublic timelineSpritzerGardenHose: sparse sampling of all feedsBirdDog: tweets written by up to 200,000 users

Page 8: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Introduction: StatisticsU.S. Unique Visitor (000) Trend (Source: comScore Media Metrix)

Page 9: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Introduction: Statistics21% of Twitter accounts are empty placeholders

94% of Twitter accounts have less than 100 followers

10% of Twitter users create 86% of all activity

49.6% of Twitter users are inactive (1 tweet in last 7 days)

55% of Twitter users use 3rd party application

Page 10: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Introduction: Statistics

Page 11: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Problem StatementConventional system:

News aggregators: Google News, Bing News, and Yahoo! News

Content providers: newspapers, television stations, news blogs

Vast amount of information being generated by Twitter users2008 Southern California earthquakeIranian election

Separating News from Junk

Page 12: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

ContributionsMobilizing millions of Twitter users to be eyes

and ears in the world

Geographic proximity plays important role

TwitterStand Identifying current newsClustering similar tweets into news storiesRanking news based on importanceGeo-tagging news topics

Page 13: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Key ConceptsSeparating news from noise

Clustering tweets

Mapping the the clusters to geographic location

Page 14: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Example: Twitter Vs Aggregator

Page 15: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Benefits of TwitterSocial networking website

Community and structure

Meta-data informationDescription, source location, friends, etc

Very open communityDiverse community with varied interestBroadcasting less popular view points

Capturing breaking newsVery little lag time between event and

tweet

Page 16: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Challenges of TwitterDetermining tweet is whether news or

notMost of them are not news

A very high throughputNeeds to be fast, resilient to noise

Brevity of the tweetsLucking conveyed information: time critical

Credibly issues

Page 17: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Key Strategies1. Utilizing online Algorithm

Stream of tweets arrive at furious amount

2. Extracting useful information from noise Noise, spelling & gram. error, abbr., etc

3. Keeping up with Twitter evolution

4. Finding core group of users who tweet about news Manually identify the core group is better than mining SN

structure Finding the most common set of followers among them

5. Obtaining user-generated news content Videos, photographs, unconventional news, biased

toward entertainment, politics and tech

Page 18: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Architecture of TwitterStand

Page 19: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Architecture: Input Seeders

2,000 handpicked users that are known to publish news: newspapers, television stations, reporters, bloggers, etc.

GardenHose Sampling of all tweets: very noisy feeds from diverse topics.

BirdDog Feeds from up to 200,000 users, identified by “friend finder

Artifacts Links to external resource, only retained from seeders feed

Track Automatically generate pool of search keys to scour Twitter

for potential news tweets of interest from stream of tweets

Page 20: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Separating the ChaffClassify incoming tweets as either junk or news

Except for tweets from seeders

Goal Not completely rid of noiseDiscard as many tweets as possible without losing

many news tweets

Training naïve Bayes classifier with corpus tweets marked as either junk or news

Page 21: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont.Probability of a tweet is junk or news is denoted by using Bayes Theorem:

Assumption of independence among the words in t

Page 22: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont.If D < 0, the tweet is classified as news, else it is

junk

Page 23: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont. How to insure that not to classify tweets related to news

as junk?

The corpus is made up of two component Static

Large collection of news tweets are marked as news Large collection of tweets are marked as junk

Dynamic Periodically obtained from the clustering module Names of people, hashtags

News Tweets: Static: Helps to identify news tweets on topics that have

not encountered previously Dynamic: Helps to identify news tweets about current

event

Page 24: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Online ClusteringGoal: Automatically group news tweets into sets

of tweets, clustersTopic detection: Each cluster contains tweets

pertaining to a specific topic

ChallengesTopic is not predefined No training setOnline clustering

Page 25: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont.Leader-follower clustering

Features: be able to cluster both content and time

Algorithm detailsActive cluster list

Feature vectors: tweets’ terms (TF-IDF)Time centroid

Inactive cluster: time centroid > 3 days

Page 26: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont. Cosine similarity measure

Feature vectors TFVt, TFVc

Pre-specified constantε

if > ε, start a new cluster

To account for temporal dimension Apply Gaussian

attenuator

Page 27: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Cont.: OptimizationInverted index of cluster centroids

Reduce number of distance computationFor each feature f, the index stores pointers to all

clusters containing f. iff at least one feature is common between a tweet

and a clusters

Maintaining a list of active clustersCentroids are less than a three days old

Page 28: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Additional Tweaks: Dealing with Noise Very noise medium

Seeding good quality clusters

Only Seeders are allowed to start new cluster

Unreliable feed allowed to add to existing cluster

Drawback Seeders are mostly consists of conventional news resource

Solution Relaxing the rule by any tweet can form inactive cluster if after

the k tweets have been added to the cluster (none of k tweets from seeders)

Cluster status changed to active when seeder tweet is added to the cluster

Page 29: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Tweak: FragmentationSeveral different clusters on a single topic

Frequently occurs with online clustering algorithmTweets are distributed to tens and hundreds of

duplicate clusters

Solution Periodically checking for duplicate clusters among

active clustersMaster cluster: one has older time centroid Slave cluster: one has younger time centroid Any new tweets belong to slave cluster added to

Master cluster

Page 30: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Tweak: Weight upper bounds Dynamic corpus: addition of new features have

high TF-IDF valuesRelatively unimportant, misspelled words, etc.

Problem: spurious clustersClustering based on an unimportant feature

SolutionTo a tweet to be added to a cluster, the tweet and

the cluster should share k common features (k > 1)

Page 31: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Tweak: PhrasesFeatures containing two or more terms - phrase

Problem Treading phrase as separate features results in lost

meaning: “San Francisco” Treading phrase as a single feature results with large

TF-IDF score

Solution Distinguishing two kinds of relationships betweens

words in the phrase by Determining occurrence of t1 close to t2 volumeFinding a dominant word: “Barak” “Obama”=>”Obama”Merging words to single feature: “San” “Francisco” =>

“San Francisco”

Page 32: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Topic Geographic Focus Associate each cluster of tweets with a set of geographic

locations

Tweet content: geotagging1. Toponym recognition: finding all instances of textual

reference geographic location2. Toponym resolution: determining correct location for

each recognized toponym out of all possible interpretations

Source location of the user Meta-data contains user’s location Containment or prominence heuristic

Computing Topic Focus Ranking geographic locations by frequency

Page 33: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

User Interface Issues NewsStand

Page 34: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Topic Hashtags

- Reducing ε value- Proactively searching for more tweets belonging to a particular topic

Page 35: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

ConclusionGeneral technique to extract concept from noise

Adaptable to different environment

Generating dynamic corpus online algorithm

Pinpointing news clusters to geographic location

User interface for displaying news

Harbinger of a futuristic technology that can capture and transmit the sum total of all human experiences of the moment

Page 36: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

AssumptionsNoise

Tweets that does not belong to the news domain

Tweets from seeders are considered to be reliable news

To apply Naïve Bayes classifier, assumption is made that words in tweets are independent

Page 37: TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.

Questions & Answers Sankaranarayanan, J., et al., “TwitterStand:

News in Tweets”, Proc. ACM GIS ‘09. Seattle, WA, USA

Rohib Bhargava, “Influential Marketing Blog” http://rohitbhargava.typepad.com/weblog/2009/07/10-stunning-and-useful-stats-about-twitter.html

“In-depth study of Twitter: How much we tweet, and when” http://royal.pingdom.com/2009/11/13/in-depth-study-of-twitter-how-much-we-tweet-and-when/