TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University of Maryland Group Members • Enkh-Amgalan Baatarjav • Jedsada Chartree • Thiraphat Meesumrarn
37
Embed
TwitterStand: News in Tweets Jagan Sankaranarayanan Hanan Samet Benjamin E. Teitler Micael D. Lieberman Jon Sperling Department of Computer Science University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
TwitterStand: News in Tweets
Jagan SankaranarayananHanan SametBenjamin E. TeitlerMicael D. Lieberman Jon Sperling
Department of Computer Science University of Maryland
Group Members• Enkh-Amgalan Baatarjav• Jedsada Chartree• Thiraphat Meesumrarn
Twitter dataset samples through public feedsPublic timelineSpritzerGardenHose: sparse sampling of all feedsBirdDog: tweets written by up to 200,000 users
Introduction: StatisticsU.S. Unique Visitor (000) Trend (Source: comScore Media Metrix)
Introduction: Statistics21% of Twitter accounts are empty placeholders
94% of Twitter accounts have less than 100 followers
10% of Twitter users create 86% of all activity
49.6% of Twitter users are inactive (1 tweet in last 7 days)
55% of Twitter users use 3rd party application
Introduction: Statistics
Problem StatementConventional system:
News aggregators: Google News, Bing News, and Yahoo! News
Content providers: newspapers, television stations, news blogs
Vast amount of information being generated by Twitter users2008 Southern California earthquakeIranian election
Separating News from Junk
ContributionsMobilizing millions of Twitter users to be eyes
and ears in the world
Geographic proximity plays important role
TwitterStand Identifying current newsClustering similar tweets into news storiesRanking news based on importanceGeo-tagging news topics
Cont.: OptimizationInverted index of cluster centroids
Reduce number of distance computationFor each feature f, the index stores pointers to all
clusters containing f. iff at least one feature is common between a tweet
and a clusters
Maintaining a list of active clustersCentroids are less than a three days old
Additional Tweaks: Dealing with Noise Very noise medium
Seeding good quality clusters
Only Seeders are allowed to start new cluster
Unreliable feed allowed to add to existing cluster
Drawback Seeders are mostly consists of conventional news resource
Solution Relaxing the rule by any tweet can form inactive cluster if after
the k tweets have been added to the cluster (none of k tweets from seeders)
Cluster status changed to active when seeder tweet is added to the cluster
Tweak: FragmentationSeveral different clusters on a single topic
Frequently occurs with online clustering algorithmTweets are distributed to tens and hundreds of
duplicate clusters
Solution Periodically checking for duplicate clusters among
active clustersMaster cluster: one has older time centroid Slave cluster: one has younger time centroid Any new tweets belong to slave cluster added to
Master cluster
Tweak: Weight upper bounds Dynamic corpus: addition of new features have
high TF-IDF valuesRelatively unimportant, misspelled words, etc.
Problem: spurious clustersClustering based on an unimportant feature
SolutionTo a tweet to be added to a cluster, the tweet and
the cluster should share k common features (k > 1)
Tweak: PhrasesFeatures containing two or more terms - phrase
Problem Treading phrase as separate features results in lost
meaning: “San Francisco” Treading phrase as a single feature results with large
TF-IDF score
Solution Distinguishing two kinds of relationships betweens
words in the phrase by Determining occurrence of t1 close to t2 volumeFinding a dominant word: “Barak” “Obama”=>”Obama”Merging words to single feature: “San” “Francisco” =>
“San Francisco”
Topic Geographic Focus Associate each cluster of tweets with a set of geographic
locations
Tweet content: geotagging1. Toponym recognition: finding all instances of textual
reference geographic location2. Toponym resolution: determining correct location for
each recognized toponym out of all possible interpretations
Source location of the user Meta-data contains user’s location Containment or prominence heuristic
Computing Topic Focus Ranking geographic locations by frequency