Twitris

TwitrisBrowsing real-time data by space,

time and themehttp://twitris.knoesis.org

http://twitris.knoesis.org

http://twitris.knoesis.org

Motivation, Goals

Motivation, GoalsMumbai Terror Attack 2008

Citizen sensor observations (flickr, twitter, blogs..)

No matter where you looked, tapping into a cultural perception was impossible

We wanted to know what people in India were saying vs. those in Pakistan or the U.S.A

Spatio-Temporal-Thematic Slices of Real-time Data

Around NEWS-WORTHY EVENTS

Using space and time as cues for extracting social perceptions (behind signals)

Summarizing hundreds and thousands of real-time observations

The Health Care Reform Debate in the U.S


Temporal navigation


Temporal navigation Spatial Markers

Zooming in on Florida

n-gram Summaries

Zooming in on Washington

n-gram Summaries

Browsing Real-time Data in Context

twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith

kno.e.sis center, Wright State University

Opinion on Iran

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective

Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.

Temporal perspective

Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.

Find resources related to social perceptions

News and Wikipedia articles to put extracted descriptors in context

Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using

people's perceptions as the fulcrum.

What does twitris do?

✓ Exploit spatio, temporal semantics for thematic aggregation

✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to

other users

✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)

✓ End user application

Little statistics from Tiwtris (unit: tweets)

Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)

Obama (Oct 8 - 20): 312 K (US Only)

H1N1 (Oct 5 - 20) : 232 K (US Only)

Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)

`

Twitris

Concept Cloud, News and related

articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler

Author Location Lookup

.

.

.Author Location

Lookup..


Geocode Lookup....

Geocode Lookup....

Geocode Lookup

Data ProcessingTFIDF based

descriptor extraction

Spatio, Temporal, Thematic descriptor extraction

Extracting storylines around

descriptorsTwitter Search

Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory

Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics

twitris internals in less than

140 characters

Culled out user observations correlated well with mainstream media (news, blogs)

The fourth estate perspective

Cavetas and Future work

1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization

Check us out at: http://twitris.knoesis.org

Follow us @7w17r15

Become a FB Fan and share Twtitris with everyone

A tetris like approach to twitter to gather aggregated social signals is defined as

SOYLENT GREEN and the HEALTH CARE REFORM



Opinion on Iran

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective











other users






H1N1 (Oct 5 - 20) : 232 K (US Only)


`

Twitris


articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler


.

.

.Author Location

Lookup..


Geocode Lookup....

Geocode Lookup....

Geocode Lookup






Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory



140 characters






Follow us @7w17r15





Opinion on Iran

Electio

n from th

e

US talks

about O

il

economies

,

blogging

Opinion on Iran

Electio

n from Ira

n

talks

about

theocra

cy,

oppressio

n,

demonstr

ation

Spatial perspective











other users






H1N1 (Oct 5 - 20) : 232 K (US Only)


`

Twitris


articles

Google News widget

DBpedia widget

Context + Selected

Term

Context + Selected

Term

Twitris DB

Data Collection

event-1 crawler....

event-kcrawler

.

.

.

.event-ncrawler


.

.

.Author Location

Lookup..


Geocode Lookup....

Geocode Lookup....

Geocode Lookup






Shared

Memory

Data Dumper....

Data Dumper....

Data Dumper

Shared

Memory

Shared

Memory



140 characters






Follow us @7w17r15



Core of Twitrisn-gram summaries - Spatio-temporal-thematic

event descriptors

ArchitectureStep1 : Gathering event-

relevant tweets

Because tweets are not pre-categorized

Skip if I run out of time ..

keynote:/Users/meena/Work/Talks/MSJobTalk.key?id=BGSlide-102

keynote:/Users/meena/Work/Talks/MSJobTalk.key?id=BGSlide-102

Topical TweetsGathering event-specific tweets: Iran Election


1: Pick trending hashtags from Twitter - #iranelection; #iran ..



2: Google insights to expand hashtag list



2: Google insights to expand hashtag list

Topical Tweets

3. Issue a Twitter Search (API) every 30 seconds for every hashtag, keyword

1500 tweets per query

Topical Tweets



4. Obtain other Hashtags in crawled tweets

Topical Tweets




Check for topic drifts

Topical Tweets




Check for topic drifts

5. Repeat from Step 3 and babysit!


relevant tweets

Step2: Spatial, Temporal metadata of tweets

4

2 System Overview

Twitris is currently designed to- Collect user posted tweets pertaining to an event from Twitter- Process obtained tweets to extract key descriptors and surrounding discussions- Present extracted summaries to usersThe duration and intervals of data collection and processing are configured basedon the event being analyzed. Figure 1 illustrates the various steps and servicesinvolved in data collection, analysis and visualization.

Fig. 1: Data Collection, analysis and visualizing in Twitris

Gathering Topically Relevant DataThe process of obtaining citizen observations from Twitter deserves some ex-planation since Twitter does not explicitly categorize user messages into topics.However, there is a search API4 to extract tweets. A recent trend in Twitter hasbeen the community-driven convention of adding additional context and meta-data to tweets via hashtags, that can also be used to retrieve relevant tweets.Hashtags are similar to tags on Flickr, except they are added inline to a tweet.They are created simply by prefixing a word with a hash symbol, for example,users would tag a tweet about Madonna using the hashtag #madonna.

Our strategy for obtaining posts relevant to an event uses a set of seed key-words, their corresponding hashtags and the Twitter search API. Seed keywordsare obtained via a semi-automatic process using Google Insights for Search5,a free service from Google that provides top searched and trending keywordsacross specific regions, categories, time frames and properties. The intuition isthat keywords with high search volumes indicate a greater level of social interestand therefore more likely to be used by posters on Twitter.

We start with a search term that is highly pertinent to an event and gettop X keywords during a time period from Google Insights. For the g20 sum-mit event for example, one could use the keyword g20 to obtain seed keywords.These keywords are manually verified for sufficient coverage for posts using theTwitter Search API, placed in set K̂, and used to kick-start the data collectionprocess. Past this step, the system automatically collects data every few hours.The list of keywords K̂ is also continually updated using two heuristics:1. The first uses Google Insights to periodically obtain new keywords using key-words in K̂ as the starting query.2. The second uses the corpus of tweets collected so far to detect popular key-

4 http://search.twitter.com/search.json5 http://www.google.com/insights/search/

Geo-Coordinates of Tweets

Location a tweet originates from

Location it mentions

Approximation: Poster location on Twitter profile

Location: Dayton, OH (Google geocoder service, GeoDB)

Location: “best place in the world” (fail!)


relevant tweets


Step3: Spatio-temporal clusters

4

2 System Overview







Spatio-Temporal Clusters of Tweets

Long-running, world-wide events (Iran Election Protest)

clusters by country and week?

Short, world-wide events (Olympics)

clusters by country and day?

Long-running, evolving, local events (Health Care Reform Debate)

clusters by state and day?

Because every event is different.. and we want to preserve social perceptions that generated this data!

Tunable parameters

Tweets in a Spatio-Temporal Cluster

Spatio-temporal bias dictate granularity of processing tweets

Mumbai Terror Attack

Cluster1: Tweets from India, 08/1/08

Cluster2: Tweets from Pakistan, 08/1/08

Cluster n: Tweets from USA, 08/13/08


relevant tweets


Step3: Spatio-temporal clusters

Step4: Thematic Descriptors in spatio-temporal cluster

4

2 System Overview







Thematic Descriptors

An event descriptor is an n-gram

1,2 and 3 grams

n-gram descriptors“President Obama in trying to regain control of the

health-care debate will likely shift his pitch in September”

1-grams: President, Obama, in, trying, to, regain, ...

2-grams: “President Obama”, “Obama in”, “in trying”, “trying to”...

3-grams: “President Obama in”, “Obama in trying”; “in trying to”...


A descriptor is an n-gram weighted by:

“President Obama in”“President” “President Obama”



Thematic Importance

redundancy: statistically discriminatory in nature

variability: contextually important




Thematic Importance



Spatial Importance (local vs. global popularity)




Thematic Importance



Spatial Importance (local vs. global popularity)

Temporal Importance (always popular vs. currently trending)


Thematic Importance of an n-gram

Exploiting Redundancy

tfidf of n-gram (Lucene Index)

amplify by fraction of nouns in the n-gram (Stanford Natural Language Parser)

amplify by fraction of non-stop words (‘going to try’)


Thematic Importance of an n-gramExploiting Variability

Big three/Big 3; Ford, GM, Chrysler, General Motors..

Contextually relevant words boost statistical importance

Focus word (fw) : “big three”

Associated words (awi) : co-occurring in spatio-temporal set of tweets

!"#$%&'(($

#)$*&'+,-('$

./'0$

#(1('2-$)/%/',$

Thematic importance of focus word:

focus word (fw): Big Three

tfidf of fw tfidf of awi

associated word (awi): Ford

association strength of fw and awi

!"#$%&'(($

#)$*&'+,-('$

./'0$

#(1('2-$)/%/',$

Thematic Importance of an n-gram

Contextual Relevance

Association Strength of fw and awi

depends on contexts

!"#$%&'(($

)*'+$

chrysler, GM, big 3

focus, model, release..

8

strengthen the score of the descriptor. However, we also need to pay attention tochanging viewpoints in citizen observations that may result in descriptors occur-ing in completely different contexts. If the usage of ‘Ford’ is not in the contextof the ‘Big Three’, i.e. discussions around Ford surround its new ‘Ford Focus’model, its presence should not affect ‘Big Three’s’ importance.Contextually Enhanced Thematic Score: Here, we describe how the the-matic score of an extracted descriptor, ‘Big 3’ in the above example, is amplifiedas a function of the importance of its strong associations - ‘General Motors’,‘Ford’ and ‘Chrysler’ and the association strengths between the descriptor andthe associations. For sake of brevity, let us call the ngrami descriptor whose the-matic score we are interested in affecting as the focus word fw and its strongassociations as Cfw={awi,aw2...}. The thematic score of the focus word is thenenhanced as:

fw(th)=fw(tfidf)+P

assocstr(fw,awi)∗awi(tfidf) (1)

where fw(tfidf) and awi(tfidf) are the TFIDF scores of the focus and associatedword as per Step 3 in the previous section; assocstr(fw,awi) is the associationstrength between the focus word and the associated word. Here we describe howwe find strong associations for a focus word and compute assocstr scores. Ouralgorithm begins by first gathering all possible associations for fw and places itin Cfw. We define associations or the context of a word as thematically strongdescriptors (in the top 5 n-grams of an observation) that co-occur with thefocus word in the given spatio-temporal corpus. The goal is to amplify the scoreof the focus word only with the strongly associated words in Cfw. One wayto measure strength of associations is to use word co-occurence frequencies inlanguage [9]. Borrowing from past success in this area, we measure the associationstrength between the focus word and the associated words assocstr(fw,awi) usingthe notion of point-wise mutual information in terms of co-occurrence statistics.We measure assocstr scores as a function of the point-wise mutual informationbetween the focus word and the context of awi. This is done to ensure that theassociation strengths are determined in the contexts that the descriptors occur in.Let us call the contexts for awi as Cawi={caw1,caw2..}, where cawk’s are thematicallystrong descriptors that collocate with awi. assocstr(fw,awi) is computed as:

assocstr(fw,awi)=P

k(pmi(fw,cawk))|Cawi|

,∀cawk∈Cawi

where the point-wise mutual information between fw and cawk (the context ofawi), is calculated as:

pmi(fw,cawk)=logp(fw,cawk)

p(fw)p(cawk)=logp(cawk|fw)

p(cawk) (2)

where p(fw)= n(fw)N ;p(cawk|fw)=

n(cawk,fw)n(fw) ; n(fw) is the frequency of the focus word;

n(cawk,fw) is the co-occurrence count of words cawk and fw; and N is the numberof tokens. All statistics are computed with respect to the corpus defined by thespatio-temporal setting. As we can see, this score is not symmetric and if thecontext of awi is poorly associated with fw, assocstr(fw,awi) is a low score.

At the end of evaluating all associations in Cfw, we pick those descriptorswhose association scores are greater than the average association scores of the

Contexts of associated word awi : ‘Ford’

8

strengthen the score of the descriptor. However, we also need to pay attention tochanging viewpoints in citizen observations that may result in descriptors occur-ing in completely different contexts. If the usage of ‘Ford’ is not in the contextof the ‘Big Three’, i.e. discussions around Ford surround its new ‘Ford Focus’model, its presence should not affect ‘Big Three’s’ importance.Contextually Enhanced Thematic Score: Here, we describe how the the-matic score of an extracted descriptor, ‘Big 3’ in the above example, is amplifiedas a function of the importance of its strong associations - ‘General Motors’,‘Ford’ and ‘Chrysler’ and the association strengths between the descriptor andthe associations. For sake of brevity, let us call the ngrami descriptor whose the-matic score we are interested in affecting as the focus word fw and its strongassociations as Cfw={awi,aw2...}. The thematic score of the focus word is thenenhanced as:

fw(th)=fw(tfidf)+P

assocstr(fw,awi)∗awi(tfidf) (1)

where fw(tfidf) and awi(tfidf) are the TFIDF scores of the focus and associatedword as per Step 3 in the previous section; assocstr(fw,awi) is the associationstrength between the focus word and the associated word. Here we describe howwe find strong associations for a focus word and compute assocstr scores. Ouralgorithm begins by first gathering all possible associations for fw and places itin Cfw. We define associations or the context of a word as thematically strongdescriptors (in the top 5 n-grams of an observation) that co-occur with thefocus word in the given spatio-temporal corpus. The goal is to amplify the scoreof the focus word only with the strongly associated words in Cfw. One wayto measure strength of associations is to use word co-occurence frequencies inlanguage [9]. Borrowing from past success in this area, we measure the associationstrength between the focus word and the associated words assocstr(fw,awi) usingthe notion of point-wise mutual information in terms of co-occurrence statistics.We measure assocstr scores as a function of the point-wise mutual informationbetween the focus word and the context of awi. This is done to ensure that theassociation strengths are determined in the contexts that the descriptors occur in.Let us call the contexts for awi as Cawi={caw1,caw2..}, where cawk’s are thematicallystrong descriptors that collocate with awi. assocstr(fw,awi) is computed as:

assocstr(fw,awi)=P

k(pmi(fw,cawk))|Cawi|

,∀cawk∈Cawi

where the point-wise mutual information between fw and cawk (the context ofawi), is calculated as:

pmi(fw,cawk)=logp(fw,cawk)

p(fw)p(cawk)=logp(cawk|fw)

p(cawk) (2)

where p(fw)= n(fw)N ;p(cawk|fw)=

n(cawk,fw)n(fw) ; n(fw) is the frequency of the focus word;

n(cawk,fw) is the co-occurrence count of words cawk and fw; and N is the numberof tokens. All statistics are computed with respect to the corpus defined by thespatio-temporal setting. As we can see, this score is not symmetric and if thecontext of awi is poorly associated with fw, assocstr(fw,awi) is a low score.

At the end of evaluating all associations in Cfw, we pick those descriptorswhose association scores are greater than the average association scores of the

Pointwise Mutual Information

Certain descriptors will always dominate discussions

“Terrorism” in Mumbai Terror Attack Tweets

“Healthcare” in Health Care reform debate

Allow recent (possibly interesting) ones to surface

Temporal Importance of a Descriptor

9

(a)

(b)

Fig. 2: (a) Extracted descriptors sorted by TFIDF vs. spatio-temporal-thematic scores(b) Top 15 extracted descriptors in the US for Mumbai attack event across 5 days

focus word and all associations in Cfw. The thematic weights of these associations

along with their strengths are plugged into Eqn 1 to compute the enhanced

thematic score ngrami(th), of the n-gram descriptor.

B. Temporal Importance of an event descriptor: While the thematic scores

are good indicators of what is important in a spatio-temporal setting, certain

descriptors tend to dominate discussions. In order to allow for less popular,

possibly interesting descriptors to surface, we discount the thematic score of a

descriptor depending on how popular it has been in the recent past. The temporal

discount score for a n-gram, a tuneable factor depending on the nature of the

event, is calculated over a period of time as:

ngrami(te)=temporalbias∗PD

d=1ngrami(th)d

d

where ngrami(th)d is the enhanced thematic score of the descriptor on day d, D

is the duration for which we wish to apply the dampening factor, for example,

the recent week. However, this temporal discount might not be relevant for all

applications. For this reason, we also apply a temporalbias weight ranging from 0

to 1 - a weight closer to 1 gives more importance, while a weight closer to 0 gives

lesser importance to past activity.

C. Spatial Importance of an event descriptor: We also discount the impor-

tance of a descriptor based on its occurence in other spatio-temporal sets. The

intuition is that descriptors that occur all over the world on a given day are not

as interesting compared to those that occur only in the spatio-temporal set of in-

terest. We define the spatial discount score for an n-gram as a fraction of spatial

sets or partitions (e.g. countries) that had activity surrounding this descriptor.

ngrami(sp)= k|spatio−temporalsets|∗(1−spatialbias)

0-1 bias: less to more importance to recent n-grams

Local descriptors are more interesting compared to global ones

Spatial discount

Spatial Importance of a Descriptor

9

(a)

(b)

Fig. 2: (a) Extracted descriptors sorted by TFIDF vs. spatio-temporal-thematic scores(b) Top 15 extracted descriptors in the US for Mumbai attack event across 5 days

focus word and all associations in Cfw. The thematic weights of these associations

along with their strengths are plugged into Eqn 1 to compute the enhanced

thematic score ngrami(th), of the n-gram descriptor.

B. Temporal Importance of an event descriptor: While the thematic scores

are good indicators of what is important in a spatio-temporal setting, certain

descriptors tend to dominate discussions. In order to allow for less popular,

possibly interesting descriptors to surface, we discount the thematic score of a

descriptor depending on how popular it has been in the recent past. The temporal

discount score for a n-gram, a tuneable factor depending on the nature of the

event, is calculated over a period of time as:

ngrami(te)=temporalbias∗PD

d=1ngrami(th)d

d

where ngrami(th)d is the enhanced thematic score of the descriptor on day d, D

is the duration for which we wish to apply the dampening factor, for example,

the recent week. However, this temporal discount might not be relevant for all

applications. For this reason, we also apply a temporalbias weight ranging from 0

to 1 - a weight closer to 1 gives more importance, while a weight closer to 0 gives

lesser importance to past activity.

C. Spatial Importance of an event descriptor: We also discount the impor-

tance of a descriptor based on its occurence in other spatio-temporal sets. The

intuition is that descriptors that occur all over the world on a given day are not

as interesting compared to those that occur only in the spatio-temporal set of in-

terest. We define the spatial discount score for an n-gram as a fraction of spatial

sets or partitions (e.g. countries) that had activity surrounding this descriptor.

ngrami(sp)= k|spatio−temporalsets|∗(1−spatialbias)

fraction of spatio-temporal clusters n-gram occurred in

closer to 0 = global importance

STT Score of an n-gram

Spatio-temporal-thematic score of a descriptor

= thematic score - spatio-temporal discounts

10

where k = number of spatio-temporal sets the n-gram occured in. Similar tothe temporal bias, we also introduce a spatialbias that gives importance to localvs. global activity for the descriptor on a scale of 0 to 1. A weight closer to 1does not give importance to the global spatial discount while a weight closer to0 gives a lot of importance to the global presence of the descriptor.

Depending on the event of interest, both these discounting factors can alsovary for different spatio-temporal sets. For example, when processing tweets fromIndia for the Mumbai attack setting the spatialbias to 1 eliminates the influenceof global social signals. While processing tweets from the US, one might wanta stronger global bias given that the event did not originate there. Both theseparameters are set before we begin the processing of observations.

Finally, the spatial and temporal effects are discounted from the final score,making the final spatio-temporal-thematic (STT) weight of the n-gram as

wi=ngrami(th)−ngrami(te)−ngrami(sp) (3)

Figure 2(a) illustrates the effect of our enhanced STT weights for extractedevent descriptors pertaining to the Mumbai terror attack event, in the US ona particular day. We used a temporal bias of 1 suggesting that past activitywas important and a spatial bias of 0 giving importance to the global presenceof the descriptor. As we see, descriptors generic to other spatial and temporalsettings (e.g., mumbai and mumbai attacks) get weighted lower, allowing themore interesting ones to surface higher.

Figure 2(b) shows top 15 extracted descriptors in the US across five days(days that had atleast three citizen observations). As we see, the descriptorsextracted by our system offer a good indication of what is being talked abouton those days. In an ongoing user study, we are showing users tweets on anygiven day and investigating how useful descriptors extracted by our system arecompared to those generated using the TFIDF baseline. Results of the same willbe made available at [1].3. Discussions around Event Descriptors

While it is useful to know what entities people are talking about, there mightbe different storylines surrounding these entities that could offer an insight intothe social perceptions of an event. The goal here is to thematically group dis-cussions surrounding event descriptors, while also allowing users to observe howthese discussions change over time and space. We take a simple clustering ap-proach to this problem, forming k clusters, each representing a viewpoint orstoryline within a spatio-temporal setting. While this is similar in spirit to clus-tering of documents to reveal storylines as presented in [5], we use a mutualinformation based approach.

Let us call the n-gram of interest as the focus word fw. The steps involvedin identifying storylines surrounding fw are the following (see Figure 3):1. As in our previous algorithm, we find all associations for a focus word;Cfw={awi,aw2...}, i.e. thematically strong descriptors that collocate with the fw

in the given spatio-temporal corpus.2. In order to pick cues for complementary viewpoints, we pick n associations

higher-order n-grams picked over

lower-order n-grams (if same

scores)

Top X Descriptor Tag Cloud

Tag size proportional to enhanced STT score

Twitris

Education

keyword1500 tweets

temporal metadata of

event relevant tweets

temporal semantics

eventrelevant tweetsstep2

cluster n

topical tweets3

temporal ata collection