Twitris Browsing real-time data by space, time and theme http://twitris.knoesis.org
May 21, 2015
TwitrisBrowsing real-time data by space,
time and themehttp://twitris.knoesis.org
Motivation, Goals
Motivation, GoalsMumbai Terror Attack 2008
Citizen sensor observations (flickr, twitter, blogs..)
No matter where you looked, tapping into a cultural perception was impossible
We wanted to know what people in India were saying vs. those in Pakistan or the U.S.A
Spatio-Temporal-Thematic Slices of Real-time Data
Around NEWS-WORTHY EVENTS
Using space and time as cues for extracting social perceptions (behind signals)
Summarizing hundreds and thousands of real-time observations
The Health Care Reform Debate in the U.S
The Health Care Reform Debate in the U.S
Temporal navigation
The Health Care Reform Debate in the U.S
Temporal navigation Spatial Markers
Zooming in on Florida
n-gram Summaries
Zooming in on Washington
n-gram Summaries
Browsing Real-time Data in Context
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
SOYLENT GREEN and the HEALTH CARE REFORM
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
Core of Twitrisn-gram summaries - Spatio-temporal-thematic
event descriptors
ArchitectureStep1 : Gathering event-
relevant tweets
Because tweets are not pre-categorized
Skip if I run out of time ..
Topical TweetsGathering event-specific tweets: Iran Election
Topical TweetsGathering event-specific tweets: Iran Election
1: Pick trending hashtags from Twitter - #iranelection; #iran ..
Topical TweetsGathering event-specific tweets: Iran Election
1: Pick trending hashtags from Twitter - #iranelection; #iran ..
2: Google insights to expand hashtag list
Topical TweetsGathering event-specific tweets: Iran Election
1: Pick trending hashtags from Twitter - #iranelection; #iran ..
2: Google insights to expand hashtag list
Topical Tweets
3. Issue a Twitter Search (API) every 30 seconds for every hashtag, keyword
1500 tweets per query
Topical Tweets
3. Issue a Twitter Search (API) every 30 seconds for every hashtag, keyword
1500 tweets per query
4. Obtain other Hashtags in crawled tweets
Topical Tweets
3. Issue a Twitter Search (API) every 30 seconds for every hashtag, keyword
1500 tweets per query
4. Obtain other Hashtags in crawled tweets
Check for topic drifts
Topical Tweets
3. Issue a Twitter Search (API) every 30 seconds for every hashtag, keyword
1500 tweets per query
4. Obtain other Hashtags in crawled tweets
Check for topic drifts
5. Repeat from Step 3 and babysit!
ArchitectureStep1 : Gathering event-
relevant tweets
Step2: Spatial, Temporal metadata of tweets
4
2 System Overview
Twitris is currently designed to- Collect user posted tweets pertaining to an event from Twitter- Process obtained tweets to extract key descriptors and surrounding discussions- Present extracted summaries to usersThe duration and intervals of data collection and processing are configured basedon the event being analyzed. Figure 1 illustrates the various steps and servicesinvolved in data collection, analysis and visualization.
Fig. 1: Data Collection, analysis and visualizing in Twitris
Gathering Topically Relevant DataThe process of obtaining citizen observations from Twitter deserves some ex-planation since Twitter does not explicitly categorize user messages into topics.However, there is a search API4 to extract tweets. A recent trend in Twitter hasbeen the community-driven convention of adding additional context and meta-data to tweets via hashtags, that can also be used to retrieve relevant tweets.Hashtags are similar to tags on Flickr, except they are added inline to a tweet.They are created simply by prefixing a word with a hash symbol, for example,users would tag a tweet about Madonna using the hashtag #madonna.
Our strategy for obtaining posts relevant to an event uses a set of seed key-words, their corresponding hashtags and the Twitter search API. Seed keywordsare obtained via a semi-automatic process using Google Insights for Search5,a free service from Google that provides top searched and trending keywordsacross specific regions, categories, time frames and properties. The intuition isthat keywords with high search volumes indicate a greater level of social interestand therefore more likely to be used by posters on Twitter.
We start with a search term that is highly pertinent to an event and gettop X keywords during a time period from Google Insights. For the g20 sum-mit event for example, one could use the keyword g20 to obtain seed keywords.These keywords are manually verified for sufficient coverage for posts using theTwitter Search API, placed in set K̂, and used to kick-start the data collectionprocess. Past this step, the system automatically collects data every few hours.The list of keywords K̂ is also continually updated using two heuristics:1. The first uses Google Insights to periodically obtain new keywords using key-words in K̂ as the starting query.2. The second uses the corpus of tweets collected so far to detect popular key-
4 http://search.twitter.com/search.json5 http://www.google.com/insights/search/
Geo-Coordinates of Tweets
Location a tweet originates from
Location it mentions
Approximation: Poster location on Twitter profile
Location: Dayton, OH (Google geocoder service, GeoDB)
Location: “best place in the world” (fail!)
ArchitectureStep1 : Gathering event-
relevant tweets
Step2: Spatial, Temporal metadata of tweets
Step3: Spatio-temporal clusters
4
2 System Overview
Twitris is currently designed to- Collect user posted tweets pertaining to an event from Twitter- Process obtained tweets to extract key descriptors and surrounding discussions- Present extracted summaries to usersThe duration and intervals of data collection and processing are configured basedon the event being analyzed. Figure 1 illustrates the various steps and servicesinvolved in data collection, analysis and visualization.
Fig. 1: Data Collection, analysis and visualizing in Twitris
Gathering Topically Relevant DataThe process of obtaining citizen observations from Twitter deserves some ex-planation since Twitter does not explicitly categorize user messages into topics.However, there is a search API4 to extract tweets. A recent trend in Twitter hasbeen the community-driven convention of adding additional context and meta-data to tweets via hashtags, that can also be used to retrieve relevant tweets.Hashtags are similar to tags on Flickr, except they are added inline to a tweet.They are created simply by prefixing a word with a hash symbol, for example,users would tag a tweet about Madonna using the hashtag #madonna.
Our strategy for obtaining posts relevant to an event uses a set of seed key-words, their corresponding hashtags and the Twitter search API. Seed keywordsare obtained via a semi-automatic process using Google Insights for Search5,a free service from Google that provides top searched and trending keywordsacross specific regions, categories, time frames and properties. The intuition isthat keywords with high search volumes indicate a greater level of social interestand therefore more likely to be used by posters on Twitter.
We start with a search term that is highly pertinent to an event and gettop X keywords during a time period from Google Insights. For the g20 sum-mit event for example, one could use the keyword g20 to obtain seed keywords.These keywords are manually verified for sufficient coverage for posts using theTwitter Search API, placed in set K̂, and used to kick-start the data collectionprocess. Past this step, the system automatically collects data every few hours.The list of keywords K̂ is also continually updated using two heuristics:1. The first uses Google Insights to periodically obtain new keywords using key-words in K̂ as the starting query.2. The second uses the corpus of tweets collected so far to detect popular key-
4 http://search.twitter.com/search.json5 http://www.google.com/insights/search/
Spatio-Temporal Clusters of Tweets
Long-running, world-wide events (Iran Election Protest)
clusters by country and week?
Short, world-wide events (Olympics)
clusters by country and day?
Long-running, evolving, local events (Health Care Reform Debate)
clusters by state and day?
Because every event is different.. and we want to preserve social perceptions that generated this data!
Tunable parameters
Tweets in a Spatio-Temporal Cluster
Spatio-temporal bias dictate granularity of processing tweets
Mumbai Terror Attack
Cluster1: Tweets from India, 08/1/08
Cluster2: Tweets from Pakistan, 08/1/08
Cluster n: Tweets from USA, 08/13/08
ArchitectureStep1 : Gathering event-
relevant tweets
Step2: Spatial, Temporal metadata of tweets
Step3: Spatio-temporal clusters
Step4: Thematic Descriptors in spatio-temporal cluster
4
2 System Overview
Twitris is currently designed to- Collect user posted tweets pertaining to an event from Twitter- Process obtained tweets to extract key descriptors and surrounding discussions- Present extracted summaries to usersThe duration and intervals of data collection and processing are configured basedon the event being analyzed. Figure 1 illustrates the various steps and servicesinvolved in data collection, analysis and visualization.
Fig. 1: Data Collection, analysis and visualizing in Twitris
Gathering Topically Relevant DataThe process of obtaining citizen observations from Twitter deserves some ex-planation since Twitter does not explicitly categorize user messages into topics.However, there is a search API4 to extract tweets. A recent trend in Twitter hasbeen the community-driven convention of adding additional context and meta-data to tweets via hashtags, that can also be used to retrieve relevant tweets.Hashtags are similar to tags on Flickr, except they are added inline to a tweet.They are created simply by prefixing a word with a hash symbol, for example,users would tag a tweet about Madonna using the hashtag #madonna.
Our strategy for obtaining posts relevant to an event uses a set of seed key-words, their corresponding hashtags and the Twitter search API. Seed keywordsare obtained via a semi-automatic process using Google Insights for Search5,a free service from Google that provides top searched and trending keywordsacross specific regions, categories, time frames and properties. The intuition isthat keywords with high search volumes indicate a greater level of social interestand therefore more likely to be used by posters on Twitter.
We start with a search term that is highly pertinent to an event and gettop X keywords during a time period from Google Insights. For the g20 sum-mit event for example, one could use the keyword g20 to obtain seed keywords.These keywords are manually verified for sufficient coverage for posts using theTwitter Search API, placed in set K̂, and used to kick-start the data collectionprocess. Past this step, the system automatically collects data every few hours.The list of keywords K̂ is also continually updated using two heuristics:1. The first uses Google Insights to periodically obtain new keywords using key-words in K̂ as the starting query.2. The second uses the corpus of tweets collected so far to detect popular key-
4 http://search.twitter.com/search.json5 http://www.google.com/insights/search/
Thematic Descriptors
An event descriptor is an n-gram
1,2 and 3 grams
n-gram descriptors“President Obama in trying to regain control of the
health-care debate will likely shift his pitch in September”
1-grams: President, Obama, in, trying, to, regain, ...
2-grams: “President Obama”, “Obama in”, “in trying”, “trying to”...
3-grams: “President Obama in”, “Obama in trying”; “in trying to”...
Thematic Descriptors
A descriptor is an n-gram weighted by:
“President Obama in”“President” “President Obama”
Thematic Descriptors
A descriptor is an n-gram weighted by:
Thematic Importance
redundancy: statistically discriminatory in nature
variability: contextually important
“President Obama in”“President” “President Obama”
Thematic Descriptors
A descriptor is an n-gram weighted by:
Thematic Importance
redundancy: statistically discriminatory in nature
variability: contextually important
Spatial Importance (local vs. global popularity)
“President Obama in”“President” “President Obama”
Thematic Descriptors
A descriptor is an n-gram weighted by:
Thematic Importance
redundancy: statistically discriminatory in nature
variability: contextually important
Spatial Importance (local vs. global popularity)
Temporal Importance (always popular vs. currently trending)
“President Obama in”“President” “President Obama”
Thematic Importance of an n-gram
Exploiting Redundancy
tfidf of n-gram (Lucene Index)
amplify by fraction of nouns in the n-gram (Stanford Natural Language Parser)
amplify by fraction of non-stop words (‘going to try’)
“President Obama in”“President” “President Obama”
Thematic Importance of an n-gramExploiting Variability
Big three/Big 3; Ford, GM, Chrysler, General Motors..
Contextually relevant words boost statistical importance
Focus word (fw) : “big three”
Associated words (awi) : co-occurring in spatio-temporal set of tweets
!"#$%&'(($
#)$*&'+,-('$
./'0$
#(1('2-$)/%/',$
Thematic importance of focus word:
focus word (fw): Big Three
tfidf of fw tfidf of awi
associated word (awi): Ford
association strength of fw and awi
!"#$%&'(($
#)$*&'+,-('$
./'0$
#(1('2-$)/%/',$
Thematic Importance of an n-gram
Contextual Relevance
Association Strength of fw and awi
depends on contexts
!"#$%&'(($
)*'+$
chrysler, GM, big 3
focus, model, release..
8
strengthen the score of the descriptor. However, we also need to pay attention tochanging viewpoints in citizen observations that may result in descriptors occur-ing in completely different contexts. If the usage of ‘Ford’ is not in the contextof the ‘Big Three’, i.e. discussions around Ford surround its new ‘Ford Focus’model, its presence should not affect ‘Big Three’s’ importance.Contextually Enhanced Thematic Score: Here, we describe how the the-matic score of an extracted descriptor, ‘Big 3’ in the above example, is amplifiedas a function of the importance of its strong associations - ‘General Motors’,‘Ford’ and ‘Chrysler’ and the association strengths between the descriptor andthe associations. For sake of brevity, let us call the ngrami descriptor whose the-matic score we are interested in affecting as the focus word fw and its strongassociations as Cfw={awi,aw2...}. The thematic score of the focus word is thenenhanced as:
fw(th)=fw(tfidf)+P
assocstr(fw,awi)∗awi(tfidf) (1)
where fw(tfidf) and awi(tfidf) are the TFIDF scores of the focus and associatedword as per Step 3 in the previous section; assocstr(fw,awi) is the associationstrength between the focus word and the associated word. Here we describe howwe find strong associations for a focus word and compute assocstr scores. Ouralgorithm begins by first gathering all possible associations for fw and places itin Cfw. We define associations or the context of a word as thematically strongdescriptors (in the top 5 n-grams of an observation) that co-occur with thefocus word in the given spatio-temporal corpus. The goal is to amplify the scoreof the focus word only with the strongly associated words in Cfw. One wayto measure strength of associations is to use word co-occurence frequencies inlanguage [9]. Borrowing from past success in this area, we measure the associationstrength between the focus word and the associated words assocstr(fw,awi) usingthe notion of point-wise mutual information in terms of co-occurrence statistics.We measure assocstr scores as a function of the point-wise mutual informationbetween the focus word and the context of awi. This is done to ensure that theassociation strengths are determined in the contexts that the descriptors occur in.Let us call the contexts for awi as Cawi={caw1,caw2..}, where cawk’s are thematicallystrong descriptors that collocate with awi. assocstr(fw,awi) is computed as:
assocstr(fw,awi)=P
k(pmi(fw,cawk))|Cawi|
,∀cawk∈Cawi
where the point-wise mutual information between fw and cawk (the context ofawi), is calculated as:
pmi(fw,cawk)=logp(fw,cawk)
p(fw)p(cawk)=logp(cawk|fw)
p(cawk) (2)
where p(fw)= n(fw)N ;p(cawk|fw)=
n(cawk,fw)n(fw) ; n(fw) is the frequency of the focus word;
n(cawk,fw) is the co-occurrence count of words cawk and fw; and N is the numberof tokens. All statistics are computed with respect to the corpus defined by thespatio-temporal setting. As we can see, this score is not symmetric and if thecontext of awi is poorly associated with fw, assocstr(fw,awi) is a low score.
At the end of evaluating all associations in Cfw, we pick those descriptorswhose association scores are greater than the average association scores of the
Contexts of associated word awi : ‘Ford’
8
strengthen the score of the descriptor. However, we also need to pay attention tochanging viewpoints in citizen observations that may result in descriptors occur-ing in completely different contexts. If the usage of ‘Ford’ is not in the contextof the ‘Big Three’, i.e. discussions around Ford surround its new ‘Ford Focus’model, its presence should not affect ‘Big Three’s’ importance.Contextually Enhanced Thematic Score: Here, we describe how the the-matic score of an extracted descriptor, ‘Big 3’ in the above example, is amplifiedas a function of the importance of its strong associations - ‘General Motors’,‘Ford’ and ‘Chrysler’ and the association strengths between the descriptor andthe associations. For sake of brevity, let us call the ngrami descriptor whose the-matic score we are interested in affecting as the focus word fw and its strongassociations as Cfw={awi,aw2...}. The thematic score of the focus word is thenenhanced as:
fw(th)=fw(tfidf)+P
assocstr(fw,awi)∗awi(tfidf) (1)
where fw(tfidf) and awi(tfidf) are the TFIDF scores of the focus and associatedword as per Step 3 in the previous section; assocstr(fw,awi) is the associationstrength between the focus word and the associated word. Here we describe howwe find strong associations for a focus word and compute assocstr scores. Ouralgorithm begins by first gathering all possible associations for fw and places itin Cfw. We define associations or the context of a word as thematically strongdescriptors (in the top 5 n-grams of an observation) that co-occur with thefocus word in the given spatio-temporal corpus. The goal is to amplify the scoreof the focus word only with the strongly associated words in Cfw. One wayto measure strength of associations is to use word co-occurence frequencies inlanguage [9]. Borrowing from past success in this area, we measure the associationstrength between the focus word and the associated words assocstr(fw,awi) usingthe notion of point-wise mutual information in terms of co-occurrence statistics.We measure assocstr scores as a function of the point-wise mutual informationbetween the focus word and the context of awi. This is done to ensure that theassociation strengths are determined in the contexts that the descriptors occur in.Let us call the contexts for awi as Cawi={caw1,caw2..}, where cawk’s are thematicallystrong descriptors that collocate with awi. assocstr(fw,awi) is computed as:
assocstr(fw,awi)=P
k(pmi(fw,cawk))|Cawi|
,∀cawk∈Cawi
where the point-wise mutual information between fw and cawk (the context ofawi), is calculated as:
pmi(fw,cawk)=logp(fw,cawk)
p(fw)p(cawk)=logp(cawk|fw)
p(cawk) (2)
where p(fw)= n(fw)N ;p(cawk|fw)=
n(cawk,fw)n(fw) ; n(fw) is the frequency of the focus word;
n(cawk,fw) is the co-occurrence count of words cawk and fw; and N is the numberof tokens. All statistics are computed with respect to the corpus defined by thespatio-temporal setting. As we can see, this score is not symmetric and if thecontext of awi is poorly associated with fw, assocstr(fw,awi) is a low score.
At the end of evaluating all associations in Cfw, we pick those descriptorswhose association scores are greater than the average association scores of the
Pointwise Mutual Information
Certain descriptors will always dominate discussions
“Terrorism” in Mumbai Terror Attack Tweets
“Healthcare” in Health Care reform debate
Allow recent (possibly interesting) ones to surface
Temporal Importance of a Descriptor
9
(a)
(b)
Fig. 2: (a) Extracted descriptors sorted by TFIDF vs. spatio-temporal-thematic scores(b) Top 15 extracted descriptors in the US for Mumbai attack event across 5 days
focus word and all associations in Cfw. The thematic weights of these associations
along with their strengths are plugged into Eqn 1 to compute the enhanced
thematic score ngrami(th), of the n-gram descriptor.
B. Temporal Importance of an event descriptor: While the thematic scores
are good indicators of what is important in a spatio-temporal setting, certain
descriptors tend to dominate discussions. In order to allow for less popular,
possibly interesting descriptors to surface, we discount the thematic score of a
descriptor depending on how popular it has been in the recent past. The temporal
discount score for a n-gram, a tuneable factor depending on the nature of the
event, is calculated over a period of time as:
ngrami(te)=temporalbias∗PD
d=1ngrami(th)d
d
where ngrami(th)d is the enhanced thematic score of the descriptor on day d, D
is the duration for which we wish to apply the dampening factor, for example,
the recent week. However, this temporal discount might not be relevant for all
applications. For this reason, we also apply a temporalbias weight ranging from 0
to 1 - a weight closer to 1 gives more importance, while a weight closer to 0 gives
lesser importance to past activity.
C. Spatial Importance of an event descriptor: We also discount the impor-
tance of a descriptor based on its occurence in other spatio-temporal sets. The
intuition is that descriptors that occur all over the world on a given day are not
as interesting compared to those that occur only in the spatio-temporal set of in-
terest. We define the spatial discount score for an n-gram as a fraction of spatial
sets or partitions (e.g. countries) that had activity surrounding this descriptor.
ngrami(sp)= k|spatio−temporalsets|∗(1−spatialbias)
0-1 bias: less to more importance to recent n-grams
Local descriptors are more interesting compared to global ones
Spatial discount
Spatial Importance of a Descriptor
9
(a)
(b)
Fig. 2: (a) Extracted descriptors sorted by TFIDF vs. spatio-temporal-thematic scores(b) Top 15 extracted descriptors in the US for Mumbai attack event across 5 days
focus word and all associations in Cfw. The thematic weights of these associations
along with their strengths are plugged into Eqn 1 to compute the enhanced
thematic score ngrami(th), of the n-gram descriptor.
B. Temporal Importance of an event descriptor: While the thematic scores
are good indicators of what is important in a spatio-temporal setting, certain
descriptors tend to dominate discussions. In order to allow for less popular,
possibly interesting descriptors to surface, we discount the thematic score of a
descriptor depending on how popular it has been in the recent past. The temporal
discount score for a n-gram, a tuneable factor depending on the nature of the
event, is calculated over a period of time as:
ngrami(te)=temporalbias∗PD
d=1ngrami(th)d
d
where ngrami(th)d is the enhanced thematic score of the descriptor on day d, D
is the duration for which we wish to apply the dampening factor, for example,
the recent week. However, this temporal discount might not be relevant for all
applications. For this reason, we also apply a temporalbias weight ranging from 0
to 1 - a weight closer to 1 gives more importance, while a weight closer to 0 gives
lesser importance to past activity.
C. Spatial Importance of an event descriptor: We also discount the impor-
tance of a descriptor based on its occurence in other spatio-temporal sets. The
intuition is that descriptors that occur all over the world on a given day are not
as interesting compared to those that occur only in the spatio-temporal set of in-
terest. We define the spatial discount score for an n-gram as a fraction of spatial
sets or partitions (e.g. countries) that had activity surrounding this descriptor.
ngrami(sp)= k|spatio−temporalsets|∗(1−spatialbias)
fraction of spatio-temporal clusters n-gram occurred in
closer to 0 = global importance
STT Score of an n-gram
Spatio-temporal-thematic score of a descriptor
= thematic score - spatio-temporal discounts
10
where k = number of spatio-temporal sets the n-gram occured in. Similar tothe temporal bias, we also introduce a spatialbias that gives importance to localvs. global activity for the descriptor on a scale of 0 to 1. A weight closer to 1does not give importance to the global spatial discount while a weight closer to0 gives a lot of importance to the global presence of the descriptor.
Depending on the event of interest, both these discounting factors can alsovary for different spatio-temporal sets. For example, when processing tweets fromIndia for the Mumbai attack setting the spatialbias to 1 eliminates the influenceof global social signals. While processing tweets from the US, one might wanta stronger global bias given that the event did not originate there. Both theseparameters are set before we begin the processing of observations.
Finally, the spatial and temporal effects are discounted from the final score,making the final spatio-temporal-thematic (STT) weight of the n-gram as
wi=ngrami(th)−ngrami(te)−ngrami(sp) (3)
Figure 2(a) illustrates the effect of our enhanced STT weights for extractedevent descriptors pertaining to the Mumbai terror attack event, in the US ona particular day. We used a temporal bias of 1 suggesting that past activitywas important and a spatial bias of 0 giving importance to the global presenceof the descriptor. As we see, descriptors generic to other spatial and temporalsettings (e.g., mumbai and mumbai attacks) get weighted lower, allowing themore interesting ones to surface higher.
Figure 2(b) shows top 15 extracted descriptors in the US across five days(days that had atleast three citizen observations). As we see, the descriptorsextracted by our system offer a good indication of what is being talked abouton those days. In an ongoing user study, we are showing users tweets on anygiven day and investigating how useful descriptors extracted by our system arecompared to those generated using the TFIDF baseline. Results of the same willbe made available at [1].3. Discussions around Event Descriptors
While it is useful to know what entities people are talking about, there mightbe different storylines surrounding these entities that could offer an insight intothe social perceptions of an event. The goal here is to thematically group dis-cussions surrounding event descriptors, while also allowing users to observe howthese discussions change over time and space. We take a simple clustering ap-proach to this problem, forming k clusters, each representing a viewpoint orstoryline within a spatio-temporal setting. While this is similar in spirit to clus-tering of documents to reveal storylines as presented in [5], we use a mutualinformation based approach.
Let us call the n-gram of interest as the focus word fw. The steps involvedin identifying storylines surrounding fw are the following (see Figure 3):1. As in our previous algorithm, we find all associations for a focus word;Cfw={awi,aw2...}, i.e. thematically strong descriptors that collocate with the fw
in the given spatio-temporal corpus.2. In order to pick cues for complementary viewpoints, we pick n associations
higher-order n-grams picked over
lower-order n-grams (if same
scores)
Top X Descriptor Tag Cloud
Tag size proportional to enhanced STT score