Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering” Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong † Mathematical Analytics Team, National Institute for Mathematical Scneice 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013 Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
28
Embed
Discovering Hot Topics using Twitter Streaming Data
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering”
Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong†
Mathematical Analytics Team, National Institute for Mathematical Scneice
2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013
Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
p
Outlines
• Introduction
• Dataset
• Analysis Methods and Results
• Conclusion
2
Introduction
p
Role of SNSs• Informing breaking news (Twitter Journalism)
• Expressing one’s feelings and emotions
• Communication tool in daily life
• Research tools for studying - social behaviors, - human commmunication, - detection of a flu epidemic, - and text mining
4
p
In this study
• Twitter streaming API and MongoDB were used for data collection.
• We proposed a measure for the social hot topic detection of the day.
• Geographic communities were detected for the weather related keywords, and visualized using Google Fusion Table.
5
p
Related Works• Met et al. (2006) proposed probabilistic latent semantic
indexing (PLSI) to discover a spatiotemporal theme pattern on weblogs.
• Wang et al. (2007) proposed location aware topic model (LATM) to incorporate the relationship between locations and words.
• Yin et al. (2011) proposed Latent Geogrpahical Topic Analysis (LGTA), a novel location-text joint model.
• In general, EM algorithm takes huge amount of computing time, and the previous studies did not directly classify locations by topics.
6
EM: expectation minimization
Dataset
p
Data collection• Geo-tagged public statuses tweeted in the united states.
• A total of ~19 millions geo-tagged Twitter statuses were obtained from March 23 to April 1, 2013.
• This period includes events such as snowfall on spring, same-sex marriage issues by the US court, world cup qualifier match between the US and Mexico, basketball games, and the Easter
8
Twitter streaming data in US
p
MongoDB Sharding
9
! !
! !
! !
! !
! !
! !Mongod Mongod
Mongod ! !
! !
! !Mongod Mongod
Mongod! !
! !
! !Mongod Mongod
Mongod
MongoS! !
! !
C1 Mongod
C2 Mongod
C3 Mongod
Config Servers
Shard1 Shard2 Shard3
! !
Client
Application
Replica Sets
Analysis Methods and Results
p
Word frequency
11
wf! =X
t2T
X
s2Sf!tswf! frequency function for a word ( )
in a US state ( ) at time ( ).!
s t
The most frequently tweeted words are not the social topic, but emotional words expressing one’s feelings.
Top 5 words and Easter
p
Distribution of Word Freq.
12
log10(word frequency)
log 1
0(Cou
nts) lol
likeloveEaster
※ scale-free distribution
a measure of social topics
R!t
The ratio of word frequency
p
Ratio of Word Freq.
14
R!t =
F!t � F!
t�1
F!t + F!
t�1
F!t =
X
s2Sf!ts
The time series function for a word ( ) integrated over the spatial index ( ).s
!The definition of a ratio of word frequency to measure social topic.
• According to US newspapers, there was a heavy snowfall in about six states in the Midwest to Estern states, from Missouri to Pensylvania on March 24, 2013.
• The snowfall stoped on March 25. Interestingly, is dramatically decreased for the word set H1 on March 26.
• Topic words during the weekend include the entertainment words such as moview and party but these are also used steadily during the week albeit less frequently.
p
Topic - US Law, H4
• On March 26, the hot topic was the same-sex marriage issue by US court, and we can see the corresponding rapid increase on the March 26.
Conclusion• The ratio of word frequency properly detected social hot
topics of the day by identifying increasing or decreasing frequency of keywords in Twitter messages,
• while supressing the non-topic keywords such as frequencly tweeted emotional words (e.g., lol, like, and love).
• The social topic detection method may be applied on a different time scale, e.g., hourly, monghly, or yearly.
• The geographic clustering based on a social topic appropriately reflected not only the patyway of spring storm but also the properties of US geography.