Top Banner
Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering” Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong Mathematical Analytics Team, National Institute for Mathematical Scneice 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013 Niagara Falls, Canada, August 25-28, 2013 †: corresponding author
28

Discovering Hot Topics using Twitter Streaming Data

Apr 21, 2017

Download

Social Media

Sunghyon Kyeong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Discovering Hot Topics using Twitter Streaming Data

Discovering Hot Topics using Twitter Streaming Data “Social Topics Detection and Geographic Clustering”

Hwi-Gang Kim, Seongjoo Lee, and Sunghyon Kyeong†

Mathematical Analytics Team, National Institute for Mathematical Scneice

2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2013

Niagara Falls, Canada, August 25-28, 2013 †: corresponding author

Page 2: Discovering Hot Topics using Twitter Streaming Data

p

Outlines

• Introduction

• Dataset

• Analysis Methods and Results

• Conclusion

2

Page 3: Discovering Hot Topics using Twitter Streaming Data

Introduction

Page 4: Discovering Hot Topics using Twitter Streaming Data

p

Role of SNSs• Informing breaking news (Twitter Journalism)

• Expressing one’s feelings and emotions

• Communication tool in daily life

• Research tools for studying - social behaviors, - human commmunication, - detection of a flu epidemic, - and text mining

4

Page 5: Discovering Hot Topics using Twitter Streaming Data

p

In this study

• Twitter streaming API and MongoDB were used for data collection.

• We proposed a measure for the social hot topic detection of the day.

• Geographic communities were detected for the weather related keywords, and visualized using Google Fusion Table.

5

Page 6: Discovering Hot Topics using Twitter Streaming Data

p

Related Works• Met et al. (2006) proposed probabilistic latent semantic

indexing (PLSI) to discover a spatiotemporal theme pattern on weblogs.

• Wang et al. (2007) proposed location aware topic model (LATM) to incorporate the relationship between locations and words.

• Yin et al. (2011) proposed Latent Geogrpahical Topic Analysis (LGTA), a novel location-text joint model.

• In general, EM algorithm takes huge amount of computing time, and the previous studies did not directly classify locations by topics.

6

EM: expectation minimization

Page 7: Discovering Hot Topics using Twitter Streaming Data

Dataset

Page 8: Discovering Hot Topics using Twitter Streaming Data

p

Data collection• Geo-tagged public statuses tweeted in the united states.

• A total of ~19 millions geo-tagged Twitter statuses were obtained from March 23 to April 1, 2013.

• This period includes events such as snowfall on spring, same-sex marriage issues by the US court, world cup qualifier match between the US and Mexico, basketball games, and the Easter

8

Twitter streaming data in US

Page 9: Discovering Hot Topics using Twitter Streaming Data

p

MongoDB Sharding

9

! !

! !

! !

! !

! !

! !Mongod Mongod

Mongod ! !

! !

! !Mongod Mongod

Mongod! !

! !

! !Mongod Mongod

Mongod

MongoS! !

! !

C1 Mongod

C2 Mongod

C3 Mongod

Config Servers

Shard1 Shard2 Shard3

! !

Client

Application

Replica Sets

Page 10: Discovering Hot Topics using Twitter Streaming Data

Analysis Methods and Results

Page 11: Discovering Hot Topics using Twitter Streaming Data

p

Word frequency

11

wf! =X

t2T

X

s2Sf!tswf! frequency function for a word ( )

in a US state ( ) at time ( ).!

s t

The most frequently tweeted words are not the social topic, but emotional words expressing one’s feelings.

Top 5 words and Easter

Page 12: Discovering Hot Topics using Twitter Streaming Data

p

Distribution of Word Freq.

12

log10(word frequency)

log 1

0(Cou

nts) lol

likeloveEaster

※ scale-free distribution

Page 13: Discovering Hot Topics using Twitter Streaming Data

a measure of social topics

R!t

The ratio of word frequency

Page 14: Discovering Hot Topics using Twitter Streaming Data

p

Ratio of Word Freq.

14

R!t =

F!t � F!

t�1

F!t + F!

t�1

F!t =

X

s2Sf!ts

The time series function for a word ( ) integrated over the spatial index ( ).s

!The definition of a ratio of word frequency to measure social topic.

-1.0

-0.5

0.0

0.5

1.0

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

Easter lol like love

Page 15: Discovering Hot Topics using Twitter Streaming Data

p

Social Topics by

15

Topics Top words in terms of frequencyWeather H1={weather, snow, winter, cold, sick}

Daily life H2={class, school, gym, lunch, job,jobs,tweetmyjobs}

Weekend H3={bar,party,drinking,beer,movies,drunk,club}

US law H4={gay,marriage}

Sports 1 H5={soccer,usa,mexico}

Sports 2 H6={basketball,chicago,bulls,lebron,miami,heat,kevin,leg,injury,michigan}

TV show H7={thewalkingdead,walking,dead}

EasterH8={easter,church,blassed,bunny,jesus,happy,happyeaster,basket,candy,egg,eggs,god,lord}

April Fools’ Day H9={april,joke,fool}

Emotions H10={lol,like,love,shit,fuck,haha,oh,ass}

R!t

Page 16: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Weather, H1

16

• According to US newspapers, there was a heavy snowfall in about six states in the Midwest to Estern states, from Missouri to Pensylvania on March 24, 2013.

• The snowfall stoped on March 25. Interestingly, is dramatically decreased for the word set H1 on March 26.

-0.6

-0.3

0.0

0.3

0.6

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

WeatherSnowWinterColdSick

R!t

Page 17: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Weekend, H3

17

-0.4

-0.2

0.0

0.2

0.4

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

BarPartyDrinkingBeerMoviesDrunkClub

• Topic words during the weekend include the entertainment words such as moview and party but these are also used steadily during the week albeit less frequently.

Page 18: Discovering Hot Topics using Twitter Streaming Data

p

Topic - US Law, H4

• On March 26, the hot topic was the same-sex marriage issue by US court, and we can see the corresponding rapid increase on the March 26.

18

-0.8

-0.4

0.0

0.4

0.8

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

gaymarriage

Page 19: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Sports, H5

• As the US and Mexico played a World Cup qualifying match in Mexico on March 26, we found that for the topic ‘Sports 1’ peaked on March.

19

-0.8

-0.4

0.0

0.4

0.8

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

SoccerUSAMexico

R!t

Page 20: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Easter, H9

• On March 31, we can see that about Easter such as easter, happy, bunny, egg(s), god and jesus increases.

• This is expected as the Easter is one of the most cerebrated Christian festivals in the US.

20

-1.0

-0.5

0.0

0.5

1.0

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

EasterBlessedBunnyJesusHappyHappyeasterBasketCandyEggEggsGodLord

R!t

Page 21: Discovering Hot Topics using Twitter Streaming Data

p

Topic - Emotions, H10• The for emotional words was showed a small

fluctuation ( ) even though they showed higher word frequency ranking.

• This results suggest that the frequency of expressions of feelings and emotions are relatively constant over time.

21

-0.1

-0.1

0.0

0.1

0.1

Mar/24 Mar/25 Mar/26 Mar/27 Mar/28 Mar/29 Mar/30 Mar/31 Apr/1

lol likelove shitfuck hahaoh ass

R!t

|R!t | < 0.1

Page 22: Discovering Hot Topics using Twitter Streaming Data

p

Geographic Clustering• For each set of hot topic Hk, we computed the

spatiotemporal matrix for the k-th hot topic as the following:

22

�kts =

X

!2Hk

f!ts

• Then we obtained the adjacency matrix by Pearson’s correlation coefficient between US states:Ak

ij = Corr(�k•i,�

k•j)

• Modularity (Q) was computed from the weighted graph using a Louvain community detection algorithm, which maximize Q

Q =1

2m

X

i,j

hAij �

sisj2m

i�(Ci, Cj)

Page 23: Discovering Hot Topics using Twitter Streaming Data

Graph Theory

C

B

A

D

Page 24: Discovering Hot Topics using Twitter Streaming Data

p

Types of Graph

24

1. What is degree? 2. betweenness centrality?3. global/local network efficiency?4. modular structure

undirected binary graph

directed binary graph

directed weighted graph

1

3

6

5

2

4

0 1 1 0 0 0

1 0 1 0 1 0

1 1 0 0 0 0

0 0 0 0 1 0

0 0 0 1 0 1

0 0 0 0 1 0

Aij  =

AdjacencyMatrix

Page 25: Discovering Hot Topics using Twitter Streaming Data

p

Network Analysis Ex.

25

co-authorship network formed by author list

semantic network formed by free association

Steyvers, Cognitive Science 29 (2005) 41–78Neumann, PNAS 101 (2004) 5200-5205

Page 26: Discovering Hot Topics using Twitter Streaming Data

p

Geographic Clustering

26

Geographic Clustering Adjacency Matrix

Page 27: Discovering Hot Topics using Twitter Streaming Data

p

Conclusion• The ratio of word frequency properly detected social hot

topics of the day by identifying increasing or decreasing frequency of keywords in Twitter messages,

• while supressing the non-topic keywords such as frequencly tweeted emotional words (e.g., lol, like, and love).

• The social topic detection method may be applied on a different time scale, e.g., hourly, monghly, or yearly.

• The geographic clustering based on a social topic appropriately reflected not only the patyway of spring storm but also the properties of US geography.

27

Page 28: Discovering Hot Topics using Twitter Streaming Data

Thank you for your attention