Page 1
Computational Framework for Generating Visual Summaries of
Topical Clusters in Twitter Streams*
Authors: Presenter: !Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh
1
Semantic Modeling
* http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9
Page 2
Visual Summaries of Twitter Streams
2
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif
http://www.infobarrel.com/media/image/54054.jpg
Page 3
Step 1:get &
pre-process Data
construct graph & clustering
extract keywords & summarize
Keywords
Stream Tweets
Preprocessing/ Cleaning
Construct GraphClustering
Select Relevant Clusters Extract Topical
Keywords
Visual Cluster Summary
Step 2:
Step 3:
3
Page 4
Input: Keywords• initial set of Keywords
• similar to Twitter Search
4
Page 5
Input: Keywords• initial set of Keywords
• similar to Twitter Search
5
Page 6
Step 1: Stream Tweets• HTTP base API
- JSON, REST
6
Page 7
7
• OAuth + HTTP
• here: java library with scala and play!framework
Page 8
Step 1: Preprocessing• transform Tweets
- easy-to-analyze / clan format
• Process of cleaning: 1. lowercase 2. remove urls, user mentions and stop words
• like @user, „a“ or „123“ 3. remove special characters (#,.)
8
Page 9
Step 1: Preprocessing• Example Keywords:
- SCALA - Scala - scala - #scala
• Ling Pipe Library* - remove tense and plurals
9
} scala
*http://alias-i.com/lingpipe/
Page 10
Step 1: Preprocessing• Example Tweets
10
new york time reactive
programming tool scala scale
techrepublic
akka-http based reactive stream scala scaladay
Page 11
Step 1: Preprocessing• Example Tweets
11
new york time reactive
programming tool scala scale
techrepublic
akka-http based reactive stream scala scaladay
Page 12
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
12 *http://alias-i.com/lingpipe/
akka-http based reactivestream scala scaladay
Page 13
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
13 *http://alias-i.com/lingpipe/
akka-http based reactivestream scala scaladay
Page 14
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
14 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
Page 15
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
15 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
NodesNodes
NodesLinks
Page 16
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
16 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
Page 19
Step 2: Graph• Co-Occurrence Graph
- connect nodes (words) within and between tweets
- add strength (weight) and cost (distance)
• More frequently words - increase the strength - decrease cost
19
Page 20
Step 2: Graph• Summary
reactive
scala
+
=
based
stream
…
programming
uses
…
Page 21
Step 2: Clustering• Here: „complete link (max) clustering“ algorithm
- hierarchical clustering algorithm that forms clusters by merging subgroups
• Group Words from Tweets - frequently appear on topic - cluster = topic
* http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
Page 22
Step 2: Clustering• Here: „complete link (max) clustering“ algorithm
• each node starts as individual cluster
!
• close clusters are successively merged together - close = highest cost within clusters
Clusters = Nodes = Words in tweet
22
Page 23
Step 2: Clustering
reactive
scalabased
stream
…
reactive
scalabased
stream
…
23
cost = distance = 0.5
cost = distance = 1
1
1
Graph Representation Cluster Representation
Page 24
Step 2: Clustering
24
Page 25
Step 2: Clustering
distance = 0.5
25
Page 26
Step 2: Clustering
distance = 0.5
distance = 1
distance = 1
26
Page 27
Step 2: Clustering
distance = 0.5
distance = 1
distance = 1
271
1
Page 28
Step 2: Clustering
distance = 0.5
distance = 1
distance = 1
28
distance = 2
1
1
Page 29
Step 2: Clustering
29
Page 30
Step 2: Clustering• Final step: Dendrogram
- tree diagram - represents the arrangement of hierarchical clusters
• why? - easy to apply thresholds metics
30
Page 31
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
31
root
reactive scalafirst cluster
Page 32
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
32
root
reactive scala
new york programming … akka-http based stream scaladay
Page 33
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
33
root
reactive scala
new york programming … akka-http based stream scaladay
thresholds
Page 35
Step 3: Extract topical keywords
35
Preprocessing/ Cleaning
Construct Graph
Extract Topical Keywords
Page 36
Step 3: Extract topical keywords• keywords
- express a topic - frequently used - summarize tweets content
• Questions - „What are the relevant keywords?“ - „In what clusters do they appear?“
36
Page 37
Step 3: Extract topical keywords• How?
- „topical tweets“ vs. „general tweets“
• frequently in topical tweets!- search keywords „reactive scala“!
• not frequently in general tweets!- general twitter stream (all tweets)
37
Page 38
Step 3: Extract topical keywords• Strength of a word
- is a word relevant for that topical cluster?
38
Low Frequency
High Frequency
Low Frequency
High Frequency
Topical Tweets
Gen
eral
Tw
eets
Page 39
Step 3: Extract topical keywords• Strength of a word
- is a word relevant for that topical cluster?
39
Low Frequency
High Frequency
Low Frequency
High Frequency
Topical Tweets
Gen
eral
Tw
eets ✔
relevant for topic / cluster
Page 40
Step 3: Extract topical keywords• Result
- topical strength for each keyword - sort them by relevancy - select top 20 keyword
• choose clusters that contain this words
40
Page 41
Final Step• Combine clusters and keywords
• create visual summary
41
Page 42
Final Step
42
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy
Page 43
Final Step
43
• Keyword1
• Keyword2
• Keyword3
• Keyword4
• …
high relevancy
low relevancy
Page 44
Final Step
44
• Treemap Visualisation - color = cluster - area of word = frequency of word
Page 45
Final Step
45
• Wordcloud Visualisation - color = cluster - size of word = frequency of word
Page 46
Final Notes• 4. Million Topical Tweets
• 15 Days
• User Study - Treemap vs. Word Cloud
46
Page 47
Thank You!• Discussion
- Loosing precision while cleaning tweet - Loosing sense while removing stop words like
„not“ (negate) - Unigram vs. Multigram? - ?
47