Page 1
Dynamic Multi-Relational Chinese Restaurant Process for Analyzing Influences on
Users in Social Media*Indrajit Bhattacharya
Research ScientistIBM Research, Bangalore
*Collaboration w/ Himabindu Lakkaraju & Chiranjib Bhattacharyya
Workshop on Social ComputingIIT Kharagpur, Oct 5-6 2012
Page 2
Social Media Analysis: Motivation
Microblogs: Twitter, Facebook, MySpace
Understanding and analyzing topics & trends
Influences on users
Variety of stakeholders
Business
Government
Social scientists
2
Page 3
Social Media Analysis: Challenges
Network and Influences on Users
User personality: Personal preferences, global and geographic trends, social circle in the network [Yang WSDM 11]
Dynamic nature
Topics & user personalities evolve over time
Volume of data
Existing approaches fall short 3
Page 4
Soc Med Analysis: State of the Art
Content Analysis
Ramage ICWSM 2010, Hong SOMA 2010
Variants of LDA
Inferring User Interests
Ahmed KDD 2011, Wen KDD 2010
Individual features such as user activity or network
Patterns in Temporal Evolution
Yang et al WSDM 20114
Page 5
Bayesian Non-parametric Models
Choosing no of components in a mixture model
Particularly severe problem for large data volumes such as for social media data
Bayesian solution
Infinite dimensional prior
Allows no of mixture components to grow with data size
Cannot capture richness of social media data
Algorithms often not scalable 5
Page 6
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 8
Page 7
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 9
Page 8
Dirichlet Process (Informal)
10
Page 9
Dirichlet Process: Properties
12
Page 10
Chinese Restaurant Process (CRP)
14
Page 11
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Parallelized Online Inference Algorithm
Experimental Results 15
Page 12
Relational Ch. Rest. Pr. (RelCRP)
R16
Page 13
Relational Ch. Rest. Pr. (RelCRP)
17
Page 14
Influence of World-wide Factors
18
Page 15
Influence of World-wide Factors
19
Page 16
Influence of Personal Preferences
20
Page 17
Influence of Personal Preferences
21
Page 18
Influence of Friend Network
22
Page 19
Influence of Friend Network
23
Page 20
Influence of Geography
India China
UK
24
Page 21
Influence of Geography
25
Page 22
Aggregating Influences
RelCRP is exchangeable like the CRP
Useful as a prior for infinite mixture model
RelCRP captures influence of one relation on posts
Influences act simultaneously on any user
Aggregated influence pattern is user specific
Different users affected differently by same combination of world-wide and geographic factors
Page 23
Multi Relational CRP
28
Page 24
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 30
Page 25
Evolving Patterns in Social Media
Number of Topics
Topics die and new ones are born
User Personalities
Susceptibility to influence by world-wide, geographic and friends’ preferences
Existing Topic Distributions
Words go out of fashion, new ones enter vocabulary
Topic Characters:
Popularity of topic changes world-wide, in users preference, sub-networks and geographies 31
Page 26
Dynamic MultiRelCRP
32
Page 27
User Personality Trends
33
Page 28
Evolving Topic Distributions
34
Page 29
Topic Character Trends
35
Page 30
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 36
Page 31
Inference and Estimation Tasks
37
Page 32
Online Algorithm
Traditional iterative framework does not scale for social media data
Sequential Monte Carlo methods [Canini AIStats ‘09] that rejuvenate some old labels also infeasible
Online sampling [Banerjee SDM ‘07] does not revisit old labels at all; initial batch phase
Adapt for non-parametric setting
38
Page 33
Multi-threaded Implementation
Sequential online implementation does not scale
Iterative Gibbs sampling algorithms parallelized for hierarchical Bayesian models [Asuncion NIPS 08, Smola VLDB 10]
Our algorithm is parallel, online and non-parametric
Explicit consolidation by master thread at the end of each iteration
Only new topics consolidated 39
Page 34
Talk Outline
Background: Chinese Restaurant Processes
CRP with multiple relationships: (RelCRP, MRelCRP)
Dynamic MRelCRP
Multi-threaded Online Inference Algorithm
Experimental Results 40
Page 35
Datasets and Baselines
Twitter: 360 million tweets (Jun-Dec 2009)
Facebook: 300,000 posts (public profiles, 3 mths)
Latent Dirichlet Allocation (LDA)
[Hong SOMA 2010]
Labeled LDA (L-LDA)
Hashtags as topics [Ramage ICWSM 2010]
Timeline
Dynamic non-parametric topic model [Ahmed UAI 2010] 41
Page 36
1 Model Goodness
Perplexity: Ability to generalize to unseen data
Both network and dynamics are important for modeling social media data
Model Twitter FacebookDMRelCRP 1188.29 1562.34Timeline 1582.86 1802.9L-LDA 1982.76 -LDA 2932.06 3602
Perplexity
42
Page 37
2 Quality of Discovered Topics
Label assigned to each post indicating category
Distribution over words indicating semantics
A. Clustering posts using topic labels
B. Prediction using topic labels
Predicting post authorship & user commenting activity
C. Major event detection
43
Page 38
2A Post Clustering using Topics
Use hashtags as gold standard (for Twitter)
16K posts #NIPS2009, #ICML2009, #bollywood etc
DMRelCRP close to L-LDA without using hashtags
DMelCRP produces ‘finer-grained’ clusters
Model nMI R-Index F1DMRelCRP 0.93 0.88 0.86Timeline 0.81 0.72 0.73L-LDA 1 1 1LDA 0.55 0.52 0.48
Clustering accuracy (Tw)
44
Page 39
2B Prediction Using Topics
Authorship: Given post and user, predict if author
Commenting activity: Given post and (non-author) user, predict if user comments on that post
DMRelCRP topics lead to more accurate prediction
Model Twitter Facebook Twitter FacebookDMRelCRP 0.793 0.734 0.683 0.648Timeline 0.718 0.669 0.582 0.579L-LDA 0.521 0.432 0.429 0.482LDA 0.647 - 0.542 -
Authorship Commenting
45
Page 40
2C Major Event Detection
47
Page 41
2C Major Event Detection
48
Page 42
3 Analysis of Influences
49
Page 43
3A Global Personality Trends
50
Page 44
3A Global Personality Trends
51
Michael Jackson’s death
FIFA WC
Google Wave
Page 45
3A Global Personality Trends
52
Page 46
3B Geo-specific Personality Trends
Personality trends very similar in UK and US
Geographic influences high at different epochs 53
Page 47
3B Geo-specific Personality Trends
India: W-wide and geographic influences weaker
China: W-wide weak, geo strong; stable pattern 54
Page 48
3C Topic Character Trends
55
Page 49
3C Topic Character Trends
56
Page 50
3C Topic Character Trends
57
Page 51
Scaling with Data Size
Java-based multi-threaded framework; 7 threads
8-core 32 GB RAM
Scales largely because of multi-threading 58
Page 52
Summary
First attempt at studying user influences in social media data
New non-parametric model that captures multiple relationships and temporal evolution
Multi-threaded online Gibbs sampling algorithm
Extensive evaluation on large real dataset
Topics lead to better clustering and prediction
Insights on user influence patterns
59