Analyzing the Dynamics of Communication in Online Social Networks Munmun De Choudhury School of Computing, Informatics & Decision Systems Engineering Arizona State University
Analyzing the Dynamics of Communication in Online Social Networks
Munmun De ChoudhurySchool of Computing, Informatics &Decision Systems EngineeringArizona State University
Committee Members
• Dr HariSundaram, Chair
• Dr K. SelçukCandan
• Dr Huan Liu
• Dr Duncan Watts, Yahoo! Research
• Dr DoreeSeligmann, Avaya Labs
2April 15, 2010 Ph.D Proposal Defense
Goal:Understanding the dynamics and impact of our online social interactions
Information
Media
Network
Background and MotivationContributionsCompleted WorkProposed WorkConclusions
April 15, 2010 5
Modern Social Interactional Modes
Slashdot
Engadget
Flickr
LiveJournalDigg
YouTubeBlogger
MetaFilterReddit
MySpaceOrkut
Ph.D Proposal Defense
Some Social Media Statistics
6
YouTube 139M users; US$200M [Forbes].
Flickr 3.6B images; 50M users.
Facebook 350M+ active users; 1B pieces of content (web links, news stories, blog posts, notes, photos, etc) shared each week.
Twitter 75M users; 600 tweets per second
MySpace 110M monthly active users; 14 B comments on the site.
Digg 3M unique users; $40M.
Engadget 1,887,887 monthly visitors.
Huffington Post 8.9M visitors.
Live Journal 19,128,882 accounts.
April 15, 2010 Ph.D Proposal Defense
7
What are the impacts of such large-scale online social interaction?
Viral Marketing, Advertizing Campaigns
9
What has been the buzz on the release of iPad?
Collaboration, “Wisdom of the Crowds”
11
What have been the political sentiments regarding the 2008 US Presidential election candidates?
Crisis management w.r.t. real-time events
13
Which is the best news source to read about the recent California / Mexico / Arizona earthquake?
14
What are the challenges? • Scale• Growth• Accessibility• Complexity
• Relevance• Utility• Tools
Facebook features ~400 million users300-600 tweets per second!
Twitter developer API’s rate limit: 20,000 requests only!
About a quarter of all tweets are spam (http://mashable.com/)
Time complexity of breadth first search on a social graph: O(|E|+|V|), |V| and |E| in the order of millions…
Are the extracted groups in a community meaningful?
How do we compare across interactions on Facebook and those on LinkedIn?
ContributionsCompleted WorkProposed WorkConclusions
Background and Motivation
Communication is the process by which participating individualscreate and share information with one another in order to reach amutual understanding [Berger and Calabrese, 1975].
Semiotic tradition
Cybernetictradition
Socio-psychological/Socio-cultural
tradition
Addresses interpretation and generation of meaning as a consequence of the communication process [Craig, 1999]
The model of communication is interpreted as a source that produces a message that is passed along a “channel”, to a receiver that interprets the message [von Foerster, 1995]
Considers communication to be the interaction among individuals, that results in the production and re-production of social order [Littlejohn, 2001]
Information or concept
Media or channel
Social engagement or network
Research Questions
1) How do we analyze the diffusion of information or concepts among the communicating individuals?
Time
2) How does the social engagement give rise to emergent collective behavior and evolution of groups and networks?
Time
3) How do we characterize the evolving properties of the media artifacts via which communication takes place?
Thesis Overview
Information Network Media
Overview of Our Work
23
Characterization Models Observational Studies
InformationDiffusion
Topical CommunicationFlow
Diffusion of User Actions
Impact of Homophily onDiffusion
Evolution of Social Communication Networks
Community Dynamics and External Phenomena
Prototypical CommunicationGroups
Multiplex Ties: Evolution and EmpiricalThresholds on Ego-network Size
Rich Media Communication Patterns
Interestingness of Conversations
Connecting Media to Community
Models of User Behavior on Rich Media
April 15, 2010 Ph.D Proposal Defense
Summary of Publications
24
Characterization Models Observational Studies
InformationDiffusion
1. WI 2007 (17%)2. HT 2008 (24%)3. under review at
ACM Tweb
1. SocialCom 2009 (9%)
2. journal submission for ACM Tweb
1. ICWSM 2010 (18%)2. under review at
KDD 20103. journal submission
TBD
Evolution of Social Communication Networks
1. HT 2008 (24%)2. CIKM 2008 (16%)
1. HT 20092. ACM TOIS
1. Proposed work (target: WWW 2011, ICWSM 2011)
Rich Media Communication Patterns
1. WWW 2009 (11%)2. Journal submission
for ACM TOMCCAP
1. ICME 2009 (22%) 1. Proposed work (target: KDD 2011)
April 15, 2010 Ph.D Proposal Defense
Completed WorkProposed WorkConclusions
Background and MotivationContributions
Rich Media Communication PatternsRelated publications at: WWW 2009, ICME 2009
27
Why do people repeatedly come back to the same You Tube video?
April 15, 2010 Ph.D Proposal Defense
28
They are returning to do more thanwatch the same video
April 15, 2010 Ph.D Proposal Defense
29
We think it is the conversationsaround the video they find interesting
April 15, 2010 Ph.D Proposal Defense
What is a Conversation?
• A conversation associated with an online social media object is a temporally ordered sequence of comments posted by individuals.
Comment
Participant
Timestamp of comment
Media Object
Media Attributes
Conversation
April 15, 2010 30Ph.D Proposal Defense
31
Themes and participants make conversations interesting
April 15, 2010 Ph.D Proposal Defense
An example of an interesting conversation …
33
• Goal:– What causes a conversation to be interesting, that prompts a user to
participate in the discussion on a posted video?
Our Contributions
Approach:
• Detect conversational themes.
• Determine interestingness of participants and interestingness of conversations based on a random walk model.
• Measure the consequence of a conversations.
• Excellent results on a dataset from YouTube.
April 15, 2010 Ph.D Proposal Defense
April 15, 2010 34
Related Work
Analysis of Media Properties
Theme Extraction
in Social Media
Analysis of CommunicationIn Social Media
• Visualization of evolution of tags [Dubinko et al. 2006]• YouTube traffic characterization [Gill et al. 2007]• Analysis of user-generated media [Cha et al. 2007]• Tagging conventions and strategies (YouTube) [Geisler et al. 2007]• Social dynamics in media sharing [Halvey et al. 2007]• Representative views of landmarks [Kennedy et al. 2008]
Cha et al. 2007
Mei et al. 2007
• Predictivity of online chatter [Gruhl et al. 2005]• Analysis of weblog comments [Mishne 2006]• Correlation of communication & stocks [Choudhury et al. 2007]• Predicting sales using blog communication [Liu et al. 2007]• Discussion threads in Slashdot [Kaltenbruner et al. 2007, Gomez 2008]• Video interactions in YouTube [Benevenuto et al. 2008]• Tendency of topic discussions [Zhou et al. 2008]
• Spatio-temporal theme mining [Mei et al. 2005, 2006]• Mining multi-faceted overviews of topics [Ling et al. 2008]• Video topic discovery [Liu et al. 2008]
Ph.D Proposal Defense
35
Conversational Themes• Conversational themes are sets of salient topics associated with
conversations at different points in time.
bag of words (chunks,λi,t) over time slices
t1 t2 tQ
……
……Theme models, θj
conversation i
April 15, 2010 Ph.D Proposal Defense
Theme Model
• Temporal Regularization– A word w in the chunk can be attributed either to the textual context λi,t, or
the time slice t
– Smoothness of theme models over time
, ,
, ,1
( ) , .log , | , expi t i t
K
i t j i t TC w j
L C n w p w t d j
time
time
Them
e d
istr
ibu
tio
n
Them
e st
ren
gth
Chunk context Time context Temporal Smoothing
April 15, 2010 36Ph.D Proposal Defense
Theme Model (Contd.)
• Co-participation based Regularization
– If several participants comment on a pair of chunks, their theme distributions are likely to be close to each other.
2
2
,, 1
( ) 1 | | .i m
K
i m j i j mc c C j
R C f c f c
( ) (1 ). ( ) . ( )O C L C R C
conversation ci, f(θj|ci)
ωi,m
conversation cm, f(θj|cm)
April 15, 2010 37Ph.D Proposal Defense
Interestingness of Conversations
38
ψ 1-ψImpact due toparticipants
Impact due tothemes
Interestingness of participants
Relationship between participants and conversations
Conversational Theme Distribution
Theme Strength
Interestingness of conversations
X
X
X
Interestingness of conversations
t─1 t
ψ
1─ψ
April 15, 2010
( ) ( 1) ( 1) ( 1) ( 1) ( 1). . (1 ). . .tq q q q q qdiag C C P T S CI P I C T I
Ph.D Proposal Defense
Interestingness of Participants
39
β1-β
Past communication Preference
April 15, 2010
1( ) ( 1) ( 1)
-1 ( 1) ( 1) ( 1) ( 1) ( 1) ( 1)1 2 3
(1 ). . . ,
where . . . . . .
qq q q
q q q q q q q
p T S
L P F P C C
I A P T
A P I P I P I
Ph.D Proposal Defense
40
Interestingness of participants and conversations mutually reinforce each other
April 15, 2010 Ph.D Proposal Defense
41
Joint Optimization of Interestingness
• A joint optimization framework, which maximizes the two interestingness measures for optimal X=(α1, α2, α3,Ψ) and also incorporates temporal smoothness:
2 2
1( ) . (1 ). exp expP Cg d dP CX I X I X
Interestingness of participants
Interestingness of conversations
Regularization of participants’ interestingness
Regularization of conversations’ interestingness
April 15, 2010 Ph.D Proposal Defense
42
Three consequence metrics of interestingness
t t+δ
acti
vity
Participant activity
t t+δ
participant cohesiveness
Participant Cohesiveness
t t+δ
conversation
Thematic Interestingness
April 15, 2010 Ph.D Proposal Defense
43
YouTube Dataset
• ‘News & Politics’ category on YouTube – rich communication on highly dynamic events.– 132,348 videos
– ~ 9M unique participants
– ~ 89M comments
– 15 weeks from June 20, 2008 to September 26, 2008
April 15, 2010 Ph.D Proposal Defense
45
Analysis of Interestingness of Participants
Interestingness of participants is less affected by number of comments during significant external events
April 15, 2010 Ph.D Proposal Defense
46
Analysis of Interestingness of Conversations
Mean interestingness of conversations increases during periods of several external events; however, certain highly interesting conversations always occur at different weeks irrespective of events.
April 15, 2010 Ph.D Proposal Defense
47
Evaluation using Consequences
• Interestingness is computed using five techniques –– our method with temporal smoothing (I1), – our method without temporal smoothing (I2) and – the three baseline methods,
• B1 (comment frequency), • B2 (novelty of participation), • B3 (co-participation based PageRank).
April 15, 2010 Ph.D Proposal Defense
April 15, 2010 48
Summary
• Summary– Why do people repeatedly come back to
the same YouTube video?
– Developed a probabilistic framework that characterizes “interestingness” of conversations
– Our method can explain future consequences
• Future Work– User subjectivity
– Personalized recommendations
t t+δ
participant cohesiveness
Participant Cohesiveness
Ph.D Proposal Defense
Information DiffusionRelated publications at: WI 2007, HT 2008, SocialCom 2009, ICWSM 2010, KDD 2010 (under review), ACM Tweb (under review)
XKCD: Seismic Waves
XKCD: Seismic Waves #iPad
People from the south-west of US?
Apple gadget lovers?
What is homophily?
• The tendency of individuals to bond more with ones who are “similar” to them, compared to ones who are “dissimilar” *McPherson et al. 2001+.
• Observed sociological consequence: similarity breeds connection.
52
Information roles
LocationActivity behavior
April 15, 2010 Ph.D Proposal Defense
• “Tweets”: 140 character length shared content.– RT (or re-tweet feature), hashtags (e.g. #iranelection), bit.ly encoded URLs
• Follower / Following relationship.
• “Trending topics” e.g. #musicmonday, #formulaone.
April 15, 2010 53
• Diffusion via (1) RT feature, (2) shared URL (e.g. bit.ly, tinyurl), (3) same hashtag
(1) RT based diffusion
(2) URL based diffusion
(3) hashtag based diffusion
Ph.D Proposal Defense
Diffusion characteristics vary across different user attributes along which the diffusion process takes place
Our Contributions
• Goal:
– How does attribute homophily among users affect the diffusion of information on online social media?
55
• Approach:– Representation of diffusion process via a
structure called “diffusion series”
– A DBN based probabilistic model that can predict diffusion characteristics on an attribute graph
• Excellent results on a large dataset from Twitter indicating that homophily affects diffusion, but depends on topic and metric of measurement.
April 15, 2010 Ph.D Proposal Defense
Related Work
April 15, 2010 56
Studies on Homophily
Diffusion Dynamics
Homophily &Social contagion
Inter-connectedness between homogeneous composition of groups and emergent homophily [McPherson et al. 2001]
Topic diffusion in the Blogosphere [Gruhl et al. 2004]
Diffusion of “gestures” on Second Life *Bakshy et al. 2009]
Contagion via Facebook “News Feed” *Sun et al. 2009]
Homophily and contagion over romantic and sexual networks [Bearman et al. 2004]
Compliance and conformity in networks [Cialdini et al. 2004]
Ph.D Proposal Defense
April 15, 2010 Ph.D Proposal Defense 57
Data Model
Diffusion Series
58
Social graph Attribute social graph
Diffusion series
April 15, 2010 Ph.D Proposal Defense
Characterizing Diffusion
• User-based(Volume, participation, dissemination)
• Topology-based(Reach, spread, cascade instances, collection size)
• Time-based(Rate)
59
Volume
ParticipationDissemination
Reach
Spread
Cascade Instances
Collection Size
Rate
April 15, 2010 Ph.D Proposal Defense
April 15, 2010 Ph.D Proposal Defense 60
Attribute social graph at tN
Social actions of users until tN
Predict the users likely to perform the social action
at tN+1
Construct diffusion
series at tN+1
Predict diffusion characteristics at tN+1
Does a particular chosen attribute predict the diffusion characteristics well? That is, does homophily on the particular attribute impact the diffusion process?
Method
tN tN+1
Prediction Framework
61
• Dynamic Bayesian network based representation of user’s social actions (i.e. “tweeting” behavior in this work)
– User context (e.g. past history of activity, degree of activity of ego-network, popularity of topic) i.e. Fi,N, states i.e. Si,Nand observed action i.e. Oi,N
• Two latent states: “vulnerable or socially aware” (Si=1) and “indifferent or socially unaware” (Si=0)
• Goal: estimate the probability of observed social action at a future time tN+1, based on the latent states Si,Nand the contextual attributesFi,N
April 15, 2010 Ph.D Proposal Defense
Estimation
62
, 1
, 1
, 1 , , , 1 , 1 , , , 1 , ,
, 1 , 1 , 1 , ,
| , | , , | ,
| | , ,
i N
i N
i N i N i N i N i N i N i N i N i N i NS
i N i N i N i N i NS
P O O P O S O P S O
P O S P S S
F F F
F
Estimate user context
Estimate probability of user state given context and past state
Estimate probability of user action given the stateA Hidden Markov Model
where the actions are the emissions; use Viterbi to predict the likely sequence
Multinomial density of states over the contextual attributes with a Dirichlet prior
where,Oi,N= action of user iat time slice tN
Fi,N= context of user iat time slice tN
Si,N= state of user iat time slice tN
April 15, 2010
Predicting User States
63
…..
…..
…..
, 1 , , , , , 1 ,
Using Bayes rule and first order Markov property,
( | , ) ( | ). ( | )i N i N i N i N i N i N i NP S S P S P S S F F
multinomial
Dirichlet prior
, ;
, 1 , , , , , 1 ,
, , , 1 , 1
, ;
, ; ,
, ; , 1
Parameter estimation using MAP:
( | , ) log( ( | )) log( ( | ))
log ( ; ) log ( ; )
!1
log log! ( )
i N jk
i N i N i N i N i N i N i N
i N i N i N i N
i N jkjk
i N jk i Njki N jk i N
jk
L S S P S P S S
S
SB
F
F F
multinom F Dirichlet
F
F
, ;
1
, 1 , 1
,
where ( ) is a beta-function with the parameter .
i N jlS
jl
i N i NB
April 15, 2010 Ph.D Proposal Defense
April 15, 2010 Ph.D Proposal Defense 64
Attribute social graph at tN
Social actions of users until tN
Predict the users likely to perform the social action
at tN+1
Construct diffusion
series at tN+1
Predict diffusion characteristics at tN+1
How we do we evaluate?
tN tN+1
1 1 1
1 11
1 1
1
1 ( ( ), ( )) where, ( ) is the search volume, and
ˆ( ) | ( ( ))|/ ,
ˆ ˆ| ( ( ))| is the number of nodes at slot in the collection ( ), and
ˆQ | ( ( ))|. Si
D S SN N N
DN m N D
m N
m N m N
D m Nm
D E E E
E l S Q
l S l S
l S
milarly,
Distortion Measurement• Saturation measurement, i.e. ability of a given attribute social graph to predict
user-based, topological and temporal diffusion characteristics at a future time slice, given using the Kolmogorov-Smirnov (KS) statistic:
65
• Utility measurement, i.e. ability of a given attribute social graph to correlate with external temporal variables like user search behavior and news items featured online (http://news.google.com/), given as:
1 1 11 ( ( ), ( )) where, ( ) is the news volume.DN N ND E E E
April 15, 2010 Ph.D Proposal Defense
1 1 1
1
ˆ ˆ1 ( ( ), ( )) where, ( ) is the predicted value of the
-th diffusion characteristic at .
M M MN N N
N
D E E E
M t
April 15, 2010 Ph.D Proposal Defense 66
Experimental Studies
Experimental Setup
• ~465K users, ~836K edges (“follower” / “following” relationships) and 29.5M tweets.
• 125 randomly chosen “trending topics” from Twitter, between Oct and Nov 2009.
• Trending topic – theme association based on OpenCalais (http://www.opencalais.com/).
April 15, 2010 67
Themes Trending topics
Politics Obama, Senate, Afghanistan, Tehran, Healthcare
Entertainment_Culture Beyonce, Eagles, Michael Jackson, #britney3premiere
Sports Chargers, Cliff Lee, Dodgers, Formula One, New York Yankees
Technology_Internet Android 2, Bing, Google Wave, Windows 7, #Firefox5
Social Issues Swine Flu, Unemployment, #BeatCancer, #Stoptheviolence
Ph.D Proposal Defense
• Dataset released for non-commercial research purposes:
http://www.public.asu.edu/~mdechoud/temp/released-data/
Temporal Analysis
68
Best: IRO Best: CCR
Best: ACTBest: ACT
Topical Analysis
69
A: Business-Finance, B: Politics, C: Entertainment-Culture, D: Sports, E: Technology-Internet, F: Human Interest, G: Social Issues, H: Hospitality-Recreation.
• For a given theme, the attribute that yields best prediction depends on:
Effect of external events
Habitual properties of users
April 15, 2010 Ph.D Proposal Defense
Baseline Models
• GenModel: Generative model
• Cascade: Information cascades
• LinRegress: Linear regression
• DegAct: Degree of activity
• Random: Random sample
70
GenModel
Cascade
, 1 1 , ;1 2 , ;2 , ;( ) . . .i N i N i N k i N kP O F F F
April 15, 2010 Ph.D Proposal Defense
Comparative Study
71
Summary
• Contributions:– Quantified impact of homophily on the diffusion
process in online social media.
– Proposed and evaluated a computational model that predicts diffusion characteristics over time given an attribute social graph.
– Excellent results on a large dataset from Twitter.
• Future directions:– What is the relationship between diffusion process
and social contagion?
– Can we develop computational models that maximize predicted diffusion characteristics given several homophily attributes?
72April 15, 2010 Ph.D Proposal Defense
Proposed WorkConclusions
Background and MotivationContributionsCompleted Work
Evolution of Social Communication Networks
Related publications at: HT 2008, CIKM 2008, HT 2009, ACM TOIS
Online communication networks are multi-modal and dynamic
tim
e
“active” network only limited to a few individuals and changes over time
Network evolves based on microscopic interactions at the ego-centric level
Proposed Goal
• Evolution of multiplex networks:– What are appropriate representations of multiplex
networks based on multi-relational interactions on online social media?
• Ego-network size thresholds:– Are there empirical thresholds on the sizes of ego-
networks corresponding to a given modality?
– Do dependencies exist among the thresholds across these different modalities?
– Can we model these inter-dependencies among the thresholds of multiple modalities to develop a generalized model of evolution of ego-networks?
April 15, 2010 76
tim
ePh.D Proposal Defense
Who can benefit from this research?
A news reporter A political analyst A company The average Joe…
Prior Work
April 15, 2010 78
MultiplexNetworks
Social NetworkEvolution
Thresholds ofEgo-networks
1. Multiplexity as a fundamental aspect of social relations [Dacin et al., 1999, Robins et al., 2005]2. Consideration to the different types of resources transferred through linkages [Lomi and Pattison, 2006, Robins et al., 2007]3. Structural embeddedness[Granovetter 1973, Uzzi, 1997, Gulati, 1999, Gulati and Gargiulo, 1999]
1. Cognitive limits on the number of friends an individual [Dunbar, 1992,Dunbar, 1993]2. Facebook study on applicability of Dunbar’s number (http://www.insidefacebook.com/2009/02/27/facebooks-in-house-sociologist-shares-stats-on-users-social-behavior/)
1. Preferential attachment [Barabasi and Albert, 1999, Abello et al., 2002,Cooper and Frieze, 2003]2. Copying model [Kleinberg et al., 1999, Kumar et al., 2000]3. Diameter and network densification [Leskovec et al., 2005, Leskovecet al., 2008]
Ph.D Proposal Defense
Research Steps
April 15, 2010 79Ph.D Proposal Defense
1 2
1 2
( )
( ,{ , , , }; ), directed social graph at a certain time , where there are
sets of multiplex ties { , , , } between users in .
implies that communicates with on modality at
n
n
kij k i j
G V E E E t t
n E E E V
e E u u k
time .t
Stacked representationof a multiplex network at time t given as:
• Research challenges:– predict the ego-network of each user uiin V at t+1
for modality k, given data until time t.
– derive the network G’(V,{E1, E2, …, En};t+1) for all nmodalities at t+1.
– determine threshold d(k)t+1 corresponding to all
users uiin Vthat bounds the size of the ego-networks of users for modality k at t+1.
– determine the conditional relationship between d(k)
t+1 and d(l)t+1 where k, l ≤n
1 2( ,{ , , , }; )nG V E E E t
Discussion
• Evaluation strategies– Compare with observed multiplex social ties
at a future time slice
– Ethnographic studies of evaluation of thresholds of ego-network sizes
• Potential datasets for evaluation– Digg dataset: features multiple
interactional modalities – “digging”, commenting and replying
– Flickr dataset: diverse rich media interactions – photo upload, commenting, tagging, “favoriting”, group /photo pool subscription
– Others: YouTube, Twitter (?)
April 15, 2010 80Ph.D Proposal Defense
Photo upload
tagging
Photo pool association
Commenting
Tentative Timeline
81
1. Proposed Work I: Evolution of ego-networks over multiplex ties:2010/08 – 2011/02• Background and lit survey: 2010/08• Problem formulation: 2010/08 – 2010/09• Data collection: 2010/09• Experimental studies: 2010/10• Conference submission: 2010/11• Journal submission: 2010/12 – 2011/02
2. Proposed Work II: Models of user behavior on rich media:2010/11 – 2011/04• Background and lit survey: 2010/11• Problem formulation: 2010/11 – 2010/12• Data collection: 2010/12• Experimental studies: 2011/01• Conference submission: 2011/02• Journal submission: 2011/03 – 2011/04
3. Thesis writing:2011/02 – 2011/054. Dissertation defense:2011/05
April 15, 2010 Ph.D Proposal Defense
Conclusions
Background and MotivationContributionsCompleted WorkProposed Work
Information Network Media
Conclusions
• Goal:– Analyzing the dynamics and impact of online
social interactions
• Completed Work:– Characterization of rich media artifacts, via
interestingness property• [YouTube dataset] Interesting conversations on rich social
media sites show consequential impact on sociological behavior
– Information diffusion and its relationship to attribute homophily• [Twitter dataset] Diffusion characteristics vary across
topics and is affected by homophily
• Proposed Work:– Evolution of ego-networks over multiplex social
ties
April 15, 2010 83Ph.D Proposal Defense
Open Issues & Future Directions
• Social data sampling i.e. representativeness of observations and models
• Modeling sociological behavior at smaller topological / temporal scales i.e. in the “long tail”
• Application of proposed methods to complex real-time communication contexts, e.g. in enterprise business processes, crisis mitigation teams, financial domains
• Ethnographic and longitudinal studies to incorporate user feedback on the design and analysis of social communication systems
84April 15, 2010 Ph.D Proposal Defense