Studying Social Behavior in Social Media Huan Liu Joint Work with Lei Tang, Ali Abbasi, and NiAn Agarwal Arizona State University h8p://dmml.asu.edu/ June 21, 2010, Behavior InformaEcs 2010 at PAKDD, Hyderabad, India
[email protected] Arizona State University Data Mining and Machine Learning Lab 1 Studying Social Behavior in Social Media
Studying Social Behavior in Social Media Huan Liu
Joint Work with Lei Tang, Ali Abbasi, and NiAn Agarwal
Arizona State University h8p://dmml.asu.edu/
June 21, 2010, Behavior InformaEcs 2010 at PAKDD, Hyderabad, India
Social Media
Social Media
Social Networking
Blogs
Wikis
Forums
Content
Sharing
2
TradiEonal Media
3
Broadcast Media One-‐to-‐Many
CommunicaEon Media One-‐to-‐One
Social Media: Many-‐to-‐Many
• Everyone can be a media outlet
• Disappearing of communicaEon barrier
• CharacterisEcs – User generated content – Rich User InteracEons – CollaboraEve environment – Wisdom of the crowd – Long tail
4
Studying Social Behavior
• A new laboratory to study human behavior on an unprecedented scale
• Some issues of our interests
5
Influence
ImplicaEon CollecEve Behavior
Behaviors and PredicEon
• CollecEve Behavior and Social Dimensions – People are connected in various ways, to disparate groups – Knowing how one is connected to others can help predict, but heterogeneous connecEons poses a challenge
• Influence Modeling in CommuniEes – Who are the most influenEal ones in a community
– How to evaluate results without ground truth?
• Online social behavior and their ImplicaEons in the physical world
6
[email protected] Arizona State University Data Mining and Machine Learning Lab 7 Studying Social Behavior in Social Media
CollecEve Behavior
• One’s behavior is affected by his neighbors • Can we predict one’s behavior based those of his neighbors?
• Yes, there are successful examples: – Thresholding model (e.g., Thomas Shelling’s models of segregaEon)
– CollecEve inference • Problem statement and the state of the art
Behavior PredicEon
• Given: – Social network connecEvity informaEon
– Some users with known preferences • Whether or not click on an ad
• Whether or not interested in certain topics • PoliEcal views • Like/Dislike a product
• Output: – Preferences of other users within the network
8
+?
?
+
? -‐
State-‐of-‐the-‐Art
• Markov AssumpEon – Label of one node depends on that of its neighbors
• Training – Build a relaEonal model based on labels (and a8ributes) of neighbors
• PredicEon -‐-‐-‐ collecAve inference – Predict the labels of one node while fixing labels of neighbors – Iterate unEl convergence – Typically require mulEple scans of the network
– Equivalent to label propagaAon
9
++
?
+
? -‐ ++
+
+
? -‐ ++
+
+
+ -‐
LimitaEons of CollecEve Inference
• ConnecEons in a social network are heterogeneous
• Different relaEons can be correlated with preferences in varying degrees
• RelaEon informaEon in social media is not always available
• Direct applicaEon of collecEve inference to social media treats all connecEons equivalently
• Need to differenEate heterogeneous relaEons
ASU
High School Friends
Fudan University
10
Two New Challenges
• Without relaEon-‐type informaEon, is it possible to differenEate relaEons based on network connecEvity?
• If relaEons can be differenEated, how can we determine whether a relaEon can help behavior predicEon?
Social Dimensions
RelaEon informaEon is unknown. 1) How to extract the social dimensions? 2) Which affiliaEons are relevant for preference predicEon?
12
ASU Fudan University
High School
Yahoo! Inc.
Lei 1 1 1 0
Actor1 1 0 0 1
Actor2 0 1 0 0
…… …… …… …… ……
ASU Fudan
High School One actor can be involved in mulEple affiliaEons
ExtracEon of Social Dimensions
• People associated with the same affiliaEon tend to connect to each other more frequently, thus forming a community
• Most exisEng methods find non-‐overlapping communiEes
• One user can be associated with mulEple affiliaEons
• Som clustering method should be adopted
13
Modularity OpEmizaEon
• Modularity compares the within-‐group interacEons with the expected number of random connecEons in the group
• In a network with m edges, for two nodes with degree di and dj , the expected random connecEons between them are
• The interacEon uElity in a group:
• To parEEon a network into mulEple groups, we maximize
14
max
Modularity Matrix
• Modularity formulated in matrix form:
• For som clustering, relax S to be conEnuous • SoluEon: top eigenvectors of the modularity matrix B
15
5 1 3
6
7
2
4
9
8
where
SocioDim: Framework based on Social Dimensions
• Training:
– Extract social dimensions to represent potenEal affiliaEons • Som clustering (modularity maximizaEon, mixture of block models)
– Build a classifier to select those discriminaEve dimensions • Support vector machines, decision trees, logisEc regression
• PredicEon: – Predict preferences based on one actor’s latent social dimensions
– No collecEve inference is necessary 16
Extract PotenAal AffiliaAons
Training classifier
PredicAon
Preferences
Predicted Preferences
Social Dimensions
SocioDim vs. CollecEve Inference
17
Summary
• CollecEve behavior • Heterogeneous connecEons • Social dimensions
• Be8er predicEon • Further work
– Scalability – Group profiling – EvoluEon
[email protected] Arizona State University Data Mining and Machine Learning Lab 19 Studying Social Behavior in Social Media
IdenEfying InfluenEal Bloggers
• Given the exponenEal growth of blog posts, one way is to find who are the influenEals and then use them as clues to find relevant and interesEng blogs.
InfluenEal Sites vs. InfluenEal Bloggers
• Short Head blogs – InfluenEal sites – Search engines – InformaEon Diffusion [Gruhl et al. 2004;
Kempe et al. 2003; Richardson and Domingos 2002; Java et al. 2006]
• Long Tail blogs [Anderson 2006] – Inordinately many – Less popular – Niche interests
• Extremely challenging to study all these blogs • A soluEon: Finding the influenEals as representaEves • PracEcal benefits: product pre-‐release, customer feedback, target adverEsing
blog
popu
larity
Real and Virtual World
Real World
Domain Expert
Friends
Virtual World
Online Community
InfluenEal Bloggers
• Inspired by the analogy between real-‐world and blog communiEes, we answer:
Who are the influenEals in Blogosphere?
Can we find them?
AcEve Bloggers = InfluenEal Bloggers ?
• AcEve bloggers may not be influenEal • InfluenEal bloggers may not be acEve
Searching for the InfluenEals
• AcEve bloggers – Easy to define – Omen listed at a blog site – Are they necessarily influenEal?
• How to define an influenEal blogger – InfluenEal bloggers have influenEal posts – SubjecEve – Collectable staEsEcs – How to use these staEsEcs
IntuiEve ProperEes
• Social Gestures (sta's'cs) – RecogniEon: CitaEons (incoming links)
– An influenEal blog post is recognized by many. The more influenEal the referring posts are, the more influenEal the referred post becomes.
– AcEvity GeneraEon: Volume of discussion (comments) – Amount of discussion iniEated by a blog post can be measured by the
comments it receives. Large number of comments indicates that the blog post affects many such that they care to write comments, hence influenEal.
– Novelty: Referring to (outgoing links) – Novel ideas exert more influence. Large number of outlinks suggests that
the blog post refers to several other blog posts, hence less novel. – Eloquence: “goodness” of a blog post (length)
– An influenEal is omen eloquent. Given the informal nature of Blogosphere, there is no incenEve for a blogger to write a lengthy piece that bores the readers. Hence, a long post omen suggests some necessity of doing so.
• Influence Score = f(Social Gestures)
AcEve & InfluenEal Bloggers
• AcEve and InfluenEal Bloggers • InacEve but InfluenEal Bloggers • AcEve but Non-‐influenEal Bloggers
• We don’t consider “InacEve and Non-‐influenEal Bloggers”, because they seldom submit blog posts. Moreover, they do not influence others.
Temporal Pa8erns
• Long term InfluenEals • Average term InfluenEals • Transient InfluenEals • Burgeoning InfluenEals
VerificaEon of the Model
• Challenges – No training and tesEng data, i.e., absence of ground truth – Enough experiments? or not? – If not, what’s missing
• How to validate if the model finds the influenEals • It must be independent of the model building • We use another Web 2.0 website, Digg
• “Digg is all about user powered content. Everything is submi8ed and voted on by the Digg community. Share, discover, bookmark, and promote stuff that’s important to you!”
• The higher the digg score for a blog post is, the more it is liked.
• A not-‐liked blog post will not be submi8ed thus will not appear in Digg
Digg -‐ Power of Web 2.0
Findings with Digg
• Digg records top 100 blog posts obtained through Digg Web API.
• Top 5 influenEal and top 5 acEve bloggers were picked to construct 4 categories
• For each of the 4 categories of bloggers, we collect top 20 blog posts from our model and compare them with Digg top 100.
• DistribuEon of Digg top 100 and TUAW’s 535 blog posts
RelaEve Importance of Parameters
• Observe how much our model aligns with Digg. • Compare top 20 blog posts from our model and Digg. • Considered six months
• Considered all configuraEon to study relaEve importance of each parameter.
• RecogniAon (Inlinks) > AcAvity GeneraAon (Comments) > Novelty (Outlinks) > Eloquence (Blog post length)
Summary
• Some people are more influenEal than others • Finding them can be helpful in many ways
– Indexes to the dynamic blogosphere – Opinion polling – Product trial – Target adverEsing
• Model validaEon is a challenge – using an independent social media site is one opEon
• Future work – Content analysis – Topic-‐specific influence analysis – EvoluEonary influence
[email protected] Arizona State University Data Mining and Machine Learning Lab 32 Studying Social Behavior in Social Media
ComparaEve Study of Social Behaviors
• Different methodologies – Online vs. offline
• Differed challenges • How to compare the two
• Could we use one to infer the other?
How to find people’s behavior
• Interviews and QuesEonnaires – Ask them about their connecEons, acEons, …
• Obtaining from online data (Snooping) – You’re what you connect with, write, and behave
• There are lots of companies doing this way regularly
• Full control from design to sampling
• High accuracy
• Fast and cheap • Huge amounts of data • Large scale (people and topics)
• Usually publicly accessible
• Can be located with internet tools (crawlers, search engines)
ComparaEve Advantages
Offline Methods Online Methods
• Dangerous (someEmes) • Intrusive • Time consuming
• Expensive • Must conduct separate polls for each survey
• Huge amounts of data
• Lots of junk data • Anonymous users
• There is no control of what to observe
ComparaEve Disadvantages
Offline Methods Offline Methods
ComparaEve Study-‐ Indirect Method
• Sample: Health Care Reform
Blog Trends
News Timeline
Search Timeline
ComparaEve Study -‐ Direct Method
[email protected] Arizona State University Data Mining and Machine Learning Lab 38 Studying Social Behavior in Social Media
AddiEonal InformaEon • Book: Modeling and Data Mining in Blogosphere (2009)
• Book: Community DetecEon and Mining (2010)
KDD08 Tutorial
IEEESocialCom09
SBP Conference Series
SBP08, SBP09, & SBP10 Proceedings
[email protected] Arizona State University Data Mining and Machine Learning Lab 39 Studying Social Behavior in Social Media
Key References
• L. Tang and H. Liu. Toward PredicEng CollecAve Behavior via Social Dimension ExtracEon. IEEE Intelligent Systems, July/August, 2010.
• L. Tang and H. Liu. RelaEonal Learning via Latent Social Dimensions. KDD’09, pages 817–826, 2009.
• S.A. Macskassy and F. Provost, “ClassificaEon in Networked Data: A Toolkit and a Univariate Case Study,” J. Machine Learning Research, vol. 8, no. 5, 2007, pp. 935–983
• N. Agarwal, H. Liu, L. Tang, and P.S. Yu. IdenEfying the InfluenAal Bloggers in a Community. WSDM’08, pages 207–218, 2008.
• N. Agarwal and H. Liu. "Modeling and Data Mining in Blogosphere", Morgan & Claypool, July 2009.
• dmml.asu.edu or via Huan Liu’s url
Acknowledgments: Projects are in part sponsored by AFOSR and ONR, our graEtude to members in DMML and our collaborators