Studying,Social,Behavior,in,Social,Mediahuanliu/papers/BehaviorInfo10.pdf · • A!new!laboratory!to!study!human!behavior!on! ... • Online!social!behavior!and!their!ImplicaEons!in!the!

[email protected] Arizona State University Data Mining and Machine Learning Lab 1 Studying Social Behavior in Social Media

Studying Social Behavior in Social Media Huan Liu

Joint Work with Lei Tang, Ali Abbasi, and NiAn Agarwal

Arizona State University h8p://dmml.asu.edu/

June 21, 2010, Behavior InformaEcs 2010 at PAKDD, Hyderabad, India

Social Media

Social Media

Social Networking

Blogs

Wikis

Forums

Content

Sharing

2

TradiEonal Media

3

Broadcast Media One-‐to-‐Many

CommunicaEon Media One-‐to-‐One

Social Media: Many-‐to-‐Many

•  Everyone can be a media outlet

•  Disappearing of communicaEon barrier

•  CharacterisEcs – User generated content –  Rich User InteracEons –  CollaboraEve environment – Wisdom of the crowd –  Long tail

4

Studying Social Behavior

•  A new laboratory to study human behavior on an unprecedented scale

•  Some issues of our interests

5

Influence

ImplicaEon CollecEve Behavior

Behaviors and PredicEon

•  CollecEve Behavior and Social Dimensions –  People are connected in various ways, to disparate groups –  Knowing how one is connected to others can help predict, but heterogeneous connecEons poses a challenge

•  Influence Modeling in CommuniEes – Who are the most influenEal ones in a community

–  How to evaluate results without ground truth?

•  Online social behavior and their ImplicaEons in the physical world

6


CollecEve Behavior

•  One’s behavior is affected by his neighbors •  Can we predict one’s behavior based those of his neighbors?

•  Yes, there are successful examples: – Thresholding model (e.g., Thomas Shelling’s models of segregaEon)

– CollecEve inference •  Problem statement and the state of the art

Behavior PredicEon

•  Given: – Social network connecEvity informaEon

– Some users with known preferences • Whether or not click on an ad

• Whether or not interested in certain topics •  PoliEcal views •  Like/Dislike a product

•  Output: – Preferences of other users within the network

8

+?

?

+

? -‐

State-‐of-‐the-‐Art

•  Markov AssumpEon –  Label of one node depends on that of its neighbors

•  Training –  Build a relaEonal model based on labels (and a8ributes) of neighbors

•  PredicEon -‐-‐-‐ collecAve inference –  Predict the labels of one node while fixing labels of neighbors –  Iterate unEl convergence –  Typically require mulEple scans of the network

–  Equivalent to label propagaAon

9

++

?

+

? -‐ ++

+

+

? -‐ ++

+

+

+ -‐

LimitaEons of CollecEve Inference

•  ConnecEons in a social network are heterogeneous

•  Different relaEons can be correlated with preferences in varying degrees

•  RelaEon informaEon in social media is not always available

•  Direct applicaEon of collecEve inference to social media treats all connecEons equivalently

•  Need to differenEate heterogeneous relaEons

ASU

High School Friends

Fudan University

10

Two New Challenges

• Without relaEon-‐type informaEon, is it possible to differenEate relaEons based on network connecEvity?

• If relaEons can be differenEated, how can we determine whether a relaEon can help behavior predicEon?

Social Dimensions

RelaEon informaEon is unknown. 1)  How to extract the social dimensions? 2)  Which affiliaEons are relevant for preference predicEon?

12

ASU Fudan University

High School

Yahoo! Inc.

Lei 1 1 1 0

Actor1 1 0 0 1

Actor2 0 1 0 0

…… …… …… …… ……

ASU Fudan

High School One actor can be involved in mulEple affiliaEons

ExtracEon of Social Dimensions

•  People associated with the same affiliaEon tend to connect to each other more frequently, thus forming a community

•  Most exisEng methods find non-‐overlapping communiEes

•  One user can be associated with mulEple affiliaEons

•  Som clustering method should be adopted

13

Modularity OpEmizaEon

•  Modularity compares the within-‐group interacEons with the expected number of random connecEons in the group

•  In a network with m edges, for two nodes with degree di and dj , the expected random connecEons between them are

•  The interacEon uElity in a group:

•  To parEEon a network into mulEple groups, we maximize

14

max

Modularity Matrix

•  Modularity formulated in matrix form:

•  For som clustering, relax S to be conEnuous •  SoluEon: top eigenvectors of the modularity matrix B

15

5 1 3

6

7

2

4

9

8

where

SocioDim: Framework based on Social Dimensions

•  Training:

–  Extract social dimensions to represent potenEal affiliaEons •  Som clustering (modularity maximizaEon, mixture of block models)

–  Build a classifier to select those discriminaEve dimensions •  Support vector machines, decision trees, logisEc regression

•  PredicEon: –  Predict preferences based on one actor’s latent social dimensions

–  No collecEve inference is necessary 16

Extract PotenAal AffiliaAons

Training classifier

PredicAon

Preferences

Predicted Preferences

Social Dimensions

SocioDim vs. CollecEve Inference

17

Summary

•  CollecEve behavior •  Heterogeneous connecEons •  Social dimensions

•  Be8er predicEon •  Further work

– Scalability – Group profiling – EvoluEon


IdenEfying InfluenEal Bloggers

•  Given the exponenEal growth of blog posts, one way is to find who are the influenEals and then use them as clues to find relevant and interesEng blogs.

InfluenEal Sites vs. InfluenEal Bloggers

•  Short Head blogs –  InfluenEal sites –  Search engines –  InformaEon Diffusion [Gruhl et al. 2004;

Kempe et al. 2003; Richardson and Domingos 2002; Java et al. 2006]

•  Long Tail blogs [Anderson 2006] –  Inordinately many –  Less popular –  Niche interests

•  Extremely challenging to study all these blogs •  A soluEon: Finding the influenEals as representaEves •  PracEcal benefits: product pre-‐release, customer feedback, target adverEsing

blog

popu

larity

Real and Virtual World

Real World

Domain Expert

Friends

Virtual World

Online Community

InfluenEal Bloggers

•  Inspired by the analogy between real-‐world and blog communiEes, we answer:

Who are the influenEals in Blogosphere?

Can we find them?

AcEve Bloggers = InfluenEal Bloggers ?

•  AcEve bloggers may not be influenEal •  InfluenEal bloggers may not be acEve

Searching for the InfluenEals

•  AcEve bloggers –  Easy to define –  Omen listed at a blog site –  Are they necessarily influenEal?

•  How to define an influenEal blogger –  InfluenEal bloggers have influenEal posts –  SubjecEve –  Collectable staEsEcs –  How to use these staEsEcs

IntuiEve ProperEes

•  Social Gestures (sta's'cs) –  RecogniEon: CitaEons (incoming links)

–  An influenEal blog post is recognized by many. The more influenEal the referring posts are, the more influenEal the referred post becomes.

–  AcEvity GeneraEon: Volume of discussion (comments) –  Amount of discussion iniEated by a blog post can be measured by the

comments it receives. Large number of comments indicates that the blog post affects many such that they care to write comments, hence influenEal.

–  Novelty: Referring to (outgoing links) –  Novel ideas exert more influence. Large number of outlinks suggests that

the blog post refers to several other blog posts, hence less novel. –  Eloquence: “goodness” of a blog post (length)

–  An influenEal is omen eloquent. Given the informal nature of Blogosphere, there is no incenEve for a blogger to write a lengthy piece that bores the readers. Hence, a long post omen suggests some necessity of doing so.

•  Influence Score = f(Social Gestures)

AcEve & InfluenEal Bloggers

•  AcEve and InfluenEal Bloggers •  InacEve but InfluenEal Bloggers •  AcEve but Non-‐influenEal Bloggers

•  We don’t consider “InacEve and Non-‐influenEal Bloggers”, because they seldom submit blog posts. Moreover, they do not influence others.

Temporal Pa8erns

•  Long term InfluenEals •  Average term InfluenEals •  Transient InfluenEals •  Burgeoning InfluenEals

VerificaEon of the Model

•  Challenges –  No training and tesEng data, i.e., absence of ground truth –  Enough experiments? or not? –  If not, what’s missing

•  How to validate if the model finds the influenEals •  It must be independent of the model building •  We use another Web 2.0 website, Digg

•  “Digg is all about user powered content. Everything is submi8ed and voted on by the Digg community. Share, discover, bookmark, and promote stuff that’s important to you!”

•  The higher the digg score for a blog post is, the more it is liked.

•  A not-‐liked blog post will not be submi8ed thus will not appear in Digg

Digg -‐ Power of Web 2.0

Findings with Digg

•  Digg records top 100 blog posts obtained through Digg Web API.

•  Top 5 influenEal and top 5 acEve bloggers were picked to construct 4 categories

•  For each of the 4 categories of bloggers, we collect top 20 blog posts from our model and compare them with Digg top 100.

•  DistribuEon of Digg top 100 and TUAW’s 535 blog posts

RelaEve Importance of Parameters

•  Observe how much our model aligns with Digg. •  Compare top 20 blog posts from our model and Digg. •  Considered six months

•  Considered all configuraEon to study relaEve importance of each parameter.

•  RecogniAon (Inlinks) > AcAvity GeneraAon (Comments) > Novelty (Outlinks) > Eloquence (Blog post length)

Summary

•  Some people are more influenEal than others •  Finding them can be helpful in many ways

–  Indexes to the dynamic blogosphere – Opinion polling –  Product trial –  Target adverEsing

•  Model validaEon is a challenge – using an independent social media site is one opEon

•  Future work –  Content analysis –  Topic-‐specific influence analysis –  EvoluEonary influence


ComparaEve Study of Social Behaviors

•  Different methodologies – Online vs. offline

•  Differed challenges •  How to compare the two

•  Could we use one to infer the other?

How to find people’s behavior

•  Interviews and QuesEonnaires – Ask them about their connecEons, acEons, …

•  Obtaining from online data (Snooping) – You’re what you connect with, write, and behave

•  There are lots of companies doing this way regularly

•  Full control from design to sampling

•  High accuracy

•  Fast and cheap •  Huge amounts of data •  Large scale (people and topics)

•  Usually publicly accessible

•  Can be located with internet tools (crawlers, search engines)

ComparaEve Advantages

Offline Methods Online Methods

•  Dangerous (someEmes) •  Intrusive •  Time consuming

•  Expensive •  Must conduct separate polls for each survey

•  Huge amounts of data

•  Lots of junk data •  Anonymous users

•  There is no control of what to observe

ComparaEve Disadvantages

Offline Methods Offline Methods

ComparaEve Study-‐ Indirect Method

•  Sample: Health Care Reform

Blog Trends

News Timeline

Search Timeline

ComparaEve Study -‐ Direct Method


AddiEonal InformaEon •  Book: Modeling and Data Mining in Blogosphere (2009)

•  Book: Community DetecEon and Mining (2010)

  KDD08 Tutorial

  IEEESocialCom09

  SBP Conference Series

  SBP08, SBP09, & SBP10 Proceedings


Key References

•  L. Tang and H. Liu. Toward PredicEng CollecAve Behavior via Social Dimension ExtracEon. IEEE Intelligent Systems, July/August, 2010.

•  L. Tang and H. Liu. RelaEonal Learning via Latent Social Dimensions. KDD’09, pages 817–826, 2009.

•  S.A. Macskassy and F. Provost, “ClassificaEon in Networked Data: A Toolkit and a Univariate Case Study,” J. Machine Learning Research, vol. 8, no. 5, 2007, pp. 935–983

•  N. Agarwal, H. Liu, L. Tang, and P.S. Yu. IdenEfying the InfluenAal Bloggers in a Community. WSDM’08, pages 207–218, 2008.

•  N. Agarwal and H. Liu. "Modeling and Data Mining in Blogosphere", Morgan & Claypool, July 2009.

•  dmml.asu.edu or via Huan Liu’s url

Acknowledgments: Projects are in part sponsored by AFOSR and ONR, our graEtude to members in DMML and our collaborators

Studying,Social,Behavior,in,Social,Mediahuanliu/papers/BehaviorInfo10.pdf · • A!new!laboratory!to!study!human!behavior!on! ... • Online!social!behavior!and!their!ImplicaEons!in!the!

Documents