weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Post on 17-May-2015

473 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

Transcript

Data Science at Facebook

Itamar RosennEric Sun5/4/09

Facebook Data▪ Social Graph

▪ 200M+ active users

▪ 100M+ users come to site each day

▪ several hundred thousand new users join each day

▪ hundreds of dimensions per user (numerical, categorical, text)

▪ average user has over 120 friends

▪ friendships on Facebook span many different types of relationships

▪ Social Behavior

▪ Actions: users interact with hundreds of thousands of applications, on and off the site

▪ Interactions: users interact directly with each other via over 100 distinct types of events

▪ Social Content

▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...

Managing Data at Scale▪ Solution: Hadoop + Hive

▪ HDFS / Hadoop (MapReduce in Java)

▪ MetaStore (metadata management)

▪ HiveQL (SQL-like query language on top of Hadoop + MetaStore)

▪ Data Scale

▪ More than 1PB raw capacity in largest HDFS / Hadoop cluster

▪ Over 2TB uncompressed data collected each day

▪ Dozens of TB worth of data read / written each day via Hadoop + Hive

Data Science - What We Do

Product Health MetricsLaunch EvaluationsGrowth ModelingUser Churn ModelingProduction IncentivesContent Diffusion

Ad CTR PredictionPYMKSearch RankingHighlights

Behavioral Analysis Data-Driven Systems

Data Infrastructure

Hive

Hadoop

Data Science – Who We Are

Dennis Decoste Roddy Lindsay Alex Smith

Thomas Lento

Venky Iyer

Ravi Grover Cameron Marlow

Lee Byron Itamar Rosenn

Danny Ferrante

James Mayfield

Maintained Relationships on Facebook

▪ Question: is Facebook increasing the size of people’s personal networks?

▪ Task:

▪ the types of relationships people maintain on the site

▪ the relative size of these groups

Types of RelationshipsPeople you know

▪ Facebook friends = people you’ve met at some point in life

▪ Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth)

Communication network

▪ Individuals with whom you communicate on a regular basis

▪ Includes your core support network, which may be as low as 3 people

▪ Kossinets and Watts observed communication network size of 10-20

Maintained relationships

▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing

▪ This information consumption is a form of relationship management, as it can lead to direct

communication in the future

Measuring Network Size on FacebookExamine the relationships of a random user sample over 30 days on the site. We defined

networks in 4 ways:

All friends

▪ The largest representation of a person’s network is the set of people they have verified as friends.

Reciprocal communication

▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network.

One-way communication

▪ The number of friends to whom the user has reached out via messages, wall posts, or comments.

Maintained relationships

▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice

Findings

▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates

Systemic Effects

▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.

Content Production among New Users

▪ Mission: Give people the power to share and make the world more open and connected.

▪ Question: What mechanisms lead Facebook newcomers to share content on the site?

Content ProductionIn new users’ first two weeks:

▪ 45% upload a photo

▪ 41% use a 3rd-party app

▪ 30% send a private message

▪ 27% compose a status update

▪ 22% write on a friend’s wall

Content ProductionIn new users’ first two weeks:

▪ 45% upload a photo

▪ 41% use a 3rd-party app

▪ 30% send a private message

▪ 27% compose a status update

▪ 22% write on a friend’s wall

Production Incentives Hypotheses

▪ H1: Newcomers who receive more feedback on their initial content will go on to contribute more content.

▪ H2: Newcomers whose initial content receives greater

distribution will go on to produce more content.

▪ H3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves.

▪ H4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.

MethodQuantitative

▪ Selected two cohorts:Nov. 5, 2007 (N= 347,403) Mar. 3, 2008 (N=254,603)

▪ Observed activity in their first two weeks

▪ Predicted how many photos they would upload between third and fifteenth week on Facebook

Qualitative

▪ 40-minute semi-structured interviews with seven new users

▪ Recorded audio/video and screen

▪ Asked about typical uses of facebook, content production, social norms, privacy

FeaturesIndependent VariablesH1. Feedback

▪ Comments received

H2. Distribution

▪ # of times content was viewed in Newsfeed

▪ # of friends who viewed content in Newsfeed

H3. Social Learning

▪ Number of friends’ photos seen

▪ H4. Singling Out

▪ Number of times tagged

Controls▪ Age

▪ Gender

▪ Number of friends

▪ Total pages viewed

▪ Initial engagement with photos:

▪ # of photos uploaded

▪ # of photos viewed

▪ Photo tags created

▪ Photo comments written

Results

Intercept 1.2

Controls Coefficient % change from int.

Age (in years) -0.01 -1.0% ***

Male (0/1) 0.48 +39.3% ***

Female (0/1) 1.21 +131.2% ***

Pages viewed + 0.24 +18.4% ***

Photo pages viewed + 2.80 +597.4% ***

Photo comments made 0.15 +11.2% ***

Photo tags created 0.10 +6.9% ***

Photos uploaded 0.30 +22.8% ***

Independent Vars Coefficient % change from int.

Comments received (0/1) 0.09 +6.2% ***

Photo views received 0.04 +2.6% ***

Photo stories seen 0.09 +6.1% ***

Photo tags received (0/1) 0.03 +2.1% (ns)

Model 1 – Early Uploaders Intercept 1.9

Controls Coefficient % change from int.

Age (in years) -0.01 -0.7% ***

Male (0/1) 0.84 +79.6% ***

Female (0/1) 1.43 +169.8% ***

Pages viewed + -0.02 -1.6% ***

Photo pages viewed + 2.35 +408.3% ***

Photo comments made 0.24 +17.7% ***

Photo tags created 0.17 +12.6% ***

Early-uploader (0/1) 0.39 +30.6% ***

Independent Vars Coefficient % change from int.

Photo stories seen X early-uploader

0.15 +10.7% ***

Photo stories seen Xnon-early-uploader

0.03 +2.2% ***

Photo tags received Xearly-uploader (0/1)

-0.05 -3.6% (ns)

Photo tags received Xnon-early-uploader (0/1)

0.10 +7.2% ***

Model 2 - Everyone

Summary of Results

Hypothesis Early-uploaders Non-early-uploaders

H1. Feedback Support N/A

H2. Distribution Modest Support N/A

H3. Social learning

Support Support

H4. Singling out No Support Support

▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production.

▪ For new users already uploading photos feedback is associated with increased content production, and distribution is marginally important.

Modeling Contagion Through Newsfeed▪ How do ideas spread through a social network?

▪ Use Facebook Pages to model diffusion patterns

▪ Compare results with existing models of diffusion

▪ Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s inter-connectedness and diffusion properties.

▪ Note: Research based on “old” Facebook (pre-March 2009)

▪ Still relevant: first empirical analysis of large-scale collisions of short chains

Theory of the Influentials▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.)

▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free

▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)

Contagion Theory▪ Duncan Watts: Anyone can be an influencer.

▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not

▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.

How Do Ideas Spread on Facebook?▪ News Feed allows for efficient diffusion of ideas

▪ Facebook’s Pages product is one of the most viral features of the site.

▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parentsAlice fans

a Page

Bob sees Alice’s action on his

News Feed; Bob fans the Page as

well

Charlie sees Alice’s action on his News Feed; Charlie fans the

Page as well

Chain of Length 1

Large-Scale Result: Large Connected Trees of Diffusion

▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.

Large Connected Clusters▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected.

▪ Example: On 8/21/08, 71,090 of 96,922 fans of the Nastia Liukin Page (73.3%) were in one connected cluster.

▪ For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.

How Do These Large Clusters Come About?• Are these large clusters started by “one guy”?

▪ No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.”

▪ The variability in this percentage becomes very small as # fans increases

▪ The average node in the biggest cluster is connected to 2.899 others.

• Large clusters are formed when many long chains of diffusion merge together.

Diffusion Chains on Facebook vs. Real Life• The connected nature of Facebook (combined

with easy methods of communication) makes long diffusion chains possible.

▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person

▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987)

▪ On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals

How are Long Diffusion Chains Created?• Goal: test whether the Influentials theory or the

Contagion theory is more applicable to Facebook

▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page.

▪ If size can be predicted, we can then identify the most influential users.

Data▪ Data consists of all the associations (actor follower) for a representative selection of Pages.

▪ Pages were at least 40 days old and had at least 5,000 fans

Prediction ModelResponse: max_chain_length

Predictors:

▪ gender

▪ log age

▪ log Facebook_age

▪ log feed_exposure (# friends who saw News Feed story)

▪ log friend_count

▪ log activity_count (wall posts + messages sent + photos added)

▪ log popularity (controls for News Feed exposure via Coefficient)

Method: zero-inflated negative binomial regression

Results• Only consistent coefficient is on feed_exposure

(# friends who saw News Feed story).

▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain

• Implies that friend_count is not realistically meaningful.

▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.

Conclusions

• Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains.

• The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters.

• Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.

Contact

www.facebook.com/data

itamar@facebook.com

esun@facebook.com

(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

top related