Top Banner
Data Science at Facebook Itamar Rosenn Eric Sun 5/4/09
33

weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

May 17, 2015

Download

Documents

NewBU
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Data Science at Facebook

Itamar RosennEric Sun5/4/09

Page 2: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Facebook Data▪ Social Graph

▪ 200M+ active users

▪ 100M+ users come to site each day

▪ several hundred thousand new users join each day

▪ hundreds of dimensions per user (numerical, categorical, text)

▪ average user has over 120 friends

▪ friendships on Facebook span many different types of relationships

▪ Social Behavior

▪ Actions: users interact with hundreds of thousands of applications, on and off the site

▪ Interactions: users interact directly with each other via over 100 distinct types of events

▪ Social Content

▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...

Page 3: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Managing Data at Scale▪ Solution: Hadoop + Hive

▪ HDFS / Hadoop (MapReduce in Java)

▪ MetaStore (metadata management)

▪ HiveQL (SQL-like query language on top of Hadoop + MetaStore)

▪ Data Scale

▪ More than 1PB raw capacity in largest HDFS / Hadoop cluster

▪ Over 2TB uncompressed data collected each day

▪ Dozens of TB worth of data read / written each day via Hadoop + Hive

Page 4: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Data Science - What We Do

Product Health MetricsLaunch EvaluationsGrowth ModelingUser Churn ModelingProduction IncentivesContent Diffusion

Ad CTR PredictionPYMKSearch RankingHighlights

Behavioral Analysis Data-Driven Systems

Data Infrastructure

Hive

Hadoop

Page 5: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Data Science – Who We Are

Dennis Decoste Roddy Lindsay Alex Smith

Thomas Lento

Venky Iyer

Ravi Grover Cameron Marlow

Lee Byron Itamar Rosenn

Danny Ferrante

James Mayfield

Page 6: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Maintained Relationships on Facebook

▪ Question: is Facebook increasing the size of people’s personal networks?

▪ Task:

▪ the types of relationships people maintain on the site

▪ the relative size of these groups

Page 7: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Types of RelationshipsPeople you know

▪ Facebook friends = people you’ve met at some point in life

▪ Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth)

Communication network

▪ Individuals with whom you communicate on a regular basis

▪ Includes your core support network, which may be as low as 3 people

▪ Kossinets and Watts observed communication network size of 10-20

Maintained relationships

▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing

▪ This information consumption is a form of relationship management, as it can lead to direct

communication in the future

Page 8: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Measuring Network Size on FacebookExamine the relationships of a random user sample over 30 days on the site. We defined

networks in 4 ways:

All friends

▪ The largest representation of a person’s network is the set of people they have verified as friends.

Reciprocal communication

▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network.

One-way communication

▪ The number of friends to whom the user has reached out via messages, wall posts, or comments.

Maintained relationships

▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice

Page 9: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Findings

▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates

Page 10: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Systemic Effects

▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.

Page 11: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Content Production among New Users

▪ Mission: Give people the power to share and make the world more open and connected.

▪ Question: What mechanisms lead Facebook newcomers to share content on the site?

Page 12: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Content ProductionIn new users’ first two weeks:

▪ 45% upload a photo

▪ 41% use a 3rd-party app

▪ 30% send a private message

▪ 27% compose a status update

▪ 22% write on a friend’s wall

Page 13: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Content ProductionIn new users’ first two weeks:

▪ 45% upload a photo

▪ 41% use a 3rd-party app

▪ 30% send a private message

▪ 27% compose a status update

▪ 22% write on a friend’s wall

Page 14: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Production Incentives Hypotheses

▪ H1: Newcomers who receive more feedback on their initial content will go on to contribute more content.

▪ H2: Newcomers whose initial content receives greater

distribution will go on to produce more content.

▪ H3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves.

▪ H4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.

Page 15: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

MethodQuantitative

▪ Selected two cohorts:Nov. 5, 2007 (N= 347,403) Mar. 3, 2008 (N=254,603)

▪ Observed activity in their first two weeks

▪ Predicted how many photos they would upload between third and fifteenth week on Facebook

Qualitative

▪ 40-minute semi-structured interviews with seven new users

▪ Recorded audio/video and screen

▪ Asked about typical uses of facebook, content production, social norms, privacy

Page 16: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

FeaturesIndependent VariablesH1. Feedback

▪ Comments received

H2. Distribution

▪ # of times content was viewed in Newsfeed

▪ # of friends who viewed content in Newsfeed

H3. Social Learning

▪ Number of friends’ photos seen

▪ H4. Singling Out

▪ Number of times tagged

Controls▪ Age

▪ Gender

▪ Number of friends

▪ Total pages viewed

▪ Initial engagement with photos:

▪ # of photos uploaded

▪ # of photos viewed

▪ Photo tags created

▪ Photo comments written

Page 17: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Results

Intercept 1.2

Controls Coefficient % change from int.

Age (in years) -0.01 -1.0% ***

Male (0/1) 0.48 +39.3% ***

Female (0/1) 1.21 +131.2% ***

Pages viewed + 0.24 +18.4% ***

Photo pages viewed + 2.80 +597.4% ***

Photo comments made 0.15 +11.2% ***

Photo tags created 0.10 +6.9% ***

Photos uploaded 0.30 +22.8% ***

Independent Vars Coefficient % change from int.

Comments received (0/1) 0.09 +6.2% ***

Photo views received 0.04 +2.6% ***

Photo stories seen 0.09 +6.1% ***

Photo tags received (0/1) 0.03 +2.1% (ns)

Model 1 – Early Uploaders Intercept 1.9

Controls Coefficient % change from int.

Age (in years) -0.01 -0.7% ***

Male (0/1) 0.84 +79.6% ***

Female (0/1) 1.43 +169.8% ***

Pages viewed + -0.02 -1.6% ***

Photo pages viewed + 2.35 +408.3% ***

Photo comments made 0.24 +17.7% ***

Photo tags created 0.17 +12.6% ***

Early-uploader (0/1) 0.39 +30.6% ***

Independent Vars Coefficient % change from int.

Photo stories seen X early-uploader

0.15 +10.7% ***

Photo stories seen Xnon-early-uploader

0.03 +2.2% ***

Photo tags received Xearly-uploader (0/1)

-0.05 -3.6% (ns)

Photo tags received Xnon-early-uploader (0/1)

0.10 +7.2% ***

Model 2 - Everyone

Page 18: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Summary of Results

Hypothesis Early-uploaders Non-early-uploaders

H1. Feedback Support N/A

H2. Distribution Modest Support N/A

H3. Social learning

Support Support

H4. Singling out No Support Support

▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production.

▪ For new users already uploading photos feedback is associated with increased content production, and distribution is marginally important.

Page 19: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Modeling Contagion Through Newsfeed▪ How do ideas spread through a social network?

▪ Use Facebook Pages to model diffusion patterns

▪ Compare results with existing models of diffusion

▪ Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s inter-connectedness and diffusion properties.

▪ Note: Research based on “old” Facebook (pre-March 2009)

▪ Still relevant: first empirical analysis of large-scale collisions of short chains

Page 20: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Theory of the Influentials▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.)

▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free

▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)

Page 21: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Contagion Theory▪ Duncan Watts: Anyone can be an influencer.

▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not

▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.

Page 22: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

How Do Ideas Spread on Facebook?▪ News Feed allows for efficient diffusion of ideas

▪ Facebook’s Pages product is one of the most viral features of the site.

▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parentsAlice fans

a Page

Bob sees Alice’s action on his

News Feed; Bob fans the Page as

well

Charlie sees Alice’s action on his News Feed; Charlie fans the

Page as well

Chain of Length 1

Page 23: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Large-Scale Result: Large Connected Trees of Diffusion

▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.

Page 24: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Large Connected Clusters▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected.

▪ Example: On 8/21/08, 71,090 of 96,922 fans of the Nastia Liukin Page (73.3%) were in one connected cluster.

▪ For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.

Page 25: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

How Do These Large Clusters Come About?• Are these large clusters started by “one guy”?

▪ No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.”

▪ The variability in this percentage becomes very small as # fans increases

▪ The average node in the biggest cluster is connected to 2.899 others.

• Large clusters are formed when many long chains of diffusion merge together.

Page 26: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Diffusion Chains on Facebook vs. Real Life• The connected nature of Facebook (combined

with easy methods of communication) makes long diffusion chains possible.

▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person

▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987)

▪ On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals

Page 27: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

How are Long Diffusion Chains Created?• Goal: test whether the Influentials theory or the

Contagion theory is more applicable to Facebook

▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page.

▪ If size can be predicted, we can then identify the most influential users.

Page 28: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Data▪ Data consists of all the associations (actor follower) for a representative selection of Pages.

▪ Pages were at least 40 days old and had at least 5,000 fans

Page 29: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Prediction ModelResponse: max_chain_length

Predictors:

▪ gender

▪ log age

▪ log Facebook_age

▪ log feed_exposure (# friends who saw News Feed story)

▪ log friend_count

▪ log activity_count (wall posts + messages sent + photos added)

▪ log popularity (controls for News Feed exposure via Coefficient)

Method: zero-inflated negative binomial regression

Page 30: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Results• Only consistent coefficient is on feed_exposure

(# friends who saw News Feed story).

▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain

• Implies that friend_count is not realistically meaningful.

▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.

Page 31: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

Conclusions

• Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains.

• The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters.

• Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.

Page 33: weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1

(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0