Top Banner
Statistical Models for Massive Web Data Deepak Agarwal, LinkedIn, USA Director, Applied Relevance Science (ARS) CATS Big Data Panel, October 11, 2012 Hosted by National Academy of Sciences Washington D.C., USA
26

Statistical Models for Massive Web Data

Nov 17, 2014

Download

Education

dipu1025

Talk at Committee on Theoretical and Applied Statistics (CATS) hosted by National Academy of Sciences
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical Models for Massive Web Data

Statistical Models for Massive Web DataDeepak Agarwal, LinkedIn, USADirector, Applied Relevance Science (ARS)

CATS Big Data Panel, October 11, 2012

Hosted by National Academy of Sciences

Washington D.C., USA

Page 2: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Disclaimer

The opinions expressed here are mine and in no way represent the official position of LinkedIn

Case studies presented today was work done while I was at Yahoo!

Page 3: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Big Data Applications in Business

Big Data : Competitive advantage, Innovation, reduces uncertainty in decision making

High Frequency Data– Large number of heterogeneous transactions per unit time

Web visits, financial trading, credit card transactions, telephone calls, packet flows in an IP network,…

I will focus on statistical modeling for one such data source

– User visits to websites

Page 4: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Example 1: Yahoo! front page

Recommend content links(out of 30-40, editorially programmed)

4 slots exposed, F1 has maximum exposure

Routes traffic to other Y! properties

F1 F2 F3 F4

Today module

NEWS

Page 5: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

LinkedIn News

Page 6: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

LinkedIn Ads

Page 7: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Data Generation

http request NEWS

ADS

User information

Ranking Service

Model Updates

Page 8: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

DATA

Item j with

User i (User, context) covariates xit

(profile information, device id,first degree connections, browse information,…)

item covariates xj

(keywords, content categories, ...)

(i, j) : response yijvisits

Select

(click/no-click)

CONTEXT

Page 9: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Statistical Problem

Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest

Examples of utility functions– Click-rates (CTR)– Share-rates (CTR* [Share|Click] )– Revenue per page-view = CTR*bid (more complex due to

second price auction)

CTR is a fundamental measure that opens the door to a more principled approach to rank items

Converge rapidly to maximum utility items – Sequential decision making process– Models: help cope data sparseness (curse of dimensionality

Page 10: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Illustrate with Y! front Page Application

Simplify: Maximize CTR on first slot (F1)

Article Pool– Editorially selected for high quality and brand image– Few articles in the pool but article pool dynamic

We want to provide personalized recommendations– Users with many prior visits see recommendations “tailored” to

their taste, others see the best for the “group” they belong to

Page 11: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Types of user covariates

Demographics, geo: – Not useful in front-page application

Browse behavior: activity on Y! network ( xit )– Previous visits to property, search, ad views, clicks,..– This is useful for the front-page application

Latent user factors based on previous clicks on the module ( uit )

– Useful for active module users, obtained via factor models

Page 12: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Approach: Online + Offline

Offline computation– Intensive computations done infrequently (once

a day/week) to update parameters that are less time-sensitive

Online computation– Lightweight computations frequent (once every 5-10

minutes) to update parameters that are time-sensitive

– Adaptive experiments(explore-exploit) also done online

Page 13: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Online computation: per-item online logistic regression

For item j, the state-space model is

Item coefficients are update online via Kalman-filter (discounting approach of West and Harrison)

– Item covariates are used to initialize coefficients at epoch zero

Page 14: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Closer Look at online model

Different components of

Page 15: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Online Adaptive Experimentation (Explore/Exploit) Three schemes (all work reasonably well for the front

page application)

– epsilon-greedy: Show article with maximum posterior mean except with a small probability epsilon, choose an article at random.

– Upper confidence bound (UCB): Show article with maximum score, where score = post-mean + k. post-std

– Thompson sampling: Draw a sample (δ,β) from posterior to compute article CTR and show article with maximum CTR

Page 16: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Offline computation

Computing user latent factors and item coefficient prior

– This is computed offline once a day using retrospective (user,item) interaction data for last X days (X = 30 in our case)

– Computations are done on Hadoop

Page 17: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Offline: Regression based Latent Factor Model

regression weight matrix user/item-specific correction terms (learnt from data)

Page 18: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Role of shrinkage (consider Guassian for simplicity) For new user/article, factor estimates based on

covariates

For old user, factor estimates

Linear combination of prior regression function and user feedback on items

itemnewnew

usernewnew DG xvxu

,

Page 19: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Estimating the Regression function via EM

j i j

jiji

ijiij

ddDgGgDataf vuvuvu )),(),(),,((

Maximize

Integral cannot be computed in closed form, approximated by Monte Carlo using Gibbs Sampling

For logistic, we use ARS (Gilks and Wild) to sample the latent factors within the Gibbs sampler

Page 20: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Scaling to large data on Hadoop

Randomly partition by users in the Map Run separate model on each partition

– Care taken to initialize each partition model with same values, constraints on factors ensure identifiability within each partition

Create ensembles by using different user partitions, average across ensembles to obtain estimates of user factors and regression functions

– Estimates of user factors in ensembles uncorrelated, averaging reduces variance

Page 21: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Data Example

1B events, 8M users, 6K articles Offline training produced user factor ui

Baseline: Online logistic without ui

– Covariate Only online Logistic model

Overall click lift: 9.7%, Heavy users (> 10 clicks last month): 26% Cold users (not seen in the past): 3%

Page 22: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Click-lift for heavy users

CTR LIFT Relative to COVARIATE ONLY

Logistic Model

Page 23: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Computational Advertising: Matching ads to opportunities

Advertisers

Ad Network

Ads

Page

Pick best ads

User

Publisher

Examples:Yahoo, Google,

MSN,

Ad exchanges(network

of “networks”) …

Opportunity

Page 24: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Ad- exchange (RightMedia) [Agarwal et al. KDD 10]

Advertisers participate in different ways– CPM (pay by ad-view)– CPC (pay per click)– CPA (pay per conversion)

To conduct an auction, normalize across pricing types– Compute eCPM (expected CPM)

Click-based ---- eCPM = click-rate*CPC Conversion-based ---- eCPM = conv-rate*CPA

Similar strategy of computing offline and online components– Process 90B records for each model fit – Model has hundreds of millions of parameters– Model fully deployed on RightMedia today

Page 25: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Summary

Estimating interactions in high-dimensional sparse data important in web applications

Scaling such models to Big Data is a challenging statistical problem

Combining offline + online modeling with explore/exploit a good practical strategy

Page 26: Statistical Models for Massive Web Data

NRC BIG DATA PANEL, AGARWAL, 2012

Some Challenges

Very high-dimensional modeling with very large and noisy data– Few categorical variables with large number of levels interacting with

each other to produce response– Scalability

Designing sequential experiments– Multi-armed bandits are back in a big way

Data fusion– From multiple and disparate sources

Availability of data and ability to run experiments to researchers