Statistical Models for Massive Web Data

Statistical Models for Massive Web DataDeepak Agarwal, LinkedIn, USADirector, Applied Relevance Science (ARS)

CATS Big Data Panel, October 11, 2012

Hosted by National Academy of Sciences

Washington D.C., USA

NRC BIG DATA PANEL, AGARWAL, 2012

Disclaimer

The opinions expressed here are mine and in no way represent the official position of LinkedIn

Case studies presented today was work done while I was at Yahoo!


Big Data Applications in Business

Big Data : Competitive advantage, Innovation, reduces uncertainty in decision making

High Frequency Data– Large number of heterogeneous transactions per unit time

Web visits, financial trading, credit card transactions, telephone calls, packet flows in an IP network,…

I will focus on statistical modeling for one such data source

– User visits to websites


Example 1: Yahoo! front page

Recommend content links(out of 30-40, editorially programmed)

4 slots exposed, F1 has maximum exposure

Routes traffic to other Y! properties

F1 F2 F3 F4

Today module

NEWS


LinkedIn News


LinkedIn Ads


Data Generation

http request NEWS

ADS

User information

Ranking Service

Model Updates


DATA

Item j with

User i (User, context) covariates xit

(profile information, device id,first degree connections, browse information,…)

item covariates xj

(keywords, content categories, ...)

(i, j) : response yijvisits

Select

(click/no-click)

CONTEXT


Statistical Problem

Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest

Examples of utility functions– Click-rates (CTR)– Share-rates (CTR* [Share|Click] )– Revenue per page-view = CTR*bid (more complex due to

second price auction)

CTR is a fundamental measure that opens the door to a more principled approach to rank items

Converge rapidly to maximum utility items – Sequential decision making process– Models: help cope data sparseness (curse of dimensionality


Illustrate with Y! front Page Application

Simplify: Maximize CTR on first slot (F1)

Article Pool– Editorially selected for high quality and brand image– Few articles in the pool but article pool dynamic

We want to provide personalized recommendations– Users with many prior visits see recommendations “tailored” to

their taste, others see the best for the “group” they belong to


Types of user covariates

Demographics, geo: – Not useful in front-page application

Browse behavior: activity on Y! network ( xit )– Previous visits to property, search, ad views, clicks,..– This is useful for the front-page application

Latent user factors based on previous clicks on the module ( uit )

– Useful for active module users, obtained via factor models


Approach: Online + Offline

Offline computation– Intensive computations done infrequently (once

a day/week) to update parameters that are less time-sensitive

Online computation– Lightweight computations frequent (once every 5-10

minutes) to update parameters that are time-sensitive

– Adaptive experiments(explore-exploit) also done online


Online computation: per-item online logistic regression

For item j, the state-space model is

Item coefficients are update online via Kalman-filter (discounting approach of West and Harrison)

– Item covariates are used to initialize coefficients at epoch zero


Closer Look at online model

Different components of


Online Adaptive Experimentation (Explore/Exploit) Three schemes (all work reasonably well for the front

page application)

– epsilon-greedy: Show article with maximum posterior mean except with a small probability epsilon, choose an article at random.

– Upper confidence bound (UCB): Show article with maximum score, where score = post-mean + k. post-std

– Thompson sampling: Draw a sample (δ,β) from posterior to compute article CTR and show article with maximum CTR


Offline computation

Computing user latent factors and item coefficient prior

– This is computed offline once a day using retrospective (user,item) interaction data for last X days (X = 30 in our case)

– Computations are done on Hadoop


Offline: Regression based Latent Factor Model

regression weight matrix user/item-specific correction terms (learnt from data)


Role of shrinkage (consider Guassian for simplicity) For new user/article, factor estimates based on

covariates

For old user, factor estimates

Linear combination of prior regression function and user feedback on items

itemnewnew

usernewnew DG xvxu

,


Estimating the Regression function via EM

j i j

jiji

ijiij

ddDgGgDataf vuvuvu )),(),(),,((

Maximize

Integral cannot be computed in closed form, approximated by Monte Carlo using Gibbs Sampling

For logistic, we use ARS (Gilks and Wild) to sample the latent factors within the Gibbs sampler


Scaling to large data on Hadoop

Randomly partition by users in the Map Run separate model on each partition

– Care taken to initialize each partition model with same values, constraints on factors ensure identifiability within each partition

Create ensembles by using different user partitions, average across ensembles to obtain estimates of user factors and regression functions

– Estimates of user factors in ensembles uncorrelated, averaging reduces variance


Data Example

1B events, 8M users, 6K articles Offline training produced user factor ui

Baseline: Online logistic without ui

– Covariate Only online Logistic model

Overall click lift: 9.7%, Heavy users (> 10 clicks last month): 26% Cold users (not seen in the past): 3%


Click-lift for heavy users

CTR LIFT Relative to COVARIATE ONLY

Logistic Model


Computational Advertising: Matching ads to opportunities

Advertisers

Ad Network

Ads

Page

Pick best ads

User

Publisher

Examples:Yahoo, Google,

MSN,

Ad exchanges(network

of “networks”) …

Opportunity


Ad- exchange (RightMedia) [Agarwal et al. KDD 10]

Advertisers participate in different ways– CPM (pay by ad-view)– CPC (pay per click)– CPA (pay per conversion)

To conduct an auction, normalize across pricing types– Compute eCPM (expected CPM)

Click-based ---- eCPM = click-rate*CPC Conversion-based ---- eCPM = conv-rate*CPA

Similar strategy of computing offline and online components– Process 90B records for each model fit – Model has hundreds of millions of parameters– Model fully deployed on RightMedia today


Summary

Estimating interactions in high-dimensional sparse data important in web applications

Scaling such models to Big Data is a challenging statistical problem

Combining offline + online modeling with explore/exploit a good practical strategy


Some Challenges

Very high-dimensional modeling with very large and noisy data– Few categorical variables with large number of levels interacting with

each other to produce response– Scalability

Designing sequential experiments– Multi-armed bandits are back in a big way

Data fusion– From multiple and disparate sources

Availability of data and ability to run experiments to researchers

Statistical Models for Massive Web Data

Education

doneonlinenrc big data

yahoo nrc big data panel

business big data

big data applications

properties nrc big data

websites nrc big data

hadoop nrc big data

maximum ctr nrc big