Statistical Models for Massive Web Data Deepak Agarwal, LinkedIn, USA Director, Applied Relevance Science (ARS) CATS Big Data Panel, October 11, 2012 Hosted by National Academy of Sciences Washington D.C., USA
Nov 17, 2014
Statistical Models for Massive Web DataDeepak Agarwal, LinkedIn, USADirector, Applied Relevance Science (ARS)
CATS Big Data Panel, October 11, 2012
Hosted by National Academy of Sciences
Washington D.C., USA
NRC BIG DATA PANEL, AGARWAL, 2012
Disclaimer
The opinions expressed here are mine and in no way represent the official position of LinkedIn
Case studies presented today was work done while I was at Yahoo!
NRC BIG DATA PANEL, AGARWAL, 2012
Big Data Applications in Business
Big Data : Competitive advantage, Innovation, reduces uncertainty in decision making
High Frequency Data– Large number of heterogeneous transactions per unit time
Web visits, financial trading, credit card transactions, telephone calls, packet flows in an IP network,…
I will focus on statistical modeling for one such data source
– User visits to websites
NRC BIG DATA PANEL, AGARWAL, 2012
Example 1: Yahoo! front page
Recommend content links(out of 30-40, editorially programmed)
4 slots exposed, F1 has maximum exposure
Routes traffic to other Y! properties
F1 F2 F3 F4
Today module
NEWS
NRC BIG DATA PANEL, AGARWAL, 2012
LinkedIn News
NRC BIG DATA PANEL, AGARWAL, 2012
LinkedIn Ads
NRC BIG DATA PANEL, AGARWAL, 2012
Data Generation
http request NEWS
ADS
User information
Ranking Service
Model Updates
NRC BIG DATA PANEL, AGARWAL, 2012
DATA
Item j with
User i (User, context) covariates xit
(profile information, device id,first degree connections, browse information,…)
item covariates xj
(keywords, content categories, ...)
(i, j) : response yijvisits
Select
(click/no-click)
CONTEXT
NRC BIG DATA PANEL, AGARWAL, 2012
Statistical Problem
Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest
Examples of utility functions– Click-rates (CTR)– Share-rates (CTR* [Share|Click] )– Revenue per page-view = CTR*bid (more complex due to
second price auction)
CTR is a fundamental measure that opens the door to a more principled approach to rank items
Converge rapidly to maximum utility items – Sequential decision making process– Models: help cope data sparseness (curse of dimensionality
NRC BIG DATA PANEL, AGARWAL, 2012
Illustrate with Y! front Page Application
Simplify: Maximize CTR on first slot (F1)
Article Pool– Editorially selected for high quality and brand image– Few articles in the pool but article pool dynamic
We want to provide personalized recommendations– Users with many prior visits see recommendations “tailored” to
their taste, others see the best for the “group” they belong to
NRC BIG DATA PANEL, AGARWAL, 2012
Types of user covariates
Demographics, geo: – Not useful in front-page application
Browse behavior: activity on Y! network ( xit )– Previous visits to property, search, ad views, clicks,..– This is useful for the front-page application
Latent user factors based on previous clicks on the module ( uit )
– Useful for active module users, obtained via factor models
NRC BIG DATA PANEL, AGARWAL, 2012
Approach: Online + Offline
Offline computation– Intensive computations done infrequently (once
a day/week) to update parameters that are less time-sensitive
Online computation– Lightweight computations frequent (once every 5-10
minutes) to update parameters that are time-sensitive
– Adaptive experiments(explore-exploit) also done online
NRC BIG DATA PANEL, AGARWAL, 2012
Online computation: per-item online logistic regression
For item j, the state-space model is
Item coefficients are update online via Kalman-filter (discounting approach of West and Harrison)
– Item covariates are used to initialize coefficients at epoch zero
NRC BIG DATA PANEL, AGARWAL, 2012
Closer Look at online model
Different components of
NRC BIG DATA PANEL, AGARWAL, 2012
Online Adaptive Experimentation (Explore/Exploit) Three schemes (all work reasonably well for the front
page application)
– epsilon-greedy: Show article with maximum posterior mean except with a small probability epsilon, choose an article at random.
– Upper confidence bound (UCB): Show article with maximum score, where score = post-mean + k. post-std
– Thompson sampling: Draw a sample (δ,β) from posterior to compute article CTR and show article with maximum CTR
NRC BIG DATA PANEL, AGARWAL, 2012
Offline computation
Computing user latent factors and item coefficient prior
– This is computed offline once a day using retrospective (user,item) interaction data for last X days (X = 30 in our case)
– Computations are done on Hadoop
NRC BIG DATA PANEL, AGARWAL, 2012
Offline: Regression based Latent Factor Model
regression weight matrix user/item-specific correction terms (learnt from data)
NRC BIG DATA PANEL, AGARWAL, 2012
Role of shrinkage (consider Guassian for simplicity) For new user/article, factor estimates based on
covariates
For old user, factor estimates
Linear combination of prior regression function and user feedback on items
itemnewnew
usernewnew DG xvxu
,
NRC BIG DATA PANEL, AGARWAL, 2012
Estimating the Regression function via EM
j i j
jiji
ijiij
ddDgGgDataf vuvuvu )),(),(),,((
Maximize
Integral cannot be computed in closed form, approximated by Monte Carlo using Gibbs Sampling
For logistic, we use ARS (Gilks and Wild) to sample the latent factors within the Gibbs sampler
NRC BIG DATA PANEL, AGARWAL, 2012
Scaling to large data on Hadoop
Randomly partition by users in the Map Run separate model on each partition
– Care taken to initialize each partition model with same values, constraints on factors ensure identifiability within each partition
Create ensembles by using different user partitions, average across ensembles to obtain estimates of user factors and regression functions
– Estimates of user factors in ensembles uncorrelated, averaging reduces variance
NRC BIG DATA PANEL, AGARWAL, 2012
Data Example
1B events, 8M users, 6K articles Offline training produced user factor ui
Baseline: Online logistic without ui
– Covariate Only online Logistic model
Overall click lift: 9.7%, Heavy users (> 10 clicks last month): 26% Cold users (not seen in the past): 3%
NRC BIG DATA PANEL, AGARWAL, 2012
Click-lift for heavy users
CTR LIFT Relative to COVARIATE ONLY
Logistic Model
NRC BIG DATA PANEL, AGARWAL, 2012
Computational Advertising: Matching ads to opportunities
Advertisers
Ad Network
Ads
Page
Pick best ads
User
Publisher
Examples:Yahoo, Google,
MSN,
Ad exchanges(network
of “networks”) …
Opportunity
NRC BIG DATA PANEL, AGARWAL, 2012
Ad- exchange (RightMedia) [Agarwal et al. KDD 10]
Advertisers participate in different ways– CPM (pay by ad-view)– CPC (pay per click)– CPA (pay per conversion)
To conduct an auction, normalize across pricing types– Compute eCPM (expected CPM)
Click-based ---- eCPM = click-rate*CPC Conversion-based ---- eCPM = conv-rate*CPA
Similar strategy of computing offline and online components– Process 90B records for each model fit – Model has hundreds of millions of parameters– Model fully deployed on RightMedia today
NRC BIG DATA PANEL, AGARWAL, 2012
Summary
Estimating interactions in high-dimensional sparse data important in web applications
Scaling such models to Big Data is a challenging statistical problem
Combining offline + online modeling with explore/exploit a good practical strategy
NRC BIG DATA PANEL, AGARWAL, 2012
Some Challenges
Very high-dimensional modeling with very large and noisy data– Few categorical variables with large number of levels interacting with
each other to produce response– Scalability
Designing sequential experiments– Multi-armed bandits are back in a big way
Data fusion– From multiple and disparate sources
Availability of data and ability to run experiments to researchers