1 content optimization-hug-2010-07-21

Online Content Optimization using Hadoop

Nitin [email protected]

What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….

“Deliver right CONTENT to the right USER at the right TIME”“Deliver right CONTENT to the right USER at the right TIME”

What do we do ?

Keep Carol Bartz excitedKeep Carol Bartz excited

Relevance at Yahoo!

Important

Editors

Popular

Personal / Social

People10s of Items

ScienceMillions of Items

Ranking ProblemsMost PopularMost engaging overall based on objective metrics

Related ItemsBehavioral Affinity: People who did X, did Y

Deep PersonalizationMost relevant to me based on my deep interests

X Y

Real-time Dashboard

Business Optimization

Light PersonalizationMore relevant to me based on my age, gender and property usage

Most Popular + Per User HistoryEngaging overall, and aware of what I’ve already seen

Flow

Optimization EngineContent feed with biz rules

Explore~1%

Exploit~99%

Real-timeFeedback

Content Metadata

Dashboard Optimized Module

Real-timeInsights

Rules Engine

How it happens ?

At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ atPosition ‘o’Property/Site ‘p’ Section - sModule – mInternational - i’

UserEvents

ItemMetadata

Modeling

ITEM Model

USER Model

Content ‘id’Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category}

FeatureGeneration

Additional Content & UserFeature Generation

Item BASE M F ATTR CAT_Sports

id1 0.8 +1.2 -1.5 -0.9 1.0

id2 -0.9 -0.9 +2.6 +0.3 1.0

Item BASE M F ATTR CAT_Sports

u1 0.8 1 1 0.2

u2 -0.9 1 -1.2

STORE: PNUTS

5 minlatency

RankingB-Rules

Request

5 – 30 minlatency

SLA 50 ms – 200 ms

Models

8

USER x CONTENT FEATURES

USER MODEL : Tracks User interest in terms of Content Features

ITEM x USER FEATURES

ITEM MODEL : Tracks behavior of Item across user features

USER FEATURES x CONTENT FEATURES

PRIORS : Tracks interactions of user features with content features

USER x USER

CLUSTERING : Looks at User-User Affinity based on the feature vectors

ITEM x ITEM

CLUSTERING : Looks at Item-Item Affinity based on item feature vectors

Scale• Million events per second

• Hundreds of GB per run

• Million of stories in pool

• Tens of Thousands of Features (Content and/or User)

9

Technology Stack

10Analytics and Debugging

Modeling Framework

11

Global state provided by HBase

A collection of PIG UDFs

Flow for modeling or stages assembled in PIG

OLR

Clustering

Affinity

Regression Models

Decompositions (Cholesky …)

Configuration based behavioral changes for stages of modeling

Type of Features to generated

Type of joins to perform – User / Item / Feature

Input : DFS and/or HBase

Output: DFS and/or Hbase

Standard pattern for updating serving stores <Source, Transformation, Sink>

E.g. <Hbase Table, Function(Features), PNUTS Table>

HBase

12

ITEM Table

• Stores item related features• Stores ITEM x USER FEATURES model • Stores parameters about item like view count, click count, unique user count.• 10 of Millions of Items• Updated every 5 minutes

USER Model

• Store USER x CONTENT FEATURES model for each individual user by either a Unique ID• Stores summarized user history – Essential for Modeling in terms of item decay• Millions of profiles• Updated every 5 to 30 minutes

TERM Model

• Inverts the Item Table and stores statistics for the terms. • Used to find the trending features and provide baselines for user features• Millions of terms and hundreds of parameters tracked• Updates every 5 minutes

Grid Edge Services

13

Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily

Have different scaling characteristics (E.g. Memory, CPU)

Provide gateway for accessing external data sources in M/R

Map and/or Reduce step interact with Edge Services using standard client

Examples

Categorization

Geo Tagging

Feature Transformation

Analytics and Debugging

14

Provides ability to debug modeling issues near-real time

Run complex queries for analysis

Easy to use interface

PM, Engineers, Research use this cluster to get near-real time insights

10s of Modeling monitoring and Reporting queries every 5 minute

We use HIVE

15

Data Flow

16

Learnings

17

PIG & HBase has been best combination so far

Made it simple to build different kind of science models

Point lookup using HBase has proven to be very useful

Modeling = Matrices

HBase provides a natural way to represent and access them

Edge Services

Have provided simplicity to whole stack

Management (Upgrades, Outage) has been easy

HIVE has provided us a great way for analyzing the results

PIG was also considered

Questions?

1 content optimization-hug-2010-07-21

Technology

useruser affinity

features content andor

tracks user

individual user

itemitem affinity

user item feature input

time t user u user attr

unique user count