Jan 20, 2015
Online Content Optimization using Hadoop
Nitin [email protected]
What is Yahoo?…Listen Yahoo is a great company that is very, very strong in content for its users, uses amazing. …. . For instance, on our today module in the front page, every 5 minutes we have 32,000 different variations of that module. So you don’t even know what I’m seeing in fact we serve a million different front page modules a day and that’s just through content optimization. And that’s just the beginning…Customized because we know the things you’re interested in ….
“Deliver right CONTENT to the right USER at the right TIME”“Deliver right CONTENT to the right USER at the right TIME”
What do we do ?
Keep Carol Bartz excitedKeep Carol Bartz excited
Relevance at Yahoo!
Important
Editors
Popular
Personal / Social
People10s of Items
ScienceMillions of Items
Ranking ProblemsMost PopularMost engaging overall based on objective metrics
Related ItemsBehavioral Affinity: People who did X, did Y
Deep PersonalizationMost relevant to me based on my deep interests
X Y
Real-time Dashboard
Business Optimization
Light PersonalizationMore relevant to me based on my age, gender and property usage
Most Popular + Per User HistoryEngaging overall, and aware of what I’ve already seen
Flow
Optimization EngineContent feed with biz rules
Explore~1%
Exploit~99%
Real-timeFeedback
Content Metadata
Dashboard Optimized Module
Real-timeInsights
Rules Engine
How it happens ?
At time ‘t’ User ‘u’ (user attr: age, gen, loc) interacted withContent ‘id’ atPosition ‘o’Property/Site ‘p’ Section - sModule – mInternational - i’
UserEvents
ItemMetadata
Modeling
ITEM Model
USER Model
Content ‘id’Has associated metadata ‘meta’ meta = {entity, keyword, geo, topic, category}
FeatureGeneration
Additional Content & UserFeature Generation
Item BASE M F ATTR CAT_Sports
id1 0.8 +1.2 -1.5 -0.9 1.0
id2 -0.9 -0.9 +2.6 +0.3 1.0
Item BASE M F ATTR CAT_Sports
u1 0.8 1 1 0.2
u2 -0.9 1 -1.2
STORE: PNUTS
5 minlatency
RankingB-Rules
Request
5 – 30 minlatency
SLA 50 ms – 200 ms
Models
8
USER x CONTENT FEATURES
USER MODEL : Tracks User interest in terms of Content Features
ITEM x USER FEATURES
ITEM MODEL : Tracks behavior of Item across user features
USER FEATURES x CONTENT FEATURES
PRIORS : Tracks interactions of user features with content features
USER x USER
CLUSTERING : Looks at User-User Affinity based on the feature vectors
ITEM x ITEM
CLUSTERING : Looks at Item-Item Affinity based on item feature vectors
Scale• Million events per second
• Hundreds of GB per run
• Million of stories in pool
• Tens of Thousands of Features (Content and/or User)
9
Technology Stack
10Analytics and Debugging
Modeling Framework
11
Global state provided by HBase
A collection of PIG UDFs
Flow for modeling or stages assembled in PIG
OLR
Clustering
Affinity
Regression Models
Decompositions (Cholesky …)
Configuration based behavioral changes for stages of modeling
Type of Features to generated
Type of joins to perform – User / Item / Feature
Input : DFS and/or HBase
Output: DFS and/or Hbase
Standard pattern for updating serving stores <Source, Transformation, Sink>
E.g. <Hbase Table, Function(Features), PNUTS Table>
HBase
12
ITEM Table
• Stores item related features• Stores ITEM x USER FEATURES model • Stores parameters about item like view count, click count, unique user count.• 10 of Millions of Items• Updated every 5 minutes
USER Model
• Store USER x CONTENT FEATURES model for each individual user by either a Unique ID• Stores summarized user history – Essential for Modeling in terms of item decay• Millions of profiles• Updated every 5 to 30 minutes
TERM Model
• Inverts the Item Table and stores statistics for the terms. • Used to find the trending features and provide baselines for user features• Millions of terms and hundreds of parameters tracked• Updates every 5 minutes
Grid Edge Services
13
Keeps MR jobs lean and mean Provides ability to control non-gridifyable solutions to be deployed easily
Have different scaling characteristics (E.g. Memory, CPU)
Provide gateway for accessing external data sources in M/R
Map and/or Reduce step interact with Edge Services using standard client
Examples
Categorization
Geo Tagging
Feature Transformation
Analytics and Debugging
14
Provides ability to debug modeling issues near-real time
Run complex queries for analysis
Easy to use interface
PM, Engineers, Research use this cluster to get near-real time insights
10s of Modeling monitoring and Reporting queries every 5 minute
We use HIVE
15
Data Flow
16
Learnings
17
PIG & HBase has been best combination so far
Made it simple to build different kind of science models
Point lookup using HBase has proven to be very useful
Modeling = Matrices
HBase provides a natural way to represent and access them
Edge Services
Have provided simplicity to whole stack
Management (Upgrades, Outage) has been easy
HIVE has provided us a great way for analyzing the results
PIG was also considered
Questions?