Spark Technology Center
Oct / 27 / 16
Creating an end-to-endRecommender System with Apache Spark and Elasticsearch
Jean-François PugetNick Pentreath
Spark Technology Center
§ @JFPuget§ Distinguished Engineer, IBM Machine
Learning & Optimization
§ @MLnick§ Principal Engineer, IBM Spark
Technology Center§ Apache Spark PMC
About
Spark Technology Center
§ Recommender systems & the machine learning workflow
§ Data modelling for recommender systems
§ Why Spark & Elasticsearch?§ Spark ML for collaborative filtering§ Deploying & scoring recommender
models§ Demo
Agenda
Spark Technology Center
Recommender Systems & the ML Workflow
Spark Technology Center
RecommenderSystems
Overview
Spark Technology Center
The Machine Learning Workflow
Perception
Data ??? MachineLearning ??? $$$
Spark Technology Center
The Machine Learning Workflow
Reality
Data
• Historical• Streaming
Ingest Data Processing
• Feature transformation & engineering
Model Training
• Model selection & evaluation
Deploy
• Pipelines, not just models
• Versioning
Live System
• Predict given new data
• Monitoring & live evaluation
Feedback Loop
Spark DataFrames
Spark ML
Various ???
Stream (Kafka)
Missing piece!
Spark Technology Center
The Machine Learning Workflow
Recommender Version
Data Ingest Data Processing
• Aggregation• Handle implicit
data
Model Training
• ALS• Ranking-style
evaluation
Deploy
• Model size & complexity
Live System
•User & item recommendations•Monitoring, filters
Feedback => another Event Type
Spark DataFrames
Spark ML
Elasticsearch
• User & Item Metadata
• Events
Elasticsearch
Stream (Kafka)
Spark Technology Center
Data Modeling for Recommender Systems
Spark Technology Center
Data modelUser and Item Metadata
! !
Spark Technology Center
System RequirementsUser and Item Metadata
! !
Filtering & Grouping
Business Rules
Spark Technology Center
User interactions
Implicit preference data
• Page view• eCommerce - cart, purchase• Media – preview, watch, listen
Intent data
• Search query
Anatomy of a User Event
Explicit preference data
• Rating• Review
Social network interactions
• Like• Share• Follow
User Interactions
!
!
!
!
!
!
!
!
Spark Technology Center
Data modelAnatomy of a User Event
!!
! !! !
!
Spark Technology Center
How to handle implicit feedback?Anatomy of a User Event
!!
! !! !
!!
Spark Technology Center
Why Spark & Elasticsearch?
Spark Technology Center
DataFrames§ Events & metadata are “lightly
structured” data§ Suited to DataFrames§ Pluggable external data source support
Spark ML§ Spark ML pipelines§ Scalable ALS algorithm, supporting
implicit feedback & NMF§ Cross-validation§ Custom transformers & algorithms
Why Spark?
Spark Technology Center
Storage§ Native JSON§ Scalable§ Good support for time-series / event data§ Kibana for data visualisation§ Integration with Spark DataFrames
Scoring§ Full-text search§ Filtering§ Aggregations (grouping)§ Search ~== recommendation (more
later)
Why Elasticsearch?
Spark Technology Center
Spark ML for Collaborative Filtering
Spark Technology Center
Matrix FactorizationCollaborative Filtering
3 415 2
1 32 1
!
!
−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2
0.2 1.7 2.31.9 0.4 0.81.5 −1.2 0.3−0.4 2.1 0.62.7 0.8 1.4
! !
Spark Technology Center
PredictionCollaborative Filtering
3 415 2
1 32 1
!
!
−1.1 3.2 4.30.2 1.4 3.12.5 0.3 2.34.3 −2.4 0.53.6 0.3 1.2
0.2 1.7 2.31.9 0.4 0.81.5 −1.2 0.3−0.4 2.1 0.62.7 0.8 1.4
! !
Spark Technology Center
Loading DataAlternating Least Squares
Spark Technology Center
Implicit Preference DataAlternating Least Squares
Spark Technology Center
Deploying & Scoring Recommendation Models
Spark Technology Center
Full-text Search & SimilarityPrelude: Search
“cat videos”
!
!cat videos
0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯
Sort results
0 1 ⋯ 1 0 ⋯
Scoring RankingAnalysis Term vectors
Similarity
Spark Technology Center
Can we use the same machinery?Recommendation
!0 0 ⋯ 0 1 ⋯0 1 ⋯ 1 1 ⋯1 1 ⋯ 0 0 ⋯1 0 ⋯ 0 1 ⋯
Sort results
1.2 ⋯ −0.2 0.3
Dot product & cosine similarity… the same as we need for recommendations!
Scoring RankingAnalysis Term vectors
!
!!!
SimilarityUser (or item) vector
?
Spark Technology Center
Delimited Payload FilterElasticsearchTerm Vectors
Raw vector
1.2 ⋯ −0.2 0.3
Term vector with payloads
0|1.2 ⋯ 3|-0.2 4|0.3
Custom analyzer
Spark Technology Center
Custom scoring function
• Native script (Java), compiled for speed• Scoring function computes dot product by:
§ For each document vector index (“term”), retrieve payload
§ score += payload * query(i)
• Normalize with query vector norm and document vector norm for cosine similarity (“similar items”)
ElasticsearchScoring
Spark Technology Center
Can we use the same machinery?Recommendation
! Sort results
1.2 ⋯ −0.2 0.3
Scoring RankingAnalysis Term vectors!!
Custom scoring function
!!
Delimited payload filter
−1.1 1.3 ⋯ 0.41.2 −0.2 ⋯ 0.30.5 0.7 ⋯ −1.30.9 1.4 ⋯ −0.8
!User
(or item) vector
Spark Technology Center
We get search engine functionality for free!ElasticsearchScoring
Spark Technology Center
Deploying to ElasticsearchAlternating Least Squares
Spark Technology Center
Monitoring & Feedback
Spark Technology Center
Demo
Spark Technology Center
Elasticsearch
Elasticsearch Spark Integration
Spark ML ALS for Collaborative Filtering
Collaborative Filtering for Implicit Feedback
Datasets
Elasticsearch Term Vectors & Payloads
Delimited Payload Filter
Vector Scoring Plugin
Kibana
References
Spark Technology Center
Thanks!https://github.com/MLnick/sseu16-meetuphttps://github.com/MLnick/elasticsearch-vector-scoring