Playlist Recommendations @ Nikhil Tibrewal @nikhil_tibrewal
Playlist Recommendations@
Nikhil Tibrewal
@nikhil_tibrewal
Who am I?
Nikhil Tibrewal (Nick-hill)
● Data Engineer on Lambda squad (Spotify’s primary ML team)● Graduated from Carnegie Mellon University in Dec 2013● B.Sc. in Computer Science + additional major in Econ● Been part of Spotify band for ~1.5 years● Worked on a range of projects, primarily Playlist Recommendations
Spotify in numbers
● Started in 2006, 58 markets● 75M+ active users, 20M+ paying● 30M+ songs, 20K new per day● 1.5+ billion playlists● 1 TB logs per day
● Discover tab● Radio● Related Artists● Discover Weekly● Playlist recs on “Now” Strip
Recommendations so far on SpotifyFor Ellie Goulding
“Now” Strip
Human curated playlist
“Now” Strip
Human curated playlist
Recommended playlist
But…How are playlist recs generated?
Quick Overview!
● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content
Quick Overview!
● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content
Good
Quick Overview!
● Recommend only human curated playlists (1000+)○ Well-designed cover images○ Thorough descriptions○ Title reflects content
Good Bad
Quick Overview!
● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering
Quick Overview!
● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist
Quick Overview!
● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space
ANNOY (Approximate Nearest Neighbors Oh Yeah)created at Spotify
https://github.com/spotify/annoy
Quick Overview!
● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space○ Vectorize user taste as well:
■ User vector derived from user listening history
Quick Overview!
● Recommendations pipeline: Candidate Generation○ Generate N dimensional track vectors from collaborative filtering○ Vectorize playlists:
■ Playlist vector derived from track vectors in playlist○ Use Annoy to store playlist vectors in N dimensional space○ Vectorize user taste as well:
■ User vector derived from user listening history○ User and playlist vectors in same space!○ Query for nearest playlists to user from Annoy tree
annoyTree.getNearest(seedVector, K)
Quick Overview!
● Recommendations pipeline: Ranking Model○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations■ John: 21, USA, likes rock■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
Quick Overview!
● Recommendations pipeline: Ranking Model○ Use genre information, demographics data, and playlist popularity
data to further rank recommendations■ John: 21, USA, likes rock■ Should get rock playlist recs that are popular in USA and
amongst 21 year olds○ Apply post-processing steps for shuffling and add variety to avoid
repetitions
90% DAUs have recs!
Quick Overview!
● Infrastructure○ Luigi to manage workflow (also built at Spotify)○ Entire pipeline written in Scalding○ 1200+ nodes Hadoop cluster to run jobs○ Cassandra (~dozen nodes for playlist recs)○ Java backend micro-services serving recs
Quick Overview!
"Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and is a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS" (http://cascading.io/customer/twitter/)
Scalding w.r.t. Playlist Recs
● Used Python back in the day○ Inputs and outputs were tab separated○ Complexity UP => Difficulty to maintain UP○ Hard to write tests
● Scalding provided compile time error checks○ Catch errors early○ Define schemas (e.g. Avro)
● Can use Parquet + Avro for input/output○ Easy to write and read data○ Records with a lot of fields!○ Lesson: Parquet hurts performance w/ fat columns (nested data structs)
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Data quality○ Hadoop counters wrappers in extended Scalding library code
+
Scalding w.r.t. Playlist Recs
● Data quality○ Hadoop counters wrappers in extended Scalding library code○ Verify counters within reasonable ranges
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Pipeline tolerance○ Job failures are normal, and annoying with big jobs○ Scalding checkpoints○ Lesson: checkpoint itself is a map-reduce job and has the same caveats○ Still very helpful!
+
Scalding w.r.t. Playlist Recs
● Job runtimes○ Common solutions: more reducers and code optimizations○ Speculative execution for larger jobs○ Caveat: can take up unnecessary resources
+
Scalding w.r.t. Playlist Recs
● Memory issues○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts”
■ Replicated to all mappers○ Complex jobs in Scalding => higher memory config for jobs with Sparkey
+
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Memory issues○ Used Sparkey indices in Python (developed at Spotify, now open source)
■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts”
■ Replicated to all mappers○ Complex jobs in Scalding => higher memory config for jobs with Sparkey○ Lesson: trade memory resources for MAYBE a little more time with joins
+
bigPipe.join(exSparkeyPipe)
https://github.com/spotify/sparkey
Scalding w.r.t. Playlist Recs
● Driven○ “A sophisticated tool that collects telemetry data from running Scalding /
Cascading jobs on a cluster and presenting them in an intriguing User Interface."
○ http://cascading.io/
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
● Other awesome benefits
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits○ Active community + big players
+
Scalding w.r.t. Playlist Recs
● Other awesome benefits○ Active community + big players
○ Data pipeline flows naturally follow the functional paradigm - essentially writing Scala code
+
Scalding w.r.t. Playlist Recs +
Scalding w.r.t. Playlist Recs
Productivity without sacrificing performance!
+
Status: CompletedSpotify is hiring!
Nikhil Tibrewal
@nikhil_tibrewal