June 5, 2013
Wouter de BieTeam Lead Data Infrastructure
Hadoop at Spotify
Wednesday, June 5, 13
June 5, 2013
Witaj Polsko, Spotify już gra!
Wednesday, June 5, 13
Agenda• Why Data?• Why Hadoop?• Use Cases• Infrastructure overview• Map/Reduce in different languages• Scheduling• Scaling Hadoop• Lessons learned
3
Wednesday, June 5, 13
4
Spotify? Spotify!
Wednesday, June 5, 13
Some context • Spotify started in 2006• Now 850+ employees, 250+ engineers• 26 million monthly active users• 20+ million tracks available• 12 data engineers building a platform for
easy access to data
5
Wednesday, June 5, 13
Why data?We play music, right?
6
Wednesday, June 5, 13
7
Listening behavior in Sweden
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$
1.60%$
Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
SE$23-27$
SE$45-59$
SE$$0-17$
SE$35-44$
SE$60-150$
SE$28-34$
SE$18-22$
Wednesday, June 5, 13
8
Listening behavior in Spain
0.00%$
0.20%$
0.40%$
0.60%$
0.80%$
1.00%$
1.20%$
1.40%$
Mon
$-$00$
Mon
$-$02$
Mon
$-$04$
Mon
$-$06$
Mon
$-$08$
Mon
$-$10$
Mon
$-$12$
Mon
$-$14$
Mon
$-$16$
Mon
$-$18$
Mon
$-$20$
Mon
$-$22$
Tue$-$0
0$Tue$-$0
2$Tue$-$0
4$Tue$-$0
6$Tue$-$0
8$Tue$-$1
0$Tue$-$1
2$Tue$-$1
4$Tue$-$1
6$Tue$-$1
8$Tue$-$2
0$Tue$-$2
2$Wed
$-$00$
Wed
$-$02$
Wed
$-$04$
Wed
$-$06$
Wed
$-$08$
Wed
$-$10$
Wed
$-$12$
Wed
$-$14$
Wed
$-$16$
Wed
$-$18$
Wed
$-$20$
Wed
$-$22$
Thu$-$0
0$Thu$-$0
2$Thu$-$0
4$Thu$-$0
6$Thu$-$0
8$Thu$-$1
0$Thu$-$1
2$Thu$-$1
4$Thu$-$1
6$Thu$-$1
8$Thu$-$2
0$Thu$-$2
2$Fri$-$00$
Fri$-$02$
Fri$-$04$
Fri$-$06$
Fri$-$08$
Fri$-$10$
Fri$-$12$
Fri$-$14$
Fri$-$16$
Fri$-$18$
Fri$-$20$
Fri$-$22$
Sat$-$00$
Sat$-$02$
Sat$-$04$
Sat$-$06$
Sat$-$08$
Sat$-$10$
Sat$-$12$
Sat$-$14$
Sat$-$16$
Sat$-$18$
Sat$-$20$
Sat$-$22$
Sun$-$0
0$Sun$-$0
2$Sun$-$0
4$Sun$-$0
6$Sun$-$0
8$Sun$-$1
0$Sun$-$1
2$Sun$-$1
4$Sun$-$1
6$Sun$-$1
8$Sun$-$2
0$Sun$-$2
2$
ES$$0-17$
ES$35-44$
ES$23-27$
ES$45-59$
ES$60-150$
ES$28-34$
ES$18-22$
Wednesday, June 5, 13
9
Impact of hurricane Sandy29 October 2012
Wednesday, June 5, 13
10
Impact of hurricane Sandy30 October 2012
Wednesday, June 5, 13
Why data?To get more insights. The age group example could be used for ad targeting.
11
Wednesday, June 5, 13
12
Why Hadoop?
Millions of daily active users == LOTS OF DATA
• Too much data to fit into a RDBMS (or too costly)• Hadoop provides good value for money• Runs on “commodity” hardware• Scales pretty well
Wednesday, June 5, 13
13
ReportingBusiness AnalyticsOperational AnalyticsIn-client features
Use CasesWe’re a data-driven company, so data is used almost everywhere
Wednesday, June 5, 13
Reporting
• Reporting to labels, licensors, partners and advertisers from day 1• In 2008 a RDBMS would have worked for a while, although
labels wanted very granular data, so we went with Hadoop instead
Wednesday, June 5, 13
Business Analytics
• Analyzing growth, user behavior, sign-up funnels, etc
• Company KPIs• A/B testing• NPS analysis• Segmentation analysis
Wednesday, June 5, 13
Operational metrics
•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)
Wednesday, June 5, 13
Product features
•Radio• Top lists•Recommendations (better then external parties,
because of the amount of data)
Wednesday, June 5, 13
18
600 GB of compressed data from users per day150 GB of data from services per day4 TB of data generated in Hadoop each day190 node Hadoop cluster (12 core CPUs, 32 Gb RAM, 24 TB disk space)Soon 690 nodes (12 core CPUs, 64 Gb Ram, 48 TB disk space)4 PB of storage capacity (soon 28 PB)
Some geeky numbers
Wednesday, June 5, 13
19
Our data infrastructure
Wednesday, June 5, 13
20
Spotify’s data infrastructure
Backend services
HDFS
Map/Reduce
LuigiScheduler
OperationalDatabases
Reporting
Analytical databases
Productfeatures
Dashboards
Map/ReduceJobs
Hive
Wednesday, June 5, 13
Map/Reduce language
Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API
Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available
PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available
Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)
21
Wednesday, June 5, 13
Some technical details
Technology we use in combination with Hadoop
• Java, PIG, Hive and Python for writing map reduce 1.0• Apache Kafka for log collection (in the process of replacing a batched transfer system)• Apache Cassandra (no HBase ;))• PostgreSQL• Sqoop for database import/export• Avro as storage format
22
Wednesday, June 5, 13
Scheduling
We wrote and open-sourced our own scheduler: Luigi
• Nothing suitable out there..• https://github.com/spotify/luigi•Written in Python• Generic scheduler and dependency system that supports Python M/R, Pig and Hive
23
Wednesday, June 5, 13
Scaling Hadoop at SpotifyOur Journey
• Started with a small (scrap metal) cluster of 37 servers
•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale
• Built an in-house cluster of 60 nodes because of EMR costs
• Capacity planning every 6 months, grown to 190 nodes today
• Just ordered 500 more nodes• Put in place data-retention policy and data
archive
24
Wednesday, June 5, 13
Lessons learned
4+ years of Hadoop taught us
• Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS
• “Commodity hardware” doesn’t mean cheap hardware• Hadoop isn’t a silver bullet• Hadoop is a complex system that needs love and care• You will have to extend Hadoop (and eco-system components) to tailor it to your needs• Hadoop is fun!
25
Wednesday, June 5, 13
Why? How?
26
Questions?
What?
Wednesday, June 5, 13
One more thing..
What’s your biggest annoyance when running the following?
hadoop fs -ls /
Wednesday, June 5, 13
28
Why?• JVM startup time• Loading a lot of JAR files (Hadoop, logging, other stuff you don’t need)• Problematic for us, since we call Hadoop from Python
It’s slow..
Wednesday, June 5, 13
29
Welcome Snakebite!A pure python HDFS client
• Both a client library and a command line tool• Uses Protocol Buffers to communicate to the NameNode•Much faster then command line hadoop• Tab completion included!• Available at http://github.com/spotify/snakebite
Wednesday, June 5, 13
30
How much faster?wouter@foo:~$ time for i in {1..10}; do hadoop fs -ls /; done
real 0m14.464suser 0m21.761ssys 0m1.148s
wouter@foo:~$ time for i in {1..10}; do snakebite ls /; done
real 0m1.639suser 0m1.072ssys 0m0.160s
Wednesday, June 5, 13
June 5, 2013
Check out http://www.spotify.com/jobs or @Spotifyjobs for more information.
Come talk to Adam or me after the meetup!
Or mail: [email protected] twitter: @xinit
Want to join the band?
Wednesday, June 5, 13