Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

June 5, 2013

Wouter de BieTeam Lead Data Infrastructure

Hadoop at Spotify

Wednesday, June 5, 13

June 5, 2013

Witaj Polsko, Spotify już gra!


Agenda• Why Data?• Why Hadoop?• Use Cases• Infrastructure overview• Map/Reduce in different languages• Scheduling• Scaling Hadoop• Lessons learned

3


4

Spotify? Spotify!


Some context • Spotify started in 2006• Now 850+ employees, 250+ engineers• 26 million monthly active users• 20+ million tracks available• 12 data engineers building a platform for

easy access to data

5


Why data?We play music, right?

6


7

Listening behavior in Sweden

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

1.60%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

SE$23-27$

SE$45-59$

SE$$0-17$

SE$35-44$

SE$60-150$

SE$28-34$

SE$18-22$


8

Listening behavior in Spain

0.00%$

0.20%$

0.40%$

0.60%$

0.80%$

1.00%$

1.20%$

1.40%$

Mon

$-$00$

Mon

$-$02$

Mon

$-$04$

Mon

$-$06$

Mon

$-$08$

Mon

$-$10$

Mon

$-$12$

Mon

$-$14$

Mon

$-$16$

Mon

$-$18$

Mon

$-$20$

Mon

$-$22$

Tue$-$0

0$Tue$-$0

2$Tue$-$0

4$Tue$-$0

6$Tue$-$0

8$Tue$-$1

0$Tue$-$1

2$Tue$-$1

4$Tue$-$1

6$Tue$-$1

8$Tue$-$2

0$Tue$-$2

2$Wed

$-$00$

Wed

$-$02$

Wed

$-$04$

Wed

$-$06$

Wed

$-$08$

Wed

$-$10$

Wed

$-$12$

Wed

$-$14$

Wed

$-$16$

Wed

$-$18$

Wed

$-$20$

Wed

$-$22$

Thu$-$0

0$Thu$-$0

2$Thu$-$0

4$Thu$-$0

6$Thu$-$0

8$Thu$-$1

0$Thu$-$1

2$Thu$-$1

4$Thu$-$1

6$Thu$-$1

8$Thu$-$2

0$Thu$-$2

2$Fri$-$00$

Fri$-$02$

Fri$-$04$

Fri$-$06$

Fri$-$08$

Fri$-$10$

Fri$-$12$

Fri$-$14$

Fri$-$16$

Fri$-$18$

Fri$-$20$

Fri$-$22$

Sat$-$00$

Sat$-$02$

Sat$-$04$

Sat$-$06$

Sat$-$08$

Sat$-$10$

Sat$-$12$

Sat$-$14$

Sat$-$16$

Sat$-$18$

Sat$-$20$

Sat$-$22$

Sun$-$0

0$Sun$-$0

2$Sun$-$0

4$Sun$-$0

6$Sun$-$0

8$Sun$-$1

0$Sun$-$1

2$Sun$-$1

4$Sun$-$1

6$Sun$-$1

8$Sun$-$2

0$Sun$-$2

2$

ES$$0-17$

ES$35-44$

ES$23-27$

ES$45-59$

ES$60-150$

ES$28-34$

ES$18-22$


9

Impact of hurricane Sandy29 October 2012


10

Impact of hurricane Sandy30 October 2012


Why data?To get more insights. The age group example could be used for ad targeting.

11


12

Why Hadoop?

Millions of daily active users == LOTS OF DATA

• Too much data to fit into a RDBMS (or too costly)• Hadoop provides good value for money• Runs on “commodity” hardware• Scales pretty well


13

ReportingBusiness AnalyticsOperational AnalyticsIn-client features

Use CasesWe’re a data-driven company, so data is used almost everywhere


Reporting

• Reporting to labels, licensors, partners and advertisers from day 1• In 2008 a RDBMS would have worked for a while, although

labels wanted very granular data, so we went with Hadoop instead


Business Analytics

• Analyzing growth, user behavior, sign-up funnels, etc

• Company KPIs• A/B testing• NPS analysis• Segmentation analysis


Operational metrics

•Root cause analysis• Latency analysis• Better capacity planning (servers, people, bandwidth)


Product features

•Radio• Top lists•Recommendations (better then external parties,

because of the amount of data)


18

600 GB of compressed data from users per day150 GB of data from services per day4 TB of data generated in Hadoop each day190 node Hadoop cluster (12 core CPUs, 32 Gb RAM, 24 TB disk space)Soon 690 nodes (12 core CPUs, 64 Gb Ram, 48 TB disk space)4 PB of storage capacity (soon 28 PB)

Some geeky numbers


19

Our data infrastructure


20

Spotify’s data infrastructure

Backend services

HDFS

Map/Reduce

LuigiScheduler

OperationalDatabases

Reporting

Analytical databases

Productfeatures

Dashboards

Map/ReduceJobs

Hive


Map/Reduce language

Python with Hadoop Streaming• Pros: fast development, many Spotify libraries available• Cons: slower then Java, no access to Hadoop API

Java• Pros: fast, access to Hadoop API• Cons: verbose language, not many Spotify libraries available

PIG• Pros: very small scripts, faster then streaming• Cons: yet another language to learn, not many Spotify libs available

Hive• Pros: SQL like syntax (easy for non-programmers) and relational data model• Cons: more moving parts (not well suited for a whole pipe line)

21


Some technical details

Technology we use in combination with Hadoop

• Java, PIG, Hive and Python for writing map reduce 1.0• Apache Kafka for log collection (in the process of replacing a batched transfer system)• Apache Cassandra (no HBase ;))• PostgreSQL• Sqoop for database import/export• Avro as storage format

22


Scheduling

We wrote and open-sourced our own scheduler: Luigi

• Nothing suitable out there..• https://github.com/spotify/luigi•Written in Python• Generic scheduler and dependency system that supports Python M/R, Pig and Hive

23


https://github.com/spotify/luigi

https://github.com/spotify/luigi

Scaling Hadoop at SpotifyOur Journey

• Started with a small (scrap metal) cluster of 37 servers

•Moved to Amazon Elastic Map/Reduce (EMR) and S3 to quickly scale

• Built an in-house cluster of 60 nodes because of EMR costs

• Capacity planning every 6 months, grown to 190 nodes today

• Just ordered 500 more nodes• Put in place data-retention policy and data

archive

24


Lessons learned

4+ years of Hadoop taught us

• Hadoop has brought us very far. We would never be able to handle the current volume with a “cheap” RDBMS

• “Commodity hardware” doesn’t mean cheap hardware• Hadoop isn’t a silver bullet• Hadoop is a complex system that needs love and care• You will have to extend Hadoop (and eco-system components) to tailor it to your needs• Hadoop is fun!

25


Why? How?

26

Questions?

What?


One more thing..

What’s your biggest annoyance when running the following?

hadoop fs -ls /


28

Why?• JVM startup time• Loading a lot of JAR files (Hadoop, logging, other stuff you don’t need)• Problematic for us, since we call Hadoop from Python

It’s slow..


29

Welcome Snakebite!A pure python HDFS client

• Both a client library and a command line tool• Uses Protocol Buffers to communicate to the NameNode•Much faster then command line hadoop• Tab completion included!• Available at http://github.com/spotify/snakebite


http://github.com/spotify/snakebite

http://github.com/spotify/snakebite

30

How much faster?wouter@foo:~$ time for i in {1..10}; do hadoop fs -ls /; done

real 0m14.464suser 0m21.761ssys 0m1.148s

wouter@foo:~$ time for i in {1..10}; do snakebite ls /; done

real 0m1.639suser 0m1.072ssys 0m0.160s


June 5, 2013

Check out http://www.spotify.com/jobs or @Spotifyjobs for more information.

Come talk to Adam or me after the meetup!

Or mail: [email protected] twitter: @xinit

Want to join the band?


http://www.spotify.com/jobs

http://www.spotify.com/jobs

mailto:[email protected]

mailto:[email protected]

Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster

Documents

Hadoop at Spotify - Meetupfiles.meetup.com/3171672/Hadoop at Spotify - Wouter de... · 2013-07-23 · Scaling Hadoop at Spotify Our Journey •Started with a small (scrap metal) cluster