Top Banner
The Evolution of Hadoop at Spotify Through Failures and Pain Josh Baer ([email protected]) Rafal Wojdyla ([email protected]) 1 Note: Our views are our own and don't necessarily represent those of Spotify.
56
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Evolution of Hadoop at Spotify - Through Failures and Pain

The Evolution of Hadoop at Spotify Through Failures and Pain

Josh Baer ( [email protected]) Rafal Wojdyla ([email protected])

1

Note: Our views are our own and don't necessarily represent those of Spotify.

Page 2: The Evolution of Hadoop at Spotify - Through Failures and Pain

2

• Growing Pains (2009-2012) • Gaining Focus (2013 - 2014) • The Future (2015+)

Overview

Page 3: The Evolution of Hadoop at Spotify - Through Failures and Pain

3

Our First Major Hadoop Bug

Page 4: The Evolution of Hadoop at Spotify - Through Failures and Pain

4

Cluster 1.0

Page 5: The Evolution of Hadoop at Spotify - Through Failures and Pain

What is Spotify?

• Music Streaming Service • Browse and Discover Millions of

Songs, Artists and Albums • Launched in October 2008 • December 2014:

• 60 Million Monthly Users • 15 Million Paid Subscribers

5

Page 6: The Evolution of Hadoop at Spotify - Through Failures and Pain

What is Spotify?

• Data Infrastructure: • 1300 Hadoop Nodes • 42 PB Storage • 20 TB data ingested via Kafka/day • 200 TB generated by Hadoop/day

6

Page 7: The Evolution of Hadoop at Spotify - Through Failures and Pain

7

select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;

Page 8: The Evolution of Hadoop at Spotify - Through Failures and Pain

7

select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;

Page 9: The Evolution of Hadoop at Spotify - Through Failures and Pain

7

Page 10: The Evolution of Hadoop at Spotify - Through Failures and Pain

0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar  15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar  30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar  *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar  *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar  */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar  

8

Page 11: The Evolution of Hadoop at Spotify - Through Failures and Pain

0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar  15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar  30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar  *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar  *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar  */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar  

8

Page 12: The Evolution of Hadoop at Spotify - Through Failures and Pain

9

Handles the ‘plumbing’ for Hadoop jobs

https://github.com/spotify/luigi

Page 13: The Evolution of Hadoop at Spotify - Through Failures and Pain

10

Page 14: The Evolution of Hadoop at Spotify - Through Failures and Pain

10

Page 15: The Evolution of Hadoop at Spotify - Through Failures and Pain

1111

Page 16: The Evolution of Hadoop at Spotify - Through Failures and Pain

To the Cloud!

1111

Page 17: The Evolution of Hadoop at Spotify - Through Failures and Pain

To the Cloud!

1111

Page 18: The Evolution of Hadoop at Spotify - Through Failures and Pain

12

#  sudo  addgroup  hadoop  #  sudo  adduser  —ingroup  hadoop  hdfs  #  sudo  adduser  —ingroup  hadoop  yarn  #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/  #  apt-­‐get  update  …  [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode  …  [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  

Page 19: The Evolution of Hadoop at Spotify - Through Failures and Pain

12

#  sudo  addgroup  hadoop  #  sudo  adduser  —ingroup  hadoop  hdfs  #  sudo  adduser  —ingroup  hadoop  yarn  #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/  #  apt-­‐get  update  …  [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode  …  [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  

Page 20: The Evolution of Hadoop at Spotify - Through Failures and Pain

13

Automated Config Management

(via Puppet)

Page 21: The Evolution of Hadoop at Spotify - Through Failures and Pain

14

[data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data  Found  3  items  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean  [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake  Found  1  items  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt  [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt  …

Page 22: The Evolution of Hadoop at Spotify - Through Failures and Pain

14

[data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data  Found  3  items  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean  [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake  Found  1  items  drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt  [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt  …

Page 23: The Evolution of Hadoop at Spotify - Through Failures and Pain

15

Page 24: The Evolution of Hadoop at Spotify - Through Failures and Pain

$  time  for  i  in  {1..100};  do  hadoop  fs  -­‐ls  /  >  /dev/null;  done  real   3m32.014s  user   6m15.891s  sys        0m18.821s  

$  time  for  i  in  {1..100};  do  snakebite  ls  /  >  /dev/null;  done  real   0m34.760s  user   0m29.962s  sys        0m4.512s  

16

Page 25: The Evolution of Hadoop at Spotify - Through Failures and Pain

17

Gaining Focus (2013-2014)

Page 26: The Evolution of Hadoop at Spotify - Through Failures and Pain

18

• In 2013, expanded to 200 nodes • Hadoop critical • Needed a team totally focused on it • Created a ‘squad’ with two missions:

• Migrate to a new distribution with Yarn • Make Hadoop reliable

Forming a team

Page 27: The Evolution of Hadoop at Spotify - Through Failures and Pain

19

Page 28: The Evolution of Hadoop at Spotify - Through Failures and Pain

19

Hadoop ownerless

Page 29: The Evolution of Hadoop at Spotify - Through Failures and Pain

19

Hadoop ownerless

Squad

Page 30: The Evolution of Hadoop at Spotify - Through Failures and Pain

19

Hadoop ownerless Upgrades

Squad

Page 31: The Evolution of Hadoop at Spotify - Through Failures and Pain

19

Hadoop ownerless Upgrades Getting there

Squad

Page 32: The Evolution of Hadoop at Spotify - Through Failures and Pain

20

• Alert on service level problems (i.e. no jobs running) • Keep your alarm channel clean. Beware of alert fatigue.

Alerting

Page 33: The Evolution of Hadoop at Spotify - Through Failures and Pain

21

Uhh ohh….. I think I made a mistake

Page 34: The Evolution of Hadoop at Spotify - Through Failures and Pain

[data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/  

22

Page 35: The Evolution of Hadoop at Spotify - Through Failures and Pain

Goodbye Data (1PB)

[data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/  

22

OK:  Deleted  /team/disco  

Page 36: The Evolution of Hadoop at Spotify - Through Failures and Pain

23

• “Sit on your hands before you type” - Wouter de Bie • Users will always want to retain data! • Remove superusers from ‘edgenodes’ • Moving to trash = client-side implementation

Lessons Learned

Page 37: The Evolution of Hadoop at Spotify - Through Failures and Pain

24

The Wild Wild West

Page 38: The Evolution of Hadoop at Spotify - Through Failures and Pain

25

• Same hardware profile as production cluster • Similar configuration • Staging environment • Reliable

Pre-Production Cluster

Page 39: The Evolution of Hadoop at Spotify - Through Failures and Pain

26

Automated Testing

Page 40: The Evolution of Hadoop at Spotify - Through Failures and Pain

27

Page 41: The Evolution of Hadoop at Spotify - Through Failures and Pain

28

Moving Data

Page 42: The Evolution of Hadoop at Spotify - Through Failures and Pain

29

• Features: • Data discovery • Lineage • Lifecycle management • More

• We use it for data movement • Uses Oozie behind the scenes

Apache Falcon

Page 43: The Evolution of Hadoop at Spotify - Through Failures and Pain

• Most of our jobs were Hadoop (python) Streaming • Lots of failures, slow performance • Had to find a better way….

30

Improving Performance

Page 44: The Evolution of Hadoop at Spotify - Through Failures and Pain

31

• Investigated several frameworks • Selected Crunch:

• Real types! • Higher level API • Easier to test • Better performance #JVM_FTW

*Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig

Improving Performance

Page 45: The Evolution of Hadoop at Spotify - Through Failures and Pain

32

Page 46: The Evolution of Hadoop at Spotify - Through Failures and Pain

33

Let’s Review

Page 47: The Evolution of Hadoop at Spotify - Through Failures and Pain

34

The Future (2015+)

Page 48: The Evolution of Hadoop at Spotify - Through Failures and Pain

35

Note: Spotify users is based on publicly released numbers only

Page 49: The Evolution of Hadoop at Spotify - Through Failures and Pain

36

Explosive Growth• Increased Spotify Users • Increased use cases • Increased Engineers

Page 50: The Evolution of Hadoop at Spotify - Through Failures and Pain

37

http://everynoise.com/engenremap.html

Page 51: The Evolution of Hadoop at Spotify - Through Failures and Pain

38

Scaling machines: easy Scaling people: hard

Page 52: The Evolution of Hadoop at Spotify - Through Failures and Pain

39

User Feedback: Automate IT!

Page 53: The Evolution of Hadoop at Spotify - Through Failures and Pain

40

Data Management

Page 54: The Evolution of Hadoop at Spotify - Through Failures and Pain

41

• Data-discovery tool • Luigi Integration • Find and browse datasets • View schemas • Trace lineage

• Open-source plans? :-(

Raynor

Page 55: The Evolution of Hadoop at Spotify - Through Failures and Pain

42

Two Takeaways

• Automate Everything • More time to play FIFA build cool tools

• Listen to your users • Fail fast, don’t be afraid to scrap work

Page 56: The Evolution of Hadoop at Spotify - Through Failures and Pain

43

Join the Band!Engineers wanted in NYC & Stockholm

http://spotify.com/jobs