Top Banner
How to Be Productive Data Engineer Rafal Wojdyla - [email protected] Note: My views are my own and don't necessarily represent those of Spotify.
36
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Productive data engineer

How to Be Productive Data Engineer

Rafal Wojdyla - [email protected]: My views are my own and don't necessarily represent those of Spotify.

Page 2: Productive data engineer

• Operations

• Development

• Organization

• Culture

Page 3: Productive data engineer

What is Spotify?For everyone:

• Streaming Service

• Launched in October 2008

• 60 Million Monthly Users

• 15 Million Paid Subscribers

+ and for me:

• 1.3K nodes Hadoop cluster

Page 4: Productive data engineer

Automation

Page 5: Productive data engineer

ME

ADAM

Page 6: Productive data engineer

Apache AmbariCloudera Manager

Page 7: Productive data engineer

+ Puppet

Page 8: Productive data engineer

Not InventedHere

Page 9: Productive data engineer

Never InventedHere

Page 10: Productive data engineer

Wild Wild West

Page 11: Productive data engineer

Apache Bigtop

Page 12: Productive data engineer

Enable log aggregation

Page 13: Productive data engineer

To enable log aggregation

yarn.log-aggregation-enable = trueyarn.log-aggregation.retain-seconds = ?

Page 14: Productive data engineer

+ <property>+ <name>yarn.log-aggregation-enable</name>+ <value>true</value>+ </property>++ <property>+ <name>yarn.log-aggregation.retain-seconds</name>+ <value>315569260</value>+ <!--retention: 10 years-->+ </property>

Page 15: Productive data engineer

Heap Memory used is 97%

Page 16: Productive data engineer

Hellelephant

Page 17: Productive data engineer

Custom logs• Profiling

• Garbage collection

Page 18: Productive data engineer

Right tool for the job

Page 19: Productive data engineer
Page 20: Productive data engineer
Page 21: Productive data engineer

Right abstraction for the job

Page 22: Productive data engineer

Scaling machines is easy, scaling

people is hard

Page 23: Productive data engineer

• Map split size

• Number of reducers

• HDFS data retention

• User feedback (ongoing)

Automation

Page 24: Productive data engineer

Organization

Page 25: Productive data engineer
Page 26: Productive data engineer

Ownerless

Page 27: Productive data engineer

Ownerless Squad

Page 28: Productive data engineer

Ownerless

Squad Upgrades

Page 29: Productive data engineer

Ownerless

Squad Upgrades Getting there

Page 30: Productive data engineer

Culture

Page 31: Productive data engineer

ExperimentFail Fast

Embrace Failure

Page 32: Productive data engineer

Spark

But we have tried!

Non grata

Page 33: Productive data engineer

Spark

spark.storage.memoryFraction (0.6)spark.shuffle.memoryFraction (0.2)

In shuffle heavy algorithms reduce cache fraction in favour of shuffle.

Page 34: Productive data engineer

Spark

spark.executor.heartbeatInterval (10K)spark.core.connection.ack.wait.timeout (60)

Increase in case of long GC pauses.

Page 35: Productive data engineer

Learnings• Operations Automation

• Development Abstraction

• Organization Team

• Culture Experiment

Page 36: Productive data engineer

Join the bandEngineers wanted inNYC & Stockholm

http://spotify.com/jobs