Productive data engineer

How to Be Productive Data Engineer

Rafal Wojdyla - [email protected]: My views are my own and don't necessarily represent those of Spotify.

• Operations

• Development

• Organization

• Culture

What is Spotify?For everyone:

• Streaming Service

• Launched in October 2008

• 60 Million Monthly Users

• 15 Million Paid Subscribers

+ and for me:

• 1.3K nodes Hadoop cluster

Automation

ME

ADAM

Apache AmbariCloudera Manager

+ Puppet

Not InventedHere

Never InventedHere

Wild Wild West

Apache Bigtop

Enable log aggregation

To enable log aggregation

yarn.log-aggregation-enable = trueyarn.log-aggregation.retain-seconds = ?

+ <property>+ <name>yarn.log-aggregation-enable</name>+ <value>true</value>+ </property>++ <property>+ <name>yarn.log-aggregation.retain-seconds</name>+ <value>315569260</value>+ + </property>

Heap Memory used is 97%

Hellelephant

Custom logs• Profiling

• Garbage collection

Right tool for the job

Right abstraction for the job

Scaling machines is easy, scaling

people is hard

• Map split size

• Number of reducers

• HDFS data retention

• User feedback (ongoing)

Automation

Organization

Ownerless

Ownerless Squad

Ownerless

Squad Upgrades

Ownerless

Squad Upgrades Getting there

Culture

ExperimentFail Fast

Embrace Failure

Spark

But we have tried!

Non grata

Spark

spark.storage.memoryFraction (0.6)spark.shuffle.memoryFraction (0.2)

In shuffle heavy algorithms reduce cache fraction in favour of shuffle.

Spark

spark.executor.heartbeatInterval (10K)spark.core.connection.ack.wait.timeout (60)

Increase in case of long GC pauses.

Learnings• Operations Automation

• Development Abstraction

• Organization Team

• Culture Experiment

Join the bandEngineers wanted inNYC & Stockholm

http://spotify.com/jobs

Productive data engineer

Data & Analytics

hadoop nodes

spotify service

nodes hadoop clusterbefore

operating hadoop clusters

spotify powerpoint template

handful of nodes

development decision

single moment