Productive data engineer speaker notes

How to Be Productive Data Engineer

Rafal Wojdyla - [email protected]: My views are my own and don't necessarily represent those of Spotify.

Speaker’s cut version – includes raw notes

Hi – my name is Rafal and I’m an engineer at Spotify, in this

presentation I will talk about how to be a productive data

engineer. I will combine knowledge of multiple productive

engineers at Spotify and touch different areas of our your

daily work life. I will use real world examples, failures,

success stories – but mostly failures. So if you are or want to

be a data engineer hopefully after this presentation every

single one of you will learn something now. I hope this

learning will improve your productivity, bring new feature to

your infrastructure or maybe spark a discussion inside your

team.

• Operations

• Development

• Organization

• Culture

We will go through lessons of productive data engineer and cover 4

different areas – operations, development, organization and culture.

So we will kinda work our way from low level admin tips and

spectacular disasters, after hard core operation we will talk about

development on Hadoop – what to avoid, how Spotify is overcoming

huge problem of legacy Hadoop tools. After development part we will

take a look at how organization structure can affect your productivity

and how one can tackle this problem. We will finish with ubiquitous

culture – how culture can help in being productive. So in a way we

start with low level scope of you productivity, how the cluster itself

can affect you, how operators can help you out, to later talk about

your development decision, structure of company and finish with how

environment can influence your work. There will be time for

questions at the end so please keep them till the end.

What is Spotify?For everyone:

• Streaming Service

• Launched in October 2008

• 60 Million Monthly Users

• 15 Million Paid Subscribers

+ and for me:

• 1.3K nodes Hadoop cluster

Before we go deep into presentation let’s first talk about what spotify

is. Spotify is a streaming service, launched in 2008 in beautiful

Stockholm, Sweden. Current public numbers are that we have 60M

monthly users, 15M subscribers. And what’s unique about Spotify

service is that it can play a perfect song for every single moment, and

some of this is powered through Hadoop which makes it even cooler!

For me Spotify is also 1.3K Hadoop nodes – which is like a baby for

a team of 4 people. A baby that is sometimes very frustrating, shit

happens all the time and you have to wake up in the middle of the

night and clean it up, but it’s our baby and we love it. Without further

ado let’s move to the core of the topic and start with operations. If

there’s one lesson that comes from operating hadoop clusters from a

handful of nodes in the corner of the office to 1300 nodes – it will be

AUTOMATION.

Automation

Automation is crucial – especially if talking about Hadoop. Hadoop is

huge beast to manage, there is loads of moving parts, loads of new

stuff coming in and there’s always a reason for hadoop clusters to go

down. If hadoop was not enough there’s always something that your

company will push on poor operators – whether it’s a new linux distro

or maybe there’s a bug in libc and you need to restart all the

daemons and so on and so on.

ME

ADAM

You want to be proactive and do as little as possible – without

automation even coffee won’t help you. You want to be Adam – be

happy and work on new features, enhance hadoop – bring joy to

hadoop users. You don’t want to be the poor operator on the left,

focused primarily on putting out fires, exhausted. Btw this is the

picture of Adam and me after 40 hours of hadoop upgrade from

Hadoop 1 to 2 in 2013.

Apache AmbariCloudera Manager

So how to reach good enough automation of your cluster – let’s take Spotify as example first –

Spotify started with hadoop in 2009, very early, then there was a couple of tiny expansions, a

short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes –

at that point we had to make a decision on how to manage hadoop – and because back then

CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use

Puppet for this use case. It was rather big effort and took time, during which we had to drop

some work, put out fires and work on Puppet, but it was a great investment. Today after a few

iteration we like our puppet – as an example – most recent ongoing expansion is rather easy –

we name the machines using proper naming convention and puppet kicks in installs all the

services, configuration and keeps the machines in normalized state – very very important piece

of our infrastructure. But wait – the slide says something about Ambari and CM – yes – cause if

we were to set up a cluster today, we would most likely evaluate at least these two solution.

Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy

about it right now, but there’s huge leverage you can gain out of using these tools, loads of

features that you get out of the box, that we need to implement ourselves. So if you are

considering building hadoop cluster – make sure to give these tools a good try, they may not

solve all your issues and use cases but for sure will bring loads of value and in time your will get

even more features just from community – which is great, and is something that we are

missing.

+ Puppet

That said – even if you decided to use Ambari of CM – mostly likely

you will still need some kind of configuration management tool –

whether it will be puppet, or chef or salt or whatever is your favorite –

you will need one, there will always be some extra library that you

need to install and configure or some user to create and so on.

There’s another interesting outcome of us building our own puppet

infrastructure – we know exactly how our hadoop is configured –

every single piece of it – which comes in handy in case of trouble

shooting. In this case we touched a little bit a problem of 3rd party

solutions vs let’s implement our own tailored solutions. How many of

you are aware of NIH problem?

Not InventedHere

I will argue that there’s a number of cases and teams where this

problem occurs at Spotify. NIH problem is an nutshell when you

undervalue 3rd party solutions and convince others to implement your

own solutions – in most cases this is a huge problem. The lesson

that we have learned is that you need to give external tools a try,

experiment but don’t expect something to solve all your problems –

preferable define metrics of acceptance priori evaluation of tools.

Never InventedHere

But what is actually very interesting in case of data areas is kinda

sibling problem on NIH – which is NeIH – a problem described I

believe by Michael O Church – it’s kinda opposite approach, it’s

when you overvalue 3rd party solutions, and end you in a messy

place of glue implementation madness. There’s loads of great tools

in BigData areas – not all of the work well with each other, not all of

them do well what them meant to be doing. I urge you to be critical,

sometimes implementation of your own tool or postponing a new,

shinny framework from infrastructure may be a good thing to do – but

it has to be a data driven decision that brings value. Think about this

two problems, and ask yourself are there examples of such solutions

at your company?

Wild Wild West

To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an

external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at

different corner of our infrastructure. First two days went really smooth, we went through our configuration,

state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel

proud, cause you know, we have this world class, talented hadoop expert over and he can’t find a way to

improve our cluster right? But oh boy was that a big mistake – so on day number 3, we are sitting in a

room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and

RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but

unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically

whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait

for day number 4 – so next day we are sitting in the room, again whole team, consultant but also our

managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild

West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do,

was to go to a room with the team and come up something to solve this issue, we came up with something

that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost

identical configuration which we will use for testing. But how to test was a real question. We went into

research mode and started reading and watching presentation – we were especially impressed by tool

called HIT by Yahoo, so we contacted the creators, unfortunately there was no plan to open source it – but

they gave us a nice tip – look at Apache Bigtop.

Apache Bigtop

Apache Bigtop primarily facilities building, testing and deployment of hadoop

distribution – but you can also us it in a slightly different way – you can point

you bigtop at your preproduction cluster and use its smoke tests to test the

infrastructure. So our current flow of testing and deployment is to first deploy

to preproduction cluster, run bigtop tests – get instant feedback about the

change, if feedback is fine deploy to production if not – there’s something

wrong with the change and we know that before it is deployed to production.

Some findings from using BigTop is that it’s actually very easy to extend so we

were able to add smoke tests for our own tools like snakebite and luigi but

also what is very important we also run some production workloads as part of

smoke tests – which actually makes us feel sure about the change.

So in case of Apache Bigtop the problem was testing of hadoop infrastructure

– even tho Bigtop is not perfect for this it provides loads of value just out of the

box and thus it’s a great example of preventing NIH problem.

Enable log aggregation

As an operator there are many ways to help yourself and also

delegate some of the work to developers themselves – one disabled

by default, but great feature of Hadoop 2 is log aggregation – how

many of you have log aggregation enable on your cluster? So in a

nutshell this feature will aggregate yarn logs from workers and store

them on HDFS for inspection, very useful for troubleshooting. So

most of you probably know how to enable it right?

To enable log aggregation

yarn.log-aggregation-enable = trueyarn.log-aggregation.retain-seconds = ?

It’s dead simple. But there’s one question – how long should we keep

the logs for? So we thought about it for a while, talked with HW a little

bit and since we have huge cluster why not story it for long time –

maybe we will need these logs for some analytics etc.

+ <property>+ <name>yarn.log-aggregation-enable</name>+ <value>true</value>+ </property>++ <property>+ <name>yarn.log-aggregation.retain-seconds</name>+ <value>315569260</value>+ + </property>

This is our initial change to configuration – 10 years. Does anyone

know what bad can happen if you do that?

Heap Memory used is 97%

If you run enough jobs/tasks after some time you will see something

like this in your NN – and when you see something like this in your

NN then you end up with hellelepahant!

Hellelephant

It’s a situation when your hadoop cluster spectacularly goes down.

What happen is that log aggregation will in time create many files,

very important consequence of many files on HDFS is growing heap

size, and then you get out of memory on NN. The lesson is that it’s

“good idea” to alert on heap usage for your master daemons but also

understand your configuration and its consequences, keep yourself

up to data on the changes in configuration, read code of hadoop

configuration keys and how they are connected between each other

– no all configuration parameters are documented.

Custom logs• Profiling

• Garbage collection

It’s a situation when your hadoop cluster spectacularly goes down.

What happen is that log aggregation will in time create many files,

very important consequence of many files on HDFS is growing heap

size, and then you get out of memory on NN. The lesson is that it’s

“good idea” to alert on heap usage for your master daemons but also

understand your configuration and its consequences, keep yourself

up to data on the changes in configuration, read code of hadoop

configuration keys and how they are connected between each other

– no all configuration parameters are documented.

Right tool for the job

There is a couple of interesting lessons about productive development – first

arguable most important is to pick right tool for the job – what is the current

most important value to bring. Let’s talk Spotify – in 2009 Spotify started with

hadoop streaming as a supported framework for MR development – hadoop

streaming basically enables you to implement MR jobs in languages different

than java – for many years it was THE framework – because Spotify loved

python and it enabled us to iterate faster thus provide knowledge for our

business. Time was passing and our hadoop cluster was growing – in time we

needed something different something better when it comes to performance

but also maturity. After long evaluation and I encourage you to watch

presentations by David Whiting about different frameworks, we have decided

to use Apache Crunch as supported framework for batch MR. Why – a couple

of reasons – first ease of testing and type safety.

David’s presentations:

* https://www.youtube.com/watch?v=XKY_0s7pESQ

* https://www.youtube.com/watch?v=-WOiZ2w7xtI

https://www.youtube.com/watch?v=XKY_0s7pESQ

https://www.youtube.com/watch?v=-WOiZ2w7xtI

This graph shows number of successful and failed jobs divided by

framework for 6 month – and these are production jobs – as you can

see two most popular frameworks are hadoop streaming and crunch

– but the difference between failed and successful jobs is crucial.

Crunch jobs act much better and have better testing. Type safety

helps to discover problem at compile time and testing framework that

comes with crunch that we were able to enhance with hadoop

minicluster helps users to easily test their jobs – basically makes

testing easy, something that we missed for our hadoop streaming

jobs. But performance is another thing.

On this graph we can see map throughput for apache crunch and

hadoop streaming – these are production workloads for 6 months –

there’s huge difference. Crunch turns out to be on average 8 times

faster. What is more interesting is that we actually see higher

utilization of our cluster the more crunch jobs we see on the cluster –

which makes us super happy.

Right abstraction for the job

Another thing that crunch provides is great abstraction – and that is

another thing that productive developer need to keep in mind – pick

the right abstraction for the jobs. In case of crunch we can start

thinking in terms of high level operations like filter, groupby, joins and

so on instead of old map/reduce legacy. This makes implementation

more intuitive and simply pleasant – thus make developer

experience much better. The interesting thing that we have observed

is that higher abstraction may remove some of the opportunities for

optimization thus it’s not as easy to implement the best performing

job – but on the other hand it reduces problem with premature

optimization, also on average performs really well – there’s very few

people that actually know how to optimize pure java MR jobs or

Hadoop streaming jobs at Spotify – but average optimization that we

get from crunch turns out to be really good as you could see on the

performance graph.

Scaling machines is easy, scaling

people is hard

We do have loads of nodes – and we have scaling machines nailed

down, crunch scales very well. But there’s big problem that we

currently have: scaling people. How you scale support and best

practices – we constantly see problems with code repetition, HDFS

mess, lack of data management, YARN resource contention – all this

brings our productivity down. There’s not enough time to go through

all of them but some of this problems we are trying to tackle with

nothing different then our beloved automation. Let’s see examples:

• Map split size

• Number of reducers

• HDFS data retention

• User feedback (ongoing)

Automation

We automate map split size calculation thus number of map tasks, but also number of

reducers therefor number and size of output files – all this is done by estimation and

historical data using our workflow manager Luigi – that I encourage you to take a look

at! Luigi github: https://github.com/spotify/luigi

We are about to finish our second iteration of HDFS retention policy that will

automatically remove data therefor reduce HDFS usage and in long term hopefully

reduce HDFS legacy mess.

Another ongoing effort is second iteration of automatic user feedback – we already

expose database with aggregated information about all MR jobs that our users can

query and learn how their jobs are performing – but we also plan another iteration

very simple iteration, focused on Crunch – that right after to workflow pipeline is done

will provide user with instant feedback – memory usage, garbage collection and so

on, very simple tweaks users can apply to improve their jobs – for example if a user

gives a pipeline 8GB of memory for each task and after going through counters we

see that tasks are actually using only max 3GB, instant feedback to reduce memory

could improve multitenancy of your cluster thus improve productivity.

https://github.com/spotify/luigi

Organization

With that let’s talk organization structure – how it can improve your

performance – but before that let’s take a look at a graph.

This graph show Hadoop availability by Quarter at Spotify, higher is

of course better – ok – so let’s see what happen here:

Ownerless

First part is hadoop cluster being ownerless, it was best effort

support by team of people that mostly didn’t even want to do

operations of hadoop, therefor multiple days of downtime happen,

and infrastructure was in bad shape, denormalized – overall terrible

state to be in. But there was a light at the end of the tunnel – in Q3

we have decided to create a squad – 3 people focused solely on

hadoop infrastructure.

Ownerless Squad

There was instant feedback right after squad was created – users

were happy and infrastructure was getting in shape, one of the first

decisions we have made was to move to yarn in Q4.

Ownerless

Squad Upgrades

Q4 and beginnings of ‘14 we again saw drop in availability mostly

due to huge upgrade – and it’s consequences thereafter. The

upgrade itself took whole weekend, and after the upgrade we saw

many issues and fires that we had to put out, during this time we

were mostly reactive but also working on polishing our puppet

manifests. Whole situation stabilized after most fires were gone and

puppet was in good shape.

Ownerless

Squad Upgrades Getting there

Our goal is too keep hadoop at 3 nines of availability – and we are

getting there since Q2 2014, hadoop squad is receiving constant

feedback from users and its common that hear that availability was

drastically improved which improved productivity and overall

experience, which is great and makes us want to work even harder

to achieve better results.

Culture

With that lets now talk about what surrounds us – the culture – I

strongly believe the culture at Spotify has huge influence on

productivity - there are tree main pillars of culture.

ExperimentFail Fast

Embrace Failure

Experiment, fail fast and embrace failure. We love to experiment and

we have time to experiment whether it’s company wide hack week,

R&D days – if only one wish to experiment there’s time to do that and

there’s loads of curious people at Spotify there’s always something

going on. The most successful data based experiments are Luigi –

hadoop workflow manager and snakebite – pure python hdfs client –

I encourage you to take a look at them. Fail fast – do small steps –

don’t be afraid to admit failure, keep it as part of learning process to

the point of embracing it. Talk about your failures, share them publicly

for example through presentations both internally and externally – it

will make experimentation thus innovation flow much smoother. To

back this up by example let’s talk about one ongoing experiment.

Luigi github: https://github.com/spotify/luigi

Snakebite github: https://github.com/spotify/snakebite

https://github.com/spotify/luigi

https://github.com/spotify/snakebite

Spark

But we have tried!

Non grata

Spark – it pretty much ongoing experiment that we come back to

every now and then, but officially it’s not welcome on production

cluster due to immaturity and poor multitenancy support – that said

most recent releases (>1.3) are very promising and we are

constantly playing with it and have high hopes for it, especially about

most recent dynamic resource allocation feature. There’s not so

much time left but I would like to share with you two important

lessons from our evaluation of a heavy spark job.

Spark

spark.storage.memoryFraction (0.6)spark.shuffle.memoryFraction (0.2)

In shuffle heavy algorithms reduce cache fraction in favour of shuffle.

First hint is about memory settings – there are two important settings

that can improve stability of your heavy Spark jobs – memory

available for caching – storage.memoryFraction and memory

available for shuffle – shuffle.memoryFraction. The default settings

are .6 and .2 leaving .2 for runtime. In our case we had a heavy

machine learning job that was doing almost terabyte of shuffle – but

very little (proportionally) caching – initially we had issues with shuffle

step, but reducing storage memory and leaving extra memory for

shuffle and runtime improved stability.

Spark

spark.executor.heartbeatInterval (10K)spark.core.connection.ack.wait.timeout (60)

Increase in case of long GC pauses.

Another issue that we hit was long GC pauses – thus executors

would disappear which in turns triggers recomputation and in the end

potentially application failure. After tweaking hearbeat interval and

ack.wait timeout was saw improvement in stability and even tho GC

pauses still occurred they were less harmful.

Learnings• Operations Automation

• Development Abstraction

• Organization Team

• Culture Experiment

Join the bandEngineers wanted inNYC & Stockholm

http://spotify.com/jobs

Productive data engineer speaker notes

Data & Analytics

Productive data engineer speaker notes