Building a culture where software projects get done

Building a Culture Where Software Projects Get DoneGreg BrockmanCTO at Stripe@thegdb

We dont know how to build software

EXPECTEDDisappointingInsaneTodayLIKELY

Engineering timelines will slip

System complexity never decreases

Rewrites will always fail (that doesnt stop people from trying, though)

You are not special

Choose wisely how youre spending your time

Roll your own solution to your hardest problems, not your easiest ones

Balance creation versus maintenance

5.times {print Automate}

Once a bug is triggered, it will keep biting you on a short timeline, no matter how unlikely it seems

Invest in technology to support your rate of change

Tests arent for your benefit

Create a technology monoculture

You will have technical debt and thats goodImage Credit: Philippe Kruchten

Pick a few standards

Have checks and balances against yourself

Minimize distance to the first production useTime to shard everything: 3 months (projected)

Time to shard internal collection: 1 week

Have assumption questioners

Bus factor: not just for bus accidents

Use forcing functions (cautiously)

Have a good launch process in place

Have good post-hoc processes in place

Make collaboration great

Find communication sidechannels

Documentation should not be a primary source

Meetings: useful but costly

Have design dictators

Have lots of remotes or no remotes

Greg [email protected]@thegdb

Today Im going to be talking about the single aspect of software engineering that has basically remained stagnant for the past thirty years how to get software projects done.

Most of you probably know the classic allegories in this space: the mythical man month, second system effect, and so on.But even though these stories have been around for a long time, the same issues still plague us today. And theyre growing ever more important as software continues to eat the world [1]. Just take a look at the recent healthcare.gov fiasco the difficulties of building software are now making national headlines.

[1] http://online.wsj.com/news/articles/SB10001424053111903480904576512250915629460The first step to recovery is admitting we have a problem. I think that we as a community need to acknowledge something: we still dont know how to build software. Whether its bugs, schedule delays, or feature creep, these are all staples of software engineering.

Think about how reliable traditional engineering is, and compare that to what we see in software (cf [1]). You dont cross a bridge worrying that its about to fall down. But when you use software, you see it crashing all the time.

Theres certainly debate about just why this is true. Is it just that software engineering is young, and we need more time to figure it out? I think its deeper than that software engineering is just inherently more complex than any other engineering discipline: modules have a much broader interaction surface, and more modules interact with each other, meaning that the resulting systems are just orders of magnitude more complex than anything else we build.

[1] http://www.codinghorror.com/blog/2005/05/bridges-software-engineering-and-god.htmlSoftware timelines are one thing that we just dont know how to deliver on yet. My usual rule of thumb is that you should take your expected timeline and keep tripling it until it feels like itd be insane to take that long the real timeline is usually between disappointing and insane. (Its remarkable how well this works I think the problem is most estimates are incredibly optimistic, and fail to factor in the amount of random interruptions that will happen along the way.)When was the last time you saw one of these? If you cant see, its a negative diffstat. It probably wouldnt look very familiar anyway, because its rare that anyone actually produces one. Most engineering cultures measure productivity as producing new code to do new things, and view improving old code as just overhead its very rare anyone feels good about taking the time to make old code do old things better.

Thats kind of surprising, because one thing we do know is successfully building software is all about constraining complexity, and the simpler you can make your code the more you can get done in the future.

But its also easy to approach the drive for simplicity from the wrong direction this is basically the second system effect. Let me give you an example. Back in college, I was tasked with running technology for the Harvard-MIT Math Tournament. Id inherited a lot of code: there was a Java app for entering results, a Perl script to generate rankings, a Python script to turn those rankings into results emails, and a website that ran on server-side includes.

Looking at this, I declared all existing code legacy, and resolved to rewrite everything in a single unifying Ruby on Rails project. I started writing, and whenever I thought of some potential piece of functionality, Id go ahead and add it. Support for multiple tournament years in the application? Of course. Adding a CMS for all the static pages? How could I resist? Over time, I noticed that the application was becoming very complex. There was so much functionality, which grew out of a soup of modules without well-defined abstraction boundaries, that it was becoming hard to trace the origin of any one behavior. And in response I did something surprising: I rejoiced! Surely this meant I was getting a lot done that sloccount kept going up, and what value can a project have if its not complex?

Needless to say, when I presented my application to my co-maintainers, no one could figure out what was going on. With the old system, even though there were a lot of tools, you could figure out how to change any one component by just looking at that component in isolation. With my Monorail, any changes required understanding the entire system, in all of its complexity. Even though the Monorail basically worked, and from the outside perhaps it looked like a simpler system since it unified all these tools, we had to throw it out: the increased complexity was not worth the corresponding functionality gains.

You should look at code as shackles: every additional feature you pack in is something youll have to maintain, something youll have to reimplement if you ever decide to switch stacks, and something that will potentially interact with any new features you add. Even worse than the features you meant to add, you probably will end up with a bunch of emergent behaviors that are an accident of how you happened to code together your system, and youll find that new code will start relying on those behaviors, making it impossible to understand just part of the system in order to make future changes.

Given how bad we are at writing software, we need to take every opportunity we can to constrain complexity. But tempting as it is, you cant solve complexity by throwing more complexity at it you need to figure out how to incrementally improve your existing solution. And we as a community have yet to figure out how to set up our cultures so that this happens.You can look at that occurrence as simply the follies of a novice programmer. But the crazy thing is, there are many stories just like it from industry, even from some of the best programmers out there. Youve probably heard the classic stories, like how the massive Netscape rewrite basically killed the company, but whats surprising is that this still happens today.

To illustrate: one of my friends companies was written in a scripting language, and over time increasingly felt like they couldnt get enough performance out of it. After a few attempts to improve performance, they decided to take the nuclear option: it was time to rewrite in Scala. The plan was to have a few people spend two months rewriting all of the core abstractions in Scala. After that was done, thered be a feature freeze, and everyone would spend the next two weeks porting all the application code to Scala, at which point they could just switch over to Scala entirely.

I bet you all see where this is heading. The first sign of danger was that the abstraction porting took longer than expected but after 6 months, they were ready to go. The delay was an accumulation of little things, ranging from getting the toolchain running in their stack to building out tooling to have Scala serve shadow web requests in order to check its correctness. And then the feature freeze began.

For the next three months, everyone in the company was full-time on porting over from their scripting language. It turned out that there was a lot of complexity in their application logic, and porting over and testing each page took a lot longer than the expected 2 weeks. At some point, they realized that they couldnt afford to keep their existing site stagnant, and so they put some people back to work on the old site. But now they had to deal with diverging code, which further slowed the port.

Finally, realizing that the rewrite was doomed, they redoubled their efforts in figuring out how to scale their existing language. Ultimately, they did find a solution it turned out that by cleverly breaking out parallel rendering, they were able to get the performance they needed.

The thing I find fascinating about this story is it has nothing to do with the company having bad engineers in fact, they employed some of the best engineers I know. And it isnt like this is a newly-discovered failure mode: Joel on Software has an article from 2000 talking about the dangers of rewrites [1].

So why do we keep doing monolithic rewrites? I have a lot of hypotheses, but to some extent, the underlying reasoning doesnt matter. Whats important is to make sure every member of your engineering team is aware of the fact that this is something everyone tries, and everyone gets wrong, and that if you try to rewrite your site from scratch you will fail, and possibly kill the company along with you.

Engineering problems usually have multiple solutions, and once weve found one we usually just give up on searching for more. Note that once theyd constrained their solution space to things which dont involve a massive rewrite, it wasnt actually hard for them to find a solution.

[1] http://www.joelonsoftware.com/articles/fog0000000069.htmlPerhaps the most important point to instill in your culture is the realization that you are not special. Youre not immune to any of the failure modes that people run into, and the things that Im going to talk about today are probably going to mirror the ones that youll tell in the future.

If you think youre immune, that probably just means you havent been around for long enough to see how its going to break down and bite you yet. There are people who are just as smart as you who have been thinking about the same problems for a very long time.

The best way to get things done is to approach every project cognizant of how these things usually go wrong, and be constantly looking for those warning signs that youre going to mess it up. In practice, youll still make mistakes. But this way, youll make fewer of them.The most common reason that people fail to get things done is that they spend their time working on the wrong things. Shaping your culture so that people work on whats important is tough its so easy to get sidetracked. But its also the only way to be successful.Whenever your company has a new requirement, you have a choice. Do you integrate an off-the-shelf solution, or do you build one yourself? As an engineer, its always tempting to just roll your own NIH, or not invented here (http://en.wikipedia.org/wiki/Not_invented_here), is the name given to this proclivity. As a manager, its tempting to just use someone elses solution. So how do you decide when to build and when to buy?

The danger with building isnt actually that any of these problems are hard. Its easy enough to get an MVP of an applicant tracking system up and running. The thing that always goes wrong is in the maintenance, and the laundry list of a hundred features (individual user accounts, email integration, daily summaries, reminders, etc.) that would make the product just a little bit better. Presumably your engineering effort will always be incrementally better spent working on other problems, and so these ones just wont get done. The main thing you pay for when you buy is someone to work on the long tail of features, not actually the core functionality.

Most cultures lose sight of this tradeoff, and NIH all of their easy problems. This consigns them to working on maintenance on all these applications. In contrast, many peoples first instinct is to outsource their hardest problems, because, after all, they are hard and it seems better to just let someone else deal with them.

One case where we ran into this tension was with sharding. For a long time, wed incrementally scaled our databases. Whenever a database cluster would become overloaded, wed split out collections into new physical databases. At some point, we knew this would break down, and we wanted to stay ahead of the curve.

We correspondingly decided it was time to implement a sharding layer. Sharding is a very hard problem (youre combining distributed systems with your production-critical data), and we would love to just be able to outsource that problem to someone else. We use a lot of MongoDB (we chose it primarily for its automated failover capabilities), which comes with a sharding scheme. We started out by deploying its sharding against our log archive cluster. Over the next few months, we had a number of operational issues. When those came up, the two things wed do was read the source or go to MongoDB support and ask them for help (or possibly both). Supports turnaround was quite good, but it added a lot of latency to issues that would have been critical had they been for production data. As well, we soon realized that outsourcing the code writing hadnt actually allowed us to outsource the code understanding, but mostly meant we didnt have control over the sharding layer / the ability to patch it if things went wrong.

So we soon realized that sharding was something we just needed to write ourselves. It was sufficiently core that we couldnt trust someone else to manage it, and we could take advantage of application-level invariants to better tune a lot of the behavior for our use-case.

That being said, there are many hard problems which you should not try to solve yourself. Only pick the problems that are core to your business, and where theres some reason to believe that the solution for you is significantly easier than the general case. For something like scaling your log pipeline, everyone has exact the same problems as you dont write your own; use something off the shelf even if it is missing that one little feature you want.

(As an aside, theres actually a great paper on this subject, End-to-End Arguments in System Design [1], which I highly recommend for anyone trying to decide what functionality they need to implement themselves.)

[1] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdfPerhaps one of the hardest things about building a rapidly-growing company is balancing building new things versus maintaining the old ones. Fixing the existing things are always the most urgent if your system is down, or a user cant log in, or your database has been corrupted, you do just need to drop what youre doing to go deal with it. Being good at maintenance is how you avoid losing.

However, the really important stuff, the things that will make you win, are the new and innovative things in your development pipeline. Nothing stalls a project more than an engineer being pulled into a bunch of firefighting, or answering customer questions, or helping diagnose some weird behavior. One structure that weve adopted for handling these issues is what we call build/run rotation. Each week, one person on each engineering team is on run. Their job is to serve as a buffer for all of the operational concerns. When urgent things come up requiring that teams attention, its the runners job to intercept it and handle it appropriately. Sometimes theyll have no choice but to escalate, but those times should be few and far between. With their free cycles, the runner should work on polish, trying to make small, quick wins which are easily interrupted. This leaves the builders free to focus on the most important projects to move the team forward.

One nice property of this structure is that everyone on the team has an understanding of the main operational issues affecting them. This can help focus their build time on whats important.Along these lines, you should bake into your culture a desire to squash manual tasks. Otherwise, those tasks will grow in number until theyre all that you do. Some manual tasks wont be worth automating, but those should be few and far between. Invest in a framework for task automation so that its easy to add and maintain automations once its in place, you should see people just starting to use it without further prompting.

At Stripe, our systems team has a project called Golem, which they use to automate processes such as database cluster rebuilds which used to require a lot of human time. In all of these, your goal should be to maximize your peoples efficiency and leverage.Imagine you have some service outage due to a memory leak. You track the leak down to a very rare corner case in your code. Do you immediately fix it, or do you put it on your list of things to deal with in the future?

Its very tempting to say that since its a rare condition, you can just ignore it. However, Ive noticed that empirically, if youve hit an issue once, youll almost always keep hitting it again and again on a surprisingly short timescale.

I think whats going on there is the fact that you managed to trigger it once is actually a pretty strong indicator that youre now at a point where its likely to be triggered again. Sometimes theres a second factor that you dont understand; for example, perhaps some other constraint in your systems end up making the corner case far more probable, or maybe a new customer started sending you a bunch of data of an unexpected form. Sometimes its just that youve crossed some volume threshold without noticing. But Id strongly recommend a policy of fixing production bugs as soon as they are discovered anything else will cause a surprising amount of repeat breakage.Writing code is hard, but writing code that youre sure actually works is much harder. The single best thing you can do to make it possible to get projects done in your codebase is provide good ways to be sure that changes are correct. No code should be considered complete or shippable until it has a good way of ensuring itll be modifiable in the future.The primary mechanism to accomplish this is tests. One common misconception is that tests are to ensure that what is currently written works. Thats not really true you can probably convince yourself of correctness just by playing around with the software by hand, and its usually a lot faster than writing a test.

Instead, tests are a statement to future maintainers (including yourself six months in the future) about what contracts your code needs to maintain. If its not tested, then its undefined behavior, and you should assume future refactors will either come and break it, or just wont happen because youve written a block of gnarly, untested code that everyone will be afraid to touch.

Choosing what tests to write is a bit of an art. Tests should be explicit about what theyre testing, and if they break, it should be obvious what no longer works. And importantly, they should really be statements about the functionality you care about, rather than implementation details. Ive seen TDD adherents iterate by writing an empty function, asserting it returns an array, changing the function to return an empty array, adding a new test that it returns an array of length two, and going from there. Most of the resulting tests arent useful: all you really care about is that your function, say, splits a name into first and last, and you should keep your tests to that high-level behavior.A less obvious investment to make is staying on a unified stack. As your company grows, it can be tempting to introduce new technologies (and youll probably notice people pushing for them). There are certainly times you should let this happen its unlikely that your main stack will be able to solve all problems. However, keeping as much of a technology monoculture as possible means that when you fix a performance problem in one service, all of your other services get the benefit of that fix. It can be painful, because this often means you shouldnt go seeking the right tool for the job instead, you should look for the right way of solving problems within your existing technology stack, no matter how kludgy.

A good illustration of this is that from the beginning of Stripe history, we were written entirely in Ruby. I wrote a scary amount of Ruby systems code there wasnt really the library support for some of the stuff I was doing, but sticking to Ruby meant that I could reuse our deploy, test, and dependency management code. As we built more and more infrastructure, we spun out our own in-house framework, which we currently use to underly all our services. That means we can make a single code change to, say, our logging, and suddenly have all of our services reflect the change. There have been many times when weve been tempted to break this culture, and some times that we have. When someone wants to introduce something like Redis, and we can find a way to hack the functionality we want using our existing systems, well stick to the monoculture. But when someone wants to introduce Hadoop, where its clear that building even a mildly plausible Ruby alternative is infeasible, well introduce the new technology.

In general, if the functionality is something thats important to the business, then you really have no choice but to accept whatever stack it comes with. Just keep in mind that itll be a burden that youll bear forever.Technical debt is something that always comes up in these sorts of talks. I think the best explanation of technical debt Ive seen is this image. Nice, visible things are what we call features. Nice, invisible things are architecture. Bad but visible things are bugs. The bad and invisible things are what we call technical debt. It bogs us down and slows the rate of change.

Most people feel bogged down in technical debt, and start asking how can I change my culture so we stop adding technical debt. Thats really the wrong question though. Technical debt, if managed properly, is actually a good thing. Its just like real debt: it lets you move more quickly in the short-term, but youll have to pay it back in the future. If you cant pay it back, youve lost. But by spreading out the load of polishing your system, you can get way more done.

So the real question is how to change your culture to better manage technical debt, and make sure youre paying it down at a good rate. Unfortunately, theres no silver bullet here, but a good rule of thumb is dont do work youll later have to undo its applicable in probably 75% of cases.

In any case, you should make your new debt explicit. You probably wont do anything about it, but at least you know its there, rather than it being discovered next time someone tries to make a change. Once youve identified what your debt is, whether just bolting a new function onto an existing class, or partially integrating some external system, or whatever it may be, you can be more strategic about its accumulation.Given those statements, you need to decide what properties all of your software must absolutely have. You should assume that anything on that list will slow down your immediate iteration cycle. But you should also assume that anything not on that list will end up being sacrificed in a permanent way in at least one project.

I think good testing should always be on that list thats the one lifeline you have to pulling yourself out of the technical debt mire in the future.

Security is a pretty important configuration item to pick. Do you let people embed secrets into your code? What kinds of third-party services do you allow, and what data are you ok with giving them? These are not easy questions, and the right answers vary from culture to culture.

Quality standards are also important. You should assume the things you write will be there forever. My first project at Stripe was something called password-vault. It was a system for storing shared passwords, such as logins for third-party services. It was pretty horrendous code, but I figured I just needed to get something out and Id get around to fixing it up in the next few weeks. Three years later, that code remains in use. So you need to decide up front what level of quality is acceptable, and just not let anything ship below the bar.

There are a number of other standards you might choose. Monitoring coverage? Do all services need to be nicely packaged, or is it ok to manually configure the servers they run on?

The choices you make here say a lot about who you really are as a company and culture. One dissatisfying thing about standards is you only get a few of them have too many of them, and you wont be able to get anything done. But if you dont have them, then your codebase and systems are going to fly out of control and be completely unmanageable.Because humans are so bad at building software, its important to not lean on just your intuitions and assumptions. You should make sure youre using techniques that will help you get things done while continually reevaluating your priors and adapting to the situation at hand.In general, you should try to get something, anything out into production as incrementally as possible. Trade off features, but dont trade off implementation quality. Once its in production, you dont have to worry about your branch growing stale people who are making changes will also include your system. As well, this helps guide what problems you actually need to solve (perhaps that thing you thought would be the bottleneck is actually fine, and maybe the lack of a webface is really a bigger problem than you were expecting). And once your system is out, the problems all become nicely incremental.

In many ways, your ability to get projects done really is just a reflection of the timeline to get something from zero into production, with some adjustment for iteration cycle length. Most people think about the latter, but dont take the former at all into account. Iterating is way easier than getting something built from scratch: its a lot clearer what problems you actually need to solve.

Our sharding project is a good example of this. Wed sat down and come up with a design, and implemented the core code in a few weeks. However, we figured itd take at least 3 months to get fully comfortable enough to roll it to production. There were after all a lot of moving parts, including a shard splitting tool we hadnt started on, and getting things wrong could result in missing data or incorrect queries. And of course, since we thought itd take 3 months, itd probably take more like 9.

This was starting to sound like a massive project. We thought for a while about whether there was a better approach. Finally, someone suggested what if we just roll it out to some non-critical internal collection? In that case, we could just punt on all the fanciness.

We rolled out sharding for that one collection later that week. Suddenly we had a clear set of priorities, and things that needed tuning, and things shifted from being abstract to concrete worries, which meant we could fix them. This changed sharding from vaporware into something that was used in production, somewhere.Getting sharding fully into production ended up hinging on conversations between the engineers driving the project and other people on the team who were less closely involved, but had the right background.

The next milestone for sharding was rolling it to critical data. Our databases were growing increasingly loaded, and if we didnt have sharding soon wed have to come up with a drastic stopgap.

At this point, wed become pretty comfortable with our sharding implementation. The main blocking point was building out a shard splitter. There was no way to make the shard splitter itself more incremental.

At one point, an engineer working on sharding walked a counterpart removed from the project through all the details. The counterpart asked a bunch of questions on each component, just trying to fully understand what was going on. At one point, he asked So the hard problem here is splitting shards, right? But why cant we just start putting new users onto a new shard?

And then the solution was clear in fact, we could just punt on the shard splitter altogether. Sharding just new data would leave us in no worse of a world, and would mean our databases would catch no more on fire.

So suddenly we had a plan, which allowed us to reap immediate benefit from a project that would have taken a very long time to complete, and we didnt have to implement a stopgap.

Its interesting to step back and ask: what actually happened here? Its not that the counterpart was a better engineer: instead, I think the issue is that, when youre building a system, you need to make thousands of tiny decisions and judgement calls. Probabilistically, some percentage of these will be wrong. And so sitting down with someone who isnt steeped in the details, but has enough background to question your assumptions will invariably be useful and help you discover something you wouldnt otherwise if youre familiar with rubber ducking, I think this of this as basically rubber ducking++.Many people talk about the idea of a bus factor, or the minimum number of people who could be removed from the project (graphically portrayed as being bit by a bus) before no one is familiar with the code. Usually the focus of bus factor is redundancy you want to make sure that if your main programmer leaves the company, or wants to work on something else, that the project doesnt suddenly stagnate since no one knows how it works. In reality though, theres a more important benefit to maintaining a high bus factor: it just leads to better decisions and code.

For some systems, its not enough to sit down and talk design every so often the assumption questioner really needs to be writing code alongside the primary author. One example here is Monster, a system for durable event processing I wrote early in Stripes history. Its the core backbone of our systems, and has grown from thousands to tens of millions of events per day.

The way we build systems at Stripe is to roll them out with the simplest possible implementation, and improve from there. Wed started out by running all consumers in a single process, which round-robined among them. Over time, we noticed that low-priority but high-volume consumers could starve out high-priority consumers, and ended up splitting out consumers into groups according to their priority. We continued doing this sort of incremental scaling for quite some time.

At some point, we decided the time had come to figure out how to scale Monster for the next order of magnitude of growth. All of the consumer scheduling stuff had been baked into Monster as an afterthought, and it felt like clearly we should just find a piece of software that already did that rather than roll our own. We were very cognizant of the second system effect (that is, the tendency for system redesigns to end up with feature bloat), and chose to add only the bare minimum of required new features, such as a sharding scheme to parallelize individual consumers.

After looking around, we settled on Storm as the closest thing to what we wanted. This would require rewriting Monster in Java. We didnt even entertain the notion of rewriting our consumers; instead, we immediately jumped to writing a multilang connector so that all of our consumers would remain written in Ruby. The design was carefully incremental, and allowed us to switch over just one event type to get something in production as early as possible. It seemed like wed successfully applied all of our principles, and the project should be safe from massive delay.

We kicked off the project about this time last year, with the belief that itd be fully done by the beginning of 2013. As with sharding, one engineer went off and implemented the new design.

However, we were hit with a bunch of implementation delays: our initial design for the new queuing layer turned out not to be performant; some of our consumers turned out to rely on being run in a certain order and had to be updated; and a bunch of other small things we hadnt accounted for. This meant we werent running the first events through for an extra two months, almost twice the intended project length.

Even though there was another engineer closely following the design, we realized that they werent able to effectively question assumptions: since they werent actually familiar enough with the code, it was very hard to grok what the actual implementation issues were, or to help point to which problems could be worked around. That meant we werent able to get any of the usual benefits of an assumption questioner. Had there been someone else writing code, we would have been able to have much better conversations about it, ultimately ending up with a much better project and probably getting it done much sooner.During Stripes second capture the flag, a security competition we ran last September, I really wanted a launch which didnt involve us working up to the deadline, in contrast to the first. So we internally agreed on a soft launch deadline and a hard deadline a week later and we left plenty of time to launch by the soft deadline. But at the soft deadline, we found ourselves in a world with a bunch of work left to do. We then redoubled our efforts, and finished up with just a few minutes to spare to the hard deadline.

I find this soft-hard technique works well youll probably never make the soft deadline, but it at least gives you a checkpoint at which you should decide what you need to do to make the hard one.

If youre not familiar with Parkinsons Law [1], applied to software, its the statement that work expands to fill the time allocated to it. Its surprisingly true in practice. If youre not careful, your project will expand indefinitely in timeline and scope, and you need to put in stopgaps to counteract it.

The only successful way Ive seen of fighting Parkinsons Law is a forcing function. Maybe its set yourself a hard deadline, and make sure you have some incentive to get it done by then / that you cant easily just push it back; or maybe its hire someone you dont quite have the infrastructure to support, so that you are forced to actually invest in building out the right tooling for them.

For what its worth, I used to not believe in forcing functions. To some extent, using a forcing function is an indication that you were unable to properly prioritize on your own, and so you should just get better at prioritizing rather than having to shell out to an external agent. However, in reality, I think prioritization is just a 10x harder problem than anyone gives it credit for, and setting an external forcing function is really just a hack for shifting the prioritization burden to the universe.

Now, you do have to be careful. You need to make sure whatever function you set before yourself, you dont feel boxed into shipping an inferior product. Setting external-facing deadlines is generally a bad idea, whether with customers or with the media. It can be painful, since youd love to tell a user this feature will ship by the end of the month, but what do you do in the 50% of cases when the features been delayed by the fact that we dont really know how to do software? Perhaps the worst world an engineer can be in is feeling forced into a deadline that he or she didnt choose or agree to at Stripe, all deadlines, together with how we expose them externally, are set by the engineers working on a project.

[1] http://en.wikipedia.org/wiki/Parkinson's_lawOk, so youve finally gotten your project ready to ship. How do you actually get it out there?

At Stripe, a PM is a verb, not a person. The primary engineer on a project PMs the launch. Its their responsibility to make sure that all of the concerns of getting the thing shipped are taken care of. They dont necessarily have to do everything themselves, but they should make sure it all gets done.

As your company grows, there will be an increasing amount to think about surrounding a launch. Is anyone thinking about monitoring? Tracking? Performance? How this will affect existing users? It can be hard for any one person to think of and ensure all the possible concerns are met. So you need to make sure theres a clear and predictable process for how things get launched, which everyone knows about and can participate in.

The way we do this is that, a few days to a few weeks prior to launch (depending on the project), whoevers PMing sends out a pre-shipped email. This contains the relevant details of the launch: whats going out, what the goal is, and any other needed context. This is everyone elses chance to ask questions, or to batten down the hatches to prepare the systems they own for launch. You know youve failed if you ever have someone find out about a launch at the same time the public does.

It should be very clear what approval is needed in order to launch something. Depending on your organization, the answer might be none; just go ahead and launch anything. You probably want to have a small set of people who are trusted arbiters of product quality, who are the one source of approval needed to get something out. These days, we have a product-signoff list; you just need the approval of one of the people on that list in order to complete your launch.

[1] http://www.quora.com/Stripe-company/Does-Stripe-have-product-managers-or-do-engineers-manage-the-products-themselvesSometimes, things will go wrong. Whether you misjudged how people would react to a new feature, or the site went down during routine maintenance, or you forgot to monitor some service and it silently fell over in the middle of the night, operational breakages are an expected part of the business of software. How you react to them is a key part of your culture.

First of all, you should have a good postmortem culture in place. When things go wrong, its an opportunity for you to build up expertise for how to do them right in the future. If youre a rapidly growing company, then this one mistake pales in comparison to repeating it 6-12 months from now, so its well worthwhile doing so.

Our postmortems are pretty simple: we describe the effects, the root cause, and how well avoid the issue in the future. When writing postmortems, its very important to do two things: first, postmortems shouldnt be about finger pointing [1]. Even your best engineers will make mistakes (and honestly, its probably the case that your best engineers will make the most mistakes, since they are getting the most done). They should be about figuring out what actually happened, and how to make sure it doesnt happen again. Secondly, its important to avoid platitudes or things you wont actually change. Its very easy to say we should improve the code here or we should have better test coverage, but without specific, actionable recommendations nothing is going to change.

[1] http://codeascraft.com/2012/05/22/blameless-postmortems/If theres one thing your engineering culture needs to do well, its collaboration. If youre doing it right, your organization is a collection of individually-capable nodes, and then the biggest challenge you have is coordination among those nodes. If collaboration is broken, then everything else I just talked about doesnt matter, and youre pretty much doomed.At Stripe, we look for low-effort ways to make information accessible within the company. Usually we take communication that is already happening, and when it makes sense, shift it to a standardized public forum. That makes it way easier for others to stay in the loop, without requiring much overhead from the people generating the communications.

One example is what we call email transparency [1]. The idea is that you should CC a mailing list for all emails you send, down to the person-to-person emails that I dont think anyone else will be interested in. With the right list infrastructure, this allows people to passively subscribe to the feed of everything going on in the company, while only requiring marginal effort on the part of the people sending the email. (Of course, you have to be careful about how far you take this, as emails that are personal or personnel-related should in fact be kept private. We leave that judgement call up to the discretion of the author.)

Another primitive that we use internally are SRFCs, or Stripe Request for Comments. These are effectively just design documents, but the interesting bit is that they live in a standardized place where anyone can comment inline. We write SRFCs for everything from new systems to conference room naming schemes to hiring strategies. All of these documents would likely get written at any other company, but simply by providing a well-known forum for them we end up with a lot of collaboration we wouldnt otherwise.

We will sometimes do more active things, such as status emails, but these other techniques make the active ones much less burdensome.Documentation is an interesting communication channel. Its something that everyone thinks of as a must-have, and I think most people write the wrong kind of documentation, or view it in the wrong light.

In-depth code documentation has an overwhelming tendency to go stale. There are no tests for it / nothing breaks when you change the code but not the docs, and so the natural tendency is for documentation to fall out of date with the code. The best documentation serves as a pointer: it gives someone new to the system enough high-level context and concepts to know where to get started, and its very unlikely to go stale.

Like it or not, understanding whats actually going on is going to basically always require reading the code, and you should write your documentation with that in mind.Meetings are one communication channel that get a bad rap. I think its not because meetings are inherently flawed, but because peoples usage of them tends to be flawed. There are two kinds of useful meetings.

The first is a lets kick around a bunch of ideas for us to later go off and think about. The second is a lets take these concrete proposals weve been discussing elsewhere and make a decision. In both of these cases, its useful to get a lock on everyones time and make everyone focus simultaneously. But if you try to mix and match modes, youll notice nothing gets done.

Also, you should be cognizant that meetings have a high fixed cost: you basically cant write any code for the 30 minutes leading up to it, since you know that youre just going to be interrupted. You also will spend the next 30 minutes after the meeting is over ramping back up into the zone. So, use meetings as a primitive in your culture, but make sure youre using them properly.One of the biggest challenges with having lots of great people is figuring out to do when theres a disagreement. Especially when it comes to architecture, there are often several plausible alternatives, and people tend to come down vigorously on one side or the other. At some point, its better to just make some decision than continue to debate, and you need to make sure that theres a clear way that things happen in your organization.

The way we do this at Stripe evolved from an early debate around our APIs design. Stripes first API was effectively JSON-RPC. We didnt use any of the features of HTTP, such as status codes, URLs, or headers. When we were 8 people, one of the recent hires spoke up about this, saying that it was a bad idea and we should switch over to a much more REST-ful interface. This kicked off three weeks of debate, with half the company on each side of the argument. To make matters worse, within any given side each person had their own sub-opinion. We correspondingly ended up debating the sub-opinions as much as we debated the REST vs non-REST question, which clearly was just a poor use of time.

Finally, the engineer who had started the debate went and implemented his proposal on a branch. We took a look: everyone agreed it felt at least as good as what we already had. We realized wed just lost 3 weeks on this conversation, and since this was a net win we took the plunge and rolled his change to production.

We realized a few things as a result of this. First of all, you just cant have a technical debate with more than 4 people. Usually there are two or maybe three major opinions you should make sure a representative from each is in the room, and let them hash that out, but its just far too inefficient to add incremental people.

But perhaps more fundamentally, we realized that you do just need someone who can make the final call on what were doing. You want it to be someone who feels a lot of ownership over the domain, and who has great judgement that people respect, but other than that it doesnt even really matter who it is.

We split our projects into different components, and assigned someone as the owner for each of them. The owner is given final say about everything affecting their domain, and they are correspondingly also responsible for its overall quality. Its not always a glamorous, call-the-shots role if theres ugly maintenance to do, while they dont necessarily need to do it themselves, they do need to make sure it gets done.

We made that engineer into the owner of the API, and we never had to sit around paralyzed for 3 weeks about a change again.One question that many organizations are faced with is should we hire remote engineers? Its certainly very tempting to do so: there are many great engineers who dont happen to live within commute distance of your office, and if you can hire them, it seems like a great way to expand your team.

Getting things done as a remote engineer is largely an exercise in gathering information. Youre coming from way behind relative to your local counterparts. Every time theres an IRL conversation, local engineers have some chance of walking by and hearing it. As a remote engineer, youre just excluded. The only way you can know whats going on, and consequently whats important to be working on, is through what ends up in email, code, or IRC.

Often, these communication mechanisms are less convenient than IRL conversation. (Thats the cost you incur for being able to hire these great, non-commute range engineers.) Correspondingly, people just wont shift their communication without a forcing function one remote person complaining about being left out of the loop just wont be enough. The only possible way to ensure that the shift happens is via having a real team of remote engineers. So if youre thinking of going down the remote engineer route, it can work, but you need to make sure that the teams they work on have enough distribution to make sure the pains of being remote are addressed.

Building a culture where software projects get done

Technology