StartOps: Growing an ops team from 1 founder - Lot of knowledge online but it usually assumes you have a team, lots of time and money - That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that - Tips and tools to help along the way - Use my own company and gratuitous photos of Japan to illustrate the point
Bootstrapped startups don't have the luxury of a full team of ops engineers available to respond to issues 24/7, so how can you survive on your own? This talk will tell the story of how to run your infrastructure as a single founder through to growing that into a team of on call engineers. It will include some interesting war stories as well as tips and suggestions for how to run ops at a startup.
Presented at DevOpsDays London 2013 by David Mytton.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
StartOps: Growing an ops team from 1 founder
- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point
David Mytton
Woop Japan!
Bootstrapping sometimes means leaving things to the last minute.
Photo: dannychoo.com
- First tip- Limited resources, people, time
April 2009
- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB
- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects
Outsourcing
- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples
Service access list
Outsourcing
- List of services employees have access to- Revoking credentials- Adding new users- Password management
PCI certification
Outsourcing
- Paperwork / checklist
CDN research
Outsourcing
- Paperwork / checklist
Is it time consuming?
Outsourcing
Is it time consuming?
Boring?
Outsourcing
Is it time consuming?
Boring?
Measurable improvement?
Outsourcing
2010 - 2011
And then there were 3
- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly
Dealing with humans
- As much as we’d like an API to life, managing human issues become important for scaling
Dealing with humans
Automate as much as possible
- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum
Dealing with humans
Silo’d information
- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit
Dealing with humans
Up to date docs
- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date
Dealing with humans
Checklists
- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling
Dealing with humans
Force scripting
- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes
2012 - 2013
Growing to 12
- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule
On-call
Dealing with humans
- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
On-call 1) Ops engineer
Dealing with humans
- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
On-call 1) Ops engineer
2) All engineers
Dealing with humans
- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
Dealing with humans
- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
4) Others
Dealing with humans
- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved
Off-call
Dealing with humans
- Responders to an incident get next 24 hours off-call- Social issues to deal with
On-call CEO
Dealing with humans
- I receive push notifications + e-mails for all outages
Uptime reporting
Dealing with humans
- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
Social issues
Dealing with humans
- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
Backup responder
Dealing with humans
- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
Expected
Dealing with outages
- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
Dealing with outages
Externally
Communication
- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail
Communication
Dealing with outages
Internally
- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing
Really test your vendors
Dealing with outages
- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
Simulations
Dealing with outages
- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first
You want your own team
- The only ones who care the most- Know the most- Can fix things fastest