StartOps: Growing an ops team from 1 founder

StartOps: Growing an ops team from 1 founder

- Lot of knowledge online but it usually assumes you have a team, lots of time and money- That is the goal but it doesn’t start like that so I’m going to talk about the stages to achieve that- Tips and tools to help along the way- Use my own company and gratuitous photos of Japan to illustrate the point

David Mytton

Woop Japan!

Bootstrapping sometimes means leaving things to the last minute.

Photo: dannychoo.com

- First tip- Limited resources, people, time

April 2009

- Quick development- Experience with PHP + MySQL- Slicehost was cheap- Problems with MySQL so moved to MongoDB

Why?

• Replication

Why?

• Replication

• Official drivers

Why?

• Replication


• Easy deployment

Why?

• Replication


• Easy deployment

• Fast out of the box (sort of)

1 = changes to WriteConcern

david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_

- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time

MongoDB at Server Density

•27 nodes

•27 nodes


•17TB data per month


Queues

Primary data store

Time series

It also means trying to find the quickest way.

david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01

- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet

1d, 1h, 58m

11.22MB/s

• Roaming is expensive

Hacking traveling

- Wifi hotspot- Prepaid SIM- Euro data cap

Hacking traveling

•Starbucks free wifi + power

Hacking traveling

• Travel light

- Buying things locally

Hacking traveling

• Don’t update

- Like no deploy Friday- Server updates- Local OS updates

Let other people help

- Summer 2009 moved to several managed servers with Rackspace.


• Managed hosts

- Rackspace managed hosting- Softlayer charge $1/ticket


• Managed hosts

• Support contracts

- Depending on the level of support you buy- Expensive- Are ways to work around that; getting involved with projects

Outsourcing

- Engineers terrible at valuing their own time- “Why pay for something I can build/install/configure myself?”- Can pay a trusted company/individual to do things- Lots of little things that need doing- Examples

Service access list

Outsourcing

- List of services employees have access to- Revoking credentials- Adding new users- Password management

PCI certification

Outsourcing

- Paperwork / checklist

CDN research

Outsourcing

- Paperwork / checklist

Is it time consuming?

Outsourcing


Boring?

Outsourcing


Boring?

Measurable improvement?

Outsourcing

2010 - 2011

And then there were 3

- Added a new engineer at the end of 2009 and the team stayed at 3 until the start of 2011.- More than 1 then you start having to think properly

Dealing with humans

- As much as we’d like an API to life, managing human issues become important for scaling

Dealing with humans

Automate as much as possible

- You want to remove humans from as much as possible- Prevents mistakes, makes things easier and faster- Keeps a log of what was happened- Ideally you only want to ever manually to something once- Even with just 1 person, setting up config management is a minimum

Dealing with humans

Silo’d information

- Small team so usually 1 person responsible for a lot of code- Not reasonable to have to ask that person every time there’s a problem with that bit

Dealing with humans

Up to date docs

- Every component should be fully documented- Consider appliance manuals with the troubleshooting tables they have at the back- Table of potential failures and how to deal with them- Vendor contact information- Team contact information- Have someone responsible for keeping them up to date

Dealing with humans

Checklists

- Stolen from the Checklist Manifesto / airline industry- Any manual steps, however trivial, should be checklisted- Failover, backup recovery, incident handling

Dealing with humans

Force scripting

- Takes a bit of extra time but the ROI is massive- Disallow direct access to things e.g. database queries- Better to push a button and get a guaranteed result than risk mistakes

2012 - 2013

Growing to 12

- 12, 11 of which are technical- Now have the luxury of being able to spread things out- Proper on call schedule

On-call

Dealing with humans

- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone

On-call 1) Ops engineer

Dealing with humans

- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting


2) All engineers

Dealing with humans

- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday


2) All engineers

3) Ops engineer

Dealing with humans

- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise


2) All engineers

3) Ops engineer

4) Others

Dealing with humans

- Next month we’re launching a major new product into beta- Support from design / frontend engineering- Have to press a button to get them involved

Off-call

Dealing with humans

- Responders to an incident get next 24 hours off-call- Social issues to deal with

On-call CEO

Dealing with humans

- I receive push notifications + e-mails for all outages

Uptime reporting

Dealing with humans

- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime

Social issues

Dealing with humans

- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?

Backup responder

Dealing with humans

- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity

Expected

Dealing with outages

- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them


Externally

Communication

- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail

Communication


Internally

- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing

Really test your vendors


- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them

Simulations


- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first

You want your own team

- The only ones who care the most- Know the most- Can fix things fastest

Monitoring tools

Server Density

Woop Japan!

www.serverdensity.com/dd

David Mytton

[email protected]

@davidmytton

Woop Japan!

www.serverdensity.com

StartOps: Growing an ops team from 1 founder

Technology

time consuming

extra time

time theres

things lots

real people

little things

things easier

database server