How It All Goes Down Your next service outage will be during @ctosummit at 1:50pm. Daniel Doubrovkine [email protected] @dblockdotorg
Jun 27, 2015
How It All Goes DownYour next service outage will be during
@ctosummit at 1:50pm.
Daniel Doubrovkine [email protected]
@dblockdotorg
Do you have a Syrian domain?
somewhere here there’s your .sy TLD authority
Are you using MongoDB?
Do you use a PAAS?
Is there any logic involved?1. Restart Unicorns, no improvement.2. Restart a server, no improvement.3. Found an EC2 maintenance note buried in email.4. Patch responsible for 5x slower response times!5. Do a DB query from a production console.6. Identify a pattern (slow, then fast).7. Look in MongoDB logs.
Are there humans involved?
Are you using a beta version of a driver?
Post-Mortem
Subject
SummaryWe have experienced 3 separate outages over the last 72 hours that have affected api.artsy.net which is the backbone of all our applications, including Admin, CMS, artsy.net, m.artsy.net, etc.
The first incident was slower response time 9/28 2AM-10AM EST.
The second and third incidents were two outages, 9/29 11PM-12AM EST and 9/30 4AM-6:30AM EST.
Cause
1. TriggerServers rebooted.
2. Unexpected behavior.Reboots should have been handled just fine by software.
3. Cause.Human error + bug in the driver.
4. Consequence.All front-ends down.
Resolution
1. TicketWoke up a contractor.
2. Manual intervention.Kick servers.
Post-Mortem
1. Human Error PreventableThe dismissal of the alert was wrong.
2. Failure to Plan.Reboot schedule needed a human to monitor it.
Outage History