How it All Goes Down

Post on 27-Jun-2015

67 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

"How it All Goes Down" presented at CTO Summit, NYC, November 3rd, 2014.

Transcript

How It All Goes DownYour next service outage will be during

@ctosummit at 1:50pm.

Daniel Doubrovkine db@artsy.net

@dblockdotorg

Do you have a Syrian domain?

somewhere here there’s your .sy TLD authority

Are you using MongoDB?

Do you use a PAAS?

Is there any logic involved?1. Restart Unicorns, no improvement.2. Restart a server, no improvement.3. Found an EC2 maintenance note buried in email.4. Patch responsible for 5x slower response times!5. Do a DB query from a production console.6. Identify a pattern (slow, then fast).7. Look in MongoDB logs.

Are there humans involved?

Are you using a beta version of a driver?

Post-Mortem

Subject

SummaryWe have experienced 3 separate outages over the last 72 hours that have affected api.artsy.net which is the backbone of all our applications, including Admin, CMS, artsy.net, m.artsy.net, etc.

The first incident was slower response time 9/28 2AM-10AM EST.

The second and third incidents were two outages, 9/29 11PM-12AM EST and 9/30 4AM-6:30AM EST.

Cause

1. TriggerServers rebooted.

2. Unexpected behavior.Reboots should have been handled just fine by software.

3. Cause.Human error + bug in the driver.

4. Consequence.All front-ends down.

Resolution

1. TicketWoke up a contractor.

2. Manual intervention.Kick servers.

Post-Mortem

1. Human Error PreventableThe dismissal of the alert was wrong.

2. Failure to Plan.Reboot schedule needed a human to monitor it.

Outage History

Thanks!

Daniel Doubrovkine db@artsy.net

@dblockdotorg

top related