Top Banner
@JPMALEK Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 03/16/2022 1
33

BigDoor's Jeff Malek Gluecon Presentation

Jul 25, 2015

Download

Technology

Carrie Peters
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 1

Retrospective from a startup built in the cloud : top 3 big lessons

from the AWS outage on

04.21.2011 plus 4,369 other smaller ones

Page 2: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 2

What a country : entrepreneurial resiliency

Page 3: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 3

“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,

AWS, the BD API”

(true story)

Page 4: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 4

Boom

Page 5: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 5

good to be home!

Go Buffs

Page 6: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 6

me: previous startupteams in 3 countries

highly transactional systemMS tech : IIS/MS SQL Server

co-located, leased/owned hardware0% in cloud

$75M/yearly rev

Page 7: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 7

me : current startupsystems 100% on AWS

99% free/open-source software

standing on the shoulders of giants

Page 8: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 8

fault tolerance: 3 to 47 important failearnings

and 4,369 less important ones

Page 9: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 9

in the context of our startup, of course

YMMV depending on velocity

Page 10: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 10

Ruger

Page 11: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 11

The Ruger Fault Equivalency

time = money

fault tolerance = time²  - risk tolerance

Also known as:

'Fast, good and cheap : pick two‘

Page 12: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 12

system design philosophy:leverage proven, open-source tech

in the cloudto build ascaleablereliablesecure

operational foundationquickly

Page 13: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 13

So how do you achievethe right level of fault tolerance in

the cloud?

3 tenets

Page 14: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 14

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Page 15: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 15

who here has used AWS?

Page 16: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 16

Tenet #1prepare a fault-tolerant foundation with

scripted repeatability

aka automation

Page 17: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 17

from the start :script the non-interactive install of your tools

and OS

custom AMIDebian : great package management

based on Eric Hammond’s workhttp://alestic.com/

Page 18: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 18

which will allow you toscript the setup/tear-down of your stack

Page 19: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 19

which will allow you toscript system tests

integrity (3-4K tests)performance (30-40K tests)

load, capacity (2-4M requests)

Page 20: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 20

A/B system test results : MySQL Percona Upgrade

Page 21: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 21

That’s how1 person

set up andmanaged a network

comprised of 90+/- server instancesfor 1.5 years

while serving various other roleswithout having to leave their chair

try that with real hardware

Page 22: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 22

Tenet #2SPOF Elimination

We don’t need no stinkin single points of failure.

Page 23: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 23

SPOF Examples:Cloud Provider

RegionZone

Load BalancerApp Server

DatabaseFred

Page 24: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 24

Cloud Provider fail-over?

e.g. AWS –> Rackspace

Page 25: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 25

Region fail-over?

e.g. useast->uswest within AWSNah.

Page 26: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 26

Zone fail-over?Yes.

US-WEST A

BC

D

US-EAST A

BC

D

Page 27: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 27

Zone fail-over best practices:are you using auto-scaling?

no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics

Page 28: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 28

Load-balancer (ELB), app server, database fail-over?

Yes.

Page 29: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 29

So it’s actually all about reduction of the right SPOFs for

your business context

Just adding the ability to fail-over and have backups within a region is huge!

Probably enough for most.What about Fred?

Page 30: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 30

Tenet #3Clear-Cut Communication

transparency is soooo 2010

Page 31: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 31

During an outage, communicating the right things at the right time:

hard.But not that hard.

Page 32: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 32

Tenet #1

Scripted Repeatability Tenet #2

SPOF Elimination Tenet #3

Clear-Cut Communication

Three Tenets Revisited

Page 33: BigDoor's Jeff Malek Gluecon Presentation

@JPMALEK

04/14/2023 33

Notes