On Failure and Resilience

Post on 08-May-2015

4049 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

Transcript

On Failure and Resilience

Mike Brittain!"#$%&'# '( $)*")$$#")*, $&+,

@mikebrittain

!resented at "#signals on $ug %&, %'&%

“Software Infrastructure”“Framework” code, caching, ORM, file storage tier, developer tools, CI!deployment, site performance,

front-end architecture.

Managing failures and building resilience into systems, applications,

process, and people.

Photo: http://www.etsy.com/shop/TheOldTimeJunkShop

$61 M in goods sold in the marketplace2.9 M items sold1.2 B page views

http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/

ArchitectureLinux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC%(S"(EMR

"') Logical data stores(%" shards ) more functionally partitioned)

Search and storage tiers as “services”

150 Engineers + Designers + Product(this was 20 in Feb 2010)

credit: martin_heigan (flickr)

Buyers, sellers, support, developer api, i&*n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance,search, advertising, seller economics, mobile web, iOS.

Zero Release Managers

There Will Be Fail

Credit: wilkee.deviantart.com

We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios.

We cannot prevent failures.

Yet, we can mitigate them.

Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.

“Uptime” is not binary.

Convos AsyncTasks Ads Auth

Functionally Partitioned

Convos AsyncTasks Ads Auth

Functionally Partitioned

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Master-Master Replication

Ads Ads Auth AuthAsynctasks

AsynctasksConvos Convos

1 234

5

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

5 231

4

~!" of listing data is stored on shard#

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

5 231

4

Sharded Tables

shard3 shard3 shard4 shard4shard2 shard2shard1 shard1

Outage is limited to~!" of data set

“Uptime” is not binary.

Uptime of the application is the responsibility of our Operations team.

Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.

Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.

If you are committing code, you are operating the site.

Branching in Code

“All existing revision control systems were built by people who build installed software”

Always Ship TrunkPaul Hammond

Velocity Conf 2010

Enable and disable features quickly.Features for staff or for beta groups.Percentage ramp-up of users or requests.A/B “experiments.”

Config Flags

$cfg[‘new_search’] = array('enabled' => 'on');$cfg[‘sign_in’] = array('enabled' => 'on');$cfg[‘checkout’] = array('enabled' => 'on');$cfg[‘homepage’] = array('enabled' => 'on');

$cfg[‘new_search’] = array('enabled' => 'on');

// Meanwhile...

if ($cfg[‘new_search’]) { # New hotness $results = do_solr();} else { # old and boring $results = do_grep();}

But...

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

“That sounds like it sucks.”Really?

“Doesn’t that mean you have conditionals all over your code?”

Yes.

“Does anyone ever clean those up?”

Sometimes.

“That sounds like it sucks.”Really?

“Wait a minute... all of the counter arguments are in Comic Sans. WTF?!?

Oh, you noticed? ;)

00:00Site down for maintenance

+01:47Site up, disabled login and registration

+06:40Site up, some seller tools disabled

+07:41All features restored

DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/

“Uptime” is not binary.

Features are launched by flipping a config flag, not by deploying

hundreds of lines of code.

“If Engineering at Etsy has a religion, it’s the Church of Graphs.

Ian Malpass, Code as Crafthttp://etsy.me/ePkoZB

http://www.flickr.com/photos/flyforfun/2694158656/

THIS IS HOWYOU RUN

A COMPLEXSYSTEM

http://www.flickr.com/photos/flyforfun/2694158656/

OperatorConfig flags

Metrics

Oh, you want to talk about how we collect metrics and make graphs?

http://www.slideshare.net/mikebrittain/metricsdriven-engineering

Resilient User Interfaces

Interfaces and user experiencesthat adapt to technical andarchitectural failure.

http://www.flickr.com/photos/caffeina/2144044776/

http://www.flickr.com/photos/17793901@N00/106331831/

/** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db);

if (mysqli_connect_error()) {

throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error()));

}}

try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db');} catch (DBConnection_Exception $e) {

// TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e;}

Site navigationLogo

Cute Picture

Generic, catch-allerror messaging....

http://www.flickr.com/photos/caffeina/2144044776/

Every back-end service is anopportunity for failure.

1

2 3

4

56

10

8

9

4 11

13

12

7

147

Critical Path

http://www.flickr.com/photos/caffeina/2144044776/

#srsly?

" #$$ ms

Non-blocking Ajax

Google Docs

Google Calendar

GMail

“Oops, we aren’t able to access click metrics right

now, do not worry — your data is safe.”

Product design doesn’t stopat 100% availability.

OpsDev

Product

OpsDev

1

2 3

4

56

10

8

9

4 11

13

12

7

147

Operability Reviews

What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off?... and what happens when we do?

“What could possibly go wrong?”

What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off? ...and what happens when we do?

“What could possibly go wrong?”

“GameDay” Exercises

Tuesday, April 24, 12

Tuesday, April 24, 12

Pedro

Surprise!!!Turning off multi-language supportimproves our page generation times by up to 25%.

Homepage (95th perc.)

(Blameless) Post-Mortems

How could this have gone better?

How quickly did we find out that something was wrong?Did we communicate well to our visitors and each other?Why did we have confidence that what we were doing was OK?Did we have the right tools, did we use them properly?Did we collect metrics, and could we find them?Where did we make the wrong decisions?

What steps do we take to reduce the chance of this happening again in the future?

“... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure.

This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.”

http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/

John AllspawVP, Technical Operations, Etsy

We should try to learn not only what went wrong, but also what went right.

00:00Site down for maintenance

+01:47Site up, disabled login and registration

+06:40Site up, some seller tools disabled

+07:41All features restored

DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/

Operational Mindset

OpsDev Product

Business Priorities

Operational Mindset

OpsDev Product

Introspection

!"#$ %&$'( )*+ $++*+ ,$-!.",$

!"#$ %&$'( )*+ $++*+ ,$-!.",$...or, how are we screwing our users?

Risk mitigation in a complex system

Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.

Thank you.

Mike Brittain

mike@etsy.com@mikebrittain

Flickr: roboppyhttp://www.flickr.com/photos/51035735481@N01/163374138/

Flickr: jamesjyuhttp://www.flickr.com/photos/32593095@N00/3465022/

Flickr: circulatinghttp://www.flickr.com/photos/26835318@N00/2318226026/

PHOTO CREDITS

top related