Top Banner
Cloud Native and Epidemic Failures April 2014 Adrian Cockcroft @adrianco @BatteryVentures http://www.linkedin.com/in/adriancockcroft
17

Epidemic Failures

Jan 15, 2015

Download

Technology

Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Epidemic Failures

Cloud Native and Epidemic Failures

April 2014Adrian Cockcroft

@adrianco @BatteryVentureshttp://www.linkedin.com/in/adriancockcroft

Page 2: Epidemic Failures

Cloud Native?

Epidemic Failures

Automated Diversity

Page 3: Epidemic Failures

Cloud Native

Construct a highly agile and highly available service from ephemeral and

often broken components

Page 4: Epidemic Failures

Inspiration

Page 5: Epidemic Failures

Numquam ponenda est pluralitas sine necessitate

Plurality must never be posited without necessity

Occam’s Razor

Page 6: Epidemic Failures

Monoculture

Replicate “the best” as patternsReduce interaction complexityEpidemic single point of failure

Page 7: Epidemic Failures

Pattern Failures

Infrastructure Pattern FailuresSoftware Stack Pattern Failures

Application Pattern Failures

Page 8: Epidemic Failures

Infrastructure Pattern Failures

• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path

Page 9: Epidemic Failures

Software Stack Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – e.g. heartbleed• Language bugs – compile time• Runtime bugs – JVM, Linux, Hypervisor• Network bugs – routers, firewalls, protocols

Page 10: Epidemic Failures

Application Pattern Failures

• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.

Page 11: Epidemic Failures

What to do?

Automated diversity managementDiversified automationEfficient vs. Antifragile

Page 12: Epidemic Failures

Specific Ideas

• Automate running a mixture– Diversity as default for any service stack– No developer overhead, stay agile, low cost

• Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu– Mix versions of JDK, Tomcat, etc.

• Vendor diversity– Multiple DNS vendors, cloud regions, costs more– Multiple cloud vendors? Much higher cost.

Page 13: Epidemic Failures

Generate Permutations> epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")), codeversion=gl(2,4,8,c("v34","v35")))> epi java linux codeversion1 java6 centos v342 java7 centos v343 java6 ubuntu v344 java7 ubuntu v345 java6 centos v356 java7 centos v357 java6 ubuntu v358 java7 ubuntu v35

Page 14: Epidemic Failures

Deployment

• Builds– Manual to test, automate if it works– Modify build to generate permutation AMIs– Modify Asgard to auto-deploy permutations

• Data collection– Tag each instance with its permutation– Gather metrics by permutation per instance– Do R-based Design of Experiments analysis

Page 15: Epidemic Failures

Analysis

• As a function of permutations– Error rate– Response time– CPU Utilization

• Interactions– E.g. interaction between linux and java– Contrasts identify components with issues– Small changes with high statistical significance

Page 16: Epidemic Failures

GCS Total API Outage for ~1hr

Page 17: Epidemic Failures

Takeaway

Watch out for monocultures

A|B Testing – it’s not just for personalization

http://perfcap.blogspot.comhttp://slideshare.net/adrianco – Netflix

http://slideshare.net/adriancockcroft - Battery

http://www.linkedin.com/in/adriancockcroft

@adrianco @BatteryVentures