DPC 2016 - 53 Minutes or Less - Architecting For Failure

Post on 16-Apr-2017

258 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

Transcript

53 Minutes or Less - Architecting For Failure In

The CloudBen Andersen-Waine

53 Minutes?

99.99%

Availability (%) Year Month Week

90 36.5 Days 72 Hours 16.8 Hours

99 3.65 Days 7.2 Hours 1.68 Hours

99.9 8.76 Hours 43.8 Min 10.1 Min

99.99 52.56 Min 4.38 Min 1.01 Min

Adapted From: https://en.wikipedia.org/wiki/High_availability

Architecting For Failure?

Who are you?

1) You have some kind of web application / service

2) You are using an IaaS cloud provider

3) The service needs to be “highly available”

SAMPLE

http://example.com/more/info/README

High Level Content

Deeper Reading

Infrastructure

Infrastructure

• Regions & Availability Zones

• Autoscaling

• Multi Region

Regions And Availability Zones

“Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.”

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html

http://aws.amazon.com/about-aws/global-infrastructure/

Auto Scaling

“Auto Scaling helps you maintain application availability and allows you to scale your Amazon EC2 capacity up or down automatically according to conditions you define. ”

https://aws.amazon.com/autoscaling/

Auto Scaling

• Instance metrics (useful for containers)

• Load balancer health check (useful for web apps on EC2)

Multi Region

Devops

One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.

http://martinfowler.com/bliki/PhoenixServer.html

Immutable Infrastructure

Devops• Environment Creating

• Releasing

• Secret Management

• Service Discovery

Environment Creation

• Vendors Tool (AWS Cloud Formation / GCE Cloud Deployment Manager)

• 3rd Party Solution - Terraform, Ansible

Immutable Infrastructure

http://martinfowler.com/bliki/SnowflakeServer.html

Configuration changes are regularly needed to tweak the environment so that it runs efficiently and communicates properly with other systems. This requires some mix of command-line invocations, jumping between GUI screens, and editing text files.

The result is a unique snowflake - good for a ski resort, bad for a data center.

Releases: Build An Artifact

• Build A VM (AWS ami / GCE image)

• Use Containers

Releases: Building A VM

Releases: Building A Container

Releases: Canarys

http://martinfowler.com/bliki/CanaryRelease.html

Releases: Blue / Green Deploy

https://cloudnative.io/blog/2015/02/the-dos-and-donts-of-bluegreen-deployment/

Service Discovery

https://www.nginx.com/blog/service-discovery-in-a-microservices-architecture/

Service Discovery

• https://github.com/coreos/etcd

• https://www.consul.io/

• https://zookeeper.apache.org/

Secrets

• Use secret keeper or vault

• Use environment variables

Secrets

Secrets

Secrets

Secrets

Secrets

• https://www.vaultproject.io/

• https://square.github.io/keywhiz/

Secrets

Software Development

General Best Practise

• Write tests (preferably first)

• Continuously integrate

• Write Documentation

Problem: Services Go Away

Circuit Breaking

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Circuit Breaking

http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Circuit Breaking

Available solutions:

• https://github.com/Netflix/Hystrix

• https://github.com/ejsmont-artur/php-circuit-breaker

Problem: Spikey Workloads

Queue Based Load Levelling

https://msdn.microsoft.com/en-gb/library/dn589783.aspx

Priority Queue

https://msdn.microsoft.com/en-gb/library/dn589794.aspx

Competing Consumers

https://msdn.microsoft.com/en-gb/library/dn568101.aspx

Monitoring / SLAs

SLA - Service Level Agreement

http://www.nkarten.com/handbook.pdf

Monitoring

Obligatory Meme

The Simian Army

http://techblog.netflix.com/2011/07/netflix-simian-army.htmlhttps://github.com/Netflix/SimianArmy/

Final Thoughts

Questions

Feedback

https://joind.in/talk/41c42

top related