Top Banner
©Continuent 2012. Surviving An Amazon Outage Neil Armitage, Cluster implementation Engineer, Continuent Wednesday, 24 April 13
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Surviving an Amazon Outage

©Continuent 2012.

Surviving An Amazon Outage

Neil Armitage, Cluster implementation Engineer, Continuent

Wednesday, 24 April 13

Page 2: Surviving an Amazon Outage

©Continuent 2012 2

Overview

• Continuent’s external/internal infrastructure is built in AWS

• Review carried out in the Summer of 2012 after several AWS Outages

• Treated the review as a Customer engagement

• Further review in Autumn of 2012 leading to the Multi-Cloud deployment

Wednesday, 24 April 13

Page 3: Surviving an Amazon Outage

©Continuent 2012

What is AWS

Amazon Web Services is a collection of remote computing services (also called web services)

that together make up a cloud computing platform.

The central services are EC2 (Compute) and S3 (Storage) Services.

3

Wednesday, 24 April 13

Page 4: Surviving an Amazon Outage

©Continuent 2012

AWS Regions

4

Ireland(3 AZ)

Sao Paulo(2 AZ)

Northern Virginia(5 AZ)

Oregon(3 AZ)

California(3 AZ)

Singapore(2 AZ)

Tokyo(3 AZ)

Sydney(2 AZ)

Wednesday, 24 April 13

Page 5: Surviving an Amazon Outage

©Continuent 2012

AWS Availability Zones

5

Region

Availability Zone Availability Zone

Availability Zone

Region

Availability Zone Availability Zone

Wednesday, 24 April 13

Page 6: Surviving an Amazon Outage

©Continuent 2012

AWS Services

• Compute EC2

• Network - Route 53 and Virtual Private Cloud (VPC)

• Content Delivery - Cloudfront

• Storage - S3, Glacier, EBS

• Database - DynamoDB, RDS, RedShift, SimpleDB

• Deployment - Cloudformation, Beanstalk, OpsWorks

6

Wednesday, 24 April 13

Page 7: Surviving an Amazon Outage

©Continuent 2012

AWS Size*

• Between 100K and 500K physical servers

• 1.5million Public IP Addresses

• S3 holds > 2 Trillion objects - 1.1m requests per second

• 1/3 of daily users access a site running on AWS

• 1% of internet tra!c goes through Amazon Infrastructure

7

* Estimates based on various internet sources

Wednesday, 24 April 13

Page 8: Surviving an Amazon Outage

©Continuent 2012

Continuent Systems

• External facing website

• Jira/Con"uence internal systems

• Subversion

• Jenkins build system

8

Wednesday, 24 April 13

Page 9: Surviving an Amazon Outage

©Continuent 2012

External Website

9

Internet ElasticIP

Web Server

DBServer

Region

Availability Zone

Wednesday, 24 April 13

Page 10: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/Subversion

10

Internet ElasticIP

App ServerJira

ConfluenceSVN ServerMySQL

Availability Zone

Region

Wednesday, 24 April 13

Page 11: Surviving an Amazon Outage

©Continuent 2012

AWS Problems Summer 2012

“Amazon Cloud Hit by Real Clouds, Downing Net!ix, Instagram, Other Sites”

Severe Storms caused power outages at AWS US-East Data centers, generators failed taking out 7% of EC2 instances.http://www.pcworld.com/article/258627/amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html

11

Wednesday, 24 April 13

Page 12: Surviving an Amazon Outage

©Continuent 2012

Migration Plan

• Move to a clustered Continuent Tungsten environment

• Ensure all components are replicated into at least one other AWS Region

• Limited downtime on Customer facing systems

• Minimal downtime on internal systems

12

Wednesday, 24 April 13

Page 13: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Data Service: nyc

Wednesday, 24 April 13

Page 14: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 15: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 16: Surviving an Amazon Outage

©Continuent 2012 13

MasterSlave Slave

App Logic

Tungsten Connector

Replicator Replicator Replicator

App Logic

Tungsten Connector

Manager Manager Manager

Monitoring and control

Monitoring and control

Data Service: nyc

Wednesday, 24 April 13

Page 17: Surviving an Amazon Outage

©Continuent 2012

Website Database Tier - Round 1

14

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 18: Surviving an Amazon Outage

©Continuent 2012

DB Failures - Failure in US-EAST-1C

15

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 19: Surviving an Amazon Outage

©Continuent 2012

DB Failures - Failure in US-EAST

16

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Connectors

Wednesday, 24 April 13

Page 20: Surviving an Amazon Outage

©Continuent 2012 17

DEMO

Wednesday, 24 April 13

Page 21: Surviving an Amazon Outage

©Continuent 2012

Website Web Tier - Round 1

18

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 22: Surviving an Amazon Outage

©Continuent 2012

Web Failures - Failure in US-EAST-1C

19

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 23: Surviving an Amazon Outage

©Continuent 2012

Web Failures - Failure in US-EAST

20

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1B 1C 1C

S3 Backups

S3 Backups

Internet

EIP

DNS Update

Wednesday, 24 April 13

Page 24: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/SVN - Round 1

21

Region

Availability Zone

Region

Availability Zone

US-EAST-1 US-WEST-1

1C 1C

S3 Backups

S3 Backups

Internet

EIP

Wednesday, 24 April 13

Page 25: Surviving an Amazon Outage

©Continuent 2012

AWS Failures - Autumn 2012

“Amazon Web Services outage takes out popular websites again”

•EBS degraded performance

•Problems allocating new volumes

http://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular-

websites-again.html

22

Wednesday, 24 April 13

Page 26: Surviving an Amazon Outage

©Continuent 2012

Website Database Tier - Round 2

23

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1

US-WEST-1

1B 1C

1C

S3 Backups

S3 Backups

RackSpace

Wednesday, 24 April 13

Page 27: Surviving an Amazon Outage

©Continuent 2012

Website Web Tier - Round 2

24

Region

Availability Zone Availability Zone

Region

Availability Zone

US-EAST-1

US-WEST-11B 1C

1C

S3 Backups

S3 Backups

Internet

EIP

RackSpace

Wednesday, 24 April 13

Page 28: Surviving an Amazon Outage

©Continuent 2012

Jira/Con!uence/SVN - Round 2

25

Region

Availability Zone Region

Availability Zone

US-EAST-1

US-WEST-11C

1C

S3 Backups

S3 Backups

Internet

EIP

RackSpace

Wednesday, 24 April 13

Page 29: Surviving an Amazon Outage

©Continuent 2012

Best Practices

• RAID EBS Volumes (RAID1)

• Backups

• xtrabackup (backed up into S3)

• EBS Snapshot

26

ec2-­‐consistent-­‐snapshot  \  -­‐-­‐mysql  -­‐-­‐freeze-­‐filesystem  /vol  \  -­‐-­‐region  eu-­‐west-­‐1    \  -­‐-­‐description  "$(hostanme)  RAID  snapshot  $(date  +'%Y-­‐%m-­‐%d  %H:%M:%S')"  \  vol-­‐1f9a6446  vol-­‐649a643d

Wednesday, 24 April 13

Page 30: Surviving an Amazon Outage

©Continuent 2012

Best Practices

• Monitoring

• Nagios scripts converted to email alerts

• New Relic

27

Wednesday, 24 April 13

Page 31: Surviving an Amazon Outage

©Continuent 2012

Lesson Learnt

• EC2 Instances fail

• One of anything is never enough

• Don’t assume you can spin up more resources instantly

• Think multi-cloud, public/private

• Resources are disposable - throw away and rebuild if needed

28

Wednesday, 24 April 13

Page 32: Surviving an Amazon Outage

©Continuent 2012

Further Plans

• Realtime replication of web assets (glusterFS?)

• Introduce a Elastic Load Balancer in front of US-EAST Web servers to allow for auto web failover

• Migrate into a VPC

• Investigate Route 53 for DNS Failover

29

Wednesday, 24 April 13

Page 33: Surviving an Amazon Outage

©Continuent 2012 30

We are Recruiting

Come to our booth for more infomation

Wednesday, 24 April 13

Page 34: Surviving an Amazon Outage

©Continuent 2012 31

Continuent Website:http://www.continuent.com

Tungsten Replicator 2.0:http://code.google.com/p/tungsten-replicator

Our Blogs:http://scale-out-blog.blogspot.comhttp://datacharmer.blogspot.comhttp://flyingclusters.blogspot.com

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009e-mail: [email protected]

Wednesday, 24 April 13