Top Banner
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Andrew Shieh, SmugMug Operations shandrew @ smugmug.com November 15, 2013 SmugMug’s Zero Downtime Migration to AWS ARC312 Friday, November 15, 13
38

SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Jan 12, 2015

Download

Technology

SmugMug spent six years split between its datacenters and AWS. Find out how and why SmugMug went 100% AWS, migrating 30 TB of databases, hundreds of frontends, load balancing, and caches, across the US in one night with zero downtime.We show you specific techniques and processes that made our large-scale migration a resounding success: moving massive MySQL databases, testing and sizing a new AWS infrastructure, automating AWS operations, managing the risks involved in wholesale infrastructure change, and architecting for reliability in multiple AWS Availability Zones. We talk about the performance, scalability, operational, and business benefits and challenges we've seen since moving 100% to AWS. Finally, we share secrets about our favorite AWS products.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Andrew Shieh, SmugMug Operationsshandrew @ smugmug.comNovember 15, 2013

SmugMug’s Zero Downtime Migration to AWSARC312

Friday, November 15, 13

Page 2: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug—Who are we?

Friday, November 15, 13

Page 3: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

The early days of SmugMug• Gradual bootstrapped growth• Multiple self-managed datacenter cages• Too many servers of varying types• Too many disks• Tons of valuable skilled employee

hours spent in cages

Friday, November 15, 13

Page 4: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

DataCenter Fantasy

Friday, November 15, 13

Page 5: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Data Center Reality

Friday, November 15, 13

Page 6: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Data Center Reality

Friday, November 15, 13

Page 7: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug <3 AWS• Early adopter of Amazon S3• Over the years, moved rendering,

upload, archiving, payments, permissions, email, and more compute to AWS

• Before mid-2012, no ultra-high performance I/O

Friday, November 15, 13

Page 8: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug Architecture ~2006

AWS: S3

AWS: S3SV: Web, DB, Image*

Friday, November 15, 13

Page 9: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug Architecture ~2011

AWS: S3

AWS: S3, Image (upload, processing, render, video, …) SV: Web, DB

Friday, November 15, 13

Page 10: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug Architecture - Transition

AWS: S3

AWS: S3, Image*, WebSV: Web, DB

DC: Replication DB, Direct Connect

Friday, November 15, 13

Page 11: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

SmugMug Architecture Today

AWS: S3, Image*, Web, DBØ

Friday, November 15, 13

Page 12: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

How did we get there?

Friday, November 15, 13

Page 13: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Our database I/O evolution:Always cutting edge• Started with MySQL on spinning

disk RAID, max RAM• Moved to ZFS SSD + SSD cache +

spinning disks• Moved to custom 24-SSD arrays

Friday, November 15, 13

Page 14: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

hi1.4xlarge FTW• our custom, obscure hardware =>

difficult to resolve problems,difficult to upgrade

• hi1 overall DB IO performance comparable to 8 x SSD RAID10

• < 3%/yr hi1 instance failure rate!

Friday, November 15, 13

Page 15: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Amazon VPC - also a big win• Easy mapping of internal / external network security

model to AWS

Friday, November 15, 13

Page 16: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Zero downtime move?

Friday, November 15, 13

Page 17: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Friday, November 15, 13

Page 18: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Friday, November 15, 13

Page 19: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Zero Downtime Move• Flexibility of the AWS cloud

makes a zero downtime move inexpensive. Pay for only what you use. Provision fast.

• Plan• Test• Plan and test again

Friday, November 15, 13

Page 20: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Major changes post-move• Database storage goes from SSD to

hi1.4xlarge ephemeral• Hardware load balancers become

Elastic Load Balancing load balancers

Friday, November 15, 13

Page 21: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Major changes post-move• Database storage goes from SSD to

hi1.4xlarge ephemeral• Hardware load balancers become ELB• haproxy layer 7 load/traffic directing

goes from static to dynamic config• Web servers autoscale for each cluster• Membase to ElastiCache (later to

Amazon EC2)

Friday, November 15, 13

Page 22: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +

sufficient bandwidth

Friday, November 15, 13

Page 23: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +

sufficient bandwidth

• Bot testing• Read-only live site testing w/ QA

Friday, November 15, 13

Page 24: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

More on moving• Full scale read-write testing

is difficult• Be aware of AWS limits• Talk to support for big

growth• Roll back plan - manage

risky change

Friday, November 15, 13

Page 25: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Flipping the switch to AWS• “The biggest, scariest engineering

change we've made in the company's history” - Don, SmugMug Chief Geek

• Go read-only (1 min)• Pre-Scale up big• MHA to reassign MySQL

masters and their replication (30min)• Point DNS+CDN to Elastic Load

Balancing (5-30m)

Friday, November 15, 13

Page 26: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Flipping the switch to AWS• Test! (60 min)• When Read-only is

all good, go to read-write (5 min)

• Test! Inevitable bugs at this step (hours)

Friday, November 15, 13

Page 27: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

MHA?• Facebook, DeNA

• Helps to reliably reassign MySQL masters and replication, maintaining consistency

Friday, November 15, 13

Page 28: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

MHA?• Manual failover in MySQL

5.5 and earlier is painful, time-consuming

• Be careful with automation for rare events — it can bite

Friday, November 15, 13

Page 29: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Problems?• Completely redundant

network links can fail• Bugs related to IP address

change• ElastiCache performance• NewRelic! Use it or a similar

APM product

Friday, November 15, 13

Page 30: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Results

Friday, November 15, 13

Page 31: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Results

Friday, November 15, 13

Page 32: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Results• Data Center - performance fluctuated

through day• AWS w/scaling - flat performance

throughout the day - significant scalability limits removed

• Networking was a key improvement• Success!

Friday, November 15, 13

Page 33: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Lessons Learned• We love AWS even more than before• Automate everything• Understand Amazon EBS, and

understand underlying details of AWS services

• Unpredictable Ops schedules vs. large projects

Friday, November 15, 13

Page 34: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Lessons Learned

Job #1: Making business happen

Friday, November 15, 13

Page 35: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

We made more changes, because we could• As long as we’re moving our infrastructure,

why not rebuild most of it too?• Linux, MySQL, package versions upgraded• New monitoring tools• NFS dependencies eliminated, moved to

Amazon S3 or DynamoDB• Code pushes managed by nice distributed

tools utilizing Amazon S3 + internal torrent

Friday, November 15, 13

Page 36: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

One last thing...• Go Multi-availability-zone!• Load balancers send traffic to multiple

haproxy per AZ with AZ-specific web clusters, DB replicas

• Backed up w/ cross AZ• Keep SPOFs in one AZ

Friday, November 15, 13

Page 37: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Questions?Andrew Shieh, Sunnyvale, [email protected]@shandrew

http://www.smugmug.com/ http://pics.shieh.info/

Thank you!

Friday, November 15, 13

Page 38: SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

ARC312 - SmugMug’s Zero Downtime Migration to AWS

Thank You

Friday, November 15, 13