Top Banner
Five years of EC2 distilled Grig Gheorghiu Silicon Valley Cloud Computing Meetup, Feb. 19th 2013 @griggheo agiletesting.blogspot.com
29

Five Years of EC2 Distilled

Jan 15, 2015

Download

Technology

Grig Gheorghiu

My experiences running large-scale system infrastructures in EC2 and in traditional data centers, with lessons learned the hard way.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Five Years of EC2 Distilled

Five years of EC2 distilledGrig Gheorghiu

Silicon Valley Cloud Computing Meetup, Feb. 19th 2013

@griggheoagiletesting.blogspot.com

Page 2: Five Years of EC2 Distilled

whoami

• Dir of Technology at Reliam (managed hosting)

• Sr Sys Architect at OpenX

• VP Technical Ops at Evite

• VP Technical Ops at Nasty Gal

Page 3: Five Years of EC2 Distilled

EC2 creds

• Started with personal m1.small instance in 2008

• Still around!

• UPTIME:• 5:13:52 up 438 days, 23:33, 1 user, load average:

0.03, 0.09, 0.08

Page 4: Five Years of EC2 Distilled

EC2 at OpenX

• end of 2008

• 100s then 1000s of instances

• one of largest AWS customers at the time

• NAMING is very important

• terminated DB server by mistake

• in ideal world naming doesn’t matter

Page 5: Five Years of EC2 Distilled

EC2 at OpenX (cont.)

• Failures are very frequent at scale

• Forced to architect for failure and horizontal scaling

• Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole)

• Elasticity: easier to scale out than scale back

Page 6: Five Years of EC2 Distilled

EC2 at OpenX (cont.)

• Automation and configuration management become critical

• Used little-known tool - ‘slack’

• Rolled own EC2 management tool in Python, wrapped around EC2 Java API

• Testing deployments is critical (one mistake can get propagated everywhere)

Page 7: Five Years of EC2 Distilled

EC2 at OpenX (cont.)

• Hard to scale at the DB layer (MySQL)

• mysql-proxy for r/w split

• slaves behind HAProxy for reads

• HAProxy for LB, then ELB

• ELB melted initially, had to be gradually warmed up

Page 8: Five Years of EC2 Distilled

EC2 at Evite

• Sharded MySQL at DB layer; application very write-intensive

• Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally

• Engaged Percona at the same time

Page 9: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• Started with EBS volumes (separate for data, transaction logs, temp files)

• EBS horror stories

• CPU Wait up to 100%, instances AWOL

• I/O very inconsistent, unpredictable

• Striped EBS volumes in RAID0 helps with performance but not with reliability

Page 10: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• EBS apocalypse in April 2011

• Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!)

• IMPORTANT: rebuilding redundancy into your system is HARD

• For DB servers, reloading data on new server is a lengthy process

Page 11: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• General operation: very frequent failures (once a week); nightmare for pager duty

• Got very good at disaster recovery!

• Failover of master to slave

• Rebuilding of slave from master (xtrabackup)

• Local disks striped in RAID0 better than EBS

Page 12: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• Ended up moving DB servers back to data center

• Bare metal (Dell C2100, 144 GB RAM, RAID10); 2 MySQL instances per server

• Lots of tuning help from Percona

• BUT: EC2 was great for capacity planning! (Zynga does the same)

Page 13: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• Relational databases are not ready for the cloud (reliability, I/O performance)

• Still keep MySQL slaves in EC2 for DR

• Ryan Mack (Facebook): “We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."

Page 14: Five Years of EC2 Distilled

EC2 at Evite (cont.)

• Didn’t use provisioned IOPS for EBS

• Didn’t use VPC

• Great experience with Elastic Map Reduce, S3, Route 53 DNS

• Not so great experience with DynamoDB

• ELB OK but still need HAProxy behind it

Page 15: Five Years of EC2 Distilled

EC2 at NastyGal

• VPC - really good idea!

• Extension of data center infrastructure

• Currently using it for dev/staging + some internal backend production

• Challenging to set up VPN tunnels to various firewall vendors (Cisco, Fortinet) - not much debugging on VPC side

Page 16: Five Years of EC2 Distilled

Interacting with AWS

• AWS API (mostly Java based, but also Ruby and Python)

• Multi-cloud libraries: jclouds (Java), libcloud (Python), deltacloud (Ruby)

• Chef knife

• Vagrant EC2 provider

• Roll your own

Page 17: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Monitoring - alerting, logging, graphing

• It’s not in production if it’s not monitored and graphed

• Monitoring is for ops what testing is for dev

• Great way to learn a new infrastructure

• Dev and ops on pager

Page 18: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Going from #monitoringsucks to #monitoringlove and @monitorama

• Modern monitoring/graphing/logging tools

• Sensu, Graphite, Boundary, Server Density, New Relic, Papertrail, Pingdom, Dead Man’s Snitch

Page 19: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Dashboards!

• Mission Control page with graphs based on Graphite and Google Visualization API

• Correlate spikes and dips in graphs with errors (external and internal monitoring)

• Akamai HTTP 500 alerts correlated with Web server 500 errors and DB server I/O wait increase

Page 20: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• HTTP 500 errors as a percentage of all HTTP requests across all app servers in the last 60 minutes

Page 21: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Expect failures and recover quickly

• Capacity planning

• Dark launching

• Measure baselines

• Correlate external symptoms (HTTP 500) with metrics (CPU I/O Wait) then keep metrics under certain thresholds by adding resources

Page 22: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Automate, automate, automate! - Chef, Puppet, CFEngine, Jenkins, Capistrano, Fabric

• Chef - can be single source of truth for infrastructure

• Running chef-client continuously on nodes requires discipline

• Logging into remote node is anti-pattern (hard!)

Page 23: Five Years of EC2 Distilled

Proper infrastructure care and feeding

• Chef best practices

• Use knife - no snowflakes!

• Deploy new nodes, don’t do massive updates in place

• BUT! beware of OS monoculture

• kernel bug after 200+ days

• leapocalypse

Page 24: Five Years of EC2 Distilled

Is the cloud worth the hype?

• It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you)

• Cloud expert? Carry pager or STFU

• Forces you to think about failure recovery, horizontal scalability, automation

• Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)

Page 25: Five Years of EC2 Distilled

So...when should I use the cloud?

• Great for dev/staging/testing

• Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases)

• Not great for ‘snowflake’-type systems

• Not great for RDBMS (esp. write-intensive)

Page 26: Five Years of EC2 Distilled

If you still want to use the cloud

• Watch that monthly bill!

• Use multiple cloud vendors

• Design your infrastructure to scale horizontally and to be portable across cloud vendors

• Shared nothing

• No SAN, NAS

Page 27: Five Years of EC2 Distilled

If you still want to use the cloud

• Don’t get locked into vendor-proprietary services

• EC2, S3, Route 53, EMR are OK

• Data stores are not OK (DynamoDB)

• OpsWorks - debatable (based on Chef, but still locks you in)

• Wrap services in your own RESTful endpoints

Page 28: Five Years of EC2 Distilled

Does EC2 have rivals?

• No (or at least not yet)

• Anybody use GCE?

• Other public clouds are either toys or smaller, with less features (no names named)

• Perception matters - not a contender unless featured on High Scalability blog

• APIs matter less (can use multi-cloud libs)

Page 29: Five Years of EC2 Distilled

Does EC2 have rivals?

• OpenStack, CloudStack, Eucalyptus all seem promising

• Good approach: private infrastructure (bare metal, private cloud) for performance/reliability + extension into public cloud for elasticity/agility (EC2 VPC, Rack Connect)

• How about PaaS?

• Personally: too hard to relinquish control