Jay Edwards & Ben Black
PalominoDB
{jay, ben}@palominodb.com
MySQL in AWS
Patterns
Agenda
1. Introduction
2. RDS, EC2/MySQL
3. Web console, CLI, API
4. Performance/Availability
5. Implementation choices
6. Managing DDL
7. Common failures
8. Cost
9. Questions
About us
Jay!
CTO, PDB, OFA, Twitter
Ben!
Sr. DBA, PDB, Garmin
Booth? Yes. Hiring? Yes.
Interactivity
Ask away; we've got time. Ben will be glad to
try and solve your problems.
AWS tutorial?
• "Click on the replica button and come back
in 30 minutes"
• "PIOPs <-> EBS. Uncheck that box and
come back in 2 hours"
RDS and EC2/MySQL
RDS benefits
Fully managed
• High Availability
• Replicas? *click*
• PIT recover? *click*
• *click, click, click*
RDS un-benefits
Fully managed
• No binlog access
• No SUPER
• No flexible topology
The more experienced a DBA you are, the
crankier you will be.
RDS improves!
Like all AWS properties, RDS features continue
to improve all the time.
It's perfect for developers, proofs of concept,
one-offs, absorbing temporary load.
(Tungsten supports replication into RDS from
MySQL).
EC2/MySQL
All the MySQL you've come to love & hate
Multi-Region via replication & WAN tunnel
Why RDS or EC2?
RDS
1. You can tolerate ~99% uptime (which many
people can)
2. You don't have lots of DBAs and need to
optimize for operational ease
EC2
1. Multi-region availability
2. Vertical scaling
Questions?
Any particular scenarios you want to ask us
about?
Web Console, CLI, API
Overview
Functionality isn't complete
• some things aren't exposed via some
methods
Web Console
Most of the stuff you need for common day-to-
day maintenance
Sometimes:
• slow
• isn't working
• needs rage-clicking
CLI setup
RDS CLI
export AWS_RDS_HOME
export AWS_CREDENTIAL_FILE
(AWSAccessKeyId,AWSSecretKey)
CLI pain
It's written in Java right now*. The JVM
overhead makes it painfully slow for large-
scale automation.
* The future is the Redshift CLI (python,
coherent interface)
CLI output
Verbose and clunky
DBINSTANCE,scp01-replica2,2010-05-
22T01:53:47.372Z,db.m1.large,mysql,50,(nil),master,available,scp01-
replica2.cdysvrlmdirl.us-east-1.rds.amazonaws.com,3306,us-east-
1b,(nil),0,(nil),(nil),(nil),(nil),(nil),(nil),(nil),(nil),sun:05:00-sun:09:00,23:00-
01:00,(nil),n,5.1.50,n,simcoprod01,general-public-license
SECGROUP,Name,Status
SECGROUP,default,active
PARAMGRP,Group Name,Apply Status
PARAMGRP,default.mysql5.1,in-synch
Combining the worst features of machine- and human-readable text
formats.
API
Use Boto! (Mitch works for AWS).
Apply immediately.
--apply-immediately
Check the box hiding at the bottom of the page.
Availability
How many nines?
EC2 Region SLA
99.95% SLA
“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.”
("Region unavailable" == "multiple AZs are toast")
Implies you've got to go multi-region
EC2 Region SLA
~99.2% Reality
The previous definition is very strict; 2 or more
regions; can't create instances; blah, blah.
1-2X year multi-AZ degradation (EBS, network,
who knows)
Multi-region
It's coming for RDS. Probably before the end of
the year.
Until then...
Always go Multi-AZ
Minimal downtime for most maintenance
Saves you from most master crashes
{Sometimes, often, frequently} destroys all
replicas
Multi-AZ binlogs
sync_binlog=1
innodb_support_xa=1
• used to DESTROY write throughput
• MySQL 5.6 drastic improvements
Questions?
Questions about designing for availability?
Implementation Choices
Instance sizing
Dynamicity == reduced cost
(Now, in general, $$ isn't why you go to the
cloud; it's operational efficiency & reduced
friction).
Have a spreadsheet and do capacity analyses
frequently.
Ephemeral SSDs
Really nice! 150,000* IOPs
Really bad! ~~POOF~~
• Excellent for replicas
• Requires operational excellence
* YMMV
Provisioned IOPs
Really, really nice!
• Drastically lower failure rate (order of
magnitude)
• Guaranteed throughput
Not so nice.
• Costs $$
Provisioned IOPs
Masters and replicas can be different.
You can convert PIOPs <-> EBS back and
forth.
Consider multi-AZ PIOPs master for the best in
durability.
VPCs
Go VPC from the beginning for production.
• Hard to convert
• Use ELBs for internal load-balancing
• Not sharing the 10.net with everybody
Cluster compute
Placement groups are available for CC
instances.
"Placement group" means "physically close
hardware".
Very low-latency 10GbE full bisection
Questions?
Questions about your particular setup?
Managing DDL
DDL
Not possible to perform ddl on a slave, then
swap with master.
Slave promotion
Blocking DDL
DDL
Online schema changes
(log_bin_trust_function_creators)
No OS access
Be careful cleaning up if you ctrl-c
CALL mysql.rds_skip_repl_error;
Questions?
Questions about DDL?
Escape from RDS
mysql schema
--routines
users?
Dumping Users
mysql --host=olddatabasehost -BNe "select
concat('\'',user,'\'@\'',host,'\'') from mysql.user where user
not like 'rds%' and user != 'master'" | \
while read uh; do mysql --host=olddatabasehost -BNe
"show grants for $uh" | sed 's/$/;/; s/\\\\/\\/g'; done >
user_grants.sql
http://www.villescorner.com/2012/11/mysqldump-from-
amazon-rds-headaches-of.html
Common failures
(Should really be called zones and regions)
Operations is about
managing change and
mitigating risk.
Local failures
Database crashes
Human error
Localized EBS hang
How to mitigate?
Multi-AZ PIOPs master
Operational excellence
Throw away & rebuild replicas
Local failures redux
Local failures should be, at most, annoyances.
Runbooks*
Game days
Monitoring
* Process is a poor substitute for competence.
If you can't deal with
expected and desired
change, you'll never be
able to handle unexpected
and unwanted change.
Regional failures
A well-designed architecture will save you.
How quickly can your DNS flip?
How good is your replication?
Do you have a CDN?
Is your application going to run?
Not everybody can afford this.
Zones and Regions
A zone is analogous to a data center (for some
small number of buildings).
A region is a geographically dispersed
collection of zones that is distinct from any
other region.
Zones & Regions differ
Different instance types
Different features
Different provisioning capacity
OFA had ~40% of the US-East medium
instances at one point. Couldn't duplicate that
in US-West
Questions?
Cost
Reserved instances
• Substantial savings (how often do you turn
off production databases?)
• Secondary market
• Must match AZ and instance size
• Discount coupon
Heavy utilization instances charge the
hourly rate 24x7
Watch the $
Spreadsheet!
Inventory!
Load analysis!
Cloudability!
Dynamicity
The only thing you can't do is downsize
storage.
Change instance size? Check.
Turn PIOPs off? Check.
Delete replicas? Check.
Up to meet need. Down to meet budget.
Upgrading
Minor upgrade (can be auto during maint
window / will reboot or failover)
*Disable this
Upgrade from 5.5 to 5.6
1) Dump/load
2) Delta load
3) Switchover
Fin
Ask away!