© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sailesh Krishnamurthy Senior Engineering Manager Amazon Web Services Amazon Aurora: A Database for the Cloud
©"2015,"Amazon"Web"Services," Inc."or"its"Affiliates."All"rights"reserved.
Sailesh"KrishnamurthySenior"Engineering"Manager
Amazon"Web"Services
Amazon'Aurora:'A'Database'for'the'Cloud
Rapidly"growing"global"footprint
Over"1"million"active customers"across"190"countries
800+"government"agencies
3,000+"educational" institutions
12"regions
33"availability"zones
52"edge"locations
Everyday,"AWS"adds"enough"new"server"capacity"to"support"Amazon.com"when"it"was"a"$7"billion"global"enterprise.
Moving"from"a"world"where"we"design"for"sharing"scarce"system"resources" to"one"where"the"central"challenge"is"taking"advantage"of"their"abundance
How"is"the"Cloud"changing"the"World"?
Designing"Databases"for"the"Cloud
Scarcity"to"Abundance
Monolithic"to"Service"Oriented
Single"cluster"to"a"Fleet"of"clusters
What"is"Amazon"Aurora"?
MySQLVcompatible"relational"database
Performance"and availability(of"commercial"databases
Simplicity and costVeffectiveness of"open"source"databases
Delivered"as"a"managed"service
Current"DB"Architectures"are"Monolithic
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
Storage
Application
Even%when%you%scale%it%out,%you’re%still% replicating% the%same%stack
SQL
Transactions
Caching
Logging
SQL
Transactions
Caching
Logging
Application
StorageStorage
SQL
Transactions
Caching
Logging
Application
Storage
SQL
Transactions
Caching
Logging
Application
Storage
Aurora"Architecture:"ReVimagined"for"the"CloudMoved"the"logging"and"storage"layer"into"a"multitenant,"scaleVout"storage"service"optimized"for"OLTP"database"workloads
Leverage"existing"AWS"services:"Amazon"EC2,"Amazon"VPC,"Amazon"DynamoDB,"Amazon"SWF,"and"Amazon"S3
Maintain"compatibility"with"MySQL"–customers"can"migrate"their"MySQL"applications"asVis,"use"all"MySQL"tools.
Control'PlaneData'Plane
Amazon DynamoDB
Amazon SWF
Amazon Route 53
Logging'+'Storage
SQL
Transactions
Caching
Amazon S3
1
2
3
Aurora"Storage:"A"ServiceVOriented"Architecture• ScaleVout,"multiVtenant,"SSD"storage• Seamless" storage"scalability• Up"to"64"TB"database"size• Only" pay"for"what"you"use
• LogVstructured"storage• Many"small" segments," each"with"their"own"redo"logs• Redo"logs"used" to"generate"data"pages"on"demand• Eliminates" chatter"between"database"and"storage
• Highly"available/durable"by"default• 6Vway"replication" across"3"AZs• 4"of" 6"write"quorum• Automatic"fallback" to"3"of"4"if"an"Availability" Zone"(AZ)"is"unavailable
• 3"of" 6"read"quorum• Continuous" backup" to"S3
AZ"1 AZ"2 AZ"3
Amazon" S3
Aurora:"Database/Storage"Interaction
SQL
Transactions
AZ"1 AZ"2 AZ"3
Caching
Amazon" S3
• NewVAPI"model• Reads"are"blockVbased" (read"pages)
• Writes"are"deltaVbased" (write"redo"logs)
• Distributed"quorumVbased"writes• Ordered"logVstream" in"a"single" LSN"space
• Database"writes"logVstream" to"6"nodes" in"3"AZs
• Transaction" commit"only" after"write"quorum"established
• Continuous"state"exchange"protocol"• Segments" can"have"holes" (lost" log"records)
• Read"at"an"LSN"directed"to"the"right"segment
• Storage"segments" know"when" to"coalesce" redo"logs
5x"faster"than"RDS"MySQL"5.6"&"5.7
WRITE PERFORMANCE READ PERFORMANCE
MySQL" SysBench results
R3.8XL:"32"cores"/"244"GB"RAMFive%times%higher%throughput% than%stock%MySQL,
based% on%industry%standard% benchmarks.
How"did"we"achieve"this"?
Do"fewer"IOs
Minimize"network"packets
Cache"prior"results
Offload"the"database"engine
DO'LESS'WORK
Process"asynchronously
Reduce" latency"path
Use"lockVfree"data"structures
Batch"operations"together
BE'MORE'EFFICIENT
DATABASES'ARE'ALL'ABOUT'I/O
NETWORKOATTACHED'STORAGE'IS'ALL'ABOUT'PACKETS/SECOND
HIGHOTHROUGHPUT'PROCESSING'DOES'NOT'ALLOW'CONTEXT'SWITCHES
IO"Traffic"in"RDS"MySQL
BINLOG DATA DOUBLEVWRITEREDO"LOG FRM"FILES
T YPE ' O F 'WRI T E
MYSQL'WITH'REPLICA
EBS"mirrorEBS"mirror
AZ"1 AZ"2
Amazon" S3
EBSAmazon"Elastic"Block"
Store"(EBS)
PrimaryInstance
ReplicaInstance
1
2
3
4
5
Issue"write"to"EBS"– EBS"issues"to"mirror,"ackwhen"both"doneStage"write"to"standby"instance"through"DRBDIssue"write"to"EBS"on"standby"instance
IO'FLOW
Steps"1,"3,"5"are"sequential"and"synchronousThis"amplifies"both"latency"and"jitterMany"types"of"writes"for"each"user"operationHave"to"write"data"blocks"twice"to"avoid"torn"writes
OBSERVATIONS
780K"transactions7,388K"I/Os per"million"txns (excludes"mirroring,"standby)Average"7.4"I/Os per"transaction
PERFORMANCE
30"minute" SysBench" writeonly workload," 100GB" dataset,"RDS'MultiAZ," 30K" PIOPS
IO"Traffic"in"Aurora
BINLOG DATA DOUBLEVWRITEREDO"LOG FRM"FILES
T YPE ' O F 'WRI T E
AZ"1 AZ"3
PrimaryInstance
Amazon S3
AZ"2
ReplicaInstance
AMAZON& AURORA
ASYNC4/6"QUORUM
DISTRIBUTED"WRITES
IO&FLOW
Only"write"redo"log"records;"all"steps"asynchronousNo"data"block"writes"(checkpoint,"cache"replacement)6Xmore log"writes,"but"9X less network"trafficTolerant"of"network"and"storage"outlier"latency
OBSERVATIONS
27,378K"transactions" 35X MORE
950K"I/Os per"1M"txns (6X"amplification)Average"0.95"I/O"per"txn 7.7X LESS
PERFORMANCE
Boxcar"redo"log"records"– fully"ordered"by"LSNShuffle"to"appropriate"segments"– partially"orderedBoxcar"to"storage"nodes"and"issue"writesReplica
Instance
30"minute" SysBench" writeonly workload," 100GB" dataset
LOG" RECORDS
Primary"Instance
INCOMING"QUEUE
STORAGE'NODE
S3"BACKUP
1
2
3
4
5
6
7
8
UPDATE"QUEUE
ACK
HOTLOG
DATABLOCKS
POINT"IN"TIMESNAPSHOT
GC
SCRUBCOALESCE
SORTGROUP
PEER" TO"PEER" GOSSIPPeerStorageNodes
All"steps"are"asynchronousOnly"steps"1"and"2"are"in"foreground"latency"pathInput"queue"is"46X less than"MySQL"(unamplified,"per"node)Favor"latencyVsensitive"operationsUse"disk"space"to"buffer"against"spikes"in"activity
OBSERVATIONS
IO&FLOW
① Receive"record"and"add"to"inVmemory"queue② Persist"record"and"ACK"③ Organize"records"and"identify"gaps"in"log④ Gossip"with"peers"to"fill"in"holes⑤ Coalesce"log"records"into"new"data"block"versions⑥ Periodically"stage"log"and"new"block"versions"to"S3⑦ Periodically"garbage"collect"old"versions⑧ Periodically"validate"CRC"codes"on"blocks
IO"Traffic"in"Aurora"(Storage"Node)
IO"Traffic"in"Aurora"Read"Replicas
PAGE" CACHE"UPDATE"SELECTIVE" LOG" APPLY
Aurora' Master
30%"Read
70%"Write
Aurora' Replica
100%"New"Reads
Shared' MultiOAZ' Storage
MySQL' Master
30%"Read
70%"Write
MySQL' Replica
30%"New" Reads
70%"Write
SINGLEVTHREADEDBINLOG" APPLY
Data' Volume Data' Volume
• Logical: Ship"SQL"statements"to"Replica
• Write"workload"similar"on"both"instances
• Independent"storage
• Can"result"in"data"drift"between"Master"and"Replica
Physical: Ship"redo"from"Master"to"Replica
Replica"shares"storage."No"writes"performed
Cached"pages"have"redo"applied
Advance"read"view"when"all"commits"seen
MYSQL'READ'SCALING AMAZON'AURORA'READ'SCALING
Adaptive"Thread"Pool
• ReVentrant"connections"multiplexed"to"active"threads
• KernelVspace"epoll()"inserts"into"latchVfree"event"queue
• Dynamically"size"thread"pool"
• Gracefully"handles"5000+"concurrent"client"sessions"on"r3.8xl
Standard"MySQL"– one"thread"per"connection
Doesn’t"scale"with"connection"count
MySQL"EE"– connections"assigned"to"thread"group
Requires"careful"stall"threshold"tuning
CLIENT
"CONN
ECTION
CLIENT
"CONN
ECTION LATCH" FREE
TASK"QUEUE
epoll()
MYSQL'THREAD'MODEL AURORA'THREAD'MODEL
Asynchronous"Group"Commits
Read
Write
Commit
Read
Read
T1
Commit " ( T1 )
Commit " ( T2 )
Commit' (T3)
L SN " 1 0
LSN " 1 2
LSN ' 22
L SN " 5 0
LSN ' 30'
LSN ' 34
LSN ' 41
LSN ' 47
LSN"20
LSN"49
Commit' (T4)
Commit' (T5)
Commit' (T6)
Commit' (T7)
Commit " ( T8 )
LSN"GROWTHDurable"LSN"at"headVnode"
COMMIT"QUEUEPending"commits"in"LSN"order
TIME
GROUPCOMMIT
TRANSACTIONS
Read
Write
Commit
Read
Read
T1
Read
Write
Commit
Read
Read
Tn
• TRADITIONAL'APPROACH AMAZON&AURORAMaintain( a(buffer(of(log(records(to(write(out(to(disk
Issue(write( when( buffer(full(or(time(out(waiting( for(writes
First(writer( has(latency(penalty(when(write( rate(is(low
Request( I/O(with( first(write,(fill(buffer(till(write( picked(up
Individual( write(durable( when( 4(of(6(storage(nodes(ACK
Advance(DB(Durable( point(up(to(earliest(pending( ACK
More"Replicas• Aurora"cluster"contains"primary"node"and"up"to"fifteen"secondary"nodes
• Failing"database"nodes"are"automatically"detected"and"replaced
• Failing"database"processes"are"automatically"detected"and"recycled
• Secondary"nodes"automatically"promoted"on"persistent"outage,"no"single"point"of"failure
• Customer"application"may"scaleVout"read"traffic"across"secondary"nodes
AZ"1 AZ"3AZ"2
PrimaryNodePrimaryNodePrimaryNode
PrimaryNodePrimaryNodeSecondaryNode
PrimaryNodePrimaryNodeSecondaryNode
! Customer"specifiable" failVover"order
! Read"balancing"across"read"replicas
Storage"Durability• Storage"volume"automatically"grows"up"to"64"TB
• Quorum"system"for"read/write;"latency"tolerant
• Peer"to"peer"gossip"replication"to"fill"in"holes
• Continuous"backup"to"S3"(built"for"11"9s"durability)
• Continuous"monitoring"of"nodes"and"disks"for"repair"
• 10GB"segments"as"unit"of"repair"or"hotspot"rebalance
• Quorum"membership"changes"do"not"stall"writes
AZ"1 AZ"2 AZ"3
Amazon" S3
Continuous"BackupSegment"snapshot Log"records
Recovery"point
Segment'1
Segment'2
Segment'3
Time
• Take periodic"snapshot"of"each"segment"in"parallel;"stream"the"redo"logs"to"Amazon"S3
• Backup"happens"continuously"without"performance"or"availability"impact
• At"restore,"retrieve"the"appropriate"segment"snapshots"and"log"streams"to"storage"nodes
• Apply"log"streams"to"segment"snapshots"in"parallel"and"asynchronously
Survivable"Buffer"Caches• We"moved"the"cache"out"of"the"database"process
• Cache"remains"warm"in"the"event"of"database"restart
• Lets"you"resume"fully"loaded"operations"much"faster
• Instant"crash"recovery"+ survivable"cache"='quick"and"easy"recovery"from"DB"failures
SQL
Transactions
Caching
SQL
Transactions
Caching
SQL
Transactions
Caching
Caching"process"is"outside"the"DB"process"and"remains"warm"across"a"database"restar t
Instant"Crash"Recovery• Traditional"Databases
• Have"to"replay"logs"since"the"last"checkpoint
• Typically"5"minutes"between"checkpoints
• SingleVthreaded"in"MySQL;"requires"a"large"number"of"disk"accesses
• Amazon"Aurora
• Underlying"storage"replays"redo"records"on"demand"as"part"of"a"disk"read
• Parallel,"distributed,"asynchronous
• No"replay"for"startup
Checkpointed"Data Redo"Log
Crash" at"T0 requiresa"reVapplication" of"theSQL"in" the"redo" log"sincelast"checkpoint
T0 T0
Crash" at"T0 will" result" in" redo"logs"being" applied" to"each"segment" on"demand," in"parallel,"asynchronously
Fast"FailVOver
AppRunningFailure"Detection DNS"Propagation
Recovery Recovery
DBFailure
MYSQL
AppRunning
Failure"Detection DNS"Propagation
Recovery
DBFailure
AURORA"WITH"MARIADB"DRIVER
1 5 O 2 0 's e c
3 O 2 0 ' s e c
RealVlife"data"V failVover"time
“In"RDS"MySQL,"it"took"minutes"or"sometimes"tens"of"minutes"to"failover."It’s"pretty"awesome"that"you"can"failover/restart"within"less"than"a"minute.”
Simulate"failures"using"SQL
ALTER"SYSTEM"CRASH"[{INSTANCE"|"DISPATCHER"|"NODE}]
ALTER"SYSTEM"SIMULATE"percent_failure DISK"failure_type IN"[DISK"index"|"NODE"index]"FOR"INTERVAL"interval
ALTER"SYSTEM"SIMULATE"percent_failure NETWORK"failure_type[TO"{ALL"|"read_replica |"availability_zone}]"FOR"INTERVAL"interval
• To"cause"the"failure"of"a"component"at"the"database"node:
• To"simulate"the"failure"of"disks:
• To"simulate"the"failure"of"networking:
Simplify"Database"Management• Create"a"database"in"minutes
• Automated"patching
• PushVbutton"scale"compute
• Continuous"backups"to"Amazon"S3
• Automatic"failure"detection"and"failover
Amazon RDS
Simplify"Storage"Management• Read"replicas"are"available"as"failover"targets—no"data"loss
• Instantly"create"user"snapshots—no"performance"impact
• Continuous,"incremental"backups"to"Amazon"S3
• Automatic"storage"scaling"up"to"64"TB—no"performance"or"availability"impact
• Automatic"restriping,"mirror"repair,"hot"spot"management,"encryption
Simplify"Data"Security• Encryption"to"secure"data"at"rest• AESV256;"hardware"accelerated• All"blocks"on"disk"and"in"Amazon"S3"are"encrypted• Key"management"via"AWS"KMS
• SSL"to"secure"data"in"transit
• Network"isolation"via"Amazon"VPC"by"default
• No"direct"access"to"nodes
• Supports"industry"standard"security"and"data"protection"certifications
Storage
SQL
Transactions
Caching
Amazon S3
Application
Well"established"MySQL"ecosystem
Business'Intelligence Data'Integration Query'and'Monitoring SI'and'Consulting
Source:%Amazon
“We'ran'our'compatibility'test'suites'against'Amazon'Aurora'and'everything'just'
worked.""V Dan"Jewett,"Vice"President"of"Product"Management"at"Tableau
Monitor"Aurora"with"Datadog
• Just"add"readVonly"AWS"credentials"and"select"the"services"you"wish"to"monitor"(e.g."RDS)
Simplify"migration"from"RDS"MySQL
• 1."Establish"baseline
a. RDS"MySQL"to"Aurora"DB"snapshot"migration
b. MySQL"dump/import
• 2."CatchVup"changesApplication'Users
MySQL Aurora
Network
Migration"from"EC2"&"onVpremise MySQL• Data(migration(service
• Logical(data(replication(from(onLpremise(or(EC2• Code(&(schema(conversion(across(engines
• S3(integration• Load(partial(datasets(directly(from(/(to(S3• Ingest(large(database(snapshots((>2TB)
• Snowball(integration• Ingest(huge(database(snapshots((>10TB)• Send(us(your(data(in(a(suitcase!
Migration"from"nonVMySQL"Databases
AWS(Database(Migration(Service
" Move"data"to"the"same"or"different"database"engine"
" Keep"your"apps"running"during"the"migration
" Start"your"first"migration"in"10"minutes"or"less
" Replicate"within,"to,"or"from"Amazon"EC2"or"RDS
Beyond"Benchmarks
• If'only'real"world"applications"saw"benchmark"performance
• POSSIBLE'DISTORTIONSReal"world"requests"contend"with"each"otherReal"world"metadata"rarely"fits" in"data"dictionary"cacheReal"world"data"rarely"fits"in"buffer"cacheReal"world"production"databases"need"to"run"HA
• SysBench"OLTP"Workload
• 250"tables
Connections Amazon&AuroraRDS&MySQLw/&30K IOPS
50 40,000 10,000
500" 71,000 21,000
5,000" 110,000 13,000
8xUP' TO
FASTER
Scaling"User"Connections
Tables Amazon&AuroraMySQL I2.8XLlocal SSD
MySQLI2.8XLRAM&disk
RDS&MySQLw/&30K IOPS(single&AZ)
10" 60,000" 18,000" 22,000" 25,000"
100" 66,000" 19,000" 24,000" 23,000"
1,000" 64,000" 7,000" 18,000" 8,000"
10,000" 54,000" 4,000" 8,000" 5,000"
• SysBench writeVonly" workload
• Measuring" writes" per" second• 1,000" connections
11xUP&TO
FASTER
Scaling"Table"Count
DB&Size Amazon& AuroraRDS& MySQLw/&30K IOPS
1GB 107,000 8,400
10GB 107,000 2,400
100GB" 101,000 1,500
1TB 26,000 1,200
67xUP&TO
FASTER
• SYSBENCH'WRITEOONLY
DB&Size Amazon& AuroraRDS& MySQLw/&30K IOPS
80GB 12,582 585
800GB 9,406 69
CLOUDHARMONY&TPCVC
136xUP&TO
FASTER
Scaling"Data"Set
Updates'per
second Amazon' Aurora
RDS'MySQL
30K IOPS (single'AZ)
1,000 2.62"ms 0"s
2,000 3.42"ms 1"s
5,000 3.94"ms 60"s
10,000 5.38"ms 300"s
• SysBench" Writeonly Workload
• 250"tables
500xUP&TO
LOWER& LAG
Scaling"With"Replicas
“In"RDS"MySQL,"we"saw"replica"lag"spike"to"almost"12"minutes"which"is"almost"absurd"from"an"application’s"perspective."The"maximum"read"replica"lag"across"4"replicas"never"exceeded"beyond"20"ms.”
RealVlife"data"V read"replica"latency
Questions'?
Thank%you!
P.S.%We’re%hiring%!%Email%me%at:%[email protected]
http://aws.amazon.com/rds/aurora