Databases Sargun Dhillon @Sargun
Jul 27, 2015
Internet Traffic vs. Penetration
0
25
50
75
100
0
10000
20000
30000
40000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
IP Traffic (PB/mo) Global Penetration (%)
Biggest AWS Database• vCPUs: 32
• Memory: 244
• Storage: 3TB
• IOPs: 30,000 IOPs
• Networking: 10 Gigabit
• Resiliency: Multi-AZ
• SLA: 99.95%
• Backend: Postgresql
“Average partition duration ranged from 6 minutes for software-related failures to more than 8.2 hours for
hardware-related failures (median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days,
respectively).” -The Network is Reliable
WANs Fail
“Experience at Amazon has shown that data stores that provide
ACID guarantees tend to have poor availability.”
Dynamo: Amazon’s Highly Available Key-value Store
“A shared-data system can have at most two of the three following properties:
Consistency, Availability, and tolerance to network Partitions.”
-Dr. Eric Brewer
On Consistency
• ACID Consistency: Any transaction, or operation will bring the database from one valid state to another
• CAP Consistency: All nodes see the same data at the same time (synchrony)
On Partition Tolerance
• The network will be allowed to lose arbitrarily many messages sent from one node to another.
• Databases systems, in order to be useful must have communication over the network
• Clients count
There is no such thing as a 100% reliable network:
Can’t choose CA
http://codahale.com/you-cant-sacrifice-partition-tolerance
“This is a specific form of weak consistency; the storage system
guarantees that if no new updates are made to the object,
eventually all accesses will return the last updated value.”
Definition of “Eventual Consistency” from “Eventually Consistency Revisited” - Werner Vogels
Tunable CAP Controls• R (Read Acks) tunable: Default Quorum
• W (Write Acks) tunable: Default Quorum
• PR (Primary Read Acks) tunable: Default 0
• PW (Primary Write Acks) tunable: Default 0
• N (replicas) tunable: Default 3
Vector Clocks
• Extension of Lamport Clocks
• Used to detect cause and effect in distributed systems
• Can determine concurrency of events, and causality violations
• CRDTs:
• Convergent Replicated Data Types
• Commutative Replication Data Types
• Enables data structures to be always writeable on both sides of a partition, and replay after healing a partition
• Enable distributed computation across monotonic functions
• Two Types:
• CvRDTs
• CmRDTs
CRDTs
CmRDTs
• Op / method based CRDTs
• Size grows monotonically
• Uses version vectors to determine order of operations
CRDTs in the Wild• Sets
• Observe-remove set
• Grow-only sets
• Counters
• Grow-only counters
• PN-Counters
• Flags
• Maps
Data structures that are CRDTs
• Probabilistic, convergent data structures
• Hyper log log
• Bloom filter
• Co-recursive folding functions
• Maximum-counter
• Running Average
• Operational Transform
CRDTs
• Incredibly powerful primitive
• Not only useful for in-database manipulation but client-database interaction
• You can compose them, and build your own
• Garbage collection is tricky
Invariant Operation AP / CPSpecify unique ID Any CP
Generate unique ID Any AP
> INCREMENT AP
> DECREMENT CP
< INCREMENT CP
< DECREMENT AP
Secondary Index Any AP
Materialized View Any APAUTO_INCREMENT INSERT CP
Linearizability CAS CP
Operations Requiring
Weak Consistency
vs.
Strong Consistency
BASE not ACID• Basically Available: There will be a response
per request (failure, or success)
• Soft State: Any two reads against the system may yield different data (when measured against time)
• Eventually Consistent: The system will eventually become consistent when all failures have healed, and time goes to infinity
AWS Deployment• 6 x i2.4xlarge
• 732GB of RAM
• 19TB of storage
• 960,000 IOPs
• 96 vCPUs
• 3 x Replication
• 10 Gigabit networking
• 99.9999999997% availability
Test Model
• 50 actors
• 5 Ads with inventory between 1000, and 1200
• Actors randomly get [1,3] times to choose per round
• Rounds continue until entire inventory is exhausted
Test Model
Out
stan
ding
Impr
essi
ons
-300
0
300
600
900
1200
Round Number
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76
Ad 1 Ad 2 Ad 3 Ad 4 Ad 5
Garbage CollectionUtilizes secondary indexes in batch process to delete
exhausted ads from user records
Ad Serving
• Requires batch generation of targets
• Requires external GC
• Allows for multidatacenter operation