Dynamo: Amazon's Highly Available Key-value Store - MIMUW

IntroductionBackground

Related WorkSystem Architecture (core distributed systems techniques)

ImplementationExperiences & Lessons Learned

Dynamo: Amazon’s Highly Available Key-value StoreGiuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,

Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall andWerner Vogels

Presentation by Jakub Bartodziej

Department of Mathematics, Computer Science and MechanicsUniversity of Warsaw

Distributed Systems, 2011

Presentation by Jakub Bartodziej Dynamo: Amazon’s Highly Available Key-value Store

Outline1 Introduction2 Background

System Assumptions and RequirementsService Level Agreements (SLA)Design Considerations

3 Related Work4 System Architecture (core distributed systems techniques)

System InterfacePartitioning AlgorithmReplicationData VersioningExecution of get () and put () operationsHandling Failures: Hinted HandoffHandling permanent failures: Replica synchronizationMembership and Failure DetectionAdding/Removing Storage Nodes

5 Implementation6 Experiences & Lessons Learned

Balancing Performance and DurabilityEnsuring Uniform Load distributionDivergent Versions: When and How ManyClient-driven or Server-driven CoordinationBalancing background vs. foreground tasks

Amazon

world-wide e-commerce platform

tens of millions customers

service oriented architecturerequirements

performancereliabilityefficiencyscalability

number of storage technologiesone of which is Dynamo

Dynamounderlying storage technology for a number of the core services in Amazon’se-commerce platformwas able to scale to extreme peak loads efficiently without any downtime

Amazon

services store their state in a databasetraditionally relational databases

excess functionalityinefficientrequire expensive hardware and highly skilled personel for operationconsistency over availability

enter Dynamohighly availablekey/valuesimple scale out schemeeach service runs it’s own instances

Query Modelsimple read and write operationaccess by primary keybinary objects (usually < 1MB)

ACID (Atomicity, Consistency, Isolation, Durability)causes poor availabilityavailability over consistencyno Isolation guaranteesonly single key updates

Efficiencylatency requirements measured at 99.9th percentiletradeoffs are in performance, cost efficiency, availability, and durability guarantees

Other Assumptionsnon-hostile environment (no auth*)scale up to hundreds of hosts

guarantee that the application can deliver its functionality in a bounded time

page request to one of the e-commerce sites typically requires the renderingengine to construct its response by sending requests to over 150 services

it is not uncommon for the call graph of an application to have more than one level

example: the service will provide a response within 300ms for 99.9% of itsrequests for a peak client load of 500 requests per second

storage systems play an important roleDynamo aims to

give services control over system propertieslet services make their own tradeoffs between functionality, performance andcost-effectiveness

Service-oriented architecture of Amazon’s platform

when dealing with the possibility of network failures, strong consistency and highdata availability cannot be achieved simultaneouslyavailability can be increased by using optimistic replication techniques (changesare allowed to propagate to replicas in the background)

when to resolve conflicts?Dynamo is designed to be “always writeable” (e.g. shopping cart)conflict resolution in reads

who resolves them?data store - “last write wins”application - complex logic

other key principlesIncremental scalability - one host (“node”) at a timeSymmetry - nodes have the same responsibilitiesDecentralization - favor peer-to-peer control techniquesHeterogenity - differences in infrastructure, e.g. the capacity of the nodes

Work in peer-to-peer systems, distributed file systems and databases.

Dynamo has to be always writeable.

No need for hierarchical namespaces, relational schema.

multi-hop routing is unacceptable

get(key) : (context, value)

put(key, context, value)

contextopaqueencodes metadata such as the version of the objectis stored along the object, so that the system can verify it’s validity

MD5 hash on key (yields 128-bit identifier)

consistent hashingoutput range of hash function is treated as fixed circular spaceeach node is assigned a random value (position on the ring)data items are assigned to nodes by hashing the key - item is assigned the next nodeclockwiseeach node becomes responsible for the region between it and the predecessordeparture and arrival only affects the immediate neighbors

challenges to basic consistent hashingthe random position assignment of each node on the ring leads to non-uniform data andload distributionthe basic algorithm is oblivious to the heterogeneity in the performance of nodes

solution: virtual nodes - each node gets multiple points in the ringif a node becomes unavailable, it’s load is evenly dispersed across the remaining nodesif a node becomes available, it accepts a roughly equivalent amount of load from each ofthe other nodesnumber of virtual nodes can be based on capacity - accounts for heterogenity in thephysical structure

Partitioning and replication of keys in Dynamo ring

each data item is replicated at N hosts“coordinator” node replicates the keys at N − 1 clockwise successors

each node is responsible for N preceding ranges

a list of nodes responsible for storing a particular key is called the “preference list”every node can reconstruct the preference list (explained later)it contains more than N nodes to account for node failuresit contains physical, as opposed to virtual nodes

eventual consistency allows for updates to be propagated to all replicasasynchronously, however, under certain failure scenarios, updates may not arriveat all replicas for an extended period of time

some applications in Amazon’s platform can tolerate such inconsistencies (eg.shopping cart)

Dynamo treats the result of each modification as a new and immutable version ofthe data. The versions form a DAG.

in case of a causation relation, the data store can choose the most recent version(syntactic reconciliation)

in case of divergent branches, the client must collapse them in put() operation(semantic reconciliation)

A typical example of a collapse operation is “merging” different versions of acustomer’s shopping cart. Using this reconciliation mechanism, an “add to cart”operation is never lost. However, deleted items can resurface.

Vector clocks

In Dynamo, when a client wishes to update an object, it must specify whichversion it is updating, by passing the context.

The context contains a vector clock, storing information about the object version.

A vector clock is effectively a list of (node, counter) pairs.

The coordinator nodes increment their counters in the vector clock before handlinga save request.

If the counters on the first object’s clock are less-than-or-equal to all of the nodesin the second clock, then the first is an ancestor of the second and can beforgotten.

Otherwise, the two changes are considered to be in conflict and requirereconciliation.

Clock truncation scheme: Along with each (node, counter) pair, Dynamo stores atimestamp that indicates the last time the node updated the data item. When thenumber of (node, counter) pairs in the vector clock reaches a threshold (say 10),the oldest pair is removed from the clock.

Vector clocks

Version evolution of an object over time

Any storage node in Dynamo is eligible to receive client get and put operations forany key.get and put operations are invoked over HTTP. Client can:

route its request through a generic load balanceruse a partition-aware client library that routes requests directly to the appropriatecoordinator nodes

Typically, the coordinator is the first among the top N nodes in the preference list.

The operation is performed on the top N healthy nodes in the preference list.Quorum-like consistency protocol. Read (write) is successful if at least R (W )nodes participate in it. Setting R + W > N yields a quorum-like system.

Latency is dictated by the slowest operation, so often R < N and W < N.

put() operation updates the vector clock, saves the information locally and sends itto the remaining N − 1 nodes

get() operation queries all N nodes and performs syntactic reconciliation

regular quorum sacrifices availability and durability

“sloppy quorum”; all read and write operations are performed on the first N healthynodes from the preference list

if a node is temporarily down or unreachable during a write operation then areplica that would normally have lived on in will now be sent to the next healthynode after the top N in the preference list

The replica will have a hint in its metadata that suggests which node was theintended recipient of the replica. Nodes that receive hinted replicas will keep themin a separate local database that is scanned periodically. Upon detecting that theoriginal target has recovered, the node will attempt to deliver the replica to theoriginal target. Once the transfer succeeds, the replica may be removed.

Dynamo is configured such that each object is replicated across multiple datacenters.

Hinted handoff works best if the system membership churn is low and nodefailures are transient.To detect the inconsistencies between replicas faster and to minimize the amountof transferred data, Dynamo uses Merkle trees.

leaves are hashes of the values of individual keysParent nodes higher in the tree are hashes of their respective children.each branch of the tree can be checked independently without requiring nodes todownload the entire tree or the entire data setMerkle trees help in reducing the amount of data that needs to be transferred whilechecking for inconsistencies among replicas

Each node maintains a separate Merkle tree for each key rangeThis allows nodes to compare whether the keys within a key range are up-to-date.By tree traversal the data can be synchronized effectively

Disadvantage: many key ranges change when a node joins or leaves the systemthereby requiring the tree(s) to be recalculated

Ring Membership

Explicit mechanism initiates addition and removal of nodes.

Each node keeps membership information locally.

Membership information form a history.

The administrator makes changes to a membership information on a single node.

The nodes propagate the information using a gossip-based protocol.

As a result, each storage node is aware of the token ranges handled by its peers.

When a node starts for the first time, it chooses it’s set of tokens and participatesin the gossip-based protocol.

Ring Membership

External Discovery

The gossip-based mechanism can lead to logically-partitioned ring.

To prevent it, some nodes play the role of seeds.

Seeds are discovered externally (e.g. static configuration of configuration service).

Every node eventually reconciles with a seed, which allows to propagate theinformation in a partitioned system.

External Discovery

Failure Detection

Failure detection in Dynamo is used to avoid attempts to communicate withunreachable peers during get() and put() operations and when transferringpartitions and hinted replicas.

A purely local notion of failure detection is entirely sufficient.node A quickly discovers that a node B is unresponsive when B fails to respond to amessageNode A then uses alternate nodes to service requests that map to B’s partitionsA periodically retries B to check for the latter’s recovery

Failure Detection

Failure detection in Dynamo is used to avoid attempts to communicate withunreachable peers during get() and put() operations and when transferringpartitions and hinted replicas.

A purely local notion of failure detection is entirely sufficient.node A quickly discovers that a node B is unresponsive when B fails to respond to amessageNode A then uses alternate nodes to service requests that map to B’s partitionsA periodically retries B to check for the latter’s recovery

When a new node (say X) is added into the system, it gets assigned a number oftokens that are randomly scattered on the ring.

For every key range that is assigned to node X, there may be a number of nodes(less than or equal to N) that are currently in charge of handling keys that fallwithin its token range.

Due to the allocation of key ranges to X, some existing nodes no longer have tosome of their keys and these nodes transfer those keys to X.

When a node is removed from the system, the reallocation of keys happens in areverse process.

Operational experience has shown that this approach distributes the load of keydistribution uniformly across the storage nodes.

By adding a confirmation round between the source and the destination, it is madesure that the destination node does not receive any duplicate transfers for a givenkey range.

three main software components:request coordination,membership and failure detection,local persistence engine

implemented in Java :)

different storage engines: Berkeley Database (BDB) Transactional Data Store,BDB Java Edition, MySQL, and an in-memory buffer with persistent backing store.

Main patterns in which Dynamo is used:Business logic specific reconciliation - the client application performs its ownreconciliation logic. e.g. shopping cartTimestamp based reconciliation - Dynamo performs simple timestamp basedreconciliation logic . e.g. customer’s session informationHigh performance read engine - these services have a high read request rate and only asmall number of updates. In this configuration, typically R is set to be 1 and W to be N.e.g. product catalog, promotional items

The common (N,R,W) configuration used by several instances of Dynamo is(3,2,2). These values are chosen to meet the necessary levels of performance,durability, consistency, and availability SLAs.

Main patterns in which Dynamo is used:Business logic specific reconciliation - the client application performs its ownreconciliation logic. e.g. shopping cartTimestamp based reconciliation - Dynamo performs simple timestamp basedreconciliation logic . e.g. customer’s session informationHigh performance read engine - these services have a high read request rate and only asmall number of updates. In this configuration, typically R is set to be 1 and W to be N.e.g. product catalog, promotional items

The common (N,R,W) configuration used by several instances of Dynamo is(3,2,2). These values are chosen to meet the necessary levels of performance,durability, consistency, and availability SLAs.

Dynamo provides the ability to trade-off durability guarantees for performance.

In the optimization each storage node maintains an object buffer in its mainmemory. Each write operation is stored in the buffer and gets periodically writtento storage by a writer thread. In this scheme, read operations first check if therequested key is present in the buffer. If so, the object is read from the bufferinstead of the storage engine.

To reduce the durability risk, the write operation is refined to have the coordinatorchoose one out of the N replicas to perform a “durable write”. Since thecoordinator waits only for W responses, the performance of the write operation isnot affected by the performance of the durable write operation performed by asingle replica.

Average and 99.9 percentiles of latencies for read and write requestsduring our peak request season of December 2006

Comparison of performance of 99.9th percentile latencies for bufferedvs. non-buffered writes over a period of 24 hours. 1 hour ticks.

Fraction of nodes that are out-of-balance and their correspondingrequest load. 30 min. ticks.

Imbalance ratio decreases with increasing load.

Intuitively, this can be explained by the fact that under high loads, a large numberof popular keys are accessed and due to uniform distribution of keys the load isevenly distributed. However, during low loads (where load is 1/8th of the measuredpeak load), fewer popular keys are accessed, resulting in a higher load imbalance.

Imbalance ratio decreases with increasing load.

Intuitively, this can be explained by the fact that under high loads, a large numberof popular keys are accessed and due to uniform distribution of keys the load isevenly distributed. However, during low loads (where load is 1/8th of the measuredpeak load), fewer popular keys are accessed, resulting in a higher load imbalance.

Partitioning Strategies

T random tokens per node and partition by token valueeach node is assigned T tokens (chosen uniformly at random from the hash space). Thetokens of all nodes are ordered according to their values in the hash space. Every twoconsecutive tokens define a range

T random tokens per node and equal sized partitionsthe hash space is divided into Q equally sized partitions/ranges and each node isassigned T random tokens. Q is usually set such that Q » N and Q » S*T, where S is thenumber of nodes in the system. A partition is placed on the first N unique nodes that areencountered while walking the consistent hashing ring clockwise from the end of thepartition.

Q/S tokens per node, equal-sized partitionsthis strategy divides the hash space into Q equally sized partitions and the placement ofpartition is decoupled from the partitioning scheme. Moreover, each node is assignedQ/S tokens where S is the number of nodes in the system. When a node leaves thesystem, its tokens are randomly distributed to the remaining nodes such that theseproperties are preserved. Similarly, when a node joins the system it "steals" tokens fromnodes in the system in a way that preserves these properties.

Partitioning strategies

Comparison of the load distribution efficiency of different strategies forsystem with 30 nodes and N=3 with equal amount of metadatamaintained at each node.

The number of versions returned to the shopping cart service was profiled for aperiod of 24 hours. During this period, 99.94% of requests saw exactly oneversion; 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3versions and 0.00009% of requests saw 4 versions. This shows that divergentversions are created rarely.

Experience shows that the increase in the number of divergent versions iscontributed not by failures but due to the increase in number of concurrent writers.The increase in the number of concurrent writes is usually triggered by busyrobots (automated client programs) and rarely by humans. This issue is notdiscussed in detail due to the sensitive nature of the story.

The number of versions returned to the shopping cart service was profiled for aperiod of 24 hours. During this period, 99.94% of requests saw exactly oneversion; 0.00057% of requests saw 2 versions; 0.00047% of requests saw 3versions and 0.00009% of requests saw 4 versions. This shows that divergentversions are created rarely.

Experience shows that the increase in the number of divergent versions iscontributed not by failures but due to the increase in number of concurrent writers.The increase in the number of concurrent writes is usually triggered by busyrobots (automated client programs) and rarely by humans. This issue is notdiscussed in detail due to the sensitive nature of the story.

An alternative approach to request coordination is to move the state machine tothe client nodes. In this scheme client applications use a library to perform requestcoordination locally. A client periodically picks a random Dynamo node anddownloads its current view of Dynamo membership state. Using this informationthe client can determine which set of nodes form the preference list for any givenkey. Read requests can be coordinated at the client node thereby avoiding theextra network hop that is incurred if the request were assigned to a randomDynamo node by the load balancer. Writes will either be forwarded to a node inthe key’s preference list or can be coordinated locally if Dynamo is usingtimestamps based versioning.An important advantage of the client-driven coordination approach is that a loadbalancer is no longer required to uniformly distribute client load. Fair loaddistribution is implicitly guaranteed by the near uniform assignment of keys to thestorage nodes.The client-driven coordination approach reduces the latencies by at least 30milliseconds for 99.9th percentile latencies and decreases the average by 3 to 4milliseconds. The latency improvement is because the client-driven approacheliminates the overhead of the load balancer and the extra network hop that maybe incurred when a request is assigned to a random node.

Each node performs different kinds of background tasks for replicasynchronization and data handoff (either due to hinting or adding/removing nodes)in addition to its normal foreground put/get operations.

It became necessary to ensure that background tasks ran only when the regularcritical operations are not affected significantly.

The admission controller constantly monitors the behavior of resource accesseswhile executing a "foreground" put/get operation.

It decides on how many time slices will be available to background tasks, therebyusing the feedback loop to limit the intrusiveness of the background activities.

Dynamo: Amazon's Highly Available Key-value Store - MIMUW

Documents

Dynamo Effect 3 Dynamo Effect 5 fileDynamo Effect 3 Dynamo.....

Amazon Dynamo – Amazon's highly available key-value store

Amazon's Storage Facility

Practical Dynamo David Wood Chapman Taylor. Dynamo website .

Habitat & Amazon's ECS

Strategy Analysis of Amazon's fire Phone

Why You Shouldn't Adopt Amazon's Work Culture

Dynamo Maths - Multisense Technology overcome dyscalculia...

Dynamo: Amazon's Highly Available Key-value Store › files....

Dynamo: Amazon's Highly Available Key-value Store - Cornell....

Weaving containers in Amazon's ECS

Amazon's 2016 Holiday Shopping Season Recap

Summary of "Amazon's Dynamo" for the 2nd nosql summer...

Weaving Containers in Amazon's ECA

Dynamo: Amazon's Highly Available Key-value Store ·...

Amazon's Past and Present