Read "Dynamo, Amazon’s Highly Available Key-value Store"

Reference

Link to paper

Abstract

K-V store.
Highly available, scalable.
Always on for reads and writes.
Eventual consistency.

1 Introduction

Amazon architecture
- Highly decentralized, loosely coupled SOA consisting hundreds of services.
- Failures are the norm of daily operations.
Dynamo uses techniques:
- Consistent hashing for partition and hashing.
- Consistency is facilitated with object versioning.
- Quorum-like technique and a decentralized replica synchronization protocol for consistency among replicas during updates.
- A gossip based distributed failure detection and membership protocol.

2 Background

Relational DBs are complex and inefficient for simple KV use cases.
System assumptions and requirements
- Query model
  - Simple reads and writes to a single data item.
  - Values are binary objects.
- Doesn’t need ACID transactions.
- Efficiency
  - SLAs on 99.9% percentile.
- No internal hostiles thus no security related requirements.
SLAs
- Uses 99.9% for better client experience.
Design considerations
- Use optimistic replication techniques -> conflict resolution
  - When to resolve, read/write?
    - Resolving during writes will reject some writes.
    - (Accepted) Resolving during reads to have an always-writable system.
  - Who resolves
    - Data store: Choices are limited, like last-write-wins.
    - Application: More flexible.
- Incremental scalability
- Symmetry
  - All nodes should have the same responsibilities.
- Decentralization
  - Favors decentralized peer-to-peer techniques over centralized control.
- Heterogeneity
  - e.g. work distribution must be proportional to the individual server capabilities.

Dynamo differs from some other projects as follows.

Always writable as no writes should be rejected.
All nodes are trusted since it’s used internally.
Doesn’t require hierarchical namespaces or complex schemas.
For latency-sensitive applications, targeting at 99.9% percentile.
- Multi-hop routing is not acceptable.
- Dynamo is zero-hop DHT that each node maintains enough routing info locally to route a request to an appropriate node directly.

4 System architecture

General considerations for a scalable and robust distributed system:

Data persistent component
Load balancing
Membership and failure detection
Failure recovery
Replica synchronization
Overload handling
State transfer
Concurrency and job scheduling
Request marshalling
Request routing
System monitoring and alarming

Core techniques used by Dynamo:

Problem	Technique	Advantage
Partitioning	Consistent hashing	Incremental scalability
High Availability for writes	Vector clocks with reconciliation during reads	Version size is decoupled from update rates
Handling temporary failures	Sloppy quorum and hinted handoff	Provides high availability and durability guarantee when some of the replicas are unavailable
Recovering from permanent failures	Anti-entropy using Merkle trees	Synchronizes divergent replicas in the background
Membership and failure detection	Gossip-based membership protocol and failure detection	Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

System interface

get(key)
put(key, context, object)
- context is some system metadata such as versions of objects.

Partition algorithm

Dynamo uses consistent hashing. On a ring, a node is assigned a random value representing the position on the ring. For an example ring A -> B -> C -> D -> A, B is the coordinator node for data with key in range (A, B].

This setting have two problems:

Random assignment leads to non-uniform data and load distribution.
Oblivious to the heterogeneity in the performance and capacity of nodes.

Therefore, Dynamo uses “virtual node” concept. Virtual nodes are nodes on the ring. A physical node can be responsible for one or more virtual nodes. Advantages are as follows:

If a node becomes unavailable, its load is evenly dispersed across remaining nodes.
If a node becomes available again or is added, it roughly accepts equivalent amount of load from the remaining ones.
The number of virtual nodes for a physical node can be decided considering the performance and capacity.

Replication

Each data is replicated across N nodes. Starting from the coordinator node, data are replicated to the N-1 successor nodes. For an example ring A -> B -> C -> D -> A and N=3, D is responsible for data in range (A, B], (B, C] and (C, D]. For a particular key, the list of nodes storing it is called preference list. In order to make sure data is replicated across N physical nodes, preference list is chosen by skipping virtual nodes on the same physical node.

Data versioning

To ensure the system is always writable, conflicting writes are accepted and the conflicts can be resolved during reads, e.g., merging different versions of a shopping cart. Vector clock is used to capture the causality of an object. It’s a list of (node, counter) pairs. For a put() operation, client must supply the version it’s updating in context, it could be obtained from an earlier read.

The size of a vector clock is usually limited since preference list contains only N nodes. In the cases of network partitions or multiple server failures, the request may be served by other nodes. Dynamo limits the size of vector clocks by associating timestamp to each pair. If the size reaches the threshold, discard the oldest ones.

Execution of get() and put() operations

Any storage node in Dynamo is eligible to receive client get and put operations for any key. Two strategies for a client to select the node performing operations:

Routes its request through a generic load balancer that will select a node based on load information.
- Clients don’t need links to Dynamo library.
- The random chosen node will forward the request to the top N healthy nodes in the preference list.
Uses a partition-aware client library that routes the request to the coordinator node directly.
- Lower latency.

Dynamo uses consistency protocol R + W > N.

R is the minimum number of nodes fore a successful read.
W is the minimum number of nodes fore a successful write.

Upon a read, the coordinator requests all existing versions of data for that key from the N highest-ranked nodes in the preference list. Once receiving at least R responses, reconcile the versions so the final versions are casually unrelated. Then send the response to the client and write the final versions back.

Handling failures: Hinted handoff

To improve availability, Dynamo uses sloppy quorum: all reads and writes are performed on the first N healthy nodes in the preference list.

For an example ring A -> B -> C -> D -> A and N=3, if A is unavailable, D will temporarily be hinted to receive the corresponding replica . Once A is recovered, D will transfer the replica back.

The preference list in constructed in a way that its storage nodes are spread across multiple data centers, in order to tackle data-center-level failures.

Handling permanent failures: replica synchronization

Dynamo uses an anti-entropy protocol with Merkle trees to keep replica synchronized.

Merkle tree
- A hash tree where leaves are hashes of the values of individual keys.
- Parent nodes higher in the tree are hashes of their respective children.
- It can quickly compare two trees to know which keys are out of sync.
In Dynamo
- Each node maintain a separate Merkle tree for each key range.
- A disadvantage is the tree needs to be recalculated if a node joins or leaves the system.
  - Will be addressed later.

Membership and failure detection

To add/remove a node to a ring,
- An admin connects to a node and issue membership changes.
- Change history is persisted.
- A gossip-based protocol propagates change history between nodes.
  - Also exchange and reconcile mapping of local nodes and token sets (key ranges).
A ring may be logically partitioned.
- e.g., connect node A to join A into the ring; connect node B to join B. In this case, A and B don’t know each other.
- To solve this, some nodes play the role of seeds, which is known by all nodes.
Failure detection
- To avoid attempt to communicate with unreachable nodes during operations like get(), put(), transferring partitions and hinted replicas.
- Local notion is sufficient: A pings B; if B doesn’t respond, A thinks B is unreachable. A can use B’s alternative nodes.
- Decentralized failure detection is unnecessary.
  - It uses a gossip protocol enabling each node to know arrival/departure of other nodes.
  - This obviates Dynamo’s node joining/leaving mechanism.

Adding/Removing storage nodes

When a new node joins the ring, following operations happen. Operational experience shows it distributes the load uniformly across storage nodes.

It gets assigned a number of random tokens (key ranges) from the ring.
The nodes currently in charge of these key ranges will transfer data to the new node upon its confirmation.
- Avoids duplicate transfers.

5 Implementation

Each Dynamo node contains 3 components:

Local persistence component
- Different storage engine can be plugged in: Berkeley DB, MySQL, etc. for different use cases.
Request coordination
- built on top of an event-driven messaging substrate where message processing pipeline is split into stages similar to SEDA architecture.
- Each client request triggers the creation of a state machine on the node handling:
  - identifying nodes for this key;
  - sending requests;
  - waiting for responses;
  - potentially doing retries;
  - processing and packaging responses.
- For read requests:
  - may performing read repair to fix stale data.
- For write requests:
  - Any node in the preference list can be the coordinator.
    - If using timestamp-based reconciliation, any node in the system can do the job.
  - This means writes for a key cannot be serialized at a single location.
  - Better load balancing.
  - Usually a write follows a read, thus use the node which replied fastest the the previous read, as write coordinator.
    - Better chances getting read-your-write consistency.
    - Reduce variability in the performance, improve 99.9% performance.
Membership and failure detection

6 Experience and lessons learned

Main patterns in which Dynamo uses are as follows. W, R, N values can be tuned to achieve desired level of performance, availability and durability. Typically, N = 3. Common combination is N = 3; R = 2; W = 2.

Business logic specific reconciliation
Timestamp based reconciliation
- Last writes wins.
High performance read engine
- R = 1; W = N

Balancing performance and durability

Write latencies are usually higher than reads since a disk accesses are required. However, we can trade-off durability guarantees for performance. Write operations can be buffered at memory, which gets periodically written to storage. To reduce durability risk of this approach, the coordinator can choose 1 out of N node performing “durable write”. As the coordinator only waits for W responses, the write performance won’t be impacted.

Ensuring uniform load distribution

Dynamo assumes that when significant skews of access distribution happen, there are enough keys for popular end of the distribution thus load can be distributed uniformly.
Experiments show that imbalance ratio decrease with increase load.
- With high loads, a large number of popular keys are distributed uniformly.

The evolution of partitioning schemes:

Strategy 1: T random tokens per node and partition by token value
- Two consecutive tokens define a range.
- Cons:
  - When a new node joins, it needs to steal tokens from others.
    - Partitioning would be affected.
    - This needs scanning, which is inefficient.
  - When nodes join/leave, key ranges of many nodes change and Merkel tree need to be recalculated.
  - No easy way to snapshot entire key space.
Strategy 2: T random tokens per node and equal-sized partitions
- Hash space is divided into Q equally sized partitions/ranges and each node is assigned T random tokens.
- Q >> N; Q >> S*T, S is the number of nodes.
- Partitioning and partitioning placement are decoupled.
- This is an intermediate state migrating from 1 to 3.
Strategy 3: Q/S tokens per node, equal-sized partitions
- Each node handles Q/S tokens.
- When nodes join/leave, tokens are re-arranged to satisfy the invariant.

Evaluation:

Load balancing efficiency: ratio of average number of requests of all nodes to the maximum number of requests of the hottest node.
Experiment result of efficiency: 3 > 2 > 1.
Strategy 3
- Pros:
  - Size of membership information at each node is greatly reduced. (>1000X)
  - Faster bootstrapping/recovery: Partition ranges are fixed and stored as files, which can be transferred easily.
  - Easy archival: Entire dataset can be archived by archiving those partition files separately.
- Cons:
  - Changing node membership requires coordination in order to preserve the invariant.
    - (ST): Strategy 1 requires coordination for tokens too?

Divergent versions

Divergent versions happen in case of:
- failure scenarios;
- large number of concurrent writes to a single data item.
Experiments show that divergent versions are rarely created.

Client-driven or server-driven coordination

State machine can be move to client side. Advantages:

Load balancer is not needed.
- This also avoids the extra network hop for the load balancer.
Clients poll any Dynamo node for membership updates every 10 seconds.

Balancing background tasks

Background tasks such as replica synchronization and data handoff should run only when the regular critical operations are not affected significantly.
An admission control mechanism and resource monitoring are integrated to schedule background tasks.

Discussion

Note that Dynamo works for a system of couple of hundreds of nodes. If running with tens of thousands of nodes, the routing table (O(n) to system size) is hard to maintain. This may be overcome by introducing hierarchical extensions or DHT systems.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1