Public speaking course notes Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Dynamo, Amazon’s Highly Available Key-value Store"

2020-06-21

Reference

Link to paper

Abstract

  • K-V store.
  • Highly available, scalable.
  • Always on for reads and writes.
  • Eventual consistency.

1 Introduction

  • Amazon architecture
    • Highly decentralized, loosely coupled SOA consisting hundreds of services.
    • Failures are the norm of daily operations.
  • Dynamo uses techniques:
    • Consistent hashing for partition and hashing.
    • Consistency is facilitated with object versioning.
    • Quorum-like technique and a decentralized replica synchronization protocol for consistency among replicas during updates.
    • A gossip based distributed failure detection and membership protocol.

2 Background

  • Relational DBs are complex and inefficient for simple KV use cases.
  • System assumptions and requirements
    • Query model
      • Simple reads and writes to a single data item.
      • Values are binary objects.
    • Doesn’t need ACID transactions.
    • Efficiency
      • SLAs on 99.9% percentile.
    • No internal hostiles thus no security related requirements.
  • SLAs
    • Uses 99.9% for better client experience.
  • Design considerations
    • Use optimistic replication techniques -> conflict resolution
      • When to resolve, read/write?
        • Resolving during writes will reject some writes.
        • (Accepted) Resolving during reads to have an always-writable system.
      • Who resolves
        • Data store: Choices are limited, like last-write-wins.
        • Application: More flexible.
    • Incremental scalability
    • Symmetry
      • All nodes should have the same responsibilities.
    • Decentralization
      • Favors decentralized peer-to-peer techniques over centralized control.
    • Heterogeneity
      • e.g. work distribution must be proportional to the individual server capabilities.

Dynamo differs from some other projects as follows.

  • Always writable as no writes should be rejected.
  • All nodes are trusted since it’s used internally.
  • Doesn’t require hierarchical namespaces or complex schemas.
  • For latency-sensitive applications, targeting at 99.9% percentile.
    • Multi-hop routing is not acceptable.
    • Dynamo is zero-hop DHT that each node maintains enough routing info locally to route a request to an appropriate node directly.

4 System architecture

General considerations for a scalable and robust distributed system:

  • Data persistent component
  • Load balancing
  • Membership and failure detection
  • Failure recovery
  • Replica synchronization
  • Overload handling
  • State transfer
  • Concurrency and job scheduling
  • Request marshalling
  • Request routing
  • System monitoring and alarming

Core techniques used by Dynamo:

Problem Technique Advantage
Partitioning Consistent hashing Incremental scalability
High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates
Handling temporary failures Sloppy quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are unavailable
Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background
Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

System interface

  • get(key)
  • put(key, context, object)
    • context is some system metadata such as versions of objects.

Partition algorithm

Dynamo uses consistent hashing. On a ring, a node is assigned a random value representing the position on the ring. For an example ring A -> B -> C -> D -> A, B is the coordinator node for data with key in range (A, B].

This setting have two problems:

  • Random assignment leads to non-uniform data and load distribution.
  • Oblivious to the heterogeneity in the performance and capacity of nodes.

Therefore, Dynamo uses “virtual node” concept. Virtual nodes are nodes on the ring. A physical node can be responsible for one or more virtual nodes. Advantages are as follows:

  • If a node becomes unavailable, its load is evenly dispersed across remaining nodes.
  • If a node becomes available again or is added, it roughly accepts equivalent amount of load from the remaining ones.
  • The number of virtual nodes for a physical node can be decided considering the performance and capacity.

Replication

Each data is replicated across N nodes. Starting from the coordinator node, data are replicated to the N-1 successor nodes. For an example ring A -> B -> C -> D -> A and N=3, D is responsible for data in range (A, B], (B, C] and (C, D]. For a particular key, the list of nodes storing it is called preference list. In order to make sure data is replicated across N physical nodes, preference list is chosen by skipping virtual nodes on the same physical node.

Data versioning

To ensure the system is always writable, conflicting writes are accepted and the conflicts can be resolved during reads, e.g., merging different versions of a shopping cart. Vector clock is used to capture the causality of an object. It’s a list of (node, counter) pairs. For a put() operation, client must supply the version it’s updating in context, it could be obtained from an earlier read.

The size of a vector clock is usually limited since preference list contains only N nodes. In the cases of network partitions or multiple server failures, the request may be served by other nodes. Dynamo limits the size of vector clocks by associating timestamp to each pair. If the size reaches the threshold, discard the oldest ones.

Execution of get() and put() operations

Any storage node in Dynamo is eligible to receive client get and put operations for any key. Two strategies for a client to select the node performing operations:

  • Routes its request through a generic load balancer that will select a node based on load information.
    • Clients don’t need links to Dynamo library.
    • The random chosen node will forward the request to the top N healthy nodes in the preference list.
  • Uses a partition-aware client library that routes the request to the coordinator node directly.
    • Lower latency.

Dynamo uses consistency protocol R + W > N.

  • R is the minimum number of nodes fore a successful read.
  • W is the minimum number of nodes fore a successful write.

Upon a read, the coordinator requests all existing versions of data for that key from the N highest-ranked nodes in the preference list. Once receiving at least R responses, reconcile the versions so the final versions are casually unrelated. Then send the response to the client and write the final versions back.

Handling failures: Hinted handoff

To improve availability, Dynamo uses sloppy quorum: all reads and writes are performed on the first N healthy nodes in the preference list.

For an example ring A -> B -> C -> D -> A and N=3, if A is unavailable, D will temporarily be hinted to receive the corresponding replica . Once A is recovered, D will transfer the replica back.

The preference list in constructed in a way that its storage nodes are spread across multiple data centers, in order to tackle data-center-level failures.

Handling permanent failures: replica synchronization

Dynamo uses an anti-entropy protocol with Merkle trees to keep replica synchronized.

  • Merkle tree
    • A hash tree where leaves are hashes of the values of individual keys.
    • Parent nodes higher in the tree are hashes of their respective children.
    • It can quickly compare two trees to know which keys are out of sync.
  • In Dynamo
    • Each node maintain a separate Merkle tree for each key range.
    • A disadvantage is the tree needs to be recalculated if a node joins or leaves the system.
      • Will be addressed later.

Membership and failure detection

  • To add/remove a node to a ring,
    • An admin connects to a node and issue membership changes.
    • Change history is persisted.
    • A gossip-based protocol propagates change history between nodes.
      • Also exchange and reconcile mapping of local nodes and token sets (key ranges).
  • A ring may be logically partitioned.
    • e.g., connect node A to join A into the ring; connect node B to join B. In this case, A and B don’t know each other.
    • To solve this, some nodes play the role of seeds, which is known by all nodes.
  • Failure detection
    • To avoid attempt to communicate with unreachable nodes during operations like get(), put(), transferring partitions and hinted replicas.
    • Local notion is sufficient: A pings B; if B doesn’t respond, A thinks B is unreachable. A can use B’s alternative nodes.
    • Decentralized failure detection is unnecessary.
      • It uses a gossip protocol enabling each node to know arrival/departure of other nodes.
      • This obviates Dynamo’s node joining/leaving mechanism.

Adding/Removing storage nodes

When a new node joins the ring, following operations happen. Operational experience shows it distributes the load uniformly across storage nodes.

  1. It gets assigned a number of random tokens (key ranges) from the ring.
  2. The nodes currently in charge of these key ranges will transfer data to the new node upon its confirmation.
    • Avoids duplicate transfers.

5 Implementation

Each Dynamo node contains 3 components:

  • Local persistence component
    • Different storage engine can be plugged in: Berkeley DB, MySQL, etc. for different use cases.
  • Request coordination
    • built on top of an event-driven messaging substrate where message processing pipeline is split into stages similar to SEDA architecture.
    • Each client request triggers the creation of a state machine on the node handling:
      • identifying nodes for this key;
      • sending requests;
      • waiting for responses;
      • potentially doing retries;
      • processing and packaging responses.
    • For read requests:
      • may performing read repair to fix stale data.
    • For write requests:
      • Any node in the preference list can be the coordinator.
        • If using timestamp-based reconciliation, any node in the system can do the job.
      • This means writes for a key cannot be serialized at a single location.
      • Better load balancing.
      • Usually a write follows a read, thus use the node which replied fastest the the previous read, as write coordinator.
        • Better chances getting read-your-write consistency.
        • Reduce variability in the performance, improve 99.9% performance.
  • Membership and failure detection

6 Experience and lessons learned

Main patterns in which Dynamo uses are as follows. W, R, N values can be tuned to achieve desired level of performance, availability and durability. Typically, N = 3. Common combination is N = 3; R = 2; W = 2.

  • Business logic specific reconciliation
  • Timestamp based reconciliation
    • Last writes wins.
  • High performance read engine
    • R = 1; W = N

Balancing performance and durability

Write latencies are usually higher than reads since a disk accesses are required. However, we can trade-off durability guarantees for performance. Write operations can be buffered at memory, which gets periodically written to storage. To reduce durability risk of this approach, the coordinator can choose 1 out of N node performing “durable write”. As the coordinator only waits for W responses, the write performance won’t be impacted.

Ensuring uniform load distribution

  • Dynamo assumes that when significant skews of access distribution happen, there are enough keys for popular end of the distribution thus load can be distributed uniformly.
  • Experiments show that imbalance ratio decrease with increase load.
    • With high loads, a large number of popular keys are distributed uniformly.

The evolution of partitioning schemes:

  • Strategy 1: T random tokens per node and partition by token value
    • Two consecutive tokens define a range.
    • Cons:
      • When a new node joins, it needs to steal tokens from others.
        • Partitioning would be affected.
        • This needs scanning, which is inefficient.
      • When nodes join/leave, key ranges of many nodes change and Merkel tree need to be recalculated.
      • No easy way to snapshot entire key space.
  • Strategy 2: T random tokens per node and equal-sized partitions
    • Hash space is divided into Q equally sized partitions/ranges and each node is assigned T random tokens.
    • Q >> N; Q >> S*T, S is the number of nodes.
    • Partitioning and partitioning placement are decoupled.
    • This is an intermediate state migrating from 1 to 3.
  • Strategy 3: Q/S tokens per node, equal-sized partitions
    • Each node handles Q/S tokens.
    • When nodes join/leave, tokens are re-arranged to satisfy the invariant.

Evaluation:

  • Load balancing efficiency: ratio of average number of requests of all nodes to the maximum number of requests of the hottest node.
  • Experiment result of efficiency: 3 > 2 > 1.
  • Strategy 3
    • Pros:
      • Size of membership information at each node is greatly reduced. (>1000X)
      • Faster bootstrapping/recovery: Partition ranges are fixed and stored as files, which can be transferred easily.
      • Easy archival: Entire dataset can be archived by archiving those partition files separately.
    • Cons:
      • Changing node membership requires coordination in order to preserve the invariant.
        • (ST): Strategy 1 requires coordination for tokens too?

Divergent versions

  • Divergent versions happen in case of:
    • failure scenarios;
    • large number of concurrent writes to a single data item.
  • Experiments show that divergent versions are rarely created.

Client-driven or server-driven coordination

State machine can be move to client side. Advantages:

  • Load balancer is not needed.
    • This also avoids the extra network hop for the load balancer.
  • Clients poll any Dynamo node for membership updates every 10 seconds.

Balancing background tasks

  • Background tasks such as replica synchronization and data handoff should run only when the regular critical operations are not affected significantly.
  • An admission control mechanism and resource monitoring are integrated to schedule background tasks.

Discussion

Note that Dynamo works for a system of couple of hundreds of nodes. If running with tens of thousands of nodes, the routing table (O(n) to system size) is hard to maintain. This may be overcome by introducing hierarchical extensions or DHT systems.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2025. All rights reserved by melonskin. Powered by Jekyll.