Read "Dynamo, Amazon’s Highly Available Key-value Store" Read "Bigtable, A Distributed Storage System for Structured Data" Read "Streaming Systems" 3, Watermarks Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

Read "Designing Data-intensive applications" 5, Replication


All chapters

  1. Reliable, Scalable, and Maintainable Applications
  2. Data Models and Query Languages
  3. Storage and Retrieval
  4. Encoding and Evolution
  5. Replication
  6. Partitioning
  7. Transactions
  8. Distributed System Troubles
  9. Consistency and Consensus
  10. Batch Processing
  11. Stream Processing
  12. The Future of Data Systems


Keep copies of the same data on multiple machines connected via a network.

  • To keep data geographically close to the users.
    • Reduce latency
  • To make system still work even if its parts have failed.
    • Increase availability
  • Scale out number of machines to serve more read queries.
    • Increase read throughput

Leaders and followers

  • Replica: each node storing a copy of the database.
  • Each write to the database needs to be processed by every replica.
  • Leader-based replication
    • One replica is designated as the leader.
      • Clients send write requests to the leader.
    • Other replicas are followers.
      • Leader sends changes to all of its followers as part of a replication log or change stream.
    • Client can query any replica.

Synchronous versus Asynchronous replication

  • Synchronous replication
    • Process
      • Leader sends the change to follower 1.
      • Follower 1 confirms the change and send this message to leader.
      • Leader reports the change is done to user and makes the changes visible to user.
    • Advantage
      • Follower is guaranteed to have an up-to-date copy of the data as leader.
    • Disadvantage
      • Writes cannot be processed if the follower doesn’t respond.
        • All writes are blocked by one non-responding replica.
    • Practice
      • Enable only a few synchronous followers (usually only one, semi-synchronous).
  • Asynchronous replication
    • Leader sends the message but doesn’t wait for a response from the follower.
      • Advantage
        • Leader can continue processing writes, even if all followers have failed.

Setting up new followers

  • Purpose
    • To increase number of replicas
    • To replace failed nodes
  • Problem
    • Copy accurate data from leader to this node
  • Process
    • Take a consistent snapshot of the leader’s database at some point in time.
    • Copy the snapshot to the new follower node.
    • Follower connects to the leader and requests all data changes happened since the snapshot was taken. The point is marked in the leader’s replication log.
    • The new follower can now work just like other followers.

Handling node outages

Follower failure: Catch-up recovery

  • A follower knows what the last transaction processed is from its log.
  • Followers can just connect the leader and request later changes.

Leader failure: failover

  • One of the followers needs to be promoted to be the new leader.
  • Clients and other followers need to know this change.
  • Process
    • Determine that the leader has failed.
      • Timeout
    • Choose a new leader.
      • Elected or assigned by a previously elected controller node.
      • Good if the new leader has the most up-to-date data changes.
      • Consensus problem: getting all the nodes to agree on a new leader.
    • Reconfiguring the system to use the new leader.
      • Clients need to send the requests to the new leader.
      • Old leader may come back and must be a new follower.
  • Problems
    • Conflicts between new leader and un-replicated data in alive old leader.
      • May just discard those data.
    • Discarding data is dangerous if other systems coordinate with this database.
      • May reuse existing primary keys due to the lag between new leader and old leader.
    • Split brain: two nodes both believe they are leader.
      • Shut down one.
    • A proper setting for timeout.

Implementation of replication logs

  • Statement-based replication
    • Log, execute and send to followers every write request (INSERT, UPDATE, DELETE).
    • Problems: cause different effects
      • Non-deterministic function in statement (NOW(), RAND()).
      • If using auto-incrementing column or has a WHERE condition relying on database.
      • Statements with side effects (triggers, etc).
    • Not preferred.
  • Write-ahead log (WAL) shipping
    • LSM-trees and B-tree
    • Leader writes in-memory data in disk and sends it to followers.
    • Used in PostgreSQL and Oracle
    • Problem
      • Data log is pretty low level (for storage engine).
        • Details like which bytes were changed in which disk blocks.
        • Database softwares with different storage format may not run on the same leader and followers.
        • Require downtime while upgrading database software.
  • Logical(row-based) log replication
    • Use different log formats for replication and for the storage engine.
    • Decouple replication log from storage engine internals.
    • Contents
      • Insert: new values of all columns.
      • Delete: info to uniquely identify the row.
      • Update: unique row ID and new values of all columns.
    • If a transaction modifies multiple rows,
      • generate several log records,
      • followed by a record committing this transaction.
    • Backward compatible
      • Leader and followers can run different versions of database software or storage engine.
    • Logs can be easily parsed by external applications.
  • Trigger-based replication
    • Move replication up to the application layer.
      • Cases
        • Only replicate a subset of data.
        • Replicate from one kind of database to another.
        • Need conflict resolution logic.
    • A trigger
      • Register custom application code that is automatically executed when a data change happens.
      • Greater overheads than other replication methods.
      • But more flexible.

Problems with replication lag

  • Leader-based replication
    • A single node for writes
    • Many nodes for reads
      • Increase availability, scalability, latency (graphically distributed)
    • Good for read
  • Eventual consistency: Users may see different versions of data on different nodes (like leader and a follower), due to asynchronous writes. But it is just a temporary state.
  • Replication lag: The delay between a write happening on the leader and being reflected on a follower.
    • may be small or large.
    • A large lag may introduce a noticeable inconsistency in practice.

Reading your own writes

  • Case: Let a user submit some data and then view this piece of data.
    • Write goes to leader.
    • Read from some node (may be an inconsistent follower).
  • Read-after-write consistency (read-your-writes)
    • Guarantee the writer will see updates.
    • No promise about other users.
  • Implementation
    • When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
    • Track the time of the last update and, for one minute after the last update, make all reads from the leader.
    • The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp.
      • Use logical timestamp.
      • Actual system lock, but the clock might be unreliable.
    • A request may need to be routed to the datacenter containing leader.
  • Cross-device read-after-write consistency
    • The same user may access the services from multiple devices.
    • More difficult to remember the timestamp of the user’s last update.
    • May need to route requests from all devices to the same datacenter.

Monotonic reads

  • It’s possible for a user to see things moving backward in time.
  • Monotonic reads prevent this anomaly.
    • If one user makes several reads in sequence, will not see time go backward.
    • Less than strong consistency
    • Stronger than eventual consistency
  • Achieving
    • Each user always makes their reads from the same replica.
      • chosen by hashing user ID.

Consistent prefix reads

  • Case: If some partitions are replicated slower than others, an observer may see the answer before they see the question.
  • Guarantee: consistent prefix reads
    • If a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
  • Solution
    • Make sure that any writes that are causally related to each other are written to the same partition.

Solutions for replication lag

  • When designing, think about what if the replication lag increases to several minutes or hours.
  • If it’s bad,
    • consider to provide stronger guarantees.
  • Providing guarantees on application is complex.
  • Can trust the databases to do the right thing.
    • Transactions
      • Single-node transactions have been abandoned by many distributed systems.
        • Too expensive in terms of performance and availability.

Multi-Leader replication

  • Multi-leader configuration, master-master or active/active replication
    • Allow more than one nodes to accept writes.
    • Each leader simultaneously acts as a follower to the other leaders.

Use cases for multi-leader replication

  • Multi-datacenter operation
    • Having a leader in each datacenter.
    • Within one datacenter, single-leader based.
    • Across datacenters, each leader replicates its changes to leaders in other datacenters.
    • Advantages
      • Writes can be processed in a local datacenter.
        • Inter-datacenter network delays are hidden from users.
      • Tolerance of datacenter outrages
      • Tolerance of network problems
        • Public internet is less reliable than local network within a datacenter.
    • Downsides
      • The same data may be concurrently modified in two different datacenters. Such conflicts must be resolved.
      • Several pitfalls with other database features
        • Auto-incrementing keys
        • Triggers
        • Integrity constraints
    • Practice
      • It is dangerous and should be avoided if possible.
    • Examples
      • Tungsten Replicator for MySQL
      • GoldenGate for Oracle
  • Clients with offline operation
    • The application needs to continue to work while it is disconnected from the internet.
    • Multiple devices act like local leaders with local databases, can work offline.
    • Tricky to get thing right
    • Examples: CouchDB
  • Collaborative editing
    • Examples: Google Docs
    • Changes are applied instantly on local replica and replicated asynchronously to server and other users.
    • Requires conflict resolution.

Handling write conflicts

  • Conflict detection is usually asynchronous.
    • Too late to ask user to resolve once detected.
    • Synchronous detection is expensive, can use single-leader replication directly.
  • Conflict avoidance
    • All writes for a particular record go through the same leader.
    • But we may want to change the designated leader for a record.
      • Conflicts may happen.
  • Converging toward a consistent state
    • All replicas must arrive at the same final value when all changes have been replicated.
    • Solutions
      • Pick the write with highest UUID or latest timestamp.
        • Data may lose.
      • Record the conflicts and prompt the user.
  • Custom conflict resolution logic
    • On write
      • Call the conflict handler as soon as database detects a conflict.
      • e.g., Bucardo
    • On read
      • Multiple versions of data are returned to the application.
      • Prompt the user or resolve automatically.
      • e.g., CouchDB
    • Works on a row, not a transaction.
    • Can be complicated and error-prone.

Multi-leader replication topologies

  • All-to-all
    • Problem: Writes may arrive in the wrong order.
      • Version vectors can be used to order events correctly.
  • Circular
    • Each node receives writes from one node and forward those writes plus itself writes to another node.
    • When a node receives a data change that is tagged, with its own identifier, that data change is ignored.
    • Topology can be reconfigured if one node failed.
    • e.g., default in MySQL
  • Star
    • One designated root node forwards writes to all of the other nodes.
    • Can be generated to a tree structure.
    • Topology can be reconfigured if one node failed.
      • May need to be done manually.

Leaderless replication

  • Any replica can directly accept writes from clients.
  • Applications: Dynamo, Riak, Cassandra, Voldemort
  • Dynamo-style
  • Client may directly send writes to several replicas or do it through a coordinator node.
    • The coordinator node does not enforce a particular ordering of writes.

Writing to the database when a node is down

  • No failover
  • Send writes to several nodes in parallel.
    • Successful if receiving ok messages from a majority of nodes accepting the write.
  • Read from several nodes.
    • Use version number to determine which value is newer.

When an unavailable node comes back online, it needs to catch up on the missed writes.

  • Read repair
    • When a client makes a read from several nodes in parallel, it can detect any stale responses.
    • Write the new value back to the stale nodes.
  • Anti-entropy process
    • In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.
    • Without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability.

Quorums for reading and writing

  • Quorum reads and writes
    • n replicas
    • Every write must be confirmed by w nodes.
    • Every read must query at least r nodes.
    • w + r > n, we can expect to get and up-to-date value when reading.
  • Practice
    • n, w, r are configurable.
    • e.g.,
      • Make n odd.
      • w = r = (n+1)/2
  • We can tolerate some unavailable nodes while reading and writing.

Limitations of quorum consistency

  • Using smaller w and r values.
    • Risky to read stale value.
    • But lower latency and higher availability.
  • Even with w + r > n, edge cases where stale values are read.
    • A sloppy quorum is used.
    • Two writes occur concurrently; wrong merge could happen due to clock skew.
    • A write happens concurrently with a read. The most recent write exists at only a few nodes.
    • Failed writes are not rolled back on those succeeded nodes.
    • If a node with new value failed and was restored from a node with old value, the quorum condition is broken.
    • Unlucky with the timing.
  • Dynamo-style databases generally only guarantee eventual consistency.
  • Guarantees are not satisfied, and generally require transactions or consensus.
    • Reading your own writes
    • Monotonic reads
    • Consistent prefix reads

Monitoring staleness

  • Be aware of the health of the replication.
  • Easy for leader-based replication; measure replication lag by subtracting a record’s position between a follower and leader.
  • Not easy for leaderless.
    • Not a common practice yet.

Sloppy quorums and hinted handoff

  • A large number of nodes are unavailable; fewer than w or r nodes are reachable from the client. It can no longer reach a quorum.
  • Sloppy quorum: The database accepts writes anyway and write them to some reachable nodes but not among the normal n nodes.
    • Still require w and r nodes.
    • But not among the designated n “home” nodes.
  • Hinted handoff: Once the network interruption is fixed, any temporary writes are sent to appropriate “home” nodes.
  • Increase write availability
    • But cannot be sure read the latest value even when w + r > n.

Multi-datacenter operation:

  • Leaderless replication is suitable for multi-datacenter operation.
    • Tolerate conflicting concurrent writes, network interruptions, latency spikes.
  • Cassandra and Voldemort
    • n is number of replicas for this record in all datacenters.
    • Configure how many replicas for this record you want in each datacenter.
    • A write is sent to all replicas.
    • Client only waits for ACK from a quorum of nodes within its local datacenter.
      • Unaffected by delays on cross-datacenter link.
    • Inter-datacenter writes are usually asynchronous.
  • Riak
    • All communication between clients and database nodes are local to one datacenter.
    • n is number of replicas for this record in one datacenter.
    • Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication.

Detecting concurrent writes

Concurrent writes (Different clients write data with the same key to nodes) may occur and cause conflicts. Dynamo-style databases conflicts can also arise during read repair or hinted handoff. We need to guarantee eventual consistency.

Conflict resolve methods:

  • Last write wins LWW, (discarding concurrent writes)
    • Each replica only stores the most “recent” value.
    • “Older” values can be overwritten and discarded.
    • Attach timestamp.
    • The only conflict resolution supported in Cassandra.
    • Poor choice if the losing data is not acceptable.
    • Only safe way to use this:
      • Ensure a key is only written once such as UUID and treated as immutable.
  • The “happens-before” relationship
    • A happens before B while B builds upon A
    • 3 possible relationships between two operations A and B
      • A happens before B
      • B happens before A
      • A and B happen concurrently (don’t know each other)
        • Needs to resolve conflicts
    • Algorithm to detect if two writes are concurrent (p. 189).
      • Include version numbers from the read before this write.
  • Merge sibling concurrent values.
    • LWW
    • Take the union.
    • Mark data removed while merging.
  • Version vectors
    • A version number per replica and per key.
    • A replica increments its own version number while processing a write.
      • Also keeps track of the version number if has seen from other replicas.
      • In order to know which values to overwrite or keep as siblings.
    • The collection of version numbers from all the replicas is called a version vector.
    • Version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written.
    • Dotted version vector in Riak 2.0.

Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2020. All rights reserved by melonskin. Powered by Jekyll.