2018-10-08

## Replication

Keep copies of the same data on multiple machines connected via a network.

• To keep data geographically close to the users.
• Reduce latency
• To make system still work even if its parts have failed.
• Increase availability
• Scale out number of machines to serve more read queries.
• Increase read throughput

### Leaders and followers

• Replica: each node storing a copy of the database.
• Each write to the database needs to be processed by every replica.
• One replica is designated as the leader.
• Clients send write requests to the leader.
• Other replicas are followers.
• Leader sends changes to all of its followers as part of a replication log or change stream.
• Client can query any replica.

#### Synchronous versus Asynchronous replication

• Synchronous replication
• Process
• Leader sends the change to follower 1.
• Follower 1 confirms the change and send this message to leader.
• Leader reports the change is done to user and makes the changes visible to user.
• Follower is guaranteed to have an up-to-date copy of the data as leader.
• Writes cannot be processed if the follower doesn’t respond.
• All writes are blocked by one non-responding replica.
• Practice
• Enable only a few synchronous followers (usually only one, semi-synchronous).
• Asynchronous replication
• Leader sends the message but doesn’t wait for a response from the follower.
• Leader can continue processing writes, even if all followers have failed.

#### Setting up new followers

• Purpose
• To increase number of replicas
• To replace failed nodes
• Problem
• Copy accurate data from leader to this node
• Process
• Take a consistent snapshot of the leader’s database at some point in time.
• Copy the snapshot to the new follower node.
• Follower connects to the leader and requests all data changes happened since the snapshot was taken. The point is marked in the leader’s replication log.
• The new follower can now work just like other followers.

#### Handling node outages

Follower failure: Catch-up recovery

• A follower knows what the last transaction processed is from its log.
• Followers can just connect the leader and request later changes.

• One of the followers needs to be promoted to be the new leader.
• Clients and other followers need to know this change.
• Process
• Determine that the leader has failed.
• Timeout
• Choose a new leader.
• Elected or assigned by a previously elected controller node.
• Good if the new leader has the most up-to-date data changes.
• Consensus problem: getting all the nodes to agree on a new leader.
• Reconfiguring the system to use the new leader.
• Clients need to send the requests to the new leader.
• Old leader may come back and must be a new follower.
• Problems
• Conflicts between new leader and un-replicated data in alive old leader.
• May just discard those data.
• Discarding data is dangerous if other systems coordinate with this database.
• May reuse existing primary keys due to the lag between new leader and old leader.
• Split brain: two nodes both believe they are leader.
• Shut down one.
• A proper setting for timeout.

#### Implementation of replication logs

• Statement-based replication
• Log, execute and send to followers every write request (INSERT, UPDATE, DELETE).
• Problems: cause different effects
• Non-deterministic function in statement (NOW(), RAND()).
• If using auto-incrementing column or has a WHERE condition relying on database.
• Statements with side effects (triggers, etc).
• Not preferred.
• Write-ahead log (WAL) shipping
• LSM-trees and B-tree
• Leader writes in-memory data in disk and sends it to followers.
• Used in PostgreSQL and Oracle
• Problem
• Data log is pretty low level (for storage engine).
• Details like which bytes were changed in which disk blocks.
• Database softwares with different storage format may not run on the same leader and followers.
• Require downtime while upgrading database software.
• Logical(row-based) log replication
• Use different log formats for replication and for the storage engine.
• Decouple replication log from storage engine internals.
• Contents
• Insert: new values of all columns.
• Delete: info to uniquely identify the row.
• Update: unique row ID and new values of all columns.
• If a transaction modifies multiple rows,
• generate several log records，
• followed by a record committing this transaction.
• Backward compatible
• Leader and followers can run different versions of database software or storage engine.
• Logs can be easily parsed by external applications.
• Trigger-based replication
• Move replication up to the application layer.
• Cases
• Only replicate a subset of data.
• Replicate from one kind of database to another.
• Need conflict resolution logic.
• A trigger
• Register custom application code that is automatically executed when a data change happens.
• Greater overheads than other replication methods.
• But more flexible.

### Problems with replication lag

• A single node for writes
• Many nodes for reads
• Increase availability, scalability, latency (graphically distributed)
• Good for read
• Eventual consistency: Users may see different versions of data on different nodes (like leader and a follower), due to asynchronous writes. But it is just a temporary state.
• Replication lag: The delay between a write happening on the leader and being reflected on a follower.
• may be small or large.
• A large lag may introduce a noticeable inconsistency in practice.

• Case: Let a user submit some data and then view this piece of data.
• Write goes to leader.
• Read from some node (may be an inconsistent follower).
• Guarantee the writer will see updates.
• No promise about other users.
• Implementation
• When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
• Track the time of the last update and, for one minute after the last update, make all reads from the leader.
• The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp.
• Use logical timestamp.
• Actual system lock, but the clock might be unreliable.
• A request may need to be routed to the datacenter containing leader.
• Cross-device read-after-write consistency
• The same user may access the services from multiple devices.
• More difficult to remember the timestamp of the user’s last update.
• May need to route requests from all devices to the same datacenter.

• It’s possible for a user to see things moving backward in time.
• Monotonic reads prevent this anomaly.
• If one user makes several reads in sequence, will not see time go backward.
• Less than strong consistency
• Stronger than eventual consistency
• Achieving
• Each user always makes their reads from the same replica.
• chosen by hashing user ID.

#### Consistent prefix reads

• Case: If some partitions are replicated slower than others, an observer may see the answer before they see the question.
• Guarantee: consistent prefix reads
• If a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
• Solution
• Make sure that any writes that are causally related to each other are written to the same partition.

#### Solutions for replication lag

• When designing, think about what if the replication lag increases to several minutes or hours.
• If it’s bad,
• consider to provide stronger guarantees.
• Providing guarantees on application is complex.
• Can trust the databases to do the right thing.
• Transactions
• Single-node transactions have been abandoned by many distributed systems.
• Too expensive in terms of performance and availability.

• Multi-leader configuration, master-master or active/active replication
• Allow more than one nodes to accept writes.
• Each leader simultaneously acts as a follower to the other leaders.

#### Use cases for multi-leader replication

• Multi-datacenter operation
• Having a leader in each datacenter.
• Within one datacenter, single-leader based.
• Across datacenters, each leader replicates its changes to leaders in other datacenters.
• Writes can be processed in a local datacenter.
• Inter-datacenter network delays are hidden from users.
• Tolerance of datacenter outrages
• Tolerance of network problems
• Public internet is less reliable than local network within a datacenter.
• Downsides
• The same data may be concurrently modified in two different datacenters. Such conflicts must be resolved.
• Several pitfalls with other database features
• Auto-incrementing keys
• Triggers
• Integrity constraints
• Practice
• It is dangerous and should be avoided if possible.
• Examples
• Tungsten Replicator for MySQL
• GoldenGate for Oracle
• Clients with offline operation
• The application needs to continue to work while it is disconnected from the internet.
• Multiple devices act like local leaders with local databases, can work offline.
• Tricky to get thing right
• Examples: CouchDB
• Collaborative editing
• Examples: Google Docs
• Changes are applied instantly on local replica and replicated asynchronously to server and other users.
• Requires conflict resolution.

#### Handling write conflicts

• Conflict detection is usually asynchronous.
• Too late to ask user to resolve once detected.
• Synchronous detection is expensive, can use single-leader replication directly.
• Conflict avoidance
• All writes for a particular record go through the same leader.
• But we may want to change the designated leader for a record.
• Conflicts may happen.
• Converging toward a consistent state
• All replicas must arrive at the same final value when all changes have been replicated.
• Solutions
• Pick the write with highest UUID or latest timestamp.
• Data may lose.
• Record the conflicts and prompt the user.
• Custom conflict resolution logic
• On write
• Call the conflict handler as soon as database detects a conflict.
• e.g., Bucardo
• Multiple versions of data are returned to the application.
• Prompt the user or resolve automatically.
• e.g., CouchDB
• Works on a row, not a transaction.
• Can be complicated and error-prone.

#### Multi-leader replication topologies

• All-to-all
• Problem: Writes may arrive in the wrong order.
• Version vectors can be used to order events correctly.
• Circular
• Each node receives writes from one node and forward those writes plus itself writes to another node.
• When a node receives a data change that is tagged, with its own identifier, that data change is ignored.
• Topology can be reconfigured if one node failed.
• e.g., default in MySQL
• Star
• One designated root node forwards writes to all of the other nodes.
• Can be generated to a tree structure.
• Topology can be reconfigured if one node failed.
• May need to be done manually.

• Any replica can directly accept writes from clients.
• Applications: Dynamo, Riak, Cassandra, Voldemort
• Dynamo-style
• Client may directly send writes to several replicas or do it through a coordinator node.
• The coordinator node does not enforce a particular ordering of writes.

#### Writing to the database when a node is down

• No failover
• Send writes to several nodes in parallel.
• Successful if receiving ok messages from a majority of nodes accepting the write.
• Read from several nodes.
• Use version number to determine which value is newer.

When an unavailable node comes back online, it needs to catch up on the missed writes.

• When a client makes a read from several nodes in parallel, it can detect any stale responses.
• Write the new value back to the stale nodes.
• Anti-entropy process
• In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.
• Without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability.

Quorums for reading and writing

• Quorum reads and writes
• n replicas
• Every write must be confirmed by w nodes.
• Every read must query at least r nodes.
• w + r > n, we can expect to get and up-to-date value when reading.
• Practice
• n, w, r are configurable.
• e.g.,
• Make n odd.
• w = r = (n+1)/2
• We can tolerate some unavailable nodes while reading and writing.

#### Limitations of quorum consistency

• Using smaller w and r values.
• Risky to read stale value.
• But lower latency and higher availability.
• Even with w + r > n, edge cases where stale values are read.
• A sloppy quorum is used.
• Two writes occur concurrently; wrong merge could happen due to clock skew.
• A write happens concurrently with a read. The most recent write exists at only a few nodes.
• Failed writes are not rolled back on those succeeded nodes.
• If a node with new value failed and was restored from a node with old value, the quorum condition is broken.
• Unlucky with the timing.
• Dynamo-style databases generally only guarantee eventual consistency.
• Guarantees are not satisfied, and generally require transactions or consensus.
• Consistent prefix reads

Monitoring staleness

• Be aware of the health of the replication.
• Easy for leader-based replication; measure replication lag by subtracting a record’s position between a follower and leader.
• Not easy for leaderless.
• Not a common practice yet.

#### Sloppy quorums and hinted handoff

• A large number of nodes are unavailable; fewer than w or r nodes are reachable from the client. It can no longer reach a quorum.
• Sloppy quorum: The database accepts writes anyway and write them to some reachable nodes but not among the normal n nodes.
• Still require w and r nodes.
• But not among the designated n “home” nodes.
• Hinted handoff: Once the network interruption is fixed, any temporary writes are sent to appropriate “home” nodes.
• Increase write availability
• But cannot be sure read the latest value even when w + r > n.

Multi-datacenter operation:

• Leaderless replication is suitable for multi-datacenter operation.
• Tolerate conflicting concurrent writes, network interruptions, latency spikes.
• Cassandra and Voldemort
• n is number of replicas for this record in all datacenters.
• Configure how many replicas for this record you want in each datacenter.
• A write is sent to all replicas.
• Client only waits for ACK from a quorum of nodes within its local datacenter.
• Unaffected by delays on cross-datacenter link.
• Inter-datacenter writes are usually asynchronous.
• Riak
• All communication between clients and database nodes are local to one datacenter.
• n is number of replicas for this record in one datacenter.
• Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication.

#### Detecting concurrent writes

Concurrent writes (Different clients write data with the same key to nodes) may occur and cause conflicts. Dynamo-style databases conflicts can also arise during read repair or hinted handoff. We need to guarantee eventual consistency.

Conflict resolve methods:

• Last write wins LWW, (discarding concurrent writes)
• Each replica only stores the most “recent” value.
• “Older” values can be overwritten and discarded.
• Attach timestamp.
• The only conflict resolution supported in Cassandra.
• Poor choice if the losing data is not acceptable.
• Only safe way to use this:
• Ensure a key is only written once such as UUID and treated as immutable.
• The “happens-before” relationship
• A happens before B while B builds upon A
• 3 possible relationships between two operations A and B
• A happens before B
• B happens before A
• A and B happen concurrently (don’t know each other)
• Needs to resolve conflicts
• Algorithm to detect if two writes are concurrent (p. 189).
• Include version numbers from the read before this write.
• Merge sibling concurrent values.
• LWW
• Take the union.
• Mark data removed while merging.
• Version vectors
• A version number per replica and per key.
• A replica increments its own version number while processing a write.
• Also keeps track of the version number if has seen from other replicas.
• In order to know which values to overwrite or keep as siblings.
• The collection of version numbers from all the replicas is called a version vector.
• Version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written.
• Dotted version vector in Riak 2.0.