All chapters
- Reliable, Scalable, and Maintainable Applications
- Data Models and Query Languages
- Storage and Retrieval
- Encoding and Evolution
- Replication
- Partitioning
- Transactions
- Distributed System Troubles
- Consistency and Consensus
- Batch Processing
- Stream Processing
- The Future of Data Systems
Replication
Keep copies of the same data on multiple machines connected via a network.
- To keep data geographically close to the users.
- Reduce latency
- To make system still work even if its parts have failed.
- Increase availability
- Scale out number of machines to serve more read queries.
- Increase read throughput
Leaders and followers
- Replica: each node storing a copy of the database.
- Each write to the database needs to be processed by every replica.
- Leader-based replication
- One replica is designated as the leader.
- Clients send write requests to the leader.
- Other replicas are followers.
- Leader sends changes to all of its followers as part of a replication log or change stream.
- Client can query any replica.
- One replica is designated as the leader.
Synchronous versus Asynchronous replication
- Synchronous replication
- Process
- Leader sends the change to follower 1.
- Follower 1 confirms the change and send this message to leader.
- Leader reports the change is done to user and makes the changes visible to user.
- Advantage
- Follower is guaranteed to have an up-to-date copy of the data as leader.
- Disadvantage
- Writes cannot be processed if the follower doesn’t respond.
- All writes are blocked by one non-responding replica.
- Writes cannot be processed if the follower doesn’t respond.
- Practice
- Enable only a few synchronous followers (usually only one, semi-synchronous).
- Process
- Asynchronous replication
- Leader sends the message but doesn’t wait for a response from the follower.
- Advantage
- Leader can continue processing writes, even if all followers have failed.
- Advantage
- Leader sends the message but doesn’t wait for a response from the follower.
Setting up new followers
- Purpose
- To increase number of replicas
- To replace failed nodes
- Problem
- Copy accurate data from leader to this node
- Process
- Take a consistent snapshot of the leader’s database at some point in time.
- Copy the snapshot to the new follower node.
- Follower connects to the leader and requests all data changes happened since the snapshot was taken. The point is marked in the leader’s replication log.
- The new follower can now work just like other followers.
Handling node outages
Follower failure: Catch-up recovery
- A follower knows what the last transaction processed is from its log.
- Followers can just connect the leader and request later changes.
Leader failure: failover
- One of the followers needs to be promoted to be the new leader.
- Clients and other followers need to know this change.
- Process
- Determine that the leader has failed.
- Timeout
- Choose a new leader.
- Elected or assigned by a previously elected controller node.
- Good if the new leader has the most up-to-date data changes.
- Consensus problem: getting all the nodes to agree on a new leader.
- Reconfiguring the system to use the new leader.
- Clients need to send the requests to the new leader.
- Old leader may come back and must be a new follower.
- Determine that the leader has failed.
- Problems
- Conflicts between new leader and un-replicated data in alive old leader.
- May just discard those data.
- Discarding data is dangerous if other systems coordinate with this database.
- May reuse existing primary keys due to the lag between new leader and old leader.
- Split brain: two nodes both believe they are leader.
- Shut down one.
- A proper setting for timeout.
- Conflicts between new leader and un-replicated data in alive old leader.
Implementation of replication logs
- Statement-based replication
- Log, execute and send to followers every write request (INSERT, UPDATE, DELETE).
- Problems: cause different effects
- Non-deterministic function in statement (
NOW()
,RAND()
). - If using auto-incrementing column or has a
WHERE
condition relying on database. - Statements with side effects (triggers, etc).
- Non-deterministic function in statement (
- Not preferred.
- Write-ahead log (WAL) shipping
- LSM-trees and B-tree
- Leader writes in-memory data in disk and sends it to followers.
- Used in PostgreSQL and Oracle
- Problem
- Data log is pretty low level (for storage engine).
- Details like which bytes were changed in which disk blocks.
- Database softwares with different storage format may not run on the same leader and followers.
- Require downtime while upgrading database software.
- Data log is pretty low level (for storage engine).
- Logical(row-based) log replication
- Use different log formats for replication and for the storage engine.
- Decouple replication log from storage engine internals.
- Contents
- Insert: new values of all columns.
- Delete: info to uniquely identify the row.
- Update: unique row ID and new values of all columns.
- If a transaction modifies multiple rows,
- generate several log records,
- followed by a record committing this transaction.
- Backward compatible
- Leader and followers can run different versions of database software or storage engine.
- Logs can be easily parsed by external applications.
- Trigger-based replication
- Move replication up to the application layer.
- Cases
- Only replicate a subset of data.
- Replicate from one kind of database to another.
- Need conflict resolution logic.
- Cases
- A trigger
- Register custom application code that is automatically executed when a data change happens.
- Greater overheads than other replication methods.
- But more flexible.
- Move replication up to the application layer.
Problems with replication lag
- Leader-based replication
- A single node for writes
- Many nodes for reads
- Increase availability, scalability, latency (graphically distributed)
- Good for read
- Eventual consistency: Users may see different versions of data on different nodes (like leader and a follower), due to asynchronous writes. But it is just a temporary state.
- Replication lag: The delay between a write happening on the leader and being reflected on a follower.
- may be small or large.
- A large lag may introduce a noticeable inconsistency in practice.
Reading your own writes
- Case: Let a user submit some data and then view this piece of data.
- Write goes to leader.
- Read from some node (may be an inconsistent follower).
- Read-after-write consistency (read-your-writes)
- Guarantee the writer will see updates.
- No promise about other users.
- Implementation
- When reading something that the user may have modified, read it from the leader; otherwise, read it from a follower.
- Track the time of the last update and, for one minute after the last update, make all reads from the leader.
- The client can remember the timestamp of its most recent write - then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp.
- Use logical timestamp.
- Actual system lock, but the clock might be unreliable.
- A request may need to be routed to the datacenter containing leader.
- Cross-device read-after-write consistency
- The same user may access the services from multiple devices.
- More difficult to remember the timestamp of the user’s last update.
- May need to route requests from all devices to the same datacenter.
Monotonic reads
- It’s possible for a user to see things moving backward in time.
- Monotonic reads prevent this anomaly.
- If one user makes several reads in sequence, will not see time go backward.
- Less than strong consistency
- Stronger than eventual consistency
- Achieving
- Each user always makes their reads from the same replica.
- chosen by hashing user ID.
- Each user always makes their reads from the same replica.
Consistent prefix reads
- Case: If some partitions are replicated slower than others, an observer may see the answer before they see the question.
- Guarantee: consistent prefix reads
- If a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
- Solution
- Make sure that any writes that are causally related to each other are written to the same partition.
Solutions for replication lag
- When designing, think about what if the replication lag increases to several minutes or hours.
- If it’s bad,
- consider to provide stronger guarantees.
- Providing guarantees on application is complex.
- Can trust the databases to do the right thing.
- Transactions
- Single-node transactions have been abandoned by many distributed systems.
- Too expensive in terms of performance and availability.
- Single-node transactions have been abandoned by many distributed systems.
- Transactions
Multi-Leader replication
- Multi-leader configuration, master-master or active/active replication
- Allow more than one nodes to accept writes.
- Each leader simultaneously acts as a follower to the other leaders.
Use cases for multi-leader replication
- Multi-datacenter operation
- Having a leader in each datacenter.
- Within one datacenter, single-leader based.
- Across datacenters, each leader replicates its changes to leaders in other datacenters.
- Advantages
- Writes can be processed in a local datacenter.
- Inter-datacenter network delays are hidden from users.
- Tolerance of datacenter outrages
- Tolerance of network problems
- Public internet is less reliable than local network within a datacenter.
- Writes can be processed in a local datacenter.
- Downsides
- The same data may be concurrently modified in two different datacenters. Such conflicts must be resolved.
- Several pitfalls with other database features
- Auto-incrementing keys
- Triggers
- Integrity constraints
- Practice
- It is dangerous and should be avoided if possible.
- Examples
- Tungsten Replicator for MySQL
- GoldenGate for Oracle
- Clients with offline operation
- The application needs to continue to work while it is disconnected from the internet.
- Multiple devices act like local leaders with local databases, can work offline.
- Tricky to get thing right
- Examples: CouchDB
- Collaborative editing
- Examples: Google Docs
- Changes are applied instantly on local replica and replicated asynchronously to server and other users.
- Requires conflict resolution.
Handling write conflicts
- Conflict detection is usually asynchronous.
- Too late to ask user to resolve once detected.
- Synchronous detection is expensive, can use single-leader replication directly.
- Conflict avoidance
- All writes for a particular record go through the same leader.
- But we may want to change the designated leader for a record.
- Conflicts may happen.
- Converging toward a consistent state
- All replicas must arrive at the same final value when all changes have been replicated.
- Solutions
- Pick the write with highest UUID or latest timestamp.
- Data may lose.
- Record the conflicts and prompt the user.
- Pick the write with highest UUID or latest timestamp.
- Custom conflict resolution logic
- On write
- Call the conflict handler as soon as database detects a conflict.
- e.g., Bucardo
- On read
- Multiple versions of data are returned to the application.
- Prompt the user or resolve automatically.
- e.g., CouchDB
- Works on a row, not a transaction.
- Can be complicated and error-prone.
- On write
Multi-leader replication topologies
- All-to-all
- Problem: Writes may arrive in the wrong order.
- Version vectors can be used to order events correctly.
- Problem: Writes may arrive in the wrong order.
- Circular
- Each node receives writes from one node and forward those writes plus itself writes to another node.
- When a node receives a data change that is tagged, with its own identifier, that data change is ignored.
- Topology can be reconfigured if one node failed.
- e.g., default in MySQL
- Star
- One designated root node forwards writes to all of the other nodes.
- Can be generated to a tree structure.
- Topology can be reconfigured if one node failed.
- May need to be done manually.
Leaderless replication
- Any replica can directly accept writes from clients.
- Applications: Dynamo, Riak, Cassandra, Voldemort
- Dynamo-style
- Client may directly send writes to several replicas or do it through a coordinator node.
- The coordinator node does not enforce a particular ordering of writes.
Writing to the database when a node is down
- No failover
- Send writes to several nodes in parallel.
- Successful if receiving ok messages from a majority of nodes accepting the write.
- Read from several nodes.
- Use version number to determine which value is newer.
When an unavailable node comes back online, it needs to catch up on the missed writes.
- Read repair
- When a client makes a read from several nodes in parallel, it can detect any stale responses.
- Write the new value back to the stale nodes.
- Anti-entropy process
- In addition, some datastores have a background process that constantly looks for differences in the data between replicas and copies any missing data from one replica to another.
- Without an anti-entropy process, values that are rarely read may be missing from some replicas and thus have reduced durability.
Quorums for reading and writing
- Quorum reads and writes
n
replicas- Every write must be confirmed by
w
nodes. - Every read must query at least
r
nodes. w + r > n
, we can expect to get and up-to-date value when reading.
- Practice
n, w, r
are configurable.- e.g.,
- Make
n
odd. w = r = (n+1)/2
- Make
- We can tolerate some unavailable nodes while reading and writing.
Limitations of quorum consistency
- Using smaller
w
andr
values.- Risky to read stale value.
- But lower latency and higher availability.
- Even with
w + r > n
, edge cases where stale values are read.- A sloppy quorum is used.
- Two writes occur concurrently; wrong merge could happen due to clock skew.
- A write happens concurrently with a read. The most recent write exists at only a few nodes.
- Failed writes are not rolled back on those succeeded nodes.
- If a node with new value failed and was restored from a node with old value, the quorum condition is broken.
- Unlucky with the timing.
- Dynamo-style databases generally only guarantee eventual consistency.
- Guarantees are not satisfied, and generally require transactions or consensus.
- Reading your own writes
- Monotonic reads
- Consistent prefix reads
Monitoring staleness
- Be aware of the health of the replication.
- Easy for leader-based replication; measure replication lag by subtracting a record’s position between a follower and leader.
- Not easy for leaderless.
- Not a common practice yet.
Sloppy quorums and hinted handoff
- A large number of nodes are unavailable; fewer than
w
orr
nodes are reachable from the client. It can no longer reach a quorum. - Sloppy quorum: The database accepts writes anyway and write them to some reachable nodes but not among the normal
n
nodes.- Still require
w
andr
nodes. - But not among the designated
n
“home” nodes.
- Still require
- Hinted handoff: Once the network interruption is fixed, any temporary writes are sent to appropriate “home” nodes.
- Increase write availability
- But cannot be sure read the latest value even when
w + r > n
.
- But cannot be sure read the latest value even when
Multi-datacenter operation:
- Leaderless replication is suitable for multi-datacenter operation.
- Tolerate conflicting concurrent writes, network interruptions, latency spikes.
- Cassandra and Voldemort
n
is number of replicas for this record in all datacenters.- Configure how many replicas for this record you want in each datacenter.
- A write is sent to all replicas.
- Client only waits for ACK from a quorum of nodes within its local datacenter.
- Unaffected by delays on cross-datacenter link.
- Inter-datacenter writes are usually asynchronous.
- Riak
- All communication between clients and database nodes are local to one datacenter.
n
is number of replicas for this record in one datacenter.- Cross-datacenter replication between database clusters happens asynchronously in the background, in a style that is similar to multi-leader replication.
Detecting concurrent writes
Concurrent writes (Different clients write data with the same key to nodes) may occur and cause conflicts. Dynamo-style databases conflicts can also arise during read repair or hinted handoff. We need to guarantee eventual consistency.
Conflict resolve methods:
- Last write wins LWW, (discarding concurrent writes)
- Each replica only stores the most “recent” value.
- “Older” values can be overwritten and discarded.
- Attach timestamp.
- The only conflict resolution supported in Cassandra.
- Poor choice if the losing data is not acceptable.
- Only safe way to use this:
- Ensure a key is only written once such as UUID and treated as immutable.
- The “happens-before” relationship
- A happens before B while B builds upon A
- 3 possible relationships between two operations A and B
- A happens before B
- B happens before A
- A and B happen concurrently (don’t know each other)
- Needs to resolve conflicts
- Algorithm to detect if two writes are concurrent (p. 189).
- Include version numbers from the read before this write.
- Merge sibling concurrent values.
- LWW
- Take the union.
- Mark data removed while merging.
- Version vectors
- A version number per replica and per key.
- A replica increments its own version number while processing a write.
- Also keeps track of the version number if has seen from other replicas.
- In order to know which values to overwrite or keep as siblings.
- The collection of version numbers from all the replicas is called a version vector.
- Version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written.
- Dotted version vector in Riak 2.0.