All chapters
- Reliable, Scalable, and Maintainable Applications
- Data Models and Query Languages
- Storage and Retrieval
- Encoding and Evolution
- Replication
- Partitioning
- Transactions
- Distributed System Troubles
- Consistency and Consensus
- Batch Processing
- Stream Processing
- The Future of Data Systems
The Trouble with Distributed Systems
- Networks
- Clocks and timing issues
Faults and partial failures
- Partial failure: In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine.
- Nondeterministic
- A system with multiple nodes and the network may work or unpredictably fail.
- May not even know whether something succeeded or not; the time it takes for a message to travel across a network is also nondeterministic.
- Nondeterministic
Cloud computing and supercomputing
Large-scale computing systems:
- High-performance computing
- Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks.
- A job typically checkpoints the state of its computation to durable storage from time to time.
- If one node fails, a common solution is to simply stop the entire cluster workload (escalate partial failure to total failure). After the faulty node is repaired, the computation is restarted from the last checkpoint.
- Cloud computing
- Often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.
- Traditional enterprise datacenters
- Somewhere between these extremes.
Differences of cloud computing from supercomputing
- Many internet-related applications are online, in the sense that they need to be able to serve users with low latency at any time.
- Making the service unavailable - for example, stopping the cluster for repair - is not acceptable.
- Nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.
- Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA).
- Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth.
- Reasonable to assume something is always broken in a big system.
- Allow rolling upgrade if the system can tolerate failed nodes.
- In a geographically distributed deployment, communication most likely goes over the internet, which is slow and unreliable compared to local networks.
Unreliable networks
- Shared-nothing systems
- A bunch of machines connected by a network.
- The network is the only way those machines can communicate.
- Each machine has its own memory and disk; one machine cannot access another machine’s memory or disk.
- Pros
- Comparatively cheap since it requires no special hardware.
- Make use of commoditized cloud computing services.
- High reliability through redundancy across multiple geographically distributed datacenters.
- Asynchronous packet networks
- One node can send a message to another node.
- No guarantees as to when it will arrive or whether it will arrive.
- May use timeout to handling lost or delayed messages.
Network faults in practice
- Various network faults can happen frequently.
- Public cloud services are more prone to network faults than private ones.
- Do need to know how your software reacts to network problems and ensure that the system can recover from them.
- Netflix Chaos Monkey
Detecting faults
- Some feedback telling explicitly something is wrong.
- On receiver machine, if no process is listening on the destination port, the OS will close or refuse TCP connections.
- Cannot deal with the case when the node process crashes while processing the request.
- If a node process crashed, the still running OS can notify other nodes about the crash.
- HBase
- If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level.
- If a router knows that the IP address is unreachable, it will reply with ICMP Destination Unreachable packet.
- Some limitations
- On receiver machine, if no process is listening on the destination port, the OS will close or refuse TCP connections.
- TCP ACK is not enough; we may need ACK on application level.
- Retry on application level and timeout.
Timeouts and unbounded delays
- Trade offs between long and short timeouts.
- Prematurely declaring a node dead is problematic.
- If the node is alive, the same work can be repeated twice.
- If the system is experiencing a high load, it will make the problem worse.
- Load was transferred to other nodes.
- Networks have unbounded delays.
- Queueing delays
- On the network switch
- On destination machine
- Due to virtual machine switching (one paused to execute another)
- TCP flow control
- TCP retransmission
- In public clouds or datacenter, a noisy neighbor can occupy a lot of resources.
- Timeout should be chosen experimentally by measurement.
- Timeout value can be changed over time.
- Can be done with Phi Accrual failure detector.
- Akka, Cassandra, TCP
Synchronous versus Asynchronous networks
- Synchronous network
- Establish a circuit for each connection, like traditional phone
- No queueing
- Reserved bandwidth
- The maximum latency is fixed.
- Good for transferring a fairly constant number of bits per second.
- Not good for bursty data transfers; need a good guess on the circuit bandwidth.
- Asynchronous network
- Packet-based, share the network bandwidth dynamically.
- e.g., TCP connection
- Hybrid choice
- ATM supports both circuit switching and packet switching.
- InfiniBand: possible to emulate circuit switching on packet networks, or provide statistically bounded delay
- Implement flow control at the link layer to avoid queueing in the network.
- But still suffer due to link congestion.
- Using quality of service and admission control.
- Implement flow control at the link layer to avoid queueing in the network.
Unreliable clocks
- It takes time for a message to travel and the delay is unpredictable.
- Each machine has its own clock, which is not perfectly accurate.
- Synchronize clocks
- Most common: network time protocol (NTP)
- Adjust the clock according to time reported by a group of servers (e.g., GPS).
- Most common: network time protocol (NTP)
Monotonic versus time-of-day clocks
Modern computers has at least two different kind of clocks:
- Time-of-the-day clock
- Return current date and time according to some calendar.
clock_gettime(CLOCK_REALTIME)
on LinuxSystem.currentTimeMillis()
in Java.- Number of seconds or ms since the epoch: midnight UTC on Jan. 1st, 1970.
- According to the Gregorian calendar, not counting leap seconds.
- Synchronized with NTP
- Problem
- If the local clock is too far ahead of the NTP server, it may be forced to reset and appears that a jump back in time happens.
- Unsuitable for measuring elapsed time.
- If the local clock is too far ahead of the NTP server, it may be forced to reset and appears that a jump back in time happens.
- Monotonic clock
- Suitable for measuring duration.
- But the absolute value is meaningless.
clock_gettime(CLOCK_MONOTONIC)
on LinuxSystem.nanoTime()
in Java.- Guaranteed to always move forward.
- NTP may adjust the frequency of the monotonic clock, but no jump forward or backward.
- Good to be used to measure elapsed time in a distributed system.
- Doesn’t assume any synchronization between different nodes.
- Not sensitive to slight inaccuracies of measurement.
Clock synchronization and accuracy
- Monotonic clocks don’t need synchronization.
- But time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful.
- Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope - hardware clocks and NTP can be fickle beasts.
- Local hardware clocks are inaccurate.
- NTP synchronization can only be as good as the network delay.
- A node can be firewalled off from NTP servers.
- Leap seconds.
- Virtualized hardware clocks in virtual machines.
- Achieve higher accuracy with CPS receiver, the Precision Time Protocol (PTP).
Relying on synchronized clocks
Dangerous to assume the clocks will work well. If a software requires synchronized clocks, it’s essential to monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declare dead.
- Timestamps for ordering events
- May fail to order events due to not perfectly synchronized clocks.
- Affect strategies like Last Write Wins
- May use logical clocks(based on incrementing counters).
- Clock readings have a confidential interval.
- Treat it as a range of time.
- Google’s TrueTime API in Spanner will report the confidential interval.
- Synchronized clocks for global snapshots
- Snapshot isolation requires a monotonically increasing transaction ID.
- Spanner use timestamps as transaction IDs.
- Good as long as the confidential intervals of two timestamps don’t overlap.
- Wait for the length of the confidential interval before committing a read-write transaction.
- Ensure that any transaction that may read the data is at a sufficiently later time; no overlap.
Process pauses
For a multi-node system, a single leader obtains a lease (a lock with timeout) from the others. The timeout and the clock will cause some problems.
- Use synchronized clock
- The leases’ timeouts are set by other nodes but compared to the local clock at the leader. Those clocks can be out of sync by more than a few seconds.
- Use local monotonic clock
- What if the leader process is paused for some duration?
- A node in a distributed system must assume that its execution can be paused for a significant length of time at any point.
- Garbage collector, suspended virtual machine, context switching, disk IO, swapping, etc
Response time guarantees:
- Hard real-time systems: In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system.
- Dynamic memory allocation may be restricted or disallowed.
- Garbage collection is not a good idea here.
- Very expensive to develop.
- Low throughput.
- Usually used in safety-critical devices.
- Garbage collection
- Cause pauses, but we can reduce the impact of the them.
- One way: treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage.
- The other: use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects.
- Cause pauses, but we can reduce the impact of the them.
Knowledge, truth and lies
System model: We can state some assumptions and design the actual systems that meets those assumptions.
The truth is defined by majority
- A node cannot necessarily trust its own judgment of a situation.
- If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive.
- Some systems have the-only things.
- e.g., the single leader, lock
- A node may falsely think itself as the chosen one even if it has been declared dead by the quorum.
- Cause problems
- Use fencing tokens on the server or database side to prevent write crash or corruptions.
- e.g., every time the lock server grants a lock or lease, it also returns a fencing token (a monotonically increasing ID).
- Every time a client sends a write request to the storage service, it must include its current fencing token.
Byzantine faults
- Byzantine fault: Distributed systems problems become much harder if there is a risk that nodes may “lie”. (send arbitrary faulty or corrupted responses)
- A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
- Can assume no Byzantine faults if all nodes in the datacenter are controlled and radiation levels are low enough that memory corruption is not a major problem.
System model and reality
Timing assumptions:
- Synchronous model
- Bounded network delay, process pauses, clock error
- Not realistic
- Partially synchronous model
- Behave like synchronous model most of the time.
- But sometimes exceeds the bounds.
- Asynchronous model
- Not allowed to make any timing assumptions.
Node failure assumptions:
- Crash-stop faults
- A node can fail only by crashing.
- Stop responding and then gone forever.
- Crash-recovery faults
- Nodes may crash at any moment, and perhaps start responding again after some unknown time.
- Stable storage is preserved across crashes.
- In-memory data can be lost.
- Byzantine (arbitrary) faults
- Nodes may do absolutely anything, including trying to trick and deceive other nodes.
For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model.
- Correctness of an algorithm
- To define what it means for an algorithm to be correct, we can describe its properties.
- An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.
- Safety and liveness
- Two kinds of properties
- Safety: nothing bad happens.
- If a safety property is violated, we can point at a particular point in time at which it was broken. After a safety property has been violated, the violation cannot be undone — the damage is already done.
- Liveness: something good eventually happens.
- It may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).
- Safety: nothing bad happens.
- Strategies
- Require that safety properties always hold.
- Can make caveats for liveness properties.
- e.g., a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage.
- Two kinds of properties
- System models are not perfect, sometimes we have bugs due to unusual circumstances.
- Theoretical analysis and empirical testing are equally important.