2018-10-25

## The Trouble with Distributed Systems

• Networks
• Clocks and timing issues

### Faults and partial failures

• Partial failure: In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine.
• Nondeterministic
• A system with multiple nodes and the network may work or unpredictably fail.
• May not even know whether something succeeded or not; the time it takes for a message to travel across a network is also nondeterministic.

#### Cloud computing and supercomputing

Large-scale computing systems:

• High-performance computing
• Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks.
• A job typically checkpoints the state of its computation to durable storage from time to time.
• If one node fails, a common solution is to simply stop the entire cluster workload (escalate partial failure to total failure). After the faulty node is repaired, the computation is restarted from the last checkpoint.
• Cloud computing
• Often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.
• Somewhere between these extremes.

Differences of cloud computing from supercomputing

• Many internet-related applications are online, in the sense that they need to be able to serve users with low latency at any time.
• Making the service unavailable - for example, stopping the cluster for repair - is not acceptable.
• Nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.
• Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA).
• Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth.
• Reasonable to assume something is always broken in a big system.
• Allow rolling upgrade if the system can tolerate failed nodes.
• In a geographically distributed deployment, communication most likely goes over the internet, which is slow and unreliable compared to local networks.

### Unreliable networks

• Shared-nothing systems
• A bunch of machines connected by a network.
• The network is the only way those machines can communicate.
• Each machine has its own memory and disk; one machine cannot access another machine’s memory or disk.
• Pros
• Comparatively cheap since it requires no special hardware.
• Make use of commoditized cloud computing services.
• High reliability through redundancy across multiple geographically distributed datacenters.
• Asynchronous packet networks
• One node can send a message to another node.
• No guarantees as to when it will arrive or whether it will arrive.
• May use timeout to handling lost or delayed messages.

#### Network faults in practice

• Various network faults can happen frequently.
• Public cloud services are more prone to network faults than private ones.
• Do need to know how your software reacts to network problems and ensure that the system can recover from them.
• Netflix Chaos Monkey

#### Detecting faults

• Some feedback telling explicitly something is wrong.
• On receiver machine, if no process is listening on the destination port, the OS will close or refuse TCP connections.
• Cannot deal with the case when the node process crashes while processing the request.
• If a node process crashed, the still running OS can notify other nodes about the crash.
• HBase
• If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level.
• If a router knows that the IP address is unreachable, it will reply with ICMP Destination Unreachable packet.
• Some limitations
• TCP ACK is not enough; we may need ACK on application level.
• Retry on application level and timeout.

#### Timeouts and unbounded delays

• Trade offs between long and short timeouts.
• Prematurely declaring a node dead is problematic.
• If the node is alive, the same work can be repeated twice.
• If the system is experiencing a high load, it will make the problem worse.
• Load was transferred to other nodes.
• Networks have unbounded delays.
• Queueing delays
• On the network switch
• On destination machine
• Due to virtual machine switching (one paused to execute another)
• TCP flow control
• TCP retransmission
• In public clouds or datacenter, a noisy neighbor can occupy a lot of resources.
• Timeout should be chosen experimentally by measurement.
• Timeout value can be changed over time.
• Can be done with Phi Accrual failure detector.
• Akka, Cassandra, TCP

#### Synchronous versus Asynchronous networks

• Synchronous network
• Establish a circuit for each connection, like traditional phone
• No queueing
• Reserved bandwidth
• The maximum latency is fixed.
• Good for transferring a fairly constant number of bits per second.
• Not good for bursty data transfers; need a good guess on the circuit bandwidth.
• Asynchronous network
• Packet-based, share the network bandwidth dynamically.
• e.g., TCP connection
• Hybrid choice
• ATM supports both circuit switching and packet switching.
• InfiniBand: possible to emulate circuit switching on packet networks, or provide statistically bounded delay
• Implement flow control at the link layer to avoid queueing in the network.
• But still suffer due to link congestion.
• Using quality of service and admission control.

### Unreliable clocks

• It takes time for a message to travel and the delay is unpredictable.
• Each machine has its own clock, which is not perfectly accurate.
• Synchronize clocks
• Most common: network time protocol (NTP)
• Adjust the clock according to time reported by a group of servers (e.g., GPS).

#### Monotonic versus time-of-day clocks

Modern computers has at least two different kind of clocks:

• Time-of-the-day clock
• Return current date and time according to some calendar.
• clock_gettime(CLOCK_REALTIME) on Linux System.currentTimeMillis() in Java.
• Number of seconds or ms since the epoch: midnight UTC on Jan. 1st, 1970.
• According to the Gregorian calendar, not counting leap seconds.
• Synchronized with NTP
• Problem
• If the local clock is too far ahead of the NTP server, it may be forced to reset and appears that a jump back in time happens.
• Unsuitable for measuring elapsed time.
• Monotonic clock
• Suitable for measuring duration.
• But the absolute value is meaningless.
• clock_gettime(CLOCK_MONOTONIC) on Linux System.nanoTime() in Java.
• Guaranteed to always move forward.
• NTP may adjust the frequency of the monotonic clock, but no jump forward or backward.
• Good to be used to measure elapsed time in a distributed system.
• Doesn’t assume any synchronization between different nodes.
• Not sensitive to slight inaccuracies of measurement.

#### Clock synchronization and accuracy

• Monotonic clocks don’t need synchronization.
• But time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful.
• Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope - hardware clocks and NTP can be fickle beasts.
• Local hardware clocks are inaccurate.
• NTP synchronization can only be as good as the network delay.
• A node can be firewalled off from NTP servers.
• Leap seconds.
• Virtualized hardware clocks in virtual machines.
• Achieve higher accuracy with CPS receiver, the Precision Time Protocol (PTP).

#### Relying on synchronized clocks

Dangerous to assume the clocks will work well. If a software requires synchronized clocks, it’s essential to monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declare dead.

• Timestamps for ordering events
• May fail to order events due to not perfectly synchronized clocks.
• Affect strategies like Last Write Wins
• May use logical clocks(based on incrementing counters).
• Clock readings have a confidential interval.
• Treat it as a range of time.
• Google’s TrueTime API in Spanner will report the confidential interval.
• Synchronized clocks for global snapshots
• Snapshot isolation requires a monotonically increasing transaction ID.
• Spanner use timestamps as transaction IDs.
• Good as long as the confidential intervals of two timestamps don’t overlap.
• Wait for the length of the confidential interval before committing a read-write transaction.
• Ensure that any transaction that may read the data is at a sufficiently later time; no overlap.

#### Process pauses

For a multi-node system, a single leader obtains a lease (a lock with timeout) from the others. The timeout and the clock will cause some problems.

• Use synchronized clock
• The leases’ timeouts are set by other nodes but compared to the local clock at the leader. Those clocks can be out of sync by more than a few seconds.
• Use local monotonic clock
• What if the leader process is paused for some duration?
• A node in a distributed system must assume that its execution can be paused for a significant length of time at any point.
• Garbage collector, suspended virtual machine, context switching, disk IO, swapping, etc

Response time guarantees:

• Hard real-time systems: In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system.
• Dynamic memory allocation may be restricted or disallowed.
• Garbage collection is not a good idea here.
• Very expensive to develop.
• Low throughput.
• Usually used in safety-critical devices.
• Garbage collection
• Cause pauses, but we can reduce the impact of the them.
• One way: treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage.
• The other: use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects.

### Knowledge, truth and lies

System model: We can state some assumptions and design the actual systems that meets those assumptions.

#### The truth is defined by majority

• A node cannot necessarily trust its own judgment of a situation.
• If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive.
• Some systems have the-only things.
• e.g., the single leader, lock
• A node may falsely think itself as the chosen one even if it has been declared dead by the quorum.
• Cause problems
• Use fencing tokens on the server or database side to prevent write crash or corruptions.
• e.g., every time the lock server grants a lock or lease, it also returns a fencing token (a monotonically increasing ID).
• Every time a client sends a write request to the storage service, it must include its current fencing token.

#### Byzantine faults

• Byzantine fault: Distributed systems problems become much harder if there is a risk that nodes may “lie”. (send arbitrary faulty or corrupted responses)
• A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
• Can assume no Byzantine faults if all nodes in the datacenter are controlled and radiation levels are low enough that memory corruption is not a major problem.

#### System model and reality

Timing assumptions:

• Synchronous model
• Bounded network delay, process pauses, clock error
• Not realistic
• Partially synchronous model
• Behave like synchronous model most of the time.
• But sometimes exceeds the bounds.
• Asynchronous model
• Not allowed to make any timing assumptions.

Node failure assumptions:

• Crash-stop faults
• A node can fail only by crashing.
• Stop responding and then gone forever.
• Crash-recovery faults
• Nodes may crash at any moment, and perhaps start responding again after some unknown time.
• Stable storage is preserved across crashes.
• In-memory data can be lost.
• Byzantine (arbitrary) faults
• Nodes may do absolutely anything, including trying to trick and deceive other nodes.

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model.

• Correctness of an algorithm
• To define what it means for an algorithm to be correct, we can describe its properties.
• An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.
• Safety and liveness
• Two kinds of properties