Read "Designing Data-intensive applications" 8, Distributed System Troubles

All chapters

The Trouble with Distributed Systems

Networks
Clocks and timing issues

Faults and partial failures

Partial failure: In a distributed system, there may well be some parts of the system that are broken in some unpredictable way, even though other parts of the system are working fine.
- Nondeterministic
  - A system with multiple nodes and the network may work or unpredictably fail.
  - May not even know whether something succeeded or not; the time it takes for a message to travel across a network is also nondeterministic.

Cloud computing and supercomputing

Large-scale computing systems:

High-performance computing
- Supercomputers with thousands of CPUs are typically used for computationally intensive scientific computing tasks.
- A job typically checkpoints the state of its computation to durable storage from time to time.
- If one node fails, a common solution is to simply stop the entire cluster workload (escalate partial failure to total failure). After the faulty node is repaired, the computation is restarted from the last checkpoint.
Cloud computing
- Often associated with multi-tenant datacenters, commodity computers connected with an IP network (often Ethernet), elastic/on-demand resource allocation, and metered billing.
Traditional enterprise datacenters
- Somewhere between these extremes.

Differences of cloud computing from supercomputing

Many internet-related applications are online, in the sense that they need to be able to serve users with low latency at any time.
- Making the service unavailable - for example, stopping the cluster for repair - is not acceptable.
Nodes in cloud services are built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but also have higher failure rates.
- Supercomputers are typically built from specialized hardware, where each node is quite reliable, and nodes communicate through shared memory and remote direct memory access (RDMA).
Large datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth.
Reasonable to assume something is always broken in a big system.
Allow rolling upgrade if the system can tolerate failed nodes.
In a geographically distributed deployment, communication most likely goes over the internet, which is slow and unreliable compared to local networks.

Unreliable networks

Shared-nothing systems
- A bunch of machines connected by a network.
- The network is the only way those machines can communicate.
- Each machine has its own memory and disk; one machine cannot access another machine’s memory or disk.
- Pros
  - Comparatively cheap since it requires no special hardware.
  - Make use of commoditized cloud computing services.
  - High reliability through redundancy across multiple geographically distributed datacenters.
Asynchronous packet networks
- One node can send a message to another node.
- No guarantees as to when it will arrive or whether it will arrive.
May use timeout to handling lost or delayed messages.

Network faults in practice

Various network faults can happen frequently.
Public cloud services are more prone to network faults than private ones.
Do need to know how your software reacts to network problems and ensure that the system can recover from them.
- Netflix Chaos Monkey

Detecting faults

Some feedback telling explicitly something is wrong.
- On receiver machine, if no process is listening on the destination port, the OS will close or refuse TCP connections.
  - Cannot deal with the case when the node process crashes while processing the request.
- If a node process crashed, the still running OS can notify other nodes about the crash.
  - HBase
- If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level.
- If a router knows that the IP address is unreachable, it will reply with ICMP Destination Unreachable packet.
  - Some limitations
TCP ACK is not enough; we may need ACK on application level.
Retry on application level and timeout.

Timeouts and unbounded delays

Trade offs between long and short timeouts.
Prematurely declaring a node dead is problematic.
- If the node is alive, the same work can be repeated twice.
- If the system is experiencing a high load, it will make the problem worse.
  - Load was transferred to other nodes.
Networks have unbounded delays.
Queueing delays
- On the network switch
- On destination machine
- Due to virtual machine switching (one paused to execute another)
- TCP flow control
- TCP retransmission
- In public clouds or datacenter, a noisy neighbor can occupy a lot of resources.
Timeout should be chosen experimentally by measurement.
Timeout value can be changed over time.
- Can be done with Phi Accrual failure detector.
- Akka, Cassandra, TCP

Synchronous versus Asynchronous networks

Synchronous network
- Establish a circuit for each connection, like traditional phone
- No queueing
- Reserved bandwidth
- The maximum latency is fixed.
- Good for transferring a fairly constant number of bits per second.
- Not good for bursty data transfers; need a good guess on the circuit bandwidth.
Asynchronous network
- Packet-based, share the network bandwidth dynamically.
- e.g., TCP connection
Hybrid choice
- ATM supports both circuit switching and packet switching.
- InfiniBand: possible to emulate circuit switching on packet networks, or provide statistically bounded delay
  - Implement flow control at the link layer to avoid queueing in the network.
    - But still suffer due to link congestion.
  - Using quality of service and admission control.

Unreliable clocks

It takes time for a message to travel and the delay is unpredictable.
Each machine has its own clock, which is not perfectly accurate.
Synchronize clocks
- Most common: network time protocol (NTP)
  - Adjust the clock according to time reported by a group of servers (e.g., GPS).

Monotonic versus time-of-day clocks

Modern computers has at least two different kind of clocks:

Time-of-the-day clock
- Return current date and time according to some calendar.
- clock_gettime(CLOCK_REALTIME) on Linux System.currentTimeMillis() in Java.
  - Number of seconds or ms since the epoch: midnight UTC on Jan. 1st, 1970.
  - According to the Gregorian calendar, not counting leap seconds.
- Synchronized with NTP
- Problem
  - If the local clock is too far ahead of the NTP server, it may be forced to reset and appears that a jump back in time happens.
    - Unsuitable for measuring elapsed time.
Monotonic clock
- Suitable for measuring duration.
- But the absolute value is meaningless.
- clock_gettime(CLOCK_MONOTONIC) on Linux System.nanoTime() in Java.
- Guaranteed to always move forward.
- NTP may adjust the frequency of the monotonic clock, but no jump forward or backward.
- Good to be used to measure elapsed time in a distributed system.
  - Doesn’t assume any synchronization between different nodes.
  - Not sensitive to slight inaccuracies of measurement.

Clock synchronization and accuracy

Monotonic clocks don’t need synchronization.
But time-of-day clocks need to be set according to an NTP server or other external time source in order to be useful.
Unfortunately, our methods for getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might hope - hardware clocks and NTP can be fickle beasts.
- Local hardware clocks are inaccurate.
- NTP synchronization can only be as good as the network delay.
- A node can be firewalled off from NTP servers.
- Leap seconds.
- Virtualized hardware clocks in virtual machines.
Achieve higher accuracy with CPS receiver, the Precision Time Protocol (PTP).

Relying on synchronized clocks

Dangerous to assume the clocks will work well. If a software requires synchronized clocks, it’s essential to monitor the clock offsets between all the machines. Any node whose clock drifts too far from the others should be declare dead.

Timestamps for ordering events
- May fail to order events due to not perfectly synchronized clocks.
- Affect strategies like Last Write Wins
- May use logical clocks(based on incrementing counters).
Clock readings have a confidential interval.
- Treat it as a range of time.
- Google’s TrueTime API in Spanner will report the confidential interval.
Synchronized clocks for global snapshots
- Snapshot isolation requires a monotonically increasing transaction ID.
- Spanner use timestamps as transaction IDs.
  - Good as long as the confidential intervals of two timestamps don’t overlap.
  - Wait for the length of the confidential interval before committing a read-write transaction.
    - Ensure that any transaction that may read the data is at a sufficiently later time; no overlap.

Process pauses

For a multi-node system, a single leader obtains a lease (a lock with timeout) from the others. The timeout and the clock will cause some problems.

Use synchronized clock
- The leases’ timeouts are set by other nodes but compared to the local clock at the leader. Those clocks can be out of sync by more than a few seconds.
Use local monotonic clock
- What if the leader process is paused for some duration?
- A node in a distributed system must assume that its execution can be paused for a significant length of time at any point.
  - Garbage collector, suspended virtual machine, context switching, disk IO, swapping, etc

Response time guarantees:

Hard real-time systems: In these systems, there is a specified deadline by which the software must respond; if it doesn’t meet the deadline, that may cause a failure of the entire system.
- Dynamic memory allocation may be restricted or disallowed.
- Garbage collection is not a good idea here.
- Very expensive to develop.
- Low throughput.
- Usually used in safety-critical devices.
Garbage collection
- Cause pauses, but we can reduce the impact of the them.
  - One way: treat GC pauses like brief planned outages of a node, and to let other nodes handle requests from clients while one node is collecting its garbage.
  - The other: use the garbage collector only for short-lived objects (which are fast to collect) and to restart processes periodically, before they accumulate enough long-lived objects to require a full GC of long-lived objects.

Knowledge, truth and lies

System model: We can state some assumptions and design the actual systems that meets those assumptions.

The truth is defined by majority

A node cannot necessarily trust its own judgment of a situation.
- If a quorum of nodes declares another node dead, then it must be considered dead, even if that node still very much feels alive.
Some systems have the-only things.
- e.g., the single leader, lock
- A node may falsely think itself as the chosen one even if it has been declared dead by the quorum.
  - Cause problems
- Use fencing tokens on the server or database side to prevent write crash or corruptions.
  - e.g., every time the lock server grants a lock or lease, it also returns a fencing token (a monotonically increasing ID).
  - Every time a client sends a write request to the storage service, it must include its current fencing token.

Byzantine faults

Byzantine fault: Distributed systems problems become much harder if there is a risk that nodes may “lie”. (send arbitrary faulty or corrupted responses)
A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.
Can assume no Byzantine faults if all nodes in the datacenter are controlled and radiation levels are low enough that memory corruption is not a major problem.

System model and reality

Timing assumptions:

Synchronous model
- Bounded network delay, process pauses, clock error
- Not realistic
Partially synchronous model
- Behave like synchronous model most of the time.
- But sometimes exceeds the bounds.
Asynchronous model
- Not allowed to make any timing assumptions.

Node failure assumptions:

Crash-stop faults
- A node can fail only by crashing.
- Stop responding and then gone forever.
Crash-recovery faults
- Nodes may crash at any moment, and perhaps start responding again after some unknown time.
- Stable storage is preserved across crashes.
- In-memory data can be lost.
Byzantine (arbitrary) faults
- Nodes may do absolutely anything, including trying to trick and deceive other nodes.

For modeling real systems, the partially synchronous model with crash-recovery faults is generally the most useful model.

Correctness of an algorithm
- To define what it means for an algorithm to be correct, we can describe its properties.
- An algorithm is correct in some system model if it always satisfies its properties in all situations that we assume may occur in that system model.
Safety and liveness
- Two kinds of properties
  - Safety: nothing bad happens.
    - If a safety property is violated, we can point at a particular point in time at which it was broken. After a safety property has been violated, the violation cannot be undone — the damage is already done.
  - Liveness: something good eventually happens.
    - It may not hold at some point in time (for example, a node may have sent a request but not yet received a response), but there is always hope that it may be satisfied in the future (namely by receiving a response).
- Strategies
  - Require that safety properties always hold.
  - Can make caveats for liveness properties.
    - e.g., a request needs to receive a response only if a majority of nodes have not crashed, and only if the network eventually recovers from an outage.
System models are not perfect, sometimes we have bugs due to unusual circumstances.
- Theoretical analysis and empirical testing are equally important.

AI 2

Algorithm 17

Amazon 1

Authorization 1

Blog 3

Bootstrap 1

C++ 1

CCpp 5

CSS 2

Cloud 3

Code 1

Crawler 1

DNS 1

Database 17

DeepLearning 1

Design 17

Development 1

Docker 1

English 1

Express 1

GDB 1

Go 3

Google 4

HTML 3

IOS 1

Java 17

Javascript 4

Jekyll 1

Linux 4

MacOS 2

MachineLearning 17

Markdown 4

Mobile 1

MongoDB 2

Multi-threading 3

NAS 1

Network 11

NeuralNetwork 10

Node 1

OS 8

Public-speaking 1

Python 15

RESTful 1

Rails 9

React 1

Redis 1

Ruby 6

Shell 2

Spring 2

System 17

TCP 1

TDD 1

Thread 2

Vim 1

awk 1

git 1

jQuery 1

media 1

network 1

php 1