The Problem: Why Consensus is Hard

CAP Theorem: The Fundamental Trade-off

The CAP theorem (Brewer, 2000) states that a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance.

Consistency: Every read receives the most recent write or an error. All nodes see the same data at the same time.
Availability: Every request receives a (non-error) response — though it might not be the most recent write.
Partition Tolerance: The system continues operating despite dropped or delayed messages between nodes.

In practice, network partitions are inevitable in any distributed system. Engineers therefore choose between CP systems (strong consistency, may refuse requests during partitions) and AP systems (always respond, may return stale data). Raft is a CP algorithm — it prioritises correctness over availability during network splits.

Why this matters: Systems like ZooKeeper, etcd, and Consul are CP. Systems like Cassandra and DynamoDB are AP. Understanding where your system sits on the CAP spectrum is essential for reasoning about failure modes.

FLP Impossibility: Why Perfect Consensus is Impossible

The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proved that no deterministic, asynchronous distributed algorithm can achieve consensus if even a single process can crash.

This might sound alarming, but the practical takeaway is: you cannot build a perfectly safe and perfectly live consensus system simultaneously. All practical consensus algorithms (Paxos, Raft, Zab) resolve this by relaxing the liveness guarantee — they may stall if too many nodes fail, but they will never return incorrect results.

FLP assumes a fully asynchronous model where you cannot distinguish a slow node from a dead one. Real systems use timeouts as a practical approximation to detect failures and make progress.

Byzantine vs Crash Failures

Raft handles crash-stop failures — nodes fail by stopping. It does not handle Byzantine failures, where nodes can behave arbitrarily (return wrong data, send conflicting messages to different peers, lie about their state).

Failure Model	Example	Algorithm
Crash-stop	Server power loss	Raft, Paxos, Zab
Crash-recovery	Reboot after power loss	Multi-Paxos, Raft with log persistence
Byzantine	Compromised node, cosmic ray bit-flip	PBFT, HotStuff, Tendermint

Byzantine Fault Tolerance (BFT) requires 3f+1 nodes to tolerate f traitors. Crash-fault tolerance requires only 2f+1 nodes to tolerate f failures. This cost difference is why Raft is used in data centres (trusted environment) while BFT is used in blockchains (zero-trust environment).

Quorum: The Mathematical Foundation

A quorum is the minimum number of nodes required to reach a decision. For a cluster of N nodes tolerating f failures, a quorum of ⌊N/2⌋ + 1 (a strict majority) ensures that:

Any two quorums overlap — at least one shared node has seen both decisions.
A failed minority (≤ f nodes) cannot prevent progress.

With 5 nodes (f=2), quorum = 3. This means the cluster can lose any 2 nodes and still make progress. Losing 3 nodes means we can no longer form a quorum and the system stalls (but never corrupts data).

flowchart LR subgraph QA["Quorum A"] A0(("0")) & A1(("1")) & A2(("2 ★")) end subgraph QB["Quorum B"] B2(("2 ★")) & B3(("3")) & B4(("4")) end A2 -. "overlap — seen both decisions" .- B2

The Raft Algorithm: Core Concepts

Node Roles: Follower, Candidate, Leader

Every Raft node is in exactly one of three states at any time:

Follower — Passive. Responds to RPCs from leaders and candidates. If no communication is received within the election timeout, the follower promotes itself to candidate.

Candidate — Seeking election. Has incremented its term and is broadcasting RequestVote RPCs. Transitions to leader on winning a quorum of votes, or back to follower if a valid leader makes contact.

Leader — Active. Handles all client requests, replicates log entries to followers, and sends periodic heartbeats to prevent new elections. There is at most one leader per term.

stateDiagram-v2 [*] --> Follower Follower --> Candidate : election timeout Candidate --> Leader : wins quorum of votes Candidate --> Follower : sees valid leader Candidate --> Candidate : split vote — retry next term Leader --> Follower : sees higher term

Terms: Raft's Logical Clock

A term is a monotonically increasing integer. Each election begins a new term. Terms serve as a logical clock — they allow nodes to detect stale information (a leader from an old term) and ignore it.

Key invariants:

A leader can only exist in one term.
If a node receives an RPC with a term > its own, it immediately updates its term and reverts to follower.
No node will vote for a candidate whose term is less than its own current term.

Terms and term transitions are how Raft ensures that old leaders (e.g., a node that was network-partitioned) cannot overwrite committed data when they reconnect.

Election Timeout: Triggering Leader Election

The election timeout is a random duration (e.g., 150–300 ms) that each follower waits before starting an election. Randomisation is the key insight — it staggers elections so that in most cases, one node's timer fires first, collects votes from the still-sleeping others, and wins.

# From raft_node.py
self._election_timeout = random.uniform(
    ELECTION_TIMEOUT_MIN,   # e.g. 1.5s
    ELECTION_TIMEOUT_MAX,   # e.g. 3.0s
)

The leader resets followers' timers by sending periodic heartbeats (empty AppendEntries RPCs). As long as the leader is alive and the network is healthy, followers never time out.

Vote Granting Rules

A node grants its vote to a candidate only when all of the following are true:

Term is current: candidate_term >= current_term
One vote per term: The node hasn't already voted for a different candidate this term.
Log is at least as fresh (§5.4): The candidate's log is at least as up-to-date as the voter's. Freshness is determined by comparing the last log entry's term first, then log length as a tiebreaker.

Rule 3 is critical for safety. It prevents a node with a stale log from becoming leader and overwriting entries that were already committed by a quorum.

Leader Election in Detail

Starting an Election

When a follower's election timeout fires:

It increments its current_term.
It transitions to Candidate and votes for itself.
It resets its own election timer (in case this election fails and it needs to retry).
It broadcasts a RequestVote RPC to all other nodes.

# RequestVote RPC payload
@dataclass
class VoteRequest:
    candidate_id: int       # Who is asking
    candidate_term: int     # The candidate's current term
    last_log_index: int     # Length of candidate's log - 1
    last_log_term: int      # Term of candidate's last log entry

Collecting Votes

Each recipient of a RequestVote RPC independently evaluates whether to grant its vote based on the three rules above. The candidate collects responses and counts grants.

A subtle but important rule: votes are durable. If a node crashes after voting and restarts, it must remember who it voted for (in production, this is persisted to disk). Without this, it could vote for two different candidates in the same term after a restart, potentially electing two leaders.

Winning the Election

Once a candidate collects votes from a strict majority (quorum), it:

Transitions to Leader.
Initialises nextIndex and matchIndex for each peer (used for log replication).
Immediately sends an empty AppendEntries (heartbeat) to all followers to assert authority and prevent new elections.

The quorum requirement for winning ensures that at most one candidate can win per term — two quorums of 3 (out of 5 nodes) must overlap by at least 1 node, and that overlap node can have voted for only one candidate.

Split Vote Resolution

It's possible for two candidates to simultaneously receive votes from non-overlapping sets of nodes, with neither reaching quorum. Raft resolves this via timeout randomisation:

Both candidates' election timers expire at different random times in the next election.
The one that fires first collects votes before the other becomes a candidate again.

The expected number of rounds to elect a leader is close to 1. The randomisation window should be large enough that a single message round-trip completes within it.

Log Replication

AppendEntries RPC

AppendEntries is the heartbeat AND the replication mechanism. The leader sends it to all followers:

As a heartbeat (empty entries): Resets followers' election timers, advancing their knowledge of the current term and the leader's commit index.
As a replication RPC (entries non-empty): Carries new log entries for the follower to append.

@dataclass
class AppendEntriesRequest:
    leader_id: int
    term: int
    prev_log_index: int     # Index of log entry immediately before new ones
    prev_log_term: int      # Term of that entry — consistency check
    entries: list[LogEntry] # New entries to store (empty = heartbeat)
    leader_commit: int      # Leader's current commit_index

The prev_log_index and prev_log_term fields implement the Log Matching Property: a follower only accepts new entries if its log matches the leader's log up to the prev_log_index entry. This inductively guarantees that two logs with the same last entry are identical up to that point.

Commit Index Advancement

An entry becomes committed once the leader has replicated it on a majority of nodes. Commitment is permanent — a committed entry will survive in the log even if the leader crashes.

Leader's commit logic:

After receiving successful AppendEntriesResponse from a follower, update matchIndex[follower].
Find the highest log index N such that: matchIndex[i] >= N for a majority of nodes AND log[N].term == currentTerm.
Advance commitIndex to N.

The condition log[N].term == currentTerm prevents a subtle bug where a leader could commit entries from previous terms on behalf of a quorum that no longer exists.

Log Matching Property

Raft's safety guarantee rests on the Log Matching Property:

If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index.

This follows inductively from the AppendEntries consistency check:

Base case: The initial empty log is trivially consistent.
Inductive step: A follower only appends an entry at index N if its log at index N-1 matches the leader's, so the two logs are identical through N.

Handling Inconsistent Logs

A new leader may find followers with shorter or divergent logs (from previous leaders that crashed mid-replication). The leader handles this by backing off nextIndex[follower] until the follower's log is consistent, then resending all missing entries.

In the optimisation described in the Raft paper (§5.3), the follower can return a conflictTerm and conflictIndex hint to allow the leader to skip multiple back-off steps at once, reducing the number of AppendEntries round-trips needed to synchronise a lagging follower.

Safety and Liveness

The Leader Completeness Property

If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.

This is Raft's core safety property. It is enforced by the vote-granting rule: a candidate can only win an election if at least one voter in its quorum has the committed entry, meaning the candidate's log is at least as up-to-date as any committed entry.

Proof sketch:

Entry E is committed at index I in term T (replicated on a quorum Q).
Any future leader in term T' > T must win a quorum Q'.
Q ∩ Q' is non-empty (both are majorities).
The shared node voted for the new leader, so the new leader's log is at least as fresh as the shared node's, which contains E.
Therefore, the new leader's log contains E.

Network Partitions and Split-Brain

When a network partition splits a cluster, Raft guarantees no split-brain (two leaders that both accept writes and diverge):

The minority partition cannot elect a leader (lacks quorum).
The majority partition can elect a leader and make progress.

Any leader in the minority partition will eventually step down when its term becomes stale upon reconnection.

flowchart LR subgraph min["Minority — no quorum ✗"] n0(["Node 0\nold leader"]) n1(["Node 1"]) end subgraph maj["Majority — elects new leader ✓"] n2(["Node 2"]) n3(["Node 3"]) n4(["Node 4\nnew leader"]) end min x--x|"partition"| maj n4 -->|"heal: higher term\nold leader steps down"| n0

This is why only committed entries are durable. If a client receives acknowledgement only after the leader commits an entry (quorum-acknowledged), it is safe. If a client acknowledges after the leader appends locally (before quorum), the write may be lost in a partition.

Re-election After Partition Heals

When the network partition heals:

Nodes on the minority side receive AppendEntries or RequestVote RPCs with a higher term.
They immediately update their term and revert to follower.
Their logs are overwritten to match the current leader's log.
Any uncommitted entries in their log are discarded.

This convergence is safe because the discarded entries were never committed — no quorum acknowledged them, so no client was told they were durable.

Linearizability Guarantee

Raft provides linearizability: each operation appears to execute atomically at some point between its invocation and completion. This means:

No stale reads (reads always reflect the latest committed write).
No duplicate executions (even with retries after leader failure).

Achieving true linearizability in practice requires the leader to:

Read-your-writes: confirm it is still the leader before responding (by getting a quorum on a heartbeat before serving reads).
Idempotency: deduplicate client retries using unique request IDs in the log.

Real-World Applications

etcd: The Backbone of Kubernetes

etcd is the primary data store for Kubernetes cluster state (pod specs, service definitions, ConfigMaps, Secrets). It uses Raft as its consensus algorithm.

Key design choices:

Key-value store with a simple HTTP/gRPC API and a watch mechanism.
Strong consistency: all reads go through the leader (no stale reads).
Watch API: clients long-poll for changes; etcd pushes updates when keys change. This powers Kubernetes controllers that react to cluster state changes.
Typical cluster size: 3 or 5 nodes. 7 is the maximum recommended (performance degrades as quorum size grows).

When etcd is unavailable, Kubernetes cannot schedule new pods or update state, though already-running pods continue running (control plane is affected, data plane is not).

CockroachDB: Distributed SQL

CockroachDB runs one Raft group per range (a 64 MB key-space shard). A table with many rows has many Raft groups running concurrently, each independently electing leaders and replicating writes.

This design means:

Writes to different rows can be parallelised across multiple Raft leaders.
A Raft group is affected by partition only if the nodes hosting its replicas are split.
Cross-range transactions use a two-phase commit protocol on top of Raft for atomicity.

Apache Kafka KRaft Mode

Kafka originally used ZooKeeper (which uses Zab, a Raft-like protocol) for its metadata and controller election. KRaft (Kafka Raft) replaces ZooKeeper with a built-in Raft implementation:

Metadata log: All cluster metadata (topic configs, partition assignments) is stored in a Raft-replicated log.
Controller election: A single controller is the Raft leader for the metadata partition.
Benefit: Eliminates ZooKeeper as a separate operational concern, reduces latency, and supports millions of partitions (ZooKeeper was the scalability bottleneck).

KRaft became production-ready in Kafka 3.3 and ZooKeeper mode is fully deprecated as of Kafka 4.0.

FoundationDB

FoundationDB (acquired by Apple in 2015) uses a modified Paxos variant for its transaction system, but its architecture embodies many of the same principles as Raft.

Notable for:

Simulation testing: FoundationDB's deterministic simulation framework can reproduce any concurrency bug by controlling all sources of non-determinism. This is considered the gold standard in distributed systems testing.
The team's insight: simulation testing found more bugs than any other technique, including formal verification.