The Problem: Why Consensus is Hard
CAP Theorem: The Fundamental Trade-off
The CAP theorem (Brewer, 2000) states that a distributed system can guarantee at most two of three properties simultaneously: Consistency, Availability, and Partition Tolerance.
- Consistency: Every read receives the most recent write or an error. All nodes see the same data at the same time.
- Availability: Every request receives a (non-error) response — though it might not be the most recent write.
- Partition Tolerance: The system continues operating despite dropped or delayed messages between nodes.
In practice, network partitions are inevitable in any distributed system. Engineers therefore choose between CP systems (strong consistency, may refuse requests during partitions) and AP systems (always respond, may return stale data). Raft is a CP algorithm — it prioritises correctness over availability during network splits.
Why this matters: Systems like ZooKeeper, etcd, and Consul are CP. Systems like Cassandra and DynamoDB are AP. Understanding where your system sits on the CAP spectrum is essential for reasoning about failure modes.
FLP Impossibility: Why Perfect Consensus is Impossible
The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proved that no deterministic, asynchronous distributed algorithm can achieve consensus if even a single process can crash.
This might sound alarming, but the practical takeaway is: you cannot build a perfectly safe and perfectly live consensus system simultaneously. All practical consensus algorithms (Paxos, Raft, Zab) resolve this by relaxing the liveness guarantee — they may stall if too many nodes fail, but they will never return incorrect results.
FLP assumes a fully asynchronous model where you cannot distinguish a slow node from a dead one. Real systems use timeouts as a practical approximation to detect failures and make progress.
Byzantine vs Crash Failures
Raft handles crash-stop failures — nodes fail by stopping. It does not handle Byzantine failures, where nodes can behave arbitrarily (return wrong data, send conflicting messages to different peers, lie about their state).
| Failure Model | Example | Algorithm |
|---|---|---|
| Crash-stop | Server power loss | Raft, Paxos, Zab |
| Crash-recovery | Reboot after power loss | Multi-Paxos, Raft with log persistence |
| Byzantine | Compromised node, cosmic ray bit-flip | PBFT, HotStuff, Tendermint |
Byzantine Fault Tolerance (BFT) requires 3f+1 nodes to tolerate f traitors. Crash-fault tolerance requires only 2f+1 nodes to tolerate f failures. This cost difference is why Raft is used in data centres (trusted environment) while BFT is used in blockchains (zero-trust environment).
Quorum: The Mathematical Foundation
A quorum is the minimum number of nodes required to reach a decision. For a cluster of N nodes tolerating f failures, a quorum of ⌊N/2⌋ + 1 (a strict majority) ensures that:
- Any two quorums overlap — at least one shared node has seen both decisions.
- A failed minority (≤ f nodes) cannot prevent progress.
With 5 nodes (f=2), quorum = 3. This means the cluster can lose any 2 nodes and still make progress. Losing 3 nodes means we can no longer form a quorum and the system stalls (but never corrupts data).
The Raft Algorithm: Core Concepts
Node Roles: Follower, Candidate, Leader
Every Raft node is in exactly one of three states at any time:
Follower — Passive. Responds to RPCs from leaders and candidates. If no communication is received within the election timeout, the follower promotes itself to candidate.
Candidate — Seeking election. Has incremented its term and is broadcasting RequestVote RPCs. Transitions to leader on winning a quorum of votes, or back to follower if a valid leader makes contact.
Leader — Active. Handles all client requests, replicates log entries to followers, and sends periodic heartbeats to prevent new elections. There is at most one leader per term.
Terms: Raft's Logical Clock
A term is a monotonically increasing integer. Each election begins a new term. Terms serve as a logical clock — they allow nodes to detect stale information (a leader from an old term) and ignore it.
Key invariants:
- A leader can only exist in one term.
- If a node receives an RPC with a term > its own, it immediately updates its term and reverts to follower.
- No node will vote for a candidate whose term is less than its own current term.
Terms and term transitions are how Raft ensures that old leaders (e.g., a node that was network-partitioned) cannot overwrite committed data when they reconnect.
Election Timeout: Triggering Leader Election
The election timeout is a random duration (e.g., 150–300 ms) that each follower waits before starting an election. Randomisation is the key insight — it staggers elections so that in most cases, one node's timer fires first, collects votes from the still-sleeping others, and wins.
# From raft_node.py
self._election_timeout = random.uniform(
ELECTION_TIMEOUT_MIN, # e.g. 1.5s
ELECTION_TIMEOUT_MAX, # e.g. 3.0s
)The leader resets followers' timers by sending periodic heartbeats (empty AppendEntries RPCs). As long as the leader is alive and the network is healthy, followers never time out.
Vote Granting Rules
A node grants its vote to a candidate only when all of the following are true:
- Term is current:
candidate_term >= current_term - One vote per term: The node hasn't already voted for a different candidate this term.
- Log is at least as fresh (§5.4): The candidate's log is at least as up-to-date as the voter's. Freshness is determined by comparing the last log entry's term first, then log length as a tiebreaker.
Rule 3 is critical for safety. It prevents a node with a stale log from becoming leader and overwriting entries that were already committed by a quorum.
Leader Election in Detail
Starting an Election
When a follower's election timeout fires:
- It increments its
current_term. - It transitions to Candidate and votes for itself.
- It resets its own election timer (in case this election fails and it needs to retry).
- It broadcasts a
RequestVoteRPC to all other nodes.
# RequestVote RPC payload
@dataclass
class VoteRequest:
candidate_id: int # Who is asking
candidate_term: int # The candidate's current term
last_log_index: int # Length of candidate's log - 1
last_log_term: int # Term of candidate's last log entryCollecting Votes
Each recipient of a RequestVote RPC independently evaluates whether to grant its vote based on the three rules above. The candidate collects responses and counts grants.
A subtle but important rule: votes are durable. If a node crashes after voting and restarts, it must remember who it voted for (in production, this is persisted to disk). Without this, it could vote for two different candidates in the same term after a restart, potentially electing two leaders.
Winning the Election
Once a candidate collects votes from a strict majority (quorum), it:
- Transitions to Leader.
- Initialises
nextIndexandmatchIndexfor each peer (used for log replication). - Immediately sends an empty
AppendEntries(heartbeat) to all followers to assert authority and prevent new elections.
The quorum requirement for winning ensures that at most one candidate can win per term — two quorums of 3 (out of 5 nodes) must overlap by at least 1 node, and that overlap node can have voted for only one candidate.
Split Vote Resolution
It's possible for two candidates to simultaneously receive votes from non-overlapping sets of nodes, with neither reaching quorum. Raft resolves this via timeout randomisation:
- Both candidates' election timers expire at different random times in the next election.
- The one that fires first collects votes before the other becomes a candidate again.
The expected number of rounds to elect a leader is close to 1. The randomisation window should be large enough that a single message round-trip completes within it.
Log Replication
AppendEntries RPC
AppendEntries is the heartbeat AND the replication mechanism. The leader sends it to all followers:
- As a heartbeat (empty
entries): Resets followers' election timers, advancing their knowledge of the current term and the leader's commit index. - As a replication RPC (
entriesnon-empty): Carries new log entries for the follower to append.
@dataclass
class AppendEntriesRequest:
leader_id: int
term: int
prev_log_index: int # Index of log entry immediately before new ones
prev_log_term: int # Term of that entry — consistency check
entries: list[LogEntry] # New entries to store (empty = heartbeat)
leader_commit: int # Leader's current commit_indexThe prev_log_index and prev_log_term fields implement the Log Matching Property: a follower only accepts new entries if its log matches the leader's log up to the prev_log_index entry. This inductively guarantees that two logs with the same last entry are identical up to that point.
Commit Index Advancement
An entry becomes committed once the leader has replicated it on a majority of nodes. Commitment is permanent — a committed entry will survive in the log even if the leader crashes.
Leader's commit logic:
- After receiving successful
AppendEntriesResponsefrom a follower, updatematchIndex[follower]. - Find the highest log index N such that:
matchIndex[i] >= Nfor a majority of nodes ANDlog[N].term == currentTerm. - Advance
commitIndexto N.
The condition log[N].term == currentTerm prevents a subtle bug where a leader could commit entries from previous terms on behalf of a quorum that no longer exists.
Log Matching Property
Raft's safety guarantee rests on the Log Matching Property:
If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index.
This follows inductively from the AppendEntries consistency check:
- Base case: The initial empty log is trivially consistent.
- Inductive step: A follower only appends an entry at index N if its log at index N-1 matches the leader's, so the two logs are identical through N.
Handling Inconsistent Logs
A new leader may find followers with shorter or divergent logs (from previous leaders that crashed mid-replication). The leader handles this by backing off nextIndex[follower] until the follower's log is consistent, then resending all missing entries.
In the optimisation described in the Raft paper (§5.3), the follower can return a conflictTerm and conflictIndex hint to allow the leader to skip multiple back-off steps at once, reducing the number of AppendEntries round-trips needed to synchronise a lagging follower.
Safety and Liveness
The Leader Completeness Property
If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms.
This is Raft's core safety property. It is enforced by the vote-granting rule: a candidate can only win an election if at least one voter in its quorum has the committed entry, meaning the candidate's log is at least as up-to-date as any committed entry.
Proof sketch:
- Entry E is committed at index I in term T (replicated on a quorum Q).
- Any future leader in term T' > T must win a quorum Q'.
- Q ∩ Q' is non-empty (both are majorities).
- The shared node voted for the new leader, so the new leader's log is at least as fresh as the shared node's, which contains E.
- Therefore, the new leader's log contains E.
Network Partitions and Split-Brain
When a network partition splits a cluster, Raft guarantees no split-brain (two leaders that both accept writes and diverge):
- The minority partition cannot elect a leader (lacks quorum).
- The majority partition can elect a leader and make progress.
Any leader in the minority partition will eventually step down when its term becomes stale upon reconnection.
This is why only committed entries are durable. If a client receives acknowledgement only after the leader commits an entry (quorum-acknowledged), it is safe. If a client acknowledges after the leader appends locally (before quorum), the write may be lost in a partition.
Re-election After Partition Heals
When the network partition heals:
- Nodes on the minority side receive
AppendEntriesorRequestVoteRPCs with a higher term. - They immediately update their term and revert to follower.
- Their logs are overwritten to match the current leader's log.
- Any uncommitted entries in their log are discarded.
This convergence is safe because the discarded entries were never committed — no quorum acknowledged them, so no client was told they were durable.
Linearizability Guarantee
Raft provides linearizability: each operation appears to execute atomically at some point between its invocation and completion. This means:
- No stale reads (reads always reflect the latest committed write).
- No duplicate executions (even with retries after leader failure).
Achieving true linearizability in practice requires the leader to:
- Read-your-writes: confirm it is still the leader before responding (by getting a quorum on a heartbeat before serving reads).
- Idempotency: deduplicate client retries using unique request IDs in the log.
Real-World Applications
etcd: The Backbone of Kubernetes
etcd is the primary data store for Kubernetes cluster state (pod specs, service definitions, ConfigMaps, Secrets). It uses Raft as its consensus algorithm.
Key design choices:
- Key-value store with a simple HTTP/gRPC API and a watch mechanism.
- Strong consistency: all reads go through the leader (no stale reads).
- Watch API: clients long-poll for changes; etcd pushes updates when keys change. This powers Kubernetes controllers that react to cluster state changes.
- Typical cluster size: 3 or 5 nodes. 7 is the maximum recommended (performance degrades as quorum size grows).
When etcd is unavailable, Kubernetes cannot schedule new pods or update state, though already-running pods continue running (control plane is affected, data plane is not).
CockroachDB: Distributed SQL
CockroachDB runs one Raft group per range (a 64 MB key-space shard). A table with many rows has many Raft groups running concurrently, each independently electing leaders and replicating writes.
This design means:
- Writes to different rows can be parallelised across multiple Raft leaders.
- A Raft group is affected by partition only if the nodes hosting its replicas are split.
- Cross-range transactions use a two-phase commit protocol on top of Raft for atomicity.
Apache Kafka KRaft Mode
Kafka originally used ZooKeeper (which uses Zab, a Raft-like protocol) for its metadata and controller election. KRaft (Kafka Raft) replaces ZooKeeper with a built-in Raft implementation:
- Metadata log: All cluster metadata (topic configs, partition assignments) is stored in a Raft-replicated log.
- Controller election: A single controller is the Raft leader for the metadata partition.
- Benefit: Eliminates ZooKeeper as a separate operational concern, reduces latency, and supports millions of partitions (ZooKeeper was the scalability bottleneck).
KRaft became production-ready in Kafka 3.3 and ZooKeeper mode is fully deprecated as of Kafka 4.0.
FoundationDB
FoundationDB (acquired by Apple in 2015) uses a modified Paxos variant for its transaction system, but its architecture embodies many of the same principles as Raft.
Notable for:
- Simulation testing: FoundationDB's deterministic simulation framework can reproduce any concurrency bug by controlling all sources of non-determinism. This is considered the gold standard in distributed systems testing.
- The team's insight: simulation testing found more bugs than any other technique, including formal verification.