DISTRIBUTED SYSTEMS SERIES
Read on Dev.to
Principal Systems Engineering Deep Dive

Building a Distributed Consensus Engine from Scratch: Raft Election and Replication Invariants

An exhaustive systems architectural guide to distributed state machines, randomized election terms, log matching properties, and implementing Raft in TypeScript.

EA
Ebenezer AkinseindeSoftware Developer & AI Automations Engineer
Published Jul 202632 min readWeb Engineering

01.The Distributed Consensus Challenge

In high-availability cloud infrastructure, databases cannot run on a single machine. To prevent data loss and support millions of concurrent users, we replicate state across multiple physical server nodes.

However, distributing state introduces the **Split-Brain Problem**. If a network partition occurs (e.g. cutting off Server 1 and Server 2 from Server 3), how do nodes independently agree on which writes are committed, in what exact sequence order, and who holds the authoritative lease?

The Replicated State Machine Invariant:

Consensus algorithms ensure that a cluster of machines can coordinate as a single cohesive unit, surviving node crashes. The system mandates that **if any single node commits a transaction log entry at index $I$, no other node in the cluster can ever commit a different entry at index $I$**.

02.Paxos Complexity vs. Raft Usability

For decades, the standard distributed consensus algorithm was **Paxos**. While mathematically proven, Paxos is notoriously abstract, difficult to conceptualize, and highly complex to implement in real-world systems.

In 2014, Stanford researchers Diego Ongaro and John Ousterhout introduced **Raft**. Raft was designed with a single goal: **understandability**.

Raft achieves consensus by decomposing the problem into three cleanly separated sub-problems:

  • Leader Election: Electing a single coordinator node when the active term starts.
  • Log Replication: The leader receiving writes from clients, copying them to followers, and enforcing commit locks.
  • Safety: Enforcing strict invariants to guarantee logs never drift or overwrite historical data.

03.The Raft Three-State Machine

At any given moment, a functioning Raft node resides in one of three clean states:

1. Follower

Completely passive. Follower nodes respond only to inbound RPC requests. They maintain an election timer that resets every time a heartbeat from the active leader arrives.

2. Candidate

Transitionary state. If a follower's election timer expires, it increments the term epoch, transitions to Candidate, votes for itself, and broadcasts `RequestVote` RPCs to locate a majority.

3. Leader

The active coordinator. Handles all client writes, manages replication indices, and periodically emits empty `AppendEntries` heartbeats to prevent followers from starting elections.

04.The AppendEntries Invariant Loop

To replicate states, Raft enforces the **Log Matching Property**:

"If two separate node log entries have identical term and index coordinates, they are guaranteed to store identical commands and possess identical historical records preceding that index."

When a write (`SET x=10`) arrives at the leader, the process follows a strict 2-Phase Commit cycle:

  1. Phase 1: Append — The leader writes the entry to its local log and broadcasts `AppendEntries` payloads to all followers. Followers verify previous index alignments and write to their local storage.
  2. Phase 2: Commit — Once a majority ($N/2 + 1$) of followers respond with success, the leader commits the entry, writes it to its state machine, returns success to the client, and instructs followers to commit in subsequent heartbeats.

05.Live Raft Consensus Cluster Laboratory

This interactive dashboard replicates a **3-node Raft consensus cluster**. Observe Node 1 sending heartbeat pings. Click **"KILL LEADER"** to crash Node 1, watch Followers' election timers expire, a Candidate request votes, and a new Leader emerge to restore stability!

RAFT CONSENSUS ENGINE

Distributed Cluster Failover & Replication Analyzer

Active Term epoch:1
Node 1Leader
Term: 1 | LogL: 1
Node 2Follower
Term: 1 | LogL: 1
Node 3Follower
Term: 1 | LogL: 1
CLUSTER WRITE CONSOLE
Consensus Log Producer
RAFT CLUSTER CONSOLE LOGS:
>Cluster online. Node 1 elected Leader for Term 1.

06.TypeScript Raft Node Engine Implementation

Below is a clean, modular TypeScript interface mapping the core `RaftConsensusEngine` loop, randomized election configurations, and `AppendEntries` payload mapping:

raft-consensus-engine.ts
interface LogEntry {
  term: number;
  command: string;
}

interface AppendEntriesPayload {
  term: number;
  leaderId: number;
  prevLogIndex: number;
  prevLogTerm: number;
  entries: LogEntry[];
  leaderCommit: number;
}

export class RaftConsensusEngine {
  private nodeId: number;
  private currentTerm: number = 0;
  private votedFor: number | null = null;
  private log: LogEntry[] = [];
  
  private commitIndex: number = 0;
  private lastApplied: number = 0;
  
  private state: "Leader" | "Follower" | "Candidate" = "Follower";
  private leaderId: number | null = null;

  constructor(nodeId: number) {
    this.nodeId = nodeId;
    this.initializeLog();
  }

  private initializeLog() {
    // Index 0 sentinel boundary node log
    this.log.push({ term: 0, command: "SENTINEL" });
  }

  // 1. Inbound AppendEntries RPC handler (Heartbeats & Replication)
  public handleAppendEntries(payload: AppendEntriesPayload): { term: number; success: boolean } {
    // Rule 1: Reply false if term is outdated
    if (payload.term < this.currentTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Update term and demote self if payload term is higher
    if (payload.term > this.currentTerm) {
      this.currentTerm = payload.term;
      this.state = "Follower";
      this.votedFor = null;
    }

    this.leaderId = payload.leaderId;

    // Rule 2: Reply false if log doesn't contain entry at prevLogIndex matching prevLogTerm
    if (this.log.length <= payload.prevLogIndex || this.log[payload.prevLogIndex].term !== payload.prevLogTerm) {
      return { term: this.currentTerm, success: false };
    }

    // Rule 3: Write new entries, overwrite if existing conflicts
    let entryIdx = payload.prevLogIndex + 1;
    for (const newEntry of payload.entries) {
      if (this.log[entryIdx]) {
        if (this.log[entryIdx].term !== newEntry.term) {
          // Conflict found! Prune log from this point
          this.log = this.log.slice(0, entryIdx);
          this.log.push(newEntry);
        }
      } else {
        this.log.push(newEntry);
      }
      entryIdx++;
    }

    // Rule 4: Align local commitIndex coordinates with leader commit bounds
    if (payload.leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(payload.leaderCommit, this.log.length - 1);
      this.applyToStateMachine();
    }

    return { term: this.currentTerm, success: true };
  }

  // 2. Election Trigger: Broadcast RequestVote RPCs
  public initiateElection(clusterSize: number, broadcastVotes: () => Promise<boolean>) {
    this.state = "Candidate";
    this.currentTerm += 1;
    this.votedFor = this.nodeId; // vote for self
    
    console.log(`Node ${this.nodeId} starting election for Term ${this.currentTerm}`);
    
    // Broadcast RequestVote payloads to peer nodes
    broadcastVotes().then(granted => {
      if (granted && this.state === "Candidate") {
        this.state = "Leader";
        this.leaderId = this.nodeId;
        console.log(`Node ${this.nodeId} elected Leader for Term ${this.currentTerm}`);
        this.startHeartbeatBroadcast();
      }
    });
  }

  private startHeartbeatBroadcast() {
    // Periodically emit empty AppendEntries payloads to preserve lease ownership
    console.log(`Leader Node ${this.nodeId} broadcasting Term ${this.currentTerm} heartbeat.`);
  }

  private applyToStateMachine() {
    while (this.commitIndex > this.lastApplied) {
      this.lastApplied++;
      const committedCommand = this.log[this.lastApplied].command;
      console.log(`Applying committed index ${this.lastApplied} ("${committedCommand}") to Node ${this.nodeId} State Machine.`);
    }
  }
}

07.Production Network Partitions & Split Brains

In distributed systems, the ultimate failure point is the **Network Partition** (Split Brain).

The Split Quorum Invariant

Suppose you have a 5-node cluster. A network line cuts, splitting the cluster into two factions:

  • **Side A (Minority):** Node 1 and Node 2.
  • **Side B (Majority):** Node 3, Node 4, and Node 5.

If a client sends a write to Node 1 on Side A, Node 1 attempts to replicate to Node 2. However, since the network cut blocks communication with the other nodes, Side A can only secure **2/5 positive acknowledgments**.

Because 2 is less than the required Quorum majority ($5/2 + 1 = 3$), Node 1 **refuses to commit the log entry**, returning an error or buffering the request.

Concurrently on Side B, an election triggers. Since Side B contains **3 online nodes**, they reach a valid quorum majority, elect Node 3 as a new leader, and continue successfully committing writes!

When the partition heals, Node 1 sees Node 3's higher Term number, demotes itself to follower, and overwrites its uncommitted Side A entries with Side B's committed history, guaranteeing complete cluster consistency!

08.Engineering Takeaways

Raft distributed consensus is the core architecture securing modern high-availability platforms like Kubernetes (etcd), CockroachDB, and HashiCorp Consul.

  • Consensus is quorum-based: A cluster must always hold a strict online majority ($N/2 + 1$) to safely elect leaders or commit state changes.
  • Safety invariants protect history: Randomized election timers mitigate split election cycles, while term numbers protect committed historical indices.
  • Raft is understandable: Decomposing consensus into isolated election, replication, and safety loops is what makes Raft easy to maintain.

Symmetric replication is the baseline of resilient databases. By implementing proper Raft Consensus invariants, you can build distributed networks that survive any node failure.

EA

Ebenezer Akinseinde

I engineer highly secure distributed consensus state machines, horizontal WebSocket relay networks, and performant AI vector databases. Let's build resilient systems together.