DATABASE INTERNALS SERIES
Read on Dev.to
Principal Systems Engineering Deep Dive

Engineering a High-Performance LSM-Tree Storage Engine: MemTables, SSTables, and Compaction

An architectural analysis of write-ahead logging, sparse index searches, bloom filters, and merged-sort compaction loops in high-throughput database systems.

EA
Ebenezer AkinseindeSoftware Developer & AI Automations Engineer
Published Jun 202629 min readWeb Engineering

01.Live LSM-Tree Laboratory Sandbox

Interact with the live storage engine simulator below. Write keys to fill the in-memory **MemTable** (limit: 4 keys). Watch it automatically flush sorted keys into **SSTable files** on the Level 0 disk, and click **"Run Compaction"** to merge them!

LSM-TREE ENGINE

Write-Path & Compaction simulator

Client PUT operations
Client GET Query
IN-MEMORY MEMTABLE (Sorted Tree):
MemTable empty. Active writes flush here.
Disk storage structures
LEVEL 0 (Append-Only flushes)Unconsolidated
sst_0_1Bloom: 0x3A
db_port5432
envproduction
LEVEL 1 (Compacted & sorted single partition)Consolidated
Level 1 empty. Run Compaction to merge Level 0 files.
LSM SYSTEM TRANSACTION LOGS:
>Storage Engine online. Initialized WAL and MemTable.

02.The LSM-Tree Design Architecture

Traditional relational databases like PostgreSQL use B-Trees. B-Trees perform random in-place updates, writing data directly to file locations on disk. While this allows fast read lookups, it makes write operations slow due to random disk access overhead.

To handle high write throughput, modern NoSQL storage engines (such as LevelDB, RocksDB, and Cassandra) use Log-Structured Merge Trees (LSM-Trees).

The core philosophy of the LSM-Tree is simple: **convert slow random disk writes into fast sequential appends**. It decomposes database operations into a memory buffer and multi-tiered, read-only disk structures:

  • MemTable: An in-memory, sorted data structure (usually a Red-Black Tree or SkipList) where all client writes are buffered.
  • Write-Ahead Log (WAL): An append-only log file on disk. Writes are appended to the WAL before being inserted into the MemTable, guaranteeing durability in case of power loss.
  • SSTables (Sorted String Tables): Read-only files on disk containing keys and values sorted alphabetically, divided into numeric layers (Level 0, Level 1, etc.).

03.Write-Ahead Logs & MemTables

When a write operation arrives, the database must write it to disk immediately to prevent data loss.

If we wrote sorted SSTables on every write, it would require rewriting the entire disk file—causing massive write amplification. Instead, the engine writes to the Write-Ahead Log (WAL). Because it is append-only, the write is done in a single sequential disk seek, making it incredibly fast.

Concurrently, the write is inserted into the MemTable. When the MemTable exceeds a configured size limit (e.g., 64MB in production), it transitions to a read-only state. A background worker then flushes its sorted contents into a new Level 0 SSTable file, clears the corresponding WAL, and creates a fresh, active MemTable.

04.SSTables and Sparse Indexing

An **SSTable** contains sorted key-value pairs stored sequentially. It is divided into data blocks, followed by an index block at the end of the file.

Instead of indexing every single key, the engine uses a Sparse Index (e.g. keeping only the first key of every 4KB block).

To look up a key, the engine searches the sparse index to find the boundary block that could contain the target key, loads that single block from disk, and scans it sequentially. This drastically reduces memory overhead, allowing the database to reference petabytes of data using only a few megabytes of index memory.

05.Bloom Filter Mechanics

If a key does not exist, looking it up requires checking every single SSTable file on disk. This is known as the **Disk Read Amplification Problem**.

To prevent unnecessary disk reads, each SSTable is accompanied by a Bloom Filter in memory. A Bloom Filter is a probabilistic bit-vector that checks key membership:

"A Bloom Filter guarantees that it will never return a false negative. If it returns false, the key is definitely not in the file. If it returns true, the key might be in the file, warranting a disk read."

By checking the Bloom Filter before reading the file, the engine skips up to 99% of unnecessary disk operations.

06.TypeScript LSM Storage Engine Implementation

Below is a clean, modular TypeScript interface mapping the core `LSMStorageEngine` class, showing WAL appends, sorted flushes, and sparse index evaluations:

lsm-storage-engine.ts
interface LogEntry {
  op: "PUT" | "DELETE";
  key: string;
  value: string;
}

interface SSTableIndex {
  key: string;
  offset: number;
}

export class LSMStorageEngine {
  private memTable: Map<string, string> = new Map();
  private walLog: LogEntry[] = [];
  private activeMemTableBytes: number = 0;
  private readonly MAX_MEMTABLE_BYTES = 64 * 1024 * 1024; // 64MB

  /**
   * Appends operation to Write-Ahead Log (WAL), 
   * updates the in-memory MemTable, and flushes to disk if full.
   */
  public put(key: string, value: string): void {
    const entry: LogEntry = { op: "PUT", key, value };
    
    // 1. Durability: Sequential append to WAL log
    this.appendWal(entry);

    // 2. High Throughput: In-memory write to MemTable
    this.memTable.set(key, value);
    this.activeMemTableBytes += key.length + value.length;

    // 3. Flush Check
    if (this.activeMemTableBytes >= this.MAX_MEMTABLE_BYTES) {
      this.flushMemTableToSSTable();
    }
  }

  public get(key: string): string | null {
    // 1. Check in-memory MemTable
    if (this.memTable.has(key)) {
      return this.memTable.get(key) || null;
    }

    // 2. Read from disk SSTables in order (newest to oldest)
    return this.searchSSTablesOnDisk(key);
  }

  private flushMemTableToSSTable(): void {
    // Extract keys and sort alphabetically
    const sortedKeys = Array.from(this.memTable.keys()).sort();
    const sstableData: { k: string; v: string }[] = [];
    const sparseIndex: SSTableIndex[] = [];

    let currentOffset = 0;
    sortedKeys.forEach((key, idx) => {
      const val = this.memTable.get(key)!;
      sstableData.push({ k: key, v: val });

      // Create sparse index entry every 4KB block boundaries
      if (idx % 16 === 0) {
        sparseIndex.push({ key, offset: currentOffset });
      }
      currentOffset += key.length + val.length;
    });

    // Write data, index blocks, and computed Bloom Filter to disk
    this.writeDiskFile({
      data: sstableData,
      index: sparseIndex,
      bloomFilter: this.generateBloomFilter(sortedKeys)
    });

    // Reset memory structures
    this.memTable.clear();
    this.walLog = [];
    this.activeMemTableBytes = 0;
  }

  private searchSSTablesOnDisk(key: string): string | null {
    const activeFiles = this.getSSTableDiskFiles();

    for (const file of activeFiles) {
      // 1. Check Bloom Filter
      if (!file.bloomFilter.test(key)) {
        continue; // Skip disk seek entirely
      }

      // 2. Scan sparse index to locate target block offset
      const blockOffset = this.binarySearchSparseIndex(file.index, key);
      const dataBlock = this.loadDiskBlock(file.path, blockOffset);

      // 3. Scan block sequentially for target match
      const result = dataBlock.find(item => item.k === key);
      if (result) return result.v;
    }

    return null;
  }

  private generateBloomFilter(keys: string[]): any {
    // Basic Bloom Filter bitmask initialization
    return {
      test: (k: string) => true // Simulated response
    };
  }

  private appendWal(entry: LogEntry) {
    this.walLog.push(entry);
  }
  private writeDiskFile(payload: any) {}
  private getSSTableDiskFiles(): any[] { return []; }
  private binarySearchSparseIndex(index: any[], key: string): number { return 0; }
  private loadDiskBlock(path: string, offset: number): any[] { return []; }
}

07.Production Compaction Challenges

Over time, writing many SSTables creates duplicate data and takes up disk space. To keep database performance stable, LSM engines run background Compaction loops.

Size-Tiered vs. Leveled Compaction

Compaction strategies balance three conflicting trade-offs: **Write Amplification**, **Read Amplification**, and **Space Amplification** (known as the RUM Conjecture).

  • Size-Tiered Compaction: Merges files of similar sizes together. This requires low write overhead but can lead to high space amplification.
  • Leveled Compaction: Organizes files into distinct capacity tiers (e.g. L1 is max 10MB, L2 is 100MB). Each level contains non-overlapping keys. This yields optimal read lookups and minimal space usage, but requires rewriting files multiple times, increasing write overhead.
EA

Ebenezer Akinseinde

I engineer highly reliable distributed consensus state machines, horizontal WebSocket relay networks, and performant AI vector databases. Let's build resilient systems together.