Skip to content

HLD: Basic Message Queue (Kafka-lite) ​

L4 scoping note. Do NOT redesign Kafka. The interviewer wants a reliable pub/sub service with durable topics, consumer offsets, at-least-once delivery, and a DLQ -- for ~100K messages/sec. That's ambitious but achievable with a handful of brokers and clearly-explained design choices. Focus on the core trade-offs (push vs pull, offset storage, rebalancing, durability levels). Avoid going deep into Raft / ZooKeeper internals; handwave the coordinator and move on.


Understanding the Problem ​

What is a Message Queue? ​

A service that lets producers publish messages to named topics and consumers subscribe to those topics. The queue decouples producers from consumers in time (consumer doesn't need to be online when producer publishes) and in space (producer doesn't know who consumers are). Think Kafka, RabbitMQ, Pub/Sub, SQS. The interview value: it forces careful thinking about durability, ordering, at-least-once delivery, and stateful consumer tracking.

Functional Requirements ​

Core (above the line):

  1. Publish -- producers send messages to topics.
  2. Subscribe & consume -- consumers read messages from topics in order.
  3. Acknowledge -- consumers confirm successful processing, which advances the consumer's offset.
  4. Dead Letter Queue (DLQ) -- messages that fail processing repeatedly are diverted to a DLQ for inspection.
  5. At-least-once delivery -- no message loss. Consumers must be prepared to see duplicates.

Below the line (out of scope):

  • Exactly-once semantics end-to-end (hard; requires idempotent consumers; mention but don't design)
  • Priority queues (every message equal)
  • Message scheduling ("deliver this at 3pm tomorrow")
  • Complex routing rules / topic exchanges (keep it topic-based, not RabbitMQ exchange-style)
  • Cross-region geo-replication (single-region is fine for L4)
  • Multi-tenant isolation at the broker level

Non-Functional Requirements ​

Core:

  1. Throughput -- 100K messages/sec write, comparable read. Messages average 1 KB.
  2. Durability -- no loss once acknowledged. Tolerate single-broker failure.
  3. Availability -- 99.9% for publish and consume.
  4. Ordered delivery within a partition -- messages with the same key land in the same partition and are delivered in order.

Below the line:

  • Sub-millisecond publish latency (tens of ms is fine)
  • Global ordering across all partitions (impossible at scale; don't claim it)

L4 sanity check: 100K msgs/sec * 1 KB = 100 MB/sec throughput. A single well-tuned broker handles 30-50 MB/sec sustained; so we need 3-5 brokers for raw throughput, plus replication factor 3 = 10-15 broker instances. Plenty of disk space -- 100 MB/sec * 86400 sec = ~8 TB/day per replica. Add consumer offset storage and metadata. Definitely bigger than a toy but not exotic.


The Set Up ​

Core Entities ​

EntityDescription
TopicNamed logical stream. Has N partitions, replication factor R.
PartitionAn ordered, append-only log of messages. Ordering guarantee is per-partition.
Message{ key, value, timestamp, offset }. Offset is the position within a partition.
BrokerA node hosting one or more partition replicas.
Consumer GroupA set of consumer instances sharing the work of consuming a topic. Each partition is assigned to exactly one consumer in the group at a time.
OffsetThe position of the next-to-read message, tracked per (topic, partition, consumerGroup).

The API ​

Publish a message:

POST /api/topics/{topicName}/produce
Content-Type: application/json

{
  "key": "user_123",
  "value": "<base64-encoded bytes>",
  "headers": { "trace-id": "..." }
}

Response: 200 OK
{
  "topic": "orders",
  "partition": 3,
  "offset": 1024723,
  "timestamp": 1714500000
}

POST because creating a new message. Partition is chosen by hash(key) mod numPartitions (or round-robin if no key).

Consume messages:

GET /api/topics/{topicName}/consume?group=my-group&maxMessages=100&timeoutMs=5000

Response: 200 OK
{
  "messages": [
    { "partition": 3, "offset": 1024723, "key": "user_123", "value": "...", "timestamp": ... },
    ...
  ]
}

Long-polling: if no messages, wait up to timeoutMs then return empty.

Acknowledge:

POST /api/topics/{topicName}/commit
{
  "group": "my-group",
  "offsets": [
    { "partition": 3, "offset": 1024823 }
  ]
}

Response: 200 OK

Advances the group's committed offset for each partition.

Create a topic (admin):

POST /api/admin/topics
{
  "name": "orders",
  "partitions": 32,
  "replicationFactor": 3,
  "retentionMs": 604800000  // 7 days
}

High-Level Design ​

                   +--- [Broker 1: partitions 0,3,6 leaders; 1,4,7 followers]
                   |
[Producer] ---+--> [Broker 2: partitions 1,4,7 leaders; 2,5,8 followers]
              |    |
              |    +--- [Broker 3: partitions 2,5,8 leaders; 0,3,6 followers]
              |
[Coordinator service] (metadata: topics, partition->leader mapping, ISR)
              ^
              |
[Consumer group manager] (rebalance, offset commits)
              ^
              |
[Consumer] ---+

Persistent storage: append-only log files on local disk, one per partition replica
Offset storage: special internal topic `__consumer_offsets` (compacted)

Flow 1: Publish ​

  1. Producer connects to any broker and asks "who is leader for partition P of topic T?" (metadata request).
  2. Producer sends message directly to the partition leader.
  3. Leader appends the message to its local log file. Assigns next offset (monotonic).
  4. Leader replicates to follower replicas. Waits for acks=all (all in-sync replicas ack) before returning success.
  5. Leader responds to producer with (partition, offset).
  6. Producer can also request acks=1 (leader only) for lower latency at cost of durability, or acks=0 (fire-and-forget).

Flow 2: Consume ​

  1. Consumer joins a consumer group. Coordinator assigns partitions to consumers in the group.
  2. Consumer asks broker leader for its assigned partitions: "give me messages from partition 3 starting at offset 1024723."
  3. Broker streams messages from its log file, up to maxBytes or maxMessages.
  4. Consumer processes them.
  5. Consumer periodically commits offsets by writing to a special internal topic __consumer_offsets (or a managed service). Offset is the next offset to read, i.e., one past the last successfully processed.

Flow 3: Consumer failure + rebalance ​

  1. Consumer A holds partition 3. Its heartbeat to the coordinator stops (crash).
  2. Coordinator triggers a rebalance after sessionTimeoutMs (e.g., 30s).
  3. Remaining consumers receive new assignments. Partition 3 now goes to Consumer B.
  4. Consumer B reads its committed offset from __consumer_offsets, seeks there, and resumes consuming.
  5. Any messages Consumer A read but did not commit are re-delivered to Consumer B. Hence "at-least-once."

Flow 4: DLQ ​

  1. Consumer processes a message, fails. Retries N times (e.g., 3) with backoff.
  2. After N failures, consumer publishes the message to a DLQ topic (e.g., orders.dlq) and commits the offset on the main topic to skip it.
  3. Operators can inspect and replay DLQ messages.

Flow 5: Retention and log compaction ​

  • Time-based retention: delete log segments older than the topic's retention (e.g., 7 days).
  • Size-based retention: keep at most N GB per partition.
  • Compaction (for special topics like __consumer_offsets): keep only the latest value per key. Old values garbage-collected.

Potential Deep Dives ​

1) Topic / partition model and ordering ​

The guarantee ​

  • Per-partition ordering: messages within a partition are strictly ordered by offset.
  • No global ordering: across partitions, there is no order.
  • Key-based routing: messages with the same key go to the same partition, so same-key messages are ordered.

Why partitions at all? ​

  • Parallelism. A single partition is a bottleneck (sequential writes, one consumer at a time within a group). N partitions = N-way parallelism for both producers and consumers.
  • 32 partitions for a 100K msgs/sec topic gives ~3K msgs/sec/partition -- comfortable.

Choosing partition count ​

  • Too few: consumers can't scale out. Max consumers in group = partition count.
  • Too many: metadata overhead, more file handles, more replication traffic.
  • Rule of thumb: partitions = max(expected_peak_consumers, throughput_mb/sec / per_partition_throughput_mb/sec). For 100K msgs/sec * 1KB = 100 MB/sec, with 10 MB/sec per partition comfortable write throughput, 10-32 partitions is right.

2) Durability and replication ​

Bad Solution: Single-broker, async-flush ​

  • Write to memory, flush to disk every second. If broker crashes, lose up to 1 second of messages.
  • Challenges: the hardest part of a message queue is NOT losing messages. Don't start here.

Good Solution: Replicated, acks=all ​

  • Each partition has 3 replicas on different brokers. One is leader, two followers.
  • Producer sets acks=all: leader waits until all in-sync replicas have written to their logs before acknowledging.
  • A failed follower is removed from the ISR (In-Sync Replica) set. If ISR drops below min.insync.replicas (e.g., 2), writes fail (better to reject than silently lose durability).
  • On broker failure, a new leader is elected from the ISR.

Great Solution: Replicated + fsync tuning ​

  • acks=all guarantees the message is in the OS page cache on all replicas, but not necessarily on disk. A simultaneous crash of all 3 replicas loses the message.
  • Options:
    • flush.messages=1: fsync after every message. Correct but slow (fsync adds 5-10 ms, destroys throughput).
    • flush.messages=10000 or flush.ms=1000: periodic fsync. Fine in practice, because replication across 3 machines on different failure domains makes simultaneous loss unlikely.
  • Kafka defaults are the pragmatic sweet spot.

L4 note: Articulate the fsync vs replication trade-off and pick replication-without-fsync as the default. Interviewers want to hear you reason about the durability spectrum.

3) Consumer offsets: where are they stored? ​

Bad Solution: Client-side storage ​

  • Consumer remembers its own offset in a local file.
  • Challenges: if consumer is replaced (new machine), state is lost. Rebalancing is broken.

Good Solution: External store (Postgres / Redis) ​

  • Offset stored keyed by (group, topic, partition).
  • Challenges: external dependency; latency per commit; manageable but adds surface area.

Great Solution: Internal topic __consumer_offsets ​

  • A special compacted topic within the queue itself. Offset commits are just messages to this topic keyed by (group, topic, partition).
  • Compaction ensures only the latest offset per key is retained. Old offsets garbage-collected.
  • Readers of offsets use the same mechanisms as regular topic reads.
  • Self-hosted, self-replicated -- no external dependency.

This is what Kafka does. Worth mentioning in an L4 interview to show you know the playbook.

4) Consumer rebalancing ​

The problem ​

Consumers join and leave the group (auto-scaling, deployments, crashes). Partitions must be reassigned.

Good Solution: Stop-the-world rebalance ​

  • Coordinator detects group membership change.
  • All consumers stop consuming, send their committed offsets, receive new assignments, resume.
  • Simple but causes brief stalls (seconds). Ok if group membership changes infrequently.

Great Solution: Cooperative / incremental rebalance ​

  • Only the minimum partitions move. If Consumer A had partitions {0,1,2} and Consumer D joins, maybe only partition 2 moves to D. Others keep consuming uninterrupted.
  • Kafka 2.4+ supports this via "cooperative sticky" assignor.
  • More complex coordinator logic. Better uptime during scaling events.

For L4, mention stop-the-world as default, incremental as an optimization.

5) At-least-once vs exactly-once ​

At-least-once (what we build) ​

  • Message may be delivered multiple times. Why? Consumer crashes after processing but before committing offset -- next consumer re-processes.
  • Consumers must be idempotent: processing the same message twice is safe.
  • Typical idempotency pattern: dedup table keyed by message ID; insert with ON CONFLICT IGNORE.

Exactly-once ​

  • No duplicates at all. Much harder.
  • Producer-side: idempotent producer (sequence numbers + broker-side dedup within a single session).
  • Consumer-side: consumer must atomically commit offset AND write its side-effects. Usually requires a transactional store (write the result + the offset in one transaction). Or output to the same queue with transactional messaging.
  • Kafka supports this via its transactions API (initTransactions, beginTransaction, commitTransaction).
  • End-to-end exactly-once across different systems is very hard without idempotent consumers.

L4 recommendation: Design for at-least-once. Require idempotent consumers. Mention exactly-once as a possibility but don't design it.

6) Dead Letter Queue design ​

When to DLQ ​

  • Message repeatedly fails processing (permanent error in data, not a transient issue).
  • Message is a poison pill (corrupt, unparseable).

How ​

  • Consumer keeps a retry count in message headers (or tracks externally).
  • After maxRetries, the consumer publishes to the DLQ topic (same schema, original topic as a header) and commits the main offset to unblock the partition.
  • DLQ topic has its own retention (often longer, so humans have time to investigate).
  • Tooling to replay DLQ messages back to the main topic after fixing the bug.
java
void handleWithDLQ(Message msg) {
    for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
        try {
            process(msg);
            consumer.commit(msg);
            return;
        } catch (TransientException e) {
            sleepWithBackoff(attempt);
        } catch (PermanentException e) {
            publishToDLQ(msg, e);
            consumer.commit(msg);
            return;
        }
    }
    // Max retries exceeded on transient errors -- treat as poison
    publishToDLQ(msg, new RetryExhaustedException());
    consumer.commit(msg);
}

7) Storage layout: the append-only log ​

  • Each partition is a directory. Inside: segment files (e.g., 1 GB each), named by base offset.
  • New messages append to the active segment. When it hits size or time threshold, close it and start a new one.
  • Sparse index per segment: every Nth message's (offset, file_position) stored in a separate index file. Enables O(log N) offset-based lookup.
  • Retention deletes whole segments, not individual messages. Fast.

8) Push vs pull ​

Pull (what we build, what Kafka does) ​

  • Consumer decides when to fetch. Long-polling for efficiency.
  • Pros: consumer controls its own pace. No overload on slow consumers. Easy rewind.
  • Cons: slight latency (poll interval).

Push ​

  • Broker pushes messages to registered consumers.
  • Pros: low latency.
  • Cons: slow consumers get overwhelmed. Broker must track consumer state.

Recommendation: Pull. Industry standard for this use case. RabbitMQ supports both; Kafka is pull-only; SQS is effectively pull (receive-message).

9) Throughput back-of-envelope ​

  • 100K msgs/sec * 1 KB = 100 MB/sec.
  • Replication factor 3 = 300 MB/sec across cluster.
  • A broker with 1 GbE NIC = 125 MB/sec max. With 10 GbE, 1.25 GB/sec.
  • 3-5 brokers with 10 GbE each handles this comfortably.
  • Disk: 100 MB/sec sustained is well within modern SSD (GB/sec range) and even spinning disk (150-200 MB/sec sequential).
  • Retention 7 days: 100 MB/sec * 7 days * 3 replicas = ~180 TB. Split across 5 brokers = ~36 TB each. Fits on a single machine with dense storage.

10) What NOT to build at L4 ​

  • Don't design ZooKeeper / Raft for the coordinator. Just say "coordinator state is managed by a consensus layer (ZooKeeper or KRaft)" and move on.
  • Don't propose a custom serialization format; Protobuf / Avro is standard.
  • Don't design multi-region active-active replication. Mention MirrorMaker-style async replication briefly if asked.
  • Don't implement exactly-once end-to-end unless specifically asked.

What is Expected at Each Level ​

L3 / Mid-level ​

  • Understand pub/sub as a pattern. Topics, producers, consumers.
  • Propose a basic append-only log per topic. Might miss partitions.
  • Know at-least-once exists; might struggle to explain how it's achieved.

L4 ​

  • Topic/partition model with key-based partitioning.
  • Replication for durability with leader/follower + ISR.
  • Pull-based consumers with long-polling.
  • Consumer groups, offset commits, rebalance on membership change.
  • At-least-once with idempotent consumers; exactly-once mentioned as hard.
  • DLQ pattern and retry logic.
  • Back-of-envelope on brokers, disk, and network.
  • Internal __consumer_offsets topic (bonus).

L5 / Senior ​

  • Incremental (cooperative) rebalancing.
  • Tiered storage: hot data on local SSD, cold segments offloaded to object storage.
  • Multi-region topology with async mirroring (MirrorMaker-style).
  • Producer idempotence + transactional messaging for EOS end-to-end.
  • Operational concerns: partition skew handling, broker capacity planning, lag monitoring, consumer health checks.
  • Schema registry integration and evolution.
  • Comparison with alternatives (Pulsar, Pub/Sub, Kinesis) and when to use each.

Frontend interview preparation reference.