HLD: Basic Message Queue (Kafka-lite)

L4 scoping note. Do NOT redesign Kafka. The interviewer wants a reliable pub/sub service with durable topics, consumer offsets, at-least-once delivery, and a DLQ -- for ~100K messages/sec. That's ambitious but achievable with a handful of brokers and clearly-explained design choices. Focus on the core trade-offs (push vs pull, offset storage, rebalancing, durability levels). Avoid going deep into Raft / ZooKeeper internals; handwave the coordinator and move on.

Understanding the Problem

What is a Message Queue?

A service that lets producers publish messages to named topics and consumers subscribe to those topics. The queue decouples producers from consumers in time (consumer doesn't need to be online when producer publishes) and in space (producer doesn't know who consumers are). Think Kafka, RabbitMQ, Pub/Sub, SQS. The interview value: it forces careful thinking about durability, ordering, at-least-once delivery, and stateful consumer tracking.

Functional Requirements

Core (above the line):

Publish -- producers send messages to topics.
Subscribe & consume -- consumers read messages from topics in order.
Acknowledge -- consumers confirm successful processing, which advances the consumer's offset.
Dead Letter Queue (DLQ) -- messages that fail processing repeatedly are diverted to a DLQ for inspection.
At-least-once delivery -- no message loss. Consumers must be prepared to see duplicates.

Below the line (out of scope):

Exactly-once semantics end-to-end (hard; requires idempotent consumers; mention but don't design)
Priority queues (every message equal)
Message scheduling ("deliver this at 3pm tomorrow")
Complex routing rules / topic exchanges (keep it topic-based, not RabbitMQ exchange-style)
Cross-region geo-replication (single-region is fine for L4)
Multi-tenant isolation at the broker level

Non-Functional Requirements

Core:

Throughput -- 100K messages/sec write, comparable read. Messages average 1 KB.
Durability -- no loss once acknowledged. Tolerate single-broker failure.
Availability -- 99.9% for publish and consume.
Ordered delivery within a partition -- messages with the same key land in the same partition and are delivered in order.

Below the line:

Sub-millisecond publish latency (tens of ms is fine)
Global ordering across all partitions (impossible at scale; don't claim it)

L4 sanity check: 100K msgs/sec * 1 KB = 100 MB/sec throughput. A single well-tuned broker handles 30-50 MB/sec sustained; so we need 3-5 brokers for raw throughput, plus replication factor 3 = 10-15 broker instances. Plenty of disk space -- 100 MB/sec * 86400 sec = ~8 TB/day per replica. Add consumer offset storage and metadata. Definitely bigger than a toy but not exotic.

The Set Up

Core Entities

Entity	Description
Topic	Named logical stream. Has N partitions, replication factor R.
Partition	An ordered, append-only log of messages. Ordering guarantee is per-partition.
Message	`{ key, value, timestamp, offset }`. Offset is the position within a partition.
Broker	A node hosting one or more partition replicas.
Consumer Group	A set of consumer instances sharing the work of consuming a topic. Each partition is assigned to exactly one consumer in the group at a time.
Offset	The position of the next-to-read message, tracked per `(topic, partition, consumerGroup)`.

The API

Publish a message:

POST /api/topics/{topicName}/produce
Content-Type: application/json

{
  "key": "user_123",
  "value": "<base64-encoded bytes>",
  "headers": { "trace-id": "..." }
}

Response: 200 OK
{
  "topic": "orders",
  "partition": 3,
  "offset": 1024723,
  "timestamp": 1714500000
}

POST because creating a new message. Partition is chosen by hash(key) mod numPartitions (or round-robin if no key).

Consume messages:

GET /api/topics/{topicName}/consume?group=my-group&maxMessages=100&timeoutMs=5000

Response: 200 OK
{
  "messages": [
    { "partition": 3, "offset": 1024723, "key": "user_123", "value": "...", "timestamp": ... },
    ...
  ]
}

Long-polling: if no messages, wait up to timeoutMs then return empty.

Acknowledge:

POST /api/topics/{topicName}/commit
{
  "group": "my-group",
  "offsets": [
    { "partition": 3, "offset": 1024823 }
  ]
}

Response: 200 OK

Advances the group's committed offset for each partition.

Create a topic (admin):

POST /api/admin/topics
{
  "name": "orders",
  "partitions": 32,
  "replicationFactor": 3,
  "retentionMs": 604800000  // 7 days
}

High-Level Design

                   +--- [Broker 1: partitions 0,3,6 leaders; 1,4,7 followers]
                   |
[Producer] ---+--> [Broker 2: partitions 1,4,7 leaders; 2,5,8 followers]
              |    |
              |    +--- [Broker 3: partitions 2,5,8 leaders; 0,3,6 followers]
              |
[Coordinator service] (metadata: topics, partition->leader mapping, ISR)
              ^
              |
[Consumer group manager] (rebalance, offset commits)
              ^
              |
[Consumer] ---+

Persistent storage: append-only log files on local disk, one per partition replica
Offset storage: special internal topic `__consumer_offsets` (compacted)

Flow 1: Publish

Producer connects to any broker and asks "who is leader for partition P of topic T?" (metadata request).
Producer sends message directly to the partition leader.
Leader appends the message to its local log file. Assigns next offset (monotonic).
Leader replicates to follower replicas. Waits for acks=all (all in-sync replicas ack) before returning success.
Leader responds to producer with (partition, offset).
Producer can also request acks=1 (leader only) for lower latency at cost of durability, or acks=0 (fire-and-forget).

Flow 2: Consume

Consumer joins a consumer group. Coordinator assigns partitions to consumers in the group.
Consumer asks broker leader for its assigned partitions: "give me messages from partition 3 starting at offset 1024723."
Broker streams messages from its log file, up to maxBytes or maxMessages.
Consumer processes them.
Consumer periodically commits offsets by writing to a special internal topic __consumer_offsets (or a managed service). Offset is the next offset to read, i.e., one past the last successfully processed.

Flow 3: Consumer failure + rebalance

Consumer A holds partition 3. Its heartbeat to the coordinator stops (crash).
Coordinator triggers a rebalance after sessionTimeoutMs (e.g., 30s).
Remaining consumers receive new assignments. Partition 3 now goes to Consumer B.
Consumer B reads its committed offset from __consumer_offsets, seeks there, and resumes consuming.
Any messages Consumer A read but did not commit are re-delivered to Consumer B. Hence "at-least-once."

Flow 4: DLQ

Consumer processes a message, fails. Retries N times (e.g., 3) with backoff.
After N failures, consumer publishes the message to a DLQ topic (e.g., orders.dlq) and commits the offset on the main topic to skip it.
Operators can inspect and replay DLQ messages.

Flow 5: Retention and log compaction

Time-based retention: delete log segments older than the topic's retention (e.g., 7 days).
Size-based retention: keep at most N GB per partition.
Compaction (for special topics like __consumer_offsets): keep only the latest value per key. Old values garbage-collected.

Potential Deep Dives

1) Topic / partition model and ordering

The guarantee

Per-partition ordering: messages within a partition are strictly ordered by offset.
No global ordering: across partitions, there is no order.
Key-based routing: messages with the same key go to the same partition, so same-key messages are ordered.

Why partitions at all?

Parallelism. A single partition is a bottleneck (sequential writes, one consumer at a time within a group). N partitions = N-way parallelism for both producers and consumers.
32 partitions for a 100K msgs/sec topic gives ~3K msgs/sec/partition -- comfortable.

Choosing partition count

Too few: consumers can't scale out. Max consumers in group = partition count.
Too many: metadata overhead, more file handles, more replication traffic.
Rule of thumb: partitions = max(expected_peak_consumers, throughput_mb/sec / per_partition_throughput_mb/sec). For 100K msgs/sec * 1KB = 100 MB/sec, with 10 MB/sec per partition comfortable write throughput, 10-32 partitions is right.

2) Durability and replication

Bad Solution: Single-broker, async-flush

Write to memory, flush to disk every second. If broker crashes, lose up to 1 second of messages.
Challenges: the hardest part of a message queue is NOT losing messages. Don't start here.

Good Solution: Replicated, `acks=all`

Each partition has 3 replicas on different brokers. One is leader, two followers.
Producer sets acks=all: leader waits until all in-sync replicas have written to their logs before acknowledging.
A failed follower is removed from the ISR (In-Sync Replica) set. If ISR drops below min.insync.replicas (e.g., 2), writes fail (better to reject than silently lose durability).
On broker failure, a new leader is elected from the ISR.

Great Solution: Replicated + fsync tuning

acks=all guarantees the message is in the OS page cache on all replicas, but not necessarily on disk. A simultaneous crash of all 3 replicas loses the message.
Options:
- flush.messages=1: fsync after every message. Correct but slow (fsync adds 5-10 ms, destroys throughput).
- flush.messages=10000 or flush.ms=1000: periodic fsync. Fine in practice, because replication across 3 machines on different failure domains makes simultaneous loss unlikely.
Kafka defaults are the pragmatic sweet spot.

L4 note: Articulate the fsync vs replication trade-off and pick replication-without-fsync as the default. Interviewers want to hear you reason about the durability spectrum.

3) Consumer offsets: where are they stored?

Bad Solution: Client-side storage

Consumer remembers its own offset in a local file.
Challenges: if consumer is replaced (new machine), state is lost. Rebalancing is broken.

Good Solution: External store (Postgres / Redis)

Offset stored keyed by (group, topic, partition).
Challenges: external dependency; latency per commit; manageable but adds surface area.

Great Solution: Internal topic `__consumer_offsets`

A special compacted topic within the queue itself. Offset commits are just messages to this topic keyed by (group, topic, partition).
Compaction ensures only the latest offset per key is retained. Old offsets garbage-collected.
Readers of offsets use the same mechanisms as regular topic reads.
Self-hosted, self-replicated -- no external dependency.

This is what Kafka does. Worth mentioning in an L4 interview to show you know the playbook.

4) Consumer rebalancing

The problem

Consumers join and leave the group (auto-scaling, deployments, crashes). Partitions must be reassigned.

Good Solution: Stop-the-world rebalance

Coordinator detects group membership change.
All consumers stop consuming, send their committed offsets, receive new assignments, resume.
Simple but causes brief stalls (seconds). Ok if group membership changes infrequently.

Great Solution: Cooperative / incremental rebalance

Only the minimum partitions move. If Consumer A had partitions {0,1,2} and Consumer D joins, maybe only partition 2 moves to D. Others keep consuming uninterrupted.
Kafka 2.4+ supports this via "cooperative sticky" assignor.
More complex coordinator logic. Better uptime during scaling events.

For L4, mention stop-the-world as default, incremental as an optimization.

5) At-least-once vs exactly-once

At-least-once (what we build)

Message may be delivered multiple times. Why? Consumer crashes after processing but before committing offset -- next consumer re-processes.
Consumers must be idempotent: processing the same message twice is safe.
Typical idempotency pattern: dedup table keyed by message ID; insert with ON CONFLICT IGNORE.

Exactly-once

No duplicates at all. Much harder.
Producer-side: idempotent producer (sequence numbers + broker-side dedup within a single session).
Consumer-side: consumer must atomically commit offset AND write its side-effects. Usually requires a transactional store (write the result + the offset in one transaction). Or output to the same queue with transactional messaging.
Kafka supports this via its transactions API (initTransactions, beginTransaction, commitTransaction).
End-to-end exactly-once across different systems is very hard without idempotent consumers.

L4 recommendation: Design for at-least-once. Require idempotent consumers. Mention exactly-once as a possibility but don't design it.

6) Dead Letter Queue design

When to DLQ

Message repeatedly fails processing (permanent error in data, not a transient issue).
Message is a poison pill (corrupt, unparseable).

How

Consumer keeps a retry count in message headers (or tracks externally).
After maxRetries, the consumer publishes to the DLQ topic (same schema, original topic as a header) and commits the main offset to unblock the partition.
DLQ topic has its own retention (often longer, so humans have time to investigate).
Tooling to replay DLQ messages back to the main topic after fixing the bug.

java

void handleWithDLQ(Message msg) {
    for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
        try {
            process(msg);
            consumer.commit(msg);
            return;
        } catch (TransientException e) {
            sleepWithBackoff(attempt);
        } catch (PermanentException e) {
            publishToDLQ(msg, e);
            consumer.commit(msg);
            return;
        }
    }
    // Max retries exceeded on transient errors -- treat as poison
    publishToDLQ(msg, new RetryExhaustedException());
    consumer.commit(msg);
}

7) Storage layout: the append-only log

Each partition is a directory. Inside: segment files (e.g., 1 GB each), named by base offset.
New messages append to the active segment. When it hits size or time threshold, close it and start a new one.
Sparse index per segment: every Nth message's (offset, file_position) stored in a separate index file. Enables O(log N) offset-based lookup.
Retention deletes whole segments, not individual messages. Fast.

8) Push vs pull

Pull (what we build, what Kafka does)

Consumer decides when to fetch. Long-polling for efficiency.
Pros: consumer controls its own pace. No overload on slow consumers. Easy rewind.
Cons: slight latency (poll interval).

Push

Broker pushes messages to registered consumers.
Pros: low latency.
Cons: slow consumers get overwhelmed. Broker must track consumer state.

Recommendation: Pull. Industry standard for this use case. RabbitMQ supports both; Kafka is pull-only; SQS is effectively pull (receive-message).

9) Throughput back-of-envelope

100K msgs/sec * 1 KB = 100 MB/sec.
Replication factor 3 = 300 MB/sec across cluster.
A broker with 1 GbE NIC = 125 MB/sec max. With 10 GbE, 1.25 GB/sec.
3-5 brokers with 10 GbE each handles this comfortably.
Disk: 100 MB/sec sustained is well within modern SSD (GB/sec range) and even spinning disk (150-200 MB/sec sequential).
Retention 7 days: 100 MB/sec * 7 days * 3 replicas = ~180 TB. Split across 5 brokers = ~36 TB each. Fits on a single machine with dense storage.

10) What NOT to build at L4

Don't design ZooKeeper / Raft for the coordinator. Just say "coordinator state is managed by a consensus layer (ZooKeeper or KRaft)" and move on.
Don't propose a custom serialization format; Protobuf / Avro is standard.
Don't design multi-region active-active replication. Mention MirrorMaker-style async replication briefly if asked.
Don't implement exactly-once end-to-end unless specifically asked.

What is Expected at Each Level

L3 / Mid-level

Understand pub/sub as a pattern. Topics, producers, consumers.
Propose a basic append-only log per topic. Might miss partitions.
Know at-least-once exists; might struggle to explain how it's achieved.

L4

Topic/partition model with key-based partitioning.
Replication for durability with leader/follower + ISR.
Pull-based consumers with long-polling.
Consumer groups, offset commits, rebalance on membership change.
At-least-once with idempotent consumers; exactly-once mentioned as hard.
DLQ pattern and retry logic.
Back-of-envelope on brokers, disk, and network.
Internal __consumer_offsets topic (bonus).

L5 / Senior

Incremental (cooperative) rebalancing.
Tiered storage: hot data on local SSD, cold segments offloaded to object storage.
Multi-region topology with async mirroring (MirrorMaker-style).
Producer idempotence + transactional messaging for EOS end-to-end.
Operational concerns: partition skew handling, broker capacity planning, lag monitoring, consumer health checks.
Schema registry integration and evolution.
Comparison with alternatives (Pulsar, Pub/Sub, Kinesis) and when to use each.

HLD: Basic Message Queue (Kafka-lite) ​

Understanding the Problem ​

What is a Message Queue? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

The API ​

High-Level Design ​

Flow 1: Publish ​

Flow 2: Consume ​

Flow 3: Consumer failure + rebalance ​

Flow 4: DLQ ​

Flow 5: Retention and log compaction ​

Potential Deep Dives ​

1) Topic / partition model and ordering ​

The guarantee ​

Why partitions at all? ​

Choosing partition count ​

2) Durability and replication ​

Bad Solution: Single-broker, async-flush ​

Good Solution: Replicated, acks=all ​

Great Solution: Replicated + fsync tuning ​

3) Consumer offsets: where are they stored? ​

Bad Solution: Client-side storage ​

Good Solution: External store (Postgres / Redis) ​

Great Solution: Internal topic __consumer_offsets ​

4) Consumer rebalancing ​

The problem ​

Good Solution: Stop-the-world rebalance ​

Great Solution: Cooperative / incremental rebalance ​

5) At-least-once vs exactly-once ​

At-least-once (what we build) ​

Exactly-once ​

6) Dead Letter Queue design ​

When to DLQ ​

How ​

7) Storage layout: the append-only log ​

8) Push vs pull ​

Pull (what we build, what Kafka does) ​

Push ​

9) Throughput back-of-envelope ​

10) What NOT to build at L4 ​

What is Expected at Each Level ​

L3 / Mid-level ​

L4 ​

L5 / Senior ​