HLD: Chat System (WhatsApp / Slack)

Understanding the Problem

What is a Chat System?

A chat system enables real-time messaging between users -- one-on-one conversations, group chats, and potentially channels. Think WhatsApp for mobile-first encrypted messaging, or Slack for team-based communication. The core challenge is delivering messages reliably and in-order at massive scale, while handling users who go offline and come back.

Functional Requirements

Core (above the line):

One-on-one messaging -- send and receive text messages between two users in real time
Group messaging -- send messages to a group of up to 500 members
Online/offline presence -- show whether a user is currently online
Message delivery guarantees -- messages must be delivered at least once, even if the recipient is offline

Below the line (mention but don't design):

Read receipts (delivered, read)
Typing indicators
Media messages (images, video, files)
Message search
End-to-end encryption
Push notifications

Non-Functional Requirements

Low latency -- message delivery within 200ms for online users
High availability -- 99.99% uptime (< 52 minutes downtime/year)
Message ordering -- messages within a conversation appear in correct order
Scale -- 500M daily active users (DAU), each sending ~40 messages/day = 20B messages/day (~230K messages/sec)
Durability -- zero message loss; every sent message must eventually be delivered

The Set Up

Core Entities

Entity	Description
User	`id`, `username`, `status` (online/offline), `lastSeen`
Message	`id`, `conversationId`, `senderId`, `content`, `timestamp`, `status` (sent/delivered/read)
Conversation	`id`, `type` (1:1 or group), `participantIds`, `lastMessageAt`
Participant	`conversationId`, `userId`, `joinedAt`, `lastReadMessageId`

API Design

Send a message:

POST /api/messages
Authorization: Bearer <token>

Request:
{
  "conversationId": "conv_abc123",
  "content": "Hey, how are you?",
  "clientMessageId": "uuid-from-client"  // idempotency key
}

Response: 201 Created
{
  "messageId": "msg_xyz789",
  "conversationId": "conv_abc123",
  "timestamp": "2024-01-15T10:30:00Z",
  "status": "sent"
}

Fetch conversation history (paginated):

GET /api/conversations/{conversationId}/messages?cursor=msg_xyz700&limit=50
Authorization: Bearer <token>

Response: 200 OK
{
  "messages": [
    {
      "messageId": "msg_xyz701",
      "senderId": "user_123",
      "content": "Hello!",
      "timestamp": "2024-01-15T10:28:00Z"
    }
  ],
  "nextCursor": "msg_xyz650",
  "hasMore": true
}

Get conversations list:

GET /api/conversations?cursor=<timestamp>&limit=20
Authorization: Bearer <token>

Response: 200 OK
{
  "conversations": [
    {
      "conversationId": "conv_abc123",
      "type": "group",
      "name": "Engineering Team",
      "lastMessage": { "content": "Ship it!", "senderId": "user_456", "timestamp": "..." },
      "unreadCount": 3
    }
  ]
}

Create a group:

POST /api/conversations
{
  "type": "group",
  "name": "Project Alpha",
  "participantIds": ["user_1", "user_2", "user_3"]
}

High-Level Design

Flow 1: Sending a 1:1 Message (Both Users Online)

[User A Client] --WebSocket--> [Chat Server A] --> [Message Queue (Kafka)]
                                                         |
                                                    [Message Router]
                                                         |
                                                    [Chat Server B] --WebSocket--> [User B Client]
                                                         |
                                                    [Message Store (Cassandra)]

Step-by-step:

User A types a message and the client sends it over an existing WebSocket connection to Chat Server A
Chat Server A validates the message (auth, rate limit, content length) and assigns a server-side messageId with a monotonically increasing timestamp (Snowflake ID)
Chat Server A writes the message to the Message Store (Cassandra) with status sent
Chat Server A publishes the message to a Kafka topic partitioned by conversationId -- this guarantees ordering within a conversation
The Message Router service consumes from Kafka. It looks up User B's connection in the Session Registry (Redis) to find which Chat Server holds User B's WebSocket
Message Router forwards the message to Chat Server B
Chat Server B pushes the message to User B over WebSocket
User B's client sends an ACK back. Chat Server B updates message status to delivered in the Message Store
Chat Server A sends a delivery receipt back to User A

Flow 2: Sending a Message to an Offline User

Steps 1-4 are identical
Message Router looks up User B in Session Registry -- User B is not connected
Message Router writes the message to an Offline Message Queue (per-user sorted set in Redis, or a dedicated table in Cassandra)
Message Router triggers a Push Notification via APNs/FCM
When User B comes online, their client establishes a WebSocket connection and sends a sync request with their lastReceivedMessageId
Chat Server fetches all messages after that ID from the Offline Message Queue and delivers them in order
Once delivered, messages are removed from the offline queue

Flow 3: Group Messaging

User A sends a message to conv_group_123 (200 members)
Chat Server writes message to Message Store, publishes to Kafka topic partitioned by conversationId
Fan-out Service consumes the message. It fetches the participant list for the group from cache (Redis) or DB
For each participant, Fan-out Service checks Session Registry:
- Online: Route message to appropriate Chat Server via internal RPC
- Offline: Write to their Offline Message Queue + send push notification
Fan-out happens asynchronously -- the sender gets an ACK as soon as the message is persisted (step 2), not after all recipients receive it

Flow 4: Presence / Online Status

When a user connects via WebSocket, Chat Server registers them in the Session Registry (Redis) with a TTL of 30 seconds
Client sends a heartbeat every 10 seconds; Chat Server refreshes the TTL in Redis
If heartbeat stops, the key expires after 30s and the user is considered offline
Presence updates are published to a Presence Channel (Redis Pub/Sub or a separate Kafka topic)
For 1:1 chats: the other user subscribes to their contact's presence channel
For groups: we do NOT fan out presence to all group members in real time (too expensive). Instead, presence is fetched on-demand when a user opens a conversation

Potential Deep Dives

Deep Dive 1: WebSocket vs Long Polling vs SSE

Bad Solution -- HTTP Polling: Client polls GET /messages every 2 seconds. With 500M DAU, that is 250M requests/sec of mostly empty responses. Wasteful bandwidth, high latency (up to 2s), and massive server load.

Good Solution -- Long Polling: Client opens a request that the server holds open until a new message arrives (or timeout after 30s). Reduces empty responses, but each "connection" is really a hanging HTTP request. You still need to re-establish after each message or timeout. Connection management is complex. Used by early Facebook Messenger.

Great Solution -- WebSocket: Full-duplex persistent connection over a single TCP socket. After the HTTP upgrade handshake, both client and server can send data at any time. This is what we use.

Trade-offs:

WebSocket requires sticky sessions or a session registry so we know which server holds each user's connection
Load balancers need L4 (TCP) support, not just L7 (HTTP)
Connection churn: if a user's connection drops, we need reconnection logic with exponential backoff

Back-of-envelope: 500M DAU, assume 30% online at peak = 150M concurrent WebSocket connections. At ~10KB memory per connection, that is 1.5TB of memory. Spread across servers with 32GB RAM each = ~50K connections/server = 3,000 Chat Servers needed.

Deep Dive 2: Message Ordering

The Problem: In a distributed system, messages can arrive out of order. User A sends "Yes" then "I'll be there at 5" -- the recipient might see them reversed.

Bad Solution -- Server wall-clock timestamps: Clock skew between servers means timestamps are unreliable. Two messages from different servers could get identical or reversed timestamps.

Good Solution -- Snowflake IDs: Generate 64-bit IDs that encode: timestamp (41 bits) + datacenter (5 bits) + machine (5 bits) + sequence (12 bits). Monotonically increasing within a single machine. Since we partition Kafka by conversationId, all messages in a conversation go through the same partition and get ordered by Kafka offset.

Great Solution -- Per-conversation sequence numbers: Each conversation maintains an atomic counter. When a message is published, it gets the next sequence number for that conversation. We use Redis INCR on a key like seq:conv_abc123. Combined with Kafka partition ordering, this guarantees total order within a conversation.

Trade-off: The Redis INCR creates a single point of serialization per conversation, but that is fine -- a single conversation rarely exceeds a few messages per second.

Deep Dive 3: Offline Message Delivery

Bad Solution -- Store in main DB, query on reconnect: User reconnects and we do SELECT * FROM messages WHERE conversationId IN (...) AND messageId > lastReceived. With hundreds of conversations, this is an expensive fan-in query at exactly the moment the server is handling a reconnection storm (e.g., after a network outage).

Good Solution -- Per-user offline inbox: Maintain a per-user sorted set in Redis: ZADD offline:user_B <timestamp> <messageId>. On reconnect, ZRANGEBYSCORE to get all pending messages. Fast O(log N + M) reads.

Great Solution -- Tiered offline storage:

Hot (< 1 hour offline): Redis sorted set. Fast retrieval for users who briefly lose connection.
Cold (> 1 hour offline): Spill to Cassandra table offline_messages(userId, messageId, conversationId, content) partitioned by userId. Redis has memory limits; we cannot store weeks of messages there.
On reconnect, fetch from Redis first, then check Cassandra if user has been offline for a long time.

Back-of-envelope: If 20% of users are offline at any time = 100M users. Average 10 pending messages each, ~500 bytes per message = 100M * 10 * 500B = 500 GB in the offline queue. Fits in a Redis cluster but is on the edge -- tiered approach is wise.

Deep Dive 4: Read Receipts at Scale

The Problem: When User B reads a message, we want to notify User A. In a group of 200 people, every "read" event would generate 199 notifications -- that is an O(N^2) problem.

Bad Solution -- Fan out every read receipt immediately: 200-person group, 50 messages, 200 people read them = 50 * 200 * 199 = ~2M notifications for one group conversation. Does not scale.

Good Solution -- Batch read receipts: Instead of sending individual read receipts, send periodic batched updates: "User B has read up to messageId X in conversation Y." One message covers all messages up to that point. Update only if the "read up to" marker advances.

Great Solution -- Lazy pull model for groups:

For 1:1 chats: send read receipts in real time (it is just one notification per read event)
For groups: store lastReadMessageId per participant in the DB. When User A opens the group, fetch the read status of the latest messages on demand. No fan-out at all.
Optionally, for the message sender: send them a single aggregated receipt like "12 of 50 members have read your message" every 30 seconds.

Deep Dive 5: Group Chat Fan-out Optimization

The Problem: A 500-member group has a message. We need to deliver it to 500 people.

Bad Solution -- Sequential delivery: Chat Server loops through 500 members one by one. If each takes 5ms, that is 2.5 seconds for one message.

Good Solution -- Parallel fan-out with thread pool: Fan-out Service uses a thread pool of 50 workers to deliver in parallel. 500 / 50 = 10 batches * 5ms = 50ms. Much better.

Great Solution -- Segmented fan-out:

Separate online users from offline users (check Session Registry in batch with MGET)
For online users: publish to their Chat Server's internal queue in batch (group by Chat Server to reduce network calls)
For offline users: batch-write to their offline queues + batch-send push notifications
For very large groups (>500 members), treat them as "channels" with a pub/sub model -- users pull messages rather than having messages pushed to them

Back-of-envelope for Kafka throughput: 230K messages/sec * average 10 recipients (mix of 1:1 and groups) = 2.3M delivery operations/sec. Kafka can handle this with ~50 partitions and a small consumer group.

Deep Dive 6: End-to-End Encryption Approach

The Signal Protocol (used by WhatsApp):

Each user generates a public/private key pair on their device. Public key is uploaded to the server.
To start a conversation, User A fetches User B's public key from the server.
They perform a Diffie-Hellman key exchange to derive a shared secret.
Each message is encrypted with AES-256 using a key derived from the shared secret via a ratchet mechanism (Double Ratchet Algorithm) -- each message gets a unique key.
The server only sees ciphertext. It cannot read messages.

Implications for our design:

Server cannot index or search message content
Group encryption is harder: either use pairwise encryption (sender encrypts N times for N members) or a shared group key (Sender Keys protocol)
Key storage is on-device only -- if the user loses their phone, messages are unrecoverable (unless they have a backup, which introduces its own encryption challenges)

What is Expected at Each Level

Mid-Level

Identify that we need WebSockets for real-time messaging
Design basic 1:1 messaging flow with a message store
Know that we need a way to handle offline users
Basic understanding of message ordering challenges

Senior

Design the full system including group messaging, presence, and offline delivery
Explain Kafka partitioning for message ordering
Discuss trade-offs between different real-time communication protocols
Design the session registry and connection management layer
Back-of-envelope calculations for connection count and message throughput
Address read receipts and delivery guarantees

Staff+

Design the Double Ratchet encryption model and its implications
Optimize group fan-out with segmented delivery strategies
Design tiered offline storage (hot/cold)
Handle edge cases: message deduplication (idempotency keys), exactly-once delivery semantics, reconnection storms after outages
Multi-region deployment strategy for global low-latency messaging
Discuss operational concerns: monitoring message delivery latency P99, dead letter queues for failed deliveries, graceful degradation under load

HLD: Chat System (WhatsApp / Slack) ​

Understanding the Problem ​

What is a Chat System? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

API Design ​

High-Level Design ​

Flow 1: Sending a 1:1 Message (Both Users Online) ​

Flow 2: Sending a Message to an Offline User ​

Flow 3: Group Messaging ​

Flow 4: Presence / Online Status ​

Potential Deep Dives ​

Deep Dive 1: WebSocket vs Long Polling vs SSE ​

Deep Dive 2: Message Ordering ​

Deep Dive 3: Offline Message Delivery ​

Deep Dive 4: Read Receipts at Scale ​

Deep Dive 5: Group Chat Fan-out Optimization ​

Deep Dive 6: End-to-End Encryption Approach ​

What is Expected at Each Level ​

Mid-Level ​

Senior ​

Staff+ ​