HLD: Chat System (WhatsApp / Slack) ​
Understanding the Problem ​
What is a Chat System? ​
A chat system enables real-time messaging between users -- one-on-one conversations, group chats, and potentially channels. Think WhatsApp for mobile-first encrypted messaging, or Slack for team-based communication. The core challenge is delivering messages reliably and in-order at massive scale, while handling users who go offline and come back.
Functional Requirements ​
Core (above the line):
- One-on-one messaging -- send and receive text messages between two users in real time
- Group messaging -- send messages to a group of up to 500 members
- Online/offline presence -- show whether a user is currently online
- Message delivery guarantees -- messages must be delivered at least once, even if the recipient is offline
Below the line (mention but don't design):
- Read receipts (delivered, read)
- Typing indicators
- Media messages (images, video, files)
- Message search
- End-to-end encryption
- Push notifications
Non-Functional Requirements ​
- Low latency -- message delivery within 200ms for online users
- High availability -- 99.99% uptime (< 52 minutes downtime/year)
- Message ordering -- messages within a conversation appear in correct order
- Scale -- 500M daily active users (DAU), each sending ~40 messages/day = 20B messages/day (~230K messages/sec)
- Durability -- zero message loss; every sent message must eventually be delivered
The Set Up ​
Core Entities ​
| Entity | Description |
|---|---|
| User | id, username, status (online/offline), lastSeen |
| Message | id, conversationId, senderId, content, timestamp, status (sent/delivered/read) |
| Conversation | id, type (1:1 or group), participantIds, lastMessageAt |
| Participant | conversationId, userId, joinedAt, lastReadMessageId |
API Design ​
Send a message:
POST /api/messages
Authorization: Bearer <token>
Request:
{
"conversationId": "conv_abc123",
"content": "Hey, how are you?",
"clientMessageId": "uuid-from-client" // idempotency key
}
Response: 201 Created
{
"messageId": "msg_xyz789",
"conversationId": "conv_abc123",
"timestamp": "2024-01-15T10:30:00Z",
"status": "sent"
}Fetch conversation history (paginated):
GET /api/conversations/{conversationId}/messages?cursor=msg_xyz700&limit=50
Authorization: Bearer <token>
Response: 200 OK
{
"messages": [
{
"messageId": "msg_xyz701",
"senderId": "user_123",
"content": "Hello!",
"timestamp": "2024-01-15T10:28:00Z"
}
],
"nextCursor": "msg_xyz650",
"hasMore": true
}Get conversations list:
GET /api/conversations?cursor=<timestamp>&limit=20
Authorization: Bearer <token>
Response: 200 OK
{
"conversations": [
{
"conversationId": "conv_abc123",
"type": "group",
"name": "Engineering Team",
"lastMessage": { "content": "Ship it!", "senderId": "user_456", "timestamp": "..." },
"unreadCount": 3
}
]
}Create a group:
POST /api/conversations
{
"type": "group",
"name": "Project Alpha",
"participantIds": ["user_1", "user_2", "user_3"]
}High-Level Design ​
Flow 1: Sending a 1:1 Message (Both Users Online) ​
[User A Client] --WebSocket--> [Chat Server A] --> [Message Queue (Kafka)]
|
[Message Router]
|
[Chat Server B] --WebSocket--> [User B Client]
|
[Message Store (Cassandra)]Step-by-step:
- User A types a message and the client sends it over an existing WebSocket connection to Chat Server A
- Chat Server A validates the message (auth, rate limit, content length) and assigns a server-side
messageIdwith a monotonically increasing timestamp (Snowflake ID) - Chat Server A writes the message to the Message Store (Cassandra) with status
sent - Chat Server A publishes the message to a Kafka topic partitioned by
conversationId-- this guarantees ordering within a conversation - The Message Router service consumes from Kafka. It looks up User B's connection in the Session Registry (Redis) to find which Chat Server holds User B's WebSocket
- Message Router forwards the message to Chat Server B
- Chat Server B pushes the message to User B over WebSocket
- User B's client sends an ACK back. Chat Server B updates message status to
deliveredin the Message Store - Chat Server A sends a delivery receipt back to User A
Flow 2: Sending a Message to an Offline User ​
- Steps 1-4 are identical
- Message Router looks up User B in Session Registry -- User B is not connected
- Message Router writes the message to an Offline Message Queue (per-user sorted set in Redis, or a dedicated table in Cassandra)
- Message Router triggers a Push Notification via APNs/FCM
- When User B comes online, their client establishes a WebSocket connection and sends a
syncrequest with theirlastReceivedMessageId - Chat Server fetches all messages after that ID from the Offline Message Queue and delivers them in order
- Once delivered, messages are removed from the offline queue
Flow 3: Group Messaging ​
- User A sends a message to
conv_group_123(200 members) - Chat Server writes message to Message Store, publishes to Kafka topic partitioned by
conversationId - Fan-out Service consumes the message. It fetches the participant list for the group from cache (Redis) or DB
- For each participant, Fan-out Service checks Session Registry:
- Online: Route message to appropriate Chat Server via internal RPC
- Offline: Write to their Offline Message Queue + send push notification
- Fan-out happens asynchronously -- the sender gets an ACK as soon as the message is persisted (step 2), not after all recipients receive it
Flow 4: Presence / Online Status ​
- When a user connects via WebSocket, Chat Server registers them in the Session Registry (Redis) with a TTL of 30 seconds
- Client sends a heartbeat every 10 seconds; Chat Server refreshes the TTL in Redis
- If heartbeat stops, the key expires after 30s and the user is considered offline
- Presence updates are published to a Presence Channel (Redis Pub/Sub or a separate Kafka topic)
- For 1:1 chats: the other user subscribes to their contact's presence channel
- For groups: we do NOT fan out presence to all group members in real time (too expensive). Instead, presence is fetched on-demand when a user opens a conversation
Potential Deep Dives ​
Deep Dive 1: WebSocket vs Long Polling vs SSE ​
Bad Solution -- HTTP Polling: Client polls GET /messages every 2 seconds. With 500M DAU, that is 250M requests/sec of mostly empty responses. Wasteful bandwidth, high latency (up to 2s), and massive server load.
Good Solution -- Long Polling: Client opens a request that the server holds open until a new message arrives (or timeout after 30s). Reduces empty responses, but each "connection" is really a hanging HTTP request. You still need to re-establish after each message or timeout. Connection management is complex. Used by early Facebook Messenger.
Great Solution -- WebSocket: Full-duplex persistent connection over a single TCP socket. After the HTTP upgrade handshake, both client and server can send data at any time. This is what we use.
Trade-offs:
- WebSocket requires sticky sessions or a session registry so we know which server holds each user's connection
- Load balancers need L4 (TCP) support, not just L7 (HTTP)
- Connection churn: if a user's connection drops, we need reconnection logic with exponential backoff
Back-of-envelope: 500M DAU, assume 30% online at peak = 150M concurrent WebSocket connections. At ~10KB memory per connection, that is 1.5TB of memory. Spread across servers with 32GB RAM each = ~50K connections/server = 3,000 Chat Servers needed.
Deep Dive 2: Message Ordering ​
The Problem: In a distributed system, messages can arrive out of order. User A sends "Yes" then "I'll be there at 5" -- the recipient might see them reversed.
Bad Solution -- Server wall-clock timestamps: Clock skew between servers means timestamps are unreliable. Two messages from different servers could get identical or reversed timestamps.
Good Solution -- Snowflake IDs: Generate 64-bit IDs that encode: timestamp (41 bits) + datacenter (5 bits) + machine (5 bits) + sequence (12 bits). Monotonically increasing within a single machine. Since we partition Kafka by conversationId, all messages in a conversation go through the same partition and get ordered by Kafka offset.
Great Solution -- Per-conversation sequence numbers: Each conversation maintains an atomic counter. When a message is published, it gets the next sequence number for that conversation. We use Redis INCR on a key like seq:conv_abc123. Combined with Kafka partition ordering, this guarantees total order within a conversation.
Trade-off: The Redis INCR creates a single point of serialization per conversation, but that is fine -- a single conversation rarely exceeds a few messages per second.
Deep Dive 3: Offline Message Delivery ​
Bad Solution -- Store in main DB, query on reconnect: User reconnects and we do SELECT * FROM messages WHERE conversationId IN (...) AND messageId > lastReceived. With hundreds of conversations, this is an expensive fan-in query at exactly the moment the server is handling a reconnection storm (e.g., after a network outage).
Good Solution -- Per-user offline inbox: Maintain a per-user sorted set in Redis: ZADD offline:user_B <timestamp> <messageId>. On reconnect, ZRANGEBYSCORE to get all pending messages. Fast O(log N + M) reads.
Great Solution -- Tiered offline storage:
- Hot (< 1 hour offline): Redis sorted set. Fast retrieval for users who briefly lose connection.
- Cold (> 1 hour offline): Spill to Cassandra table
offline_messages(userId, messageId, conversationId, content)partitioned byuserId. Redis has memory limits; we cannot store weeks of messages there. - On reconnect, fetch from Redis first, then check Cassandra if user has been offline for a long time.
Back-of-envelope: If 20% of users are offline at any time = 100M users. Average 10 pending messages each, ~500 bytes per message = 100M * 10 * 500B = 500 GB in the offline queue. Fits in a Redis cluster but is on the edge -- tiered approach is wise.
Deep Dive 4: Read Receipts at Scale ​
The Problem: When User B reads a message, we want to notify User A. In a group of 200 people, every "read" event would generate 199 notifications -- that is an O(N^2) problem.
Bad Solution -- Fan out every read receipt immediately: 200-person group, 50 messages, 200 people read them = 50 * 200 * 199 = ~2M notifications for one group conversation. Does not scale.
Good Solution -- Batch read receipts: Instead of sending individual read receipts, send periodic batched updates: "User B has read up to messageId X in conversation Y." One message covers all messages up to that point. Update only if the "read up to" marker advances.
Great Solution -- Lazy pull model for groups:
- For 1:1 chats: send read receipts in real time (it is just one notification per read event)
- For groups: store
lastReadMessageIdper participant in the DB. When User A opens the group, fetch the read status of the latest messages on demand. No fan-out at all. - Optionally, for the message sender: send them a single aggregated receipt like "12 of 50 members have read your message" every 30 seconds.
Deep Dive 5: Group Chat Fan-out Optimization ​
The Problem: A 500-member group has a message. We need to deliver it to 500 people.
Bad Solution -- Sequential delivery: Chat Server loops through 500 members one by one. If each takes 5ms, that is 2.5 seconds for one message.
Good Solution -- Parallel fan-out with thread pool: Fan-out Service uses a thread pool of 50 workers to deliver in parallel. 500 / 50 = 10 batches * 5ms = 50ms. Much better.
Great Solution -- Segmented fan-out:
- Separate online users from offline users (check Session Registry in batch with
MGET) - For online users: publish to their Chat Server's internal queue in batch (group by Chat Server to reduce network calls)
- For offline users: batch-write to their offline queues + batch-send push notifications
- For very large groups (>500 members), treat them as "channels" with a pub/sub model -- users pull messages rather than having messages pushed to them
Back-of-envelope for Kafka throughput: 230K messages/sec * average 10 recipients (mix of 1:1 and groups) = 2.3M delivery operations/sec. Kafka can handle this with ~50 partitions and a small consumer group.
Deep Dive 6: End-to-End Encryption Approach ​
The Signal Protocol (used by WhatsApp):
- Each user generates a public/private key pair on their device. Public key is uploaded to the server.
- To start a conversation, User A fetches User B's public key from the server.
- They perform a Diffie-Hellman key exchange to derive a shared secret.
- Each message is encrypted with AES-256 using a key derived from the shared secret via a ratchet mechanism (Double Ratchet Algorithm) -- each message gets a unique key.
- The server only sees ciphertext. It cannot read messages.
Implications for our design:
- Server cannot index or search message content
- Group encryption is harder: either use pairwise encryption (sender encrypts N times for N members) or a shared group key (Sender Keys protocol)
- Key storage is on-device only -- if the user loses their phone, messages are unrecoverable (unless they have a backup, which introduces its own encryption challenges)
What is Expected at Each Level ​
Mid-Level ​
- Identify that we need WebSockets for real-time messaging
- Design basic 1:1 messaging flow with a message store
- Know that we need a way to handle offline users
- Basic understanding of message ordering challenges
Senior ​
- Design the full system including group messaging, presence, and offline delivery
- Explain Kafka partitioning for message ordering
- Discuss trade-offs between different real-time communication protocols
- Design the session registry and connection management layer
- Back-of-envelope calculations for connection count and message throughput
- Address read receipts and delivery guarantees
Staff+ ​
- Design the Double Ratchet encryption model and its implications
- Optimize group fan-out with segmented delivery strategies
- Design tiered offline storage (hot/cold)
- Handle edge cases: message deduplication (idempotency keys), exactly-once delivery semantics, reconnection storms after outages
- Multi-region deployment strategy for global low-latency messaging
- Discuss operational concerns: monitoring message delivery latency P99, dead letter queues for failed deliveries, graceful degradation under load