Skip to content

HLD: Google Docs (Collaborative Editor) ​

Frequently Asked at Salesforce SMTS β€” Common HLD problem in recent SMTS interviews (2024-2026).

Understanding the Problem ​

What is a Collaborative Editor? ​

A collaborative editor lets many users edit the same document concurrently, see each other's cursors in real time, and get version history with the ability to restore older states. Think Google Docs or Salesforce's own Quip. The core engineering challenges are distributed consistency (how do all clients converge on the same final state despite concurrent edits?) and presence at scale (how do you render 100k cursors without melting the network?). This problem tests CS fundamentals β€” distributed algorithms like Operational Transform or CRDTs β€” more than most system design questions.

Functional Requirements ​

Core (above the line):

  1. Real-time collaborative editing β€” multiple users editing a rich-text document concurrently, with local echo < 16 ms (feels instant).
  2. Presence β€” see who is in the doc and where each person's cursor is.
  3. Version history β€” named revisions, timeline view, restore to any point.
  4. Undo/redo, per user β€” a user's undo affects only their own edits, not others' concurrent work.
  5. Offline editing with later sync β€” client buffers ops while offline and reconciles on reconnect.
  6. Permissions β€” viewer / commenter / editor at the doc and folder level.

Below the line (out of scope):

  • In-doc chat, tasks, mentions β€” separate feature tied to notifications.
  • Native mobile offline editors β€” the backend is shared but building the mobile client is its own project.
  • Comment threading UI β€” the data model is simple but the UI is out of scope.
  • End-to-end encryption β€” breaks server-side search and OT; a separate design.

Non-Functional Requirements ​

Core:

  • Scale: 1B docs total, 100k concurrent editors across the system, 3 collaborators on average per active doc, peak 50 editors in a hot doc.
  • Latency: local echo < 16 ms; remote-op application p99 < 200 ms across regions.
  • Consistency: eventual convergence is mandatory β€” all clients must end in the same state. Strong read-your-writes for the author of any change.
  • Durability: no lost ops. Every keystroke must be persisted before it disappears from the client's buffer.
  • Multi-tenancy: docs are scoped to an org. Sharing outside the org requires explicit permission. Presence bubbles only leak to allowed viewers.

Below the line:

  • Byzantine fault tolerance β€” we assume the server is trusted.
  • CRDTs for full offline-first (we pick OT, justify below).

Capacity Estimation ​

  • 100k concurrent editors Γ— ~1 op/s average keystroke rate = 100k ops/s at peak.
  • Average op payload on the wire: 50 bytes (insert-char or delete-range with metadata) β†’ 5 MB/s steady network throughput on the op plane.
  • Snapshots every 1000 ops or 60 seconds: ~5 GB/day of snapshots across all docs.
  • WebSocket connections: 100k concurrent β†’ roughly 10 edge nodes at 10k connections each.
  • Op log storage: 100k ops/s Γ— 86,400s Γ— 100 bytes average with metadata = ~860 GB/day raw; retain 30 days hot = ~26 TB.

The Set Up ​

Core Entities ​

  • Document β€” docId, orgId, ownerUserId, title, createdAt.
  • Revision β€” monotonically increasing revId, snapshot of doc state at that rev.
  • Operation β€” opId, docId, authorUserId, baseRev (what rev the client had when it generated this op), payload (insert / delete / format), createdAt.
  • Presence β€” ephemeral: docId, userId, cursorPos, selectionRange, lastHeartbeat.
  • Permission β€” docId, principalId (user or group), role (viewer / commenter / editor).

The API ​

HTTP/REST for CRUD and auth. WebSocket for the live editing channel β€” we need full-duplex with sub-100 ms latency, and long polling would cost too much.

Doc lifecycle (REST):

POST /v1/orgs/{orgId}/docs
GET  /v1/orgs/{orgId}/docs/{docId}
GET  /v1/orgs/{orgId}/docs/{docId}/revisions
POST /v1/orgs/{orgId}/docs/{docId}/permissions

Live session (WebSocket):

WS  /v1/orgs/{orgId}/docs/{docId}/session

client β†’ server
{ "type": "op", "baseRev": 120, "opId": "...", "op": {...} }
{ "type": "presence", "cursor": { "line": 4, "ch": 12 } }

server β†’ client
{ "type": "ack", "opId": "...", "serverRev": 121 }
{ "type": "op", "serverRev": 122, "authorUserId": "u_...", "op": {...} }
{ "type": "presence", "snapshot": [...] }

On connect, the client sends its last known serverRev; the server streams down any ops from there forward, then switches to live mode.


High-Level Design ​

Architecture ​

                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  WS clients ──▢│  Edge WS Fleet (sticky by    β”‚
                β”‚  docId β†’ same node)          β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Doc Session Svc    │◀──────▢│ Redis Cluster β”‚
                 β”‚  (OT engine per doc)β”‚ pres.  β”‚ (presence TTL)β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚ ops
                            β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Op Log (Kafka /     │──────▢ β”‚ Snapshotter /   β”‚
                 β”‚ append-only DB)     β”‚        β”‚ Materializer    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό                            β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚ Doc Storage (Blob + β”‚        β”‚ Search Index    β”‚
                 β”‚ op log in Postgres) β”‚        β”‚ (ES per org)    β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

End-to-end flow: a keystroke ​

  1. User A types a character. The editor generates an op locally: { type: "insert", pos: 42, char: "x", baseRev: 120 }.
  2. Local echo: the client applies the op to its local state immediately β€” this is what makes it feel instant (< 16 ms).
  3. Client sends the op over the existing WebSocket to the Edge WS node assigned to this doc.
  4. The Edge WS Fleet routes the WebSocket to a specific Doc Session Service pod based on a consistent hash of docId. This pod is the single writer for this doc β€” ordering decisions happen here.
  5. Doc Session Service checks op.baseRev. If baseRev == currentServerRev, the op is already in order. If baseRev < currentServerRev, the engine runs the OT transform function to rewrite the op against all ops with rev > baseRev.
  6. The transformed op is assigned serverRev = currentServerRev + 1, appended to the Op Log (Kafka topic doc.ops.{shard}), and broadcast to all other clients connected to this doc.
  7. User A gets an { type: "ack", serverRev: 121 }. Other clients receive { type: "op", serverRev: 121, op: transformedOp } and apply it to their local state.
  8. Asynchronously, the Snapshotter consumes the op log and every 1000 ops (or 60s) writes a compact snapshot to S3 as orgs/{orgId}/docs/{docId}/snapshot-{rev}.json.zst.
  9. On doc load, clients fetch the most recent snapshot plus any ops after that snapshot's rev, and fold them to compute current state.

End-to-end flow: presence ​

  1. Client sends { type: "presence", cursor: {...} } on every cursor move (debounced to 10 Hz).
  2. Edge node writes to Redis: SET pres:{docId}:{userId} <cursor> EX 10.
  3. Presence broadcaster (separate worker or Redis pub/sub) aggregates presence snapshots every 100 ms and pushes a consolidated { type: "presence", snapshot: [{userId, cursor}, ...] } to all clients in the doc.
  4. If no heartbeat for 10 s, the TTL expires and the user drops off the presence list.

Data model ​

  • documents (Postgres, L1 shared-shared): (org_id, doc_id) PK, title, owner_user_id, created_at.
  • doc_ops (append-only): (org_id, doc_id, server_rev) PK, indexed by (doc_id, server_rev). Holds the canonical op log.
  • Snapshots in S3 as orgs/{orgId}/docs/{docId}/snapshot-{rev}.json.zst.
  • Presence in Redis: pres:{docId}:{userId} β†’ cursor, EXPIRE 10s. Ephemeral; not persisted.
  • Permissions: (doc_id, principal_id, role) with index on principal_id for "what docs can this user see".

Multi-Tenancy Strategy ​

Isolation level: L1 shared DB + shared schema at the SQL layer, with org_id as the first column of every PK and every index. Partition documents and doc_ops by HASH(org_id) across 64 shards.

Tenant context flow:

  • Every WebSocket URL carries /orgs/{orgId}/docs/{docId}. The edge layer validates that the JWT's orgId claim matches the URL segment β€” prevents lateral movement if an attacker guesses a docId.
  • Routing: Edge WS β†’ Doc Session uses consistent hashing on docId. Doc Session pod reads orgId from the session context and stamps it on every log line, metric, and Kafka message.
  • Every DB query is wrapped WHERE org_id = :ctx.orgId by a middleware.

Noisy-neighbor mitigations:

  • Heavy-doc isolation: any doc with > 50 concurrent editors gets promoted to a dedicated session shard with its own op-rate budget and thread. One hot legal contract with 200 people in it cannot slow down a neighbor's small doc.
  • Per-org caps: max concurrent docs open per org (e.g., Essentials: 500; Enterprise: 50k). Prevents a scripted tenant from spawning millions of sessions to exhaust our edge fleet.
  • Per-org WebSocket connection quotas at the edge. Enforced via a Redis counter on the connection handshake.
  • Shuffle sharding of doc sessions across pods: each org maps to k=8 of n=500 session pods. A runaway tenant affects 1.6% of capacity.

Per-tenant observability:

  • Prometheus metrics labeled org_id: active sessions, ops/s, presence broadcast rate, op-log lag.
  • Per-org session-lifetime histogram to catch anomalies (e.g., org X suddenly has 10x longer-lived sessions β€” bot?).
  • Cost allocation: ops/s Γ— pod hours attributed back to orgId.

Potential Deep Dives ​

1) OT vs CRDT β€” how do we guarantee convergence? ​

Bad Solution: Last-write-wins.

  • Approach: On conflict, take whichever op arrived at the server latest.
  • Challenges: Does not converge. User A types "abc" at position 0, User B types "xyz" at position 0 concurrently. LWW keeps only one; the other user's work is silently lost.

Good Solution: Operational Transform (OT).

  • Approach: Central coordinator per doc (the Doc Session Service). Ops are small ("insert char at index", "delete N chars at index", "format range with attr"). When two concurrent ops meet at the server, the engine computes A' = T(A, B) such that applying A then B' is equivalent to applying B then A'. Clients rebase their buffered ops on any new server op. Wire size stays small (~50 bytes/op).
  • Challenges: Correctness depends on the transform function satisfying the TP1 property (transformation property 1: T is commutative for convergence). TP1 for insert/delete pairs is nontrivial; the test matrix is {insert, delete, format} Γ— {insert, delete, format} = 9 cases. Easy to introduce bugs. Requires a single writer per doc β€” scaling doc edits beyond one machine per doc means sharding.

Great Solution: OT with rigorous transform matrix + acknowledged CRDT tradeoff.

  • Approach:
    • We pick OT (matching Google Docs, Quip, ShareDB). Document the transform function formally for every pair {insert, delete, format} Γ— {insert, delete, format}. Unit test each matrix cell with property-based tests (generate random concurrent op pairs, assert convergence).
    • Single writer per doc via consistent hashing on docId β†’ specific Session pod. This serialization is fine β€” a single doc rarely exceeds a few ops/sec.
    • For multi-region: route users to the closest edge, but funnel writes to the doc's home region. Accept 100-200 ms cross-region latency for remote users; use local echo to mask it.
    • Acknowledge CRDT (Yjs, Automerge) as the alternative: wins for offline-first P2P, loses on memory because tombstones bloat over time. Reasonable for short-lived shared notes, bad for long-form documents that live for years.
  • Challenges: OT correctness is still the #1 source of bugs. Every new op type requires extending the transform matrix. If the single-writer pod crashes, clients reconnect to a new pod and must resync from the last snapshot β€” a brief visible hiccup.

2) How do we handle presence at 100k concurrent editors? ​

Bad Solution: Broadcast every cursor move to every peer.

  • Approach: Client sends cursor position 60 Hz; server fans out to every other connected user.
  • Challenges: O(N^2) messages per doc. A 50-person doc with 60 Hz cursor motion = 3000 incoming Γ— 49 broadcasts = 147k messages/sec for ONE doc. System melts.

Good Solution: Rate-limit + server-side aggregation.

  • Approach: Client debounces to 10 Hz. Server aggregates into a single "presence snapshot" message every 100 ms and fans that out to subscribers.
  • Challenges: A 200-person doc still generates 20 Γ— 200 = 4000 messages/sec in aggregate. Presence storm can starve the op channel on the same WebSocket.

Great Solution: Separate presence plane with deltas + cap.

  • Approach:
    • Presence runs on its own channel β€” either a separate Redis pub/sub that the Edge WS layer joins, or a dedicated Kafka topic for presence. Keeps presence storms from blocking ops.
    • Delta presence: each aggregated broadcast contains only cursors that changed since the last broadcast, not full snapshots.
    • Cap viewers for live presence: for any doc with > 200 concurrent users, only the first 200 get live cursor presence. Everyone else sees a "+N others editing" indicator with just a count.
    • Presence TTL of 10s in Redis; heartbeat every 5s. Missed heartbeats drop users cleanly.
  • Challenges: Delta tracking requires per-user state on the server. Caps are a UX tradeoff β€” some customers complain they cannot see all 500 colleagues. Mitigated by the "+N" indicator and a "show all" fallback that fetches on click.

3) How do we support offline editing and reconcile on reconnect? ​

Bad Solution: Server wins.

  • Approach: On reconnect, discard local buffer, replace with server state.
  • Challenges: Silently loses user work. Unacceptable for any real editor.

Good Solution: Client buffer with OT rebase.

  • Approach: Client buffers ops locally with their baseRev. On reconnect, client sends buffered ops in order. Server runs OT to transform them against the ops accepted while the client was offline, assigning new serverRev values.
  • Challenges: If the offline window is long and the ops the client based on have been compacted, OT cannot rebase. Typically fails silently or drops ops. Also: if two offline clients come back with divergent histories, you need a three-way merge that OT alone doesn't handle cleanly.

Great Solution: Bounded offline window + three-way merge fallback.

  • Approach:
    • Bound offline window to 7 days. Within that window, the op log retains all ops; OT rebase works normally.
    • Beyond that, log compaction removes ops the client was based on. Client is forced into a three-way merge: (clientState, serverState, commonAncestorSnapshot). This runs on the server using a diff algorithm and presents unresolvable conflicts as inline "resolve" UI markers.
    • Keep an audit trail of all conflicts per doc in doc_conflicts for post-hoc review.
    • Client stores ops in IndexedDB for durability across browser restarts. Each op has a stable clientOpId so server dedups retransmits.
  • Challenges: Three-way merge is imperfect for rich text (formatting conflicts are ugly). Surface unresolved conflicts transparently rather than silently dropping. Storage cost for 7-day op retention is real β€” mitigated by snapshot compression.

4) How do we implement version history and undo? ​

Bad Solution: Checkpoint-based undo.

  • Approach: Undo reverts to the previous named revision.
  • Challenges: Loses everyone else's work since that checkpoint. Completely unusable in a collaborative editor β€” collaborator C's careful edits should not be undone by collaborator A hitting Ctrl+Z.

Good Solution: Per-user undo stack with inverse ops.

  • Approach: Each user maintains a local stack of their own ops. Ctrl+Z generates the inverse op (delete for insert and vice versa) and sends it through the normal OT pipeline. The OT engine transforms the inverse op against any subsequent ops before applying it.
  • Challenges: Inverse ops interact subtly with other users' concurrent work. E.g., user inserts "foo", user B deletes "foo", user A hits undo β€” undo of insert is now a no-op. Correct behavior but confusing UX.

Great Solution: Named revisions + inverse-op undo + history as first-class data.

  • Approach:
    • Named revisions are user-facing tags pointing at specific serverRev values. Restoring a revision does NOT rewrite the op log β€” instead it emits a new op that diffs current state toward the target state. Full history and attribution preserved.
    • Per-user undo uses inverse ops through OT as in the Good solution.
    • History view streams ops from the op log and replays them visually ("see edits as they happened"). Optionally scrub with a timeline slider.
    • Retention: op log is hot for 30 days, then compacted against snapshots for long-term storage.
  • Challenges: Restore-by-diff produces large ops for big restores β€” can be 1 MB+ in a giant doc. Mitigate with a "replace-all" op type that snapshots the whole doc in one message. Undo correctness with 5+ concurrent users remains fiddly; document the expected behavior explicitly.

5) How do we scale the single-writer-per-doc bottleneck? ​

Bad Solution: Distributed consensus per op.

  • Approach: Use Raft/Paxos to coordinate ops across multiple writers.
  • Challenges: Latency explodes β€” every op needs a quorum round trip. Incompatible with < 200 ms p99 budget.

Good Solution: Consistent hash docId β†’ pod.

  • Approach: Edge WS routes connections to the Session pod that owns the doc. All writes serialize through that pod.
  • Challenges: Hot doc with 200 editors concentrates load on one pod. Pod failure drops all 200 connections until failover (seconds).

Great Solution: Consistent hash + dedicated pods for hot docs + fast failover.

  • Approach:
    • Baseline: consistent hash docId β†’ pod.
    • Detect hot docs (> 50 concurrent editors) via a metrics sidecar. Promote to a dedicated pod with its own CPU/memory budget. Rebalance is seamless β€” clients reconnect via the edge, which reads the updated routing table.
    • Fast failover: op log is durable in Kafka. On pod crash, a replacement pod loads the latest snapshot + tails Kafka from the last rev. Clients reconnect and resume. 2-3s blip at worst.
    • Regional affinity: home region for each doc based on org's primary data residency; cross-region users route to the home region with sticky WebSockets.
  • Challenges: Hot-doc detection has lag β€” a doc can spike before promotion kicks in. Mitigate by proactively promoting any doc with an enterprise org owner. Failover still causes a noticeable hiccup for users; local echo masks most of it.

What is Expected at Each Level? ​

Mid-level (SMTS-junior) ​

Identify WebSockets as the transport. Describe basic OT at a high level (concurrent edits need rewriting). Snapshot-based storage. Basic idea that we need a session service per doc. Can be prompted on offline reconcile and presence scaling.

Senior (SMTS / LMTS) ​

OT correctness at the function level (describe transform matrix). Presence as a separate plane. Op log + snapshotting with compaction. Single-writer-per-doc sharding via consistent hash. Back-of-envelope for connection count and op rate.

Staff+ (PMTS) ​

Regional replication strategy for cross-DC low latency. Formal transform matrix with property-based testing approach. Compaction strategy with 7-day offline window. Backpressure from op log to clients. CRDT contrast. Cost model for edge WS fleet. Graceful degradation plan when the Doc Session pod is at capacity.


Salesforce-Specific Considerations ​

  • Direct analog: Quip (Salesforce's collaborative doc product) uses a very similar OT-style engine. If the interviewer is ex-Quip or current Quip team, lean into this.
  • CRM records contrast: in Salesforce CRM (Accounts, Opportunities), collaborative editing uses optimistic concurrency with LastModifiedDate rather than OT. Cheap for low-contention edits; worse UX for real-time. Be ready to contrast the two models.
  • Sharing model maps directly to our Permission entity: Organization β†’ Role β†’ Permission Set β†’ Record Sharing Rule. Our doc permissions follow the same inheritance.
  • Governor-limit analog: cap the rate of ops per user per minute (e.g., 1000 ops/min per user) to prevent a script from spamming edits. Matches Salesforce's per-transaction and per-org limits.
  • Hyperforce data residency: doc's home region is determined by org residency setting. Cross-region clients accept WAN latency.
  • Platform Events: we can publish DocEdited__e events for other parts of the platform (workflow triggers, notifications) without coupling them to the editor.

Example snippet β€” OT transform for concurrent inserts ​

java
// Transform op A (from client) against op B (already on server)
// such that applying A after B is equivalent to applying the original A
// after a state where B had not happened.
public Op transformInsertAgainstInsert(Op a, Op b) {
    if (a.pos < b.pos || (a.pos == b.pos && a.tieBreakerLowerThan(b))) {
        return a;  // no shift needed
    }
    // B inserted before A's position, so A shifts right by B.length
    return new Op(a.type, a.pos + b.length, a.chars, a.clientId);
}
cpp
Op TransformInsertAgainstInsert(const Op& a, const Op& b) {
  if (a.pos < b.pos || (a.pos == b.pos && a.TieBreakerLowerThan(b))) {
    return a;
  }
  return Op{a.type, a.pos + static_cast<int>(b.chars.size()), a.chars, a.client_id};
}
typescript
function transformInsertAgainstInsert(a: Op, b: Op): Op {
  if (a.pos < b.pos || (a.pos === b.pos && a.tieBreakerLowerThan(b))) {
    return a;
  }
  return { ...a, pos: a.pos + b.chars.length };
}

Frontend interview preparation reference.