HLD: Search Autocomplete Service ​
L4 scoping note. Autocomplete is deceptively deep. The interviewer wants you to reason about tries, ranking, and low-latency serving -- NOT to invent a new inverted index or a global personalization model. Keep the scope honest: 10K QPS, sub-50 ms p99, a few million unique head queries. Mention personalization as a hook for senior-level discussion; don't design a real-time ML pipeline unless they push you there.
Understanding the Problem ​
What is Autocomplete? ​
As a user types in a search box, show a ranked list of suggested completions. Google, Amazon, YouTube, and every decent app has one. The server receives prefixes ("tay", "tayl", "taylo", "taylor") and returns completions ("taylor swift", "taylor sheridan", "taylor swift eras tour"). The interview value: narrow problem, high performance bar, nice trie data structure discussion, room for a ranking deep dive.
Functional Requirements ​
Core (above the line):
- Prefix matching -- given a prefix string, return the top N (e.g., 10) completions.
- Ranking by popularity -- most popular matching completions ranked first.
- Recency boost -- recently trending queries rank higher (a query getting hot should move up within minutes, not days).
- Corpus update -- the set of possible completions updates as new searches happen (new queries get added, old ones decay).
Below the line (out of scope):
- Spell correction / fuzzy matching ("tyalor" -> "taylor"). Mention it; don't design it.
- Personalization (per-user history influences ranking). Acceptable to mention as an add-on.
- Voice-to-text autocomplete (different input path)
- Multi-language handling beyond treating strings as byte sequences
- Front-end details: debouncing, caching in the browser -- briefly mention, not focus
Non-Functional Requirements ​
Core:
- Latency -- p99 < 50 ms (end-to-end from user keystroke to suggestions rendered). Server-side budget: ~15 ms.
- Scale -- 10K QPS sustained. Peak 30K QPS.
- Freshness -- new popular queries show up within an hour or two.
- Availability -- 99.9%. Autocomplete failing is annoying but not catastrophic -- degrade gracefully (empty dropdown).
Below the line:
- Global sub-20ms latency from every region (nice, but a regional deployment is fine for L4 scope)
- Strong consistency on ranking updates (eventual is fine -- a query's popularity lagging by 10 min is OK)
L4 sanity check: 10K QPS at sub-50ms latency means each server can easily handle a few thousand QPS if the lookup is in memory. We're talking tens of servers, not thousands. Don't over-distribute.
The Set Up ​
Core Entities ​
| Entity | Description |
|---|---|
| Suggestion | A single completion: { text, score } |
| QueryLog | Raw user searches, used to compute popularity |
| Trie | In-memory index of all suggestion strings, with each node holding the top-K completions for its prefix |
The API ​
One read endpoint, one analytics sink (internal).
Get suggestions:
GET /api/autocomplete?q=tay&limit=10
Accept: application/json
Response: 200 OK
Cache-Control: public, max-age=60
{
"query": "tay",
"suggestions": [
{ "text": "taylor swift", "score": 987 },
{ "text": "taylor sheridan", "score": 654 },
{ "text": "taylor swift eras tour", "score": 512 },
...
]
}Ingest search events (internal):
POST /internal/search-events (or directly via Kafka)
{
"query": "taylor swift eras tour",
"userId": "u_123",
"timestamp": 1714500000,
"country": "US"
}High-Level Design ​
[Client] -> [CDN edge cache] -> [API gateway] -> [Autocomplete service]
|
v
[In-memory trie] (per-node)
^
|
[Index builder job] <--- [Aggregated query stats] <--- [Kafka: search_events]Flow 1: Serving a request ​
- User types
tay. Client debounces (~150 ms) and sendsGET /api/autocomplete?q=tay. - Optional CDN hit for very common short prefixes. Cache for 60 seconds.
- API gateway routes to an Autocomplete Service node.
- Service looks up the prefix in its in-memory trie.
- At the trie node for
tay, we stored the pre-computed top-10 completions. Return them. - Response in < 15 ms server-side.
Flow 2: Logging a search ​
- User selects
"taylor swift eras tour"(or types and submits). - Search frontend emits a
search_eventsKafka message. - A streaming aggregator bumps
popularity[query] += 1(possibly with time decay). - Aggregated stats persist to a KV store (Redis or Bigtable) and serve as input to the index builder.
Flow 3: Index build (offline, recurring) ​
- Every 15 minutes, an index builder reads the current popularity table.
- It builds a fresh trie: each node stores the top-K completions that pass through its prefix.
- Serializes the trie and publishes it to object storage (GCS / S3).
- Each Autocomplete Service node pulls the new trie and atomically swaps it in. No restart, no downtime.
Flow 4: Real-time boost (optional) ​
- For bursty trending (breaking news), a lightweight "fast path" in memory tracks recent queries via a CMS + min-heap and re-ranks prefixes accordingly.
- Combined score =
0.8 * historical_score + 0.2 * recent_boost.
Potential Deep Dives ​
1) Data structure: trie vs inverted index vs hash map ​
Bad Solution: Exact-match HashMap ​
- Approach:
HashMap<String, List<Suggestion>>. Key by full prefix. - Challenges: A prefix of length L has L different keys -- store
"t","ta","tay", ... Memory explodes. Also impossible to handle prefixes you haven't seen before.
Good Solution: Trie ​
- Approach: Standard prefix tree. Each node is a character, each leaf (or marked node) indicates end-of-word.
- Pros: Natural fit for prefix matching. O(L) lookup where L is prefix length (tiny).
- Cons: Traversing to find top-K at a prefix requires walking the entire subtree below it. Expensive at serving time if the subtree is large.
Great Solution: Trie with pre-computed top-K at each node ​
- Approach: At build time, for every trie node, walk its subtree, collect all completions, and store the top-K by score directly on the node. At query time, lookup is O(L) -- just traverse to the node and return its precomputed list.
- Memory: Each node stores up to K = 10 completions. Average word length ~8, 2M unique head queries * 8 chars = 16M nodes * (children map + top-K list) = a few GB. Fits in memory on a single server.
- Trade-off: Build is slow (subtree scan at every node), but build is offline. Serving is fast.
class TrieNode {
Map<Character, TrieNode> children = new HashMap<>();
List<Suggestion> topK = new ArrayList<>(); // precomputed top-K for this prefix
}
class Trie {
TrieNode root = new TrieNode();
void insert(String word, double score) {
TrieNode node = root;
for (char c : word.toCharArray()) {
node = node.children.computeIfAbsent(c, k -> new TrieNode());
}
// Mark terminal (implementation detail)
}
void precomputeTopK(TrieNode node, int k, String prefix) {
// DFS collect all terminals in subtree
PriorityQueue<Suggestion> heap = new PriorityQueue<>(Comparator.comparingDouble(s -> s.score));
collect(node, prefix, heap, k);
node.topK = new ArrayList<>(heap);
node.topK.sort((a, b) -> Double.compare(b.score, a.score));
for (var e : node.children.entrySet()) {
precomputeTopK(e.getValue(), k, prefix + e.getKey());
}
}
List<Suggestion> lookup(String prefix) {
TrieNode node = root;
for (char c : prefix.toCharArray()) {
node = node.children.get(c);
if (node == null) return List.of();
}
return node.topK;
}
}struct TrieNode {
std::unordered_map<char, std::unique_ptr<TrieNode>> children;
std::vector<Suggestion> topK;
};
const std::vector<Suggestion>& lookup(TrieNode* root, const std::string& prefix) {
TrieNode* node = root;
for (char c : prefix) {
auto it = node->children.find(c);
if (it == node->children.end()) return EMPTY;
node = it->second.get();
}
return node->topK;
}class TrieNode {
children = new Map<string, TrieNode>();
topK: Suggestion[] = [];
}
function lookup(root: TrieNode, prefix: string): Suggestion[] {
let node: TrieNode | undefined = root;
for (const c of prefix) {
node = node.children.get(c);
if (!node) return [];
}
return node.topK;
}2) Keeping the trie fresh: build strategy ​
Bad Solution: Rebuild on every new query ​
- Approach: Every time a query happens, update the trie in place.
- Challenges: Concurrent modification during serving is a disaster. Locking every node destroys throughput. Also: ranking is a function of time-decayed counts over hundreds of millions of queries -- incremental updates don't easily capture that.
Good Solution: Periodic rebuild ​
- Approach: Every 15 minutes, an offline job:
- Reads the latest popularity table (
query -> score). - Builds a new trie from scratch.
- Writes it to blob storage.
- Reads the latest popularity table (
- Each Autocomplete Service node polls for a new trie, downloads, and atomically swaps (double-buffer: keep the old trie alive while the new one loads, then swap a pointer).
- Pros: No write contention during serves. Predictable memory. Simple.
- Cons: Freshness = 15 minutes. New queries can't appear in suggestions until next build.
Great Solution: Periodic rebuild + real-time delta ​
- Approach: Periodic rebuild is the "slow truth." Layer a small in-memory "delta trie" on top, holding queries observed since the last rebuild. Lookups check both, merge results.
- When a rebuild completes, the delta trie resets.
- Pros: Best of both: fresh within seconds for new trending queries, efficient for the 99% case.
- Cons: Slightly more code. Memory split across two structures.
For L4, "periodic rebuild every 15 min" is a complete answer. Mention the delta approach as a follow-up.
3) Ranking: how do we score a suggestion? ​
Bad Solution: Raw count ​
- Approach:
score = count of query occurrences. - Challenges: "weather" has been searched billions of times forever; a new trending query never catches up. Ranking is effectively frozen.
Good Solution: Time-decayed count ​
- Approach:
score = sum over events of exp(-lambda * age). Recent events dominate, old events fade. - With
lambda = ln(2) / 7 days, a query's contribution halves every 7 days. - In practice: maintain a daily score,
new_score = old_score * decay_factor + today_count. Compute once per day.
Great Solution: Multi-signal ranking ​
- Approach: Combine:
- Popularity (time-decayed count)
- Recency boost (queries trending in the last hour get a multiplier)
- Click-through rate from search results (queries that lead to successful outcomes rank higher)
- Length penalty (prefer shorter when tied)
- For Google scale, also: personalization (user's own history, locale, time of day)
- Implemented as a weighted sum, coefficients tuned offline.
final_score = 0.5 * popularity
+ 0.2 * recency_boost
+ 0.2 * ctr
- 0.1 * length_penalty
(+ personalization if enabled)L4 note: Describe the multi-signal approach at the conceptual level. You don't need to derive weights. Mention that the ranking model can evolve from heuristic weights to a lightweight ML model (logistic regression, gradient-boosted tree) as the product matures.
4) Distributing the trie: do we shard? ​
The question ​
A single in-memory trie with 2M head queries is a few GB. Fits on one machine. Why shard?
Good Solution: Replicate, don't shard ​
- Approach: Each Autocomplete Service node holds a FULL copy of the trie. Horizontal scaling = more replicas. 10K QPS / (say) 2K QPS per node = 5 nodes. Plus redundancy = ~10.
- Pros: Simpler. No cross-node fan-out on a request. No consistency issues across shards.
- Cons: Rebuild and distribution have to handle fanning the trie out to all replicas. Memory cost scales with node count.
Good Solution (if really needed): Shard by prefix ​
- Approach: Shard routing by first character (or first two). Route
"tay*"to shard A,"ap*"to shard B. - Cons: Skew -- some first letters are much more common. Every request touches exactly one shard (no fan-out), but a request for a 1-character prefix like
"t"hits the busiest shard.
Recommendation for L4: replicate ​
At 2 GB memory per replica and 10 replicas, we're using 20 GB of RAM. Peanuts. Don't shard for scale until the trie can't fit on one machine.
5) Handling the hot-path latency budget ​
The server budget is ~15 ms. Where does it go?
- Network in (gateway -> service): 1-2 ms
- Trie lookup: microseconds (in-memory, O(L) where L < 20)
- Serialization: 1-2 ms
- Network out: 1-2 ms
- Headroom: 8-10 ms
Trie lookup is effectively free. The overhead is networking and framing. Optimizations:
- gRPC over HTTP/2 for intra-service calls.
- Keep connections warm (HTTP keepalive / gRPC persistent channels).
- Pre-allocate response buffers (avoid GC pressure in Java/Go).
- For very hot prefixes, a CDN edge cache can answer directly without hitting the origin. Cache-Control:
public, max-age=60is safe -- suggestions don't change that fast.
6) Cold start: new queries that no one has searched before ​
A user types a completely novel query. It's not in the trie. What do we do?
- Nothing: return an empty suggestion list. Client falls back to showing just the user's typed text.
- Fallback to spell-check: if
"tayor"has no completions, but"taylor"does, offer the corrected version. Implement via BK-tree or Levenshtein automata. - Character n-grams: for rare prefixes, suggest based on n-gram neighbors. Much heavier, usually not needed.
L4: "empty result is fine; mention spell correction as a follow-up."
7) Client-side concerns (brief) ​
- Debouncing: don't fire a request on every keystroke. Wait 100-150 ms of no typing before firing.
- Request coalescing: if the user types faster than the server responds, cancel in-flight requests for stale prefixes.
- Client-side cache: cache recent suggestion responses in memory keyed by prefix. Prefix
"tayl"often hits cache if the user just typed"tay"and the response included"tayl*"variants.
These belong to the frontend design, but knowing they exist shows awareness.
8) Personalization (optional, for senior discussion) ​
If asked: a separate personalization service maintains per-user history. For each incoming request, fetch the user's top queries, blend with the global suggestions using a re-ranking step. Cache per-user preferences in Redis, warm on login. Latency budget gets tighter -- may need co-located storage.
What is Expected at Each Level ​
L3 / Mid-level ​
- Proposes a trie. Understands prefix lookup.
- Knows we need to rank by popularity.
- May miss the "precompute top-K at each node" optimization.
- Probably doesn't discuss replication strategy or delta tries.
L4 ​
- Trie with per-node precomputed top-K -- the key optimization.
- Periodic rebuild pipeline + atomic swap-in of new trie.
- Multi-signal ranking (popularity + recency at minimum).
- Back-of-envelope: memory footprint of trie, QPS per replica, end-to-end latency breakdown.
- Replication vs sharding discussion; picks replication for this scale.
- Kafka-based event pipeline for popularity aggregation.
L5 / Senior ​
- Real-time delta trie for sub-minute freshness on trending terms.
- Personalization layer and its latency implications.
- Multi-region deployment with regional tries (local popularity differs --
"football"means different things in UK vs US). - Spell correction and fuzzy matching integration.
- ML-driven ranking: feature engineering, online evaluation, A/B testing harness.
- Cost analysis: CDN bandwidth vs origin QPS vs ML inference cost.