08 — Concurrency: Scarcity and Resilience
Why This Is Senior Territory
The earlier concurrency files in this handbook — file 06 on correctness and file 07 on coordination — dealt with what threads do to memory. Locks, atomics, happens-before, producer/consumer. Those are the questions you can reason about on a whiteboard with a single JVM and two threads. They are necessary. They are not sufficient.
This file is about what happens when the whiteboard meets production. Your downstream is slow. Your queue is full. A retry storm from an unrelated deploy just multiplied your inbound load by 8. A connection pool leaked because somebody forgot try-with-resources. Your p99 goes from 50ms to 12 seconds and nobody can tell you why.
Scarcity and resilience are what separate "I have studied concurrency" from "I have been paged by concurrency." They are the topics the senior LLD candidate volunteers before the interviewer asks. You finish your class diagram, glance at the downstream HTTP client, and say: "I want to talk about timeouts, retries, and the circuit breaker around this dependency." That sentence alone is an SDE3 signal. Most candidates skip straight to happy-path code and wait for the probe.
The mental model
Correctness asks: "does my program compute the right answer if nothing fails?" Coordination asks: "does it compute the right answer when threads race?" Resilience asks: "does it stay up when the world is on fire?" All three matter. Only the third gets you paged at 2 a.m.
The core insight of this file is that every resource in your system is finite — connections, threads, memory, the downstream's capacity, your own CPU. Good LLD anticipates the moment a resource runs out and has a named policy for what happens next. Bad LLD discovers the answer in an outage postmortem.
Rate Limiting
Rate limiting is the first line of defense. It says: "no matter what you ask, I will not do more than N units of work per unit of time." It protects you from abusive clients, it protects downstream from your own overzealous clients, and it is a contract you can communicate cleanly to the caller via 429 Too Many Requests with a Retry-After header.
The four algorithms
Four algorithms dominate real systems. The first thing a senior candidate does is name them and pick one with a reason.
| Algorithm | How it works | Accuracy | Memory per key | Burst behavior |
|---|---|---|---|---|
| Fixed window | Counter resets at each window boundary (e.g. top of minute). Reject when counter > limit. | Low — allows 2× burst at window boundary | O(1) — counter + window start | Spiky. A client can fire 2× limit across the boundary. |
| Sliding window log | Keep timestamps of every request in a window. Reject when count in [now-W, now] exceeds limit. | Exact | O(N) per key where N = limit | Smooth, but memory-expensive. |
| Sliding window counter | Weighted sum of current + previous window counters. Cheap approximation of the log. | High — small error near boundary | O(1) — two counters | Smoother than fixed, cheaper than log. |
| Token bucket | Bucket holds up to B tokens, refilled at rate R tokens/sec. Each request consumes a token; reject (or queue) when empty. | Exact on average rate, bursty by design | O(1) — last refill time + token count | Allows bursts up to B, sustained rate R. Good for APIs that want to tolerate small spikes. |
| Leaky bucket | Bucket drains at fixed rate R. Each request adds water; reject when full. Output is perfectly smooth. | Exact | O(1) — current water level | Smooths out bursts — never emits faster than R. |
Token bucket vs leaky bucket is the single most-asked follow-up. The difference: token bucket controls the input side — it tolerates bursts up to bucket size and rejects when the bucket is empty. Leaky bucket controls the output side — it always emits at a smooth rate, queueing or dropping overflow. Most "rate limiter" products are token bucket. Most "traffic shapers" in networking are leaky bucket.
What to pick for the interview. Default to token bucket unless there's a specific reason otherwise. It's O(1), handles bursts (which is what real traffic looks like), and the math is trivial: tokens = min(B, tokens + (now - last_refill) * R).
The token bucket's state is just two numbers per key — last refill timestamp and current token count — so storing it in Redis for a distributed rate limiter is a single HGETALL + HMSET round trip, or a Lua script for atomicity. Memory per key is on the order of 32 bytes, meaning a million-user service fits its rate-limiter state in 32 MB of Redis. That's the reason it dominates real-world deployments: it scales further than the alternatives without introducing a storage layer of its own.
Where fixed window bites you. A fixed window of 100 req/min allows a client to send 100 requests at 11:59:59 and another 100 at 12:00:00 — 200 requests in two seconds while technically never violating the limit. Abusive clients deliberately synchronize to the window boundary. Sliding window counter is the cheap fix: weight the previous window's count by the fraction of it still in view. For example, if 40% of the current minute has elapsed and the previous minute saw 80 requests, the effective count for limit-checking purposes is 80 * 0.6 + current_window_count. This is O(1) memory with sub-1% error vs. the exact log.
Per-caller vs global
The second thing to clarify: what's the key? A rate limiter with a single global counter protects the service but is trivially exploitable — one abusive client starves everyone else. You almost always want a per-caller limit. The "caller" depends on the system:
| Key | When to use | Gotcha |
|---|---|---|
| User ID | Authenticated APIs, per-user quotas | Works after auth. Pre-auth endpoints need a different key. |
| API key | B2B APIs, usage-based pricing | The API key is the unit of billing and throttling both. |
| Tenant / org ID | Multi-tenant SaaS | A single tenant with many users should share a pool. |
| IP address | Pre-auth endpoints (login, signup) | NAT and CGNAT mean many real users share one IP. Never the only key. |
| (IP, route) | Per-endpoint abuse control | Expensive endpoints (search, export) get tighter limits than cheap ones. |
A mature rate limiter usually composes several — a per-user limit and a per-IP limit and a global safety valve. The first one exceeded causes the reject.
Where to enforce
The third choice: at what layer? Rate limiting can live at the CDN, the API gateway, the service itself, or even inside a specific function. Senior take: push to the edge whenever possible.
| Location | Pros | Cons |
|---|---|---|
| CDN / edge (Cloudflare, Fastly) | Cheap — reject before your infra pays any cost. Lowest latency for rejected requests. | Coarse — usually IP-based, hard to key by user ID. |
| API gateway (Kong, Envoy, AWS API GW) | After auth — can key by user/API key. Still in front of services. One place to change policy. | Distributed counter (Redis) is now on the hot path. |
| Service layer | Full context — can key by any business concept (tenant tier, feature flag). | Every replica must agree on the counter → Redis / shared state. |
| Function / method | Precision — expensive internal operations (e.g. "generate report") can have their own limit independent of the API. | Scattered policy. Hard to audit. |
The standard production setup is edge for volumetric abuse, gateway for per-caller quotas, and service for business-specific guards. In an LLD interview, pick one for the sketch and mention the others exist.
The distributed counter problem. A rate limiter that lives in a single process can use an in-memory hash map. A rate limiter that serves a fleet of N gateway instances can't — each instance would track its own counter, and the effective limit would be N× the intended one. The two standard fixes are: (a) shared state in Redis with Lua for atomic read-modify-write, accepting a ~1ms latency tax per request, or (b) local counters per instance with each instance allocated limit / N — simpler, cheaper, but hot-spots when traffic is skewed. The Redis approach is the default; the local approach is what you escape to when Redis itself becomes the bottleneck.
Brief link forward
The full LLD — class diagram, concurrency-safe token bucket, distributed counter tradeoffs, sliding-window-counter math — lives in /lld-handbook/notes/17-problem-rate-limiter. Here we only need enough to name rate limiting when the follow-up comes.
Connection Pools
A connection pool is the single most common shared resource in a backend service. Ask "what's the pool size?" in any LLD interview that touches a database and you instantly sound senior.
Why pools exist
Opening a TCP connection to a database is expensive — TCP handshake, TLS handshake, authentication, session setup. For a request that runs a 2ms query, setup can be 50–200ms. The pool amortizes that cost: connections are created once, handed out to request threads, and returned to the pool. A warm connection is basically free.
Pools also serve two other purposes that come up in senior follow-ups:
- Backend protection. Postgres has
max_connections. Exceeding it produces cascading failures across every service that talks to that instance. A pool bounds each service's share so no single bug drains the whole DB. - Predictable resource use. A pool of 20 means you hold at most 20 connections' worth of memory and sockets. No surprises.
Pool sizing
The single most wrong thing a candidate will say is "make the pool as large as possible." That's a path to the thundering-herd pattern: 500 app threads all firing queries at a DB that can handle 50 concurrent. The DB collapses, every query times out, and the pool does not save you.
The classic starting point is Little's Law:
pool_size ≈ throughput (req/sec) × avg_service_time (sec)If you serve 100 req/sec and each DB call takes 20ms, you need about 2 connections in steady state. Multiply by a headroom factor (2–3×) and you land around 5–6. That's far smaller than most candidates guess.
There is an even more counter-intuitive rule from the HikariCP docs: smaller pools are often faster. A pool of 10 will usually outperform a pool of 100 on the same DB because the DB does less context-switching and lock-contention internally. A well-tuned Oracle benchmark from years ago landed on connections ≈ ((core_count × 2) + effective_spindle_count).
| Pool size | Symptom | What you see |
|---|---|---|
| Too small | Queueing at the pool checkout | Request latency = query latency + pool-wait time. Pool-wait dominates at load. |
| Too large | Thundering herd on the DB | DB CPU pegged. Individual queries slow down. p99 balloons. Nothing in the app looks wrong. |
| Right-sized | Pool waits are rare; DB is not saturated | p99 ≈ query latency. Pool checkout time < 1ms. |
How you'd answer the "what pool size?" question. "I'd start with Little's Law — peak throughput times p95 service time — then multiply by 1.5 for headroom. Then I'd load test and tune. I'd rather err small: a smaller pool queues locally, which I can observe. A larger pool queues at the DB, which I can't."
Timeouts
Every pool needs three timeouts, not one. Missing any of them leads to a specific failure mode.
| Timeout | What it bounds | Failure mode if absent |
|---|---|---|
| Connection timeout (a.k.a. connect timeout) | How long to wait for the TCP/TLS handshake to complete. | If the DB is unreachable (network partition), threads hang forever on connect(). App pool exhausts. |
| Socket timeout (a.k.a. read timeout, query timeout) | How long to wait for a response after sending bytes. | If the DB accepts the connection but never replies (deadlock, full table scan, GC pause), threads hang forever. This is the most common pool-exhaustion cause. |
| Idle timeout | How long an idle connection sits in the pool before being closed. | Connections die silently on the wire (firewall, NAT gateway). Next use returns "connection reset." |
| Pool checkout timeout | How long a caller waits for a free connection from the pool. | Without this, callers pile up in the pool's internal queue. With it, you get a fast fail-fast at the pool boundary. |
All four or you will be debugging at 2 a.m. The defaults in most drivers are infinite or very generous. Read the docs of whatever pool you're using (HikariCP, pgbouncer, node-postgres) and set them explicitly.
Pool exhaustion — what to do
The pool has 10 connections. All 10 are in use. Request #11 arrives. What happens?
You, the designer, pick the policy. Senior-candidate move: name the policy explicitly.
| Policy | Behavior | When to pick |
|---|---|---|
| Queue (bounded) | New request waits in an FIFO queue up to pool_checkout_timeout. If it waits longer, it fails. | Default for most services. Smooths short spikes. |
| Fail-fast | Immediately reject with "pool exhausted" error. | Latency-critical paths where queueing is worse than failing. |
| Degrade | Serve a cached or partial response from a different source. | Read paths with a meaningful fallback. |
| Unbounded queue | Wait forever (the default for many pools). | Never. This is how you turn a downstream slowdown into a full-service hang. |
The bounded queue is the default answer. Unbounded is the wrong answer. If you accept the policy is "queue forever," you've built a system that cannot shed load — the only way requests stop is when the JVM OOMs.
The pgbouncer conversation. One follow-up you may get in database-heavy LLDs: "you have 200 app servers each with a pool of 20. That's 4000 connections to one Postgres. How?" The answer is a connection multiplexer like pgbouncer running in transaction mode in front of the DB. App servers open logical connections to pgbouncer; pgbouncer keeps a small pool (say, 50) of real connections to Postgres and reassigns them across app connections on each transaction boundary. You've decoupled the app's pool size from the DB's connection limit. The price: you lose per-connection state (prepared statements, session variables) because each transaction may land on a different physical connection. Senior candidates mention this when the numbers stop making sense.
Resource Contention
Not every shared resource is a connection. Sometimes it's a row in a database, a file lock, a cache entry. These are the hot resources — the ones where thousands of requests want the same thing simultaneously.
Hot locks
The canonical LLD problem: seat booking. One row in the seats table for seat 14A on flight UA123. A thousand users hit "reserve" at the same instant. One wins, nine hundred ninety-nine should get a clean "already taken" error. Only one row ever changes state.
Naive pessimistic locking (SELECT FOR UPDATE) works but serializes the thousand requests through a single lock. Under a burst, your p99 is the 1000th lock acquisition. Three better patterns:
| Strategy | How it works | When it shines |
|---|---|---|
| Partition the resource | Spread the contended thing across many keys. For rate limiting: shard by user_id % N. For inventory: replicate stock across warehouses and pick one randomly. | The resource is conceptually divisible. No natural partition → this doesn't apply. |
| Optimistic locking (CAS) | Read row with version number, compute update, write with WHERE version = :v. Loser retries. | Contention exists but is not extreme. Writes are cheap; rollback cost is low. |
| Queue-based serialization | All writers to resource X send to a single-threaded consumer (an actor, a partitioned Kafka topic keyed by resource). The consumer processes one at a time. | High contention on a single resource. Ordered processing is required. |
The LLD tell. A senior candidate says: "For seat booking, I'd use optimistic locking with the row version — a thousand contenders but only one winner per row, and I can retry the losers. If contention spikes to the point where retry storms happen, I'd move to a per-seat actor/queue." That's a two-step escalation path, named in one sentence.
CPU vs IO contention
These have different remedies and mixing them up is a senior-level mistake.
| Contention type | Symptom | Right cap |
|---|---|---|
| CPU-bound (compression, hashing, crypto, regex) | CPU pegged at 100%. Adding threads makes it worse (context switching). | Parallelism ≈ Runtime.getRuntime().availableProcessors(). More is pure overhead. |
| IO-bound (DB calls, HTTP calls, disk reads) | CPU idle, threads blocked on IO. Throughput is low despite plenty of headroom. | Parallelism much higher than cores — bounded only by the downstream's capacity and your memory for thread stacks. |
The classic error is using a single thread pool sized for CPU (say, 8 threads) for IO-bound work. You cap yourself at 8 in-flight requests no matter how much the downstream could handle. Symmetrically, running CPU-bound work on an unbounded IO pool turns your service into a context-switching hellscape.
The pragmatic move: separate pools. One CPU-bound pool sized to cores. One (or more) IO pools sized by target throughput × expected latency. We'll formalize this in the Bulkhead section.
The async middle ground
Modern runtimes (Node.js, Java with virtual threads, Go, Rust's Tokio) let you side-step most of this bookkeeping by decoupling concurrency from threads. You write code that looks synchronous, but under the hood the runtime multiplexes thousands of in-flight logical tasks across a small number of OS threads. A naive design that would need a 1000-thread pool in classic Java becomes a single-event-loop design in Node or a 1000-goroutine design in Go — the same logical concurrency, a fraction of the memory.
The tradeoff is not gone, it's moved. You no longer exhaust threads; you exhaust backpressure budget on the event loop. A CPU-bound task on the Node event loop stalls every other request on the same process — the symptom is "my HTTP server stops responding when I hash a password synchronously." Async runtimes demand that you keep CPU-bound work off the hot path — delegate to a worker thread pool, a separate process, or a queue. The lesson transfers: CPU work needs a cap; IO work can fan out; don't mix them on the same execution context.
Backpressure
What it is
Backpressure is the mechanism by which a slow consumer tells a fast producer to slow down. It is not an error. It is a contract. When a downstream component is at capacity, there are exactly three things it can do:
- Drop incoming work. (Load shedding.)
- Buffer it. (Queue. Works until the queue is full.)
- Backpressure the upstream — refuse to accept, or refuse to acknowledge, new work until it catches up.
A well-designed system uses all three at different layers, but the third is what keeps the system stable.
The plumbing analogy
Think of water flowing through pipes. A fast producer is a high-pressure pump. A slow consumer is a narrow pipe downstream. If there's no backpressure, the pump keeps pushing and pressure builds somewhere — usually the producer's own memory (queue grows unbounded) until the pipe bursts (OOM). Backpressure is the pump's pressure sensor that says "pipe is full, throttle me."
How you signal it
Different protocols have different vocabularies for the same idea.
| Layer | Backpressure signal | What it means |
|---|---|---|
| HTTP | 429 Too Many Requests, 503 Service Unavailable, Retry-After header | Don't retry before N seconds. Explicit rate limit hit. |
| TCP | Receive window (the number in the TCP header) shrinks | Sender must slow down; receiver's buffer is filling. |
| Kafka consumer | max.poll.records + slow poll() loop | Consumer controls its own rate by polling slower. |
| Reactive streams | request(n) — subscriber asks for N items | Pure demand-based — nothing flows until asked. |
| Bounded queue (in-proc) | put() blocks when queue is full | Producer thread blocks rather than enqueueing. |
| gRPC | Flow-control window on the stream | Similar to TCP but at the gRPC level. |
The unifying idea: the consumer's willingness to accept work is a first-class signal. The producer must respect it.
Designing for backpressure
Here is the rule that separates a senior engineer from the rest: every queue in your system is an opportunity for backpressure. Every unbounded queue is a hidden OOM.
Bounded queue behavior:
producer --put()--> [queue: capacity 1000] --take()--> consumer
When queue is full:
- BLOCK policy: producer.put() blocks until a slot frees → backpressure propagates upstream
- DROP policy: producer.put() returns false; producer decides what to do → explicit shedding
- REJECT policy: producer.put() throws; caller handles the rejection → explicit failureUnbounded queue behavior:
producer --put()--> [queue: unbounded] --take()--> consumer
When consumer is slow:
- queue grows
- grows
- grows
- OOMKilled at 3amYou can see the problem. The unbounded queue hides the backpressure signal until there is no recovery — memory is gone, the JVM is dead.
The LLD follow-up. When an interviewer asks "what happens if the consumer is slow?" the strong answer is: "My queue is bounded at 1000. When it fills, producers block — that propagates backpressure to whatever fed them. If the producer is an HTTP handler, it returns 503 to the client. If the producer is another queue, that queue fills, and the chain pushes back to the ingress where we drop or rate-limit." You've drawn a picture of the whole pipeline pushing back cleanly.
Retries
When they help
Retries are the right answer to transient failures — the class of errors where the same request, re-sent shortly after, would succeed:
- Network blip, packet loss, TLS renegotiation hiccup.
- Leader election in the downstream — 50–500ms of "no leader" errors.
- Transient overload: downstream returned
503but has capacity now. - Eventually consistent read miss that would hit on a second try.
If the error is persistent — 404, authorization failure, validation error — retrying is worse than useless; it's latency amplification with no hope of success. Respect the 4xx vs 5xx distinction: retry 5xx and 429, never retry 4xx except 429 and 408.
When they amplify damage
Retries are also how a bad day becomes an outage. The scenario:
- Downstream has a bad deploy. 30% of requests fail with 503.
- Every upstream caller retries 3× per request.
- Load on the downstream =
normal × (1 + 3 × 0.3) = 1.9×normal. - Downstream, already broken, is now serving 1.9× load. Failure rate climbs to 60%.
- Retries mean load =
1 + 3 × 0.6 = 2.8×. Failure climbs to 90%. - Downstream collapses completely. This is a retry storm.
The retry loop is a positive feedback loop on downstream load. Without damping, the math guarantees collapse.
Exponential backoff + jitter
Damping mechanism #1: space the retries exponentially.
retry 1: wait 100ms
retry 2: wait 200ms
retry 3: wait 400ms
retry 4: wait 800msThis alone is necessary but not sufficient. The catastrophic failure mode: 1000 clients hit a failing downstream at the same instant. They all retry at t + 100ms. They all retry at t + 200ms. They all retry at t + 400ms. They synchronize on the failure — the retry storm has structure now.
Damping mechanism #2: jitter. Add randomness to each delay so clients decorrelate.
| Strategy | Formula | Properties |
|---|---|---|
| No jitter | base * 2^n | Synchronization bomb. Do not ship. |
| Equal jitter | base * 2^n / 2 + random(0, base * 2^n / 2) | Half fixed, half random. Smooths somewhat. |
| Full jitter | random(0, base * 2^n) | Purely random up to the cap. Best decorrelation in the AWS study. |
Jitter is non-negotiable. The three-line difference between "exponential backoff with jitter" and "exponential backoff" is the difference between a system that gracefully recovers and one that doesn't. The AWS Architecture Blog post "Exponential Backoff And Jitter" is required reading — their benchmark showed full jitter outperforming every other scheme.
Retry budgets
Per-request backoff helps a single client. It does not help the fleet. If 10,000 clients each politely do 3 retries with jitter, the downstream still sees 30,000 extra requests during an outage. This is where the SRE concept of a retry budget comes in.
The rule: the total rate of retries across the fleet must not exceed X% (typically 10%) of the base request rate. If retries exceed the budget, stop retrying at all — the downstream is clearly broken and more retries won't help.
Implementation sketch: the client library keeps a local token budget (e.g. 10% of request count). Each retry consumes a token; each success mints a token. When the bucket is empty, retries are disabled until successes replenish it. Google's SRE book (Chapter 22) formalizes this; gRPC's built-in retry policy has a retry budget option for exactly this reason.
Idempotency requirement
Retries only work if the operation is idempotent. Retrying a non-idempotent POST /charge is a duplicate-charge bug with extra steps. Cross-reference file 06 on idempotency keys — the short version is:
- Reads (
GET) are always idempotent. PUTandDELETEare idempotent by HTTP spec if implemented correctly.POSTis not idempotent by default. To make it safe to retry, accept anIdempotency-Keyheader and deduplicate server-side.- Writes that update by primary key + absolute value (
SET balance = 100) are idempotent. Writes that update by delta (balance = balance + 10) are not.
Retries without idempotency is a bug generator. Saying this sentence out loud in an interview is worth real points.
Circuit Breaker
The circuit breaker is named after the household electrical device: when current exceeds a threshold, the breaker trips, cutting the circuit. It resets manually (or in our case, after a cooldown). The LLD circuit breaker does the same for a downstream dependency — when failures exceed a threshold, the breaker "opens" and subsequent calls fail fast without hitting the downstream at all.
The purpose is twofold:
- Protect the caller from paying the latency cost of calling a dependency that will fail anyway.
- Protect the downstream from retry-storm amplification while it's recovering.
The three states
failures ≥ threshold
┌──────────────────────────┐
│ │
▼ │
┌────────────┐ ┌──────┴─────┐
│ CLOSED │ │ OPEN │
│ (normal) │ │ (fail fast)│
└─────┬──────┘ └──────┬─────┘
▲ │
│ probe success │ cooldown expired
│ ▼
│ ┌────────────┐
└────────────────────│ HALF-OPEN │
probe fail │ (testing) │
→ back to OPEN └────────────┘- Closed. Normal operation. All calls pass through. Failures are counted.
- Open. Fail fast. Calls return immediately with a "circuit open" error without touching the downstream. A cooldown timer runs (e.g. 30 seconds).
- Half-open. Cooldown expired. Let one (or a few) probe requests through. If they succeed, close the circuit — back to normal. If they fail, open it again for another cooldown.
The half-open state is what distinguishes a circuit breaker from a simple on/off switch. Without it, you'd need a human to reset the breaker. With it, the breaker heals automatically when the downstream recovers.
Failure criteria
What counts as "enough failures to open"? Two schemes.
| Criterion | How it works | When it's wrong |
|---|---|---|
| Threshold count | Open after N consecutive failures (e.g. 5). | At low traffic — 5 failures in a row out of 5 requests is very different from 5 out of 1000. |
| Rolling ratio | Open when failure rate in a rolling window exceeds X% with at least M requests. | Needs more state (sliding window of outcomes). Correct under varying load. |
Production implementations (Hystrix, resilience4j) use rolling ratio. The minimum-request threshold matters: if traffic is sparse, you need to accumulate enough samples to make the failure rate statistically meaningful.
Half-open probe strategy
When the cooldown expires, how do you decide to close again?
- Single probe. Let one request through. If success, close. Simplest.
- N-of-M probes. Require, say, 3 of the next 5 requests to succeed. More robust; avoids closing on a lucky hit.
- Gradual ramp (soft probe). Allow a small percentage of traffic through, increasing over time. Essentially a canary. Best for high-stakes dependencies.
The tradeoff is between fast recovery (single probe closes in one round-trip) and false recovery (single probe closes on a fluke and we open again in 200ms).
LLD use
In an LLD, the circuit breaker shows up as a wrapper / decorator around any downstream dependency.
Caller → CircuitBreaker(dependencyClient) → Downstream
interface DownstreamClient {
Response call(Request r);
}
class CircuitBreaker implements DownstreamClient {
private final DownstreamClient delegate;
private State state = CLOSED;
private RollingWindow failures;
private Clock openedAt;
Response call(Request r) {
if (state == OPEN) {
if (cooldownExpired()) transitionTo(HALF_OPEN);
else throw new CircuitOpenException();
}
try {
Response resp = delegate.call(r);
onSuccess();
return resp;
} catch (Exception e) {
onFailure();
throw e;
}
}
// ... state transitions, threshold checks ...
}The shape — an adapter that implements the same interface as the thing it wraps — is the key. The caller cannot tell whether the breaker is there or not. That's the point. Breakers are composable with retries, timeouts, bulkheads — you stack them.
Order of composition
When you stack timeout, retry, and breaker around a call, the order matters. The conventional ordering — from outermost to innermost — is:
Bulkhead → CircuitBreaker → Retry → Timeout → DownstreamRead it outside-in: the bulkhead caps the pool. Inside the pool, the circuit breaker decides whether to even attempt. If open, fail-fast without hitting retry. If closed, attempt — the retry wrapper invokes the timeout'd call, retries on transient failures, and each individual attempt is itself bounded by the timeout.
Swap the order and you get subtle bugs. Put timeout outside retry and the timeout covers all retry attempts combined — suddenly a retry that would otherwise wait 400ms becomes a hard failure because the outer timeout ran out. Put the circuit breaker inside retry and every retry attempt checks the breaker — now you can trip the breaker in the middle of a retry loop for a single request, which is rarely what you want.
Libraries like resilience4j make this composition explicit via a decorator chain. In a hand-rolled LLD, sketching the order on the whiteboard communicates that you've thought about it.
Bulkhead
The name is nautical. A ship has bulkheads — watertight compartments — so that flooding in one section doesn't sink the ship. Software bulkheads do the same for resources.
The classic motivating scenario. Your service calls two dependencies: the primary DB and a search index. Both share one thread pool of 50 threads. Search gets slow (takes 5 seconds per call instead of 50ms). Within minutes, 49 of your 50 threads are blocked on search. DB calls that would take 2ms now wait in line behind search calls. Your whole service is down — even for requests that don't touch search — because one slow dependency ate the whole pool.
The fix: separate thread pools per downstream. Give search its own pool of, say, 10 threads. When search slows down, it fills its own pool and subsequent search calls get rejected or queued. The DB pool is untouched. Requests that only need the DB still work.
┌──────────────┐ ┌──────────────────────┐
│ Request │──┬──▶ │ DB pool (30 threads) │──▶ Primary DB
│ Handler │ │ └──────────────────────┘
└──────────────┘ │
│ ┌──────────────────────┐
└──▶ │ Search pool (10) │──▶ Search index
└──────────────────────┘Three sizing rules:
- Size each pool for its dependency's normal load, not worst case. If search normally takes 5 concurrent threads, give it 10, not 50.
- The sum of pool sizes should be less than the server's thread budget. Otherwise the bulkheads don't actually bulkhead; they're nominal.
- Each pool gets its own queue policy — fail-fast for search, bounded queue for the DB, whatever matches the dependency's tolerance.
In the LLD interview, saying "I'd put search behind a bulkhead — separate thread pool and checkout timeout — so a search outage doesn't eat DB capacity" is an immediate senior signal.
Timeouts — the Unsexy Foundation
I've mentioned timeouts in the pool section, the retry section, the circuit breaker section. They deserve their own moment because they are the single highest-leverage intervention in distributed systems resilience, and they are the most commonly forgotten.
Every RPC needs a timeout. Period.
Not "should have" — must. The default timeout in most HTTP libraries is infinite. Go's http.Client with a zero Timeout waits forever. Python's requests.get(url) without a timeout= waits forever. Java's URLConnection.setReadTimeout(0) waits forever. Every one of these is a footgun.
Why it matters:
- Without a timeout, a slow downstream converts into pool exhaustion. Thread or connection is held until the downstream responds — which may be never.
- Without a timeout, a retry is meaningless. You can't retry a call that hasn't returned.
- Without a timeout, a circuit breaker can't count failures — there is no failure, there is only "still waiting."
The interviewer's question. "What's your timeout on that HTTP call?" The bad answers: "the default," "I haven't decided," "30 seconds" (without a reason). The good answer has structure:
"The downstream's p99 is 200ms based on their SLA. I'd set timeout at 500ms — 2.5× p99. That gives headroom for the occasional slow request without waiting on pathological ones. If the caller itself has a 1-second SLA, I'd tighten it further so I have budget for a retry."
The pattern is: timeout = some multiple of downstream's p99, bounded by my own SLA budget. Senior candidates cite concrete numbers with concrete reasons.
Also: timeouts should shrink as calls go deeper. If the ingress has a 3-second SLA, the service it calls should have a 2-second timeout, the service it calls should have 1 second, and so on. Each layer needs time to retry, log, and respond. A downstream timeout equal to the ingress SLA means you can't even notice a failure before the client gives up.
Load Shedding
Sometimes backpressure isn't enough. The system is already overwhelmed — queues are full, pools are exhausted, latencies are blown — and the only way to recover is to stop accepting work.
Load shedding is the deliberate, controlled dropping of requests. It is the last line of defense before total collapse. When it engages, users see errors. That's the point. The alternative is everyone sees errors, because the system is dead.
Shedding strategies
| Strategy | How it works | When to use |
|---|---|---|
| Fixed rate | Drop X% of incoming requests when overloaded. | Simple but blunt. All callers suffer equally. |
| Priority-based | Drop low-priority traffic first (e.g. background syncs, analytics); preserve high-priority (user-facing). | Requires explicit priority signaling, but preserves the UX where it matters most. |
| Cost-based | Drop expensive requests first (e.g. report generation) before cheap ones (cache lookup). | When operation costs vary widely. |
| Random with bias | Probability of dropping scales with load. | Smoother than binary drop/accept; avoids bimodal behavior. |
Where to shed
Shed as early as possible in the request path. By the time a request reaches the business logic, the expensive work (TLS, auth, DB connection checkout) is already done. Shedding at the LB or gateway, before those costs are paid, lets the system recover faster.
Priority headers are the senior move. Every request gets a priority class — critical, standard, background. Under load, drop background first, standard second, never critical. Checkout flow is critical. "Update personalized recommendations" is background. This is how large services stay up during incidents — they degrade, they don't die.
The CoDel pattern
A subtler shedding scheme, borrowed from networking, is Controlled Delay (CoDel): don't look at queue length, look at how long the oldest item has been waiting. If the oldest item has been in the queue longer than a target latency (say, 100ms) for a sustained window, start dropping the newest arrivals. The logic: if the head of the queue is already too old to serve meaningfully, admitting more work only makes the situation worse. It also has a nice UX property — items that have been waiting long enough to be useless time out explicitly rather than being served stale. Facebook's "Flux" load shedder is a production-grade variant of this idea, and it shows up in several modern gateways as "queue sojourn time" shedding.
Failure Modes by Layer
Every resilient design maps its failures to layers and assigns a strategy per layer. Here's the canonical table. Memorize it.
| Layer | What fails | Signal | What to do | What to monitor |
|---|---|---|---|---|
| Transport | TCP connection refused, TLS handshake failure, DNS failure | Exception at connect() | Connection timeout, retry with backoff, health check | Connect error rate, DNS lookup latency |
| Downstream dependency | 5xx responses, high latency, timeouts | HTTP status, latency histogram | Retry (idempotent only), circuit breaker, bulkhead, fallback | Error rate by dependency, p99 latency, circuit state |
| Self (this service) | OOM, thread pool exhausted, GC pause, CPU starved | JVM/runtime signals, queue depth | Load shed, rate limit, scale horizontally | Heap use, GC pause duration, queue depth, thread count |
| Infra | Node failure, AZ failure, deploy failure, network partition | Health checks fail, K8s evicts | Multi-AZ, redundant replicas, automated failover, rollback | Node health, pod restart rate, replication lag |
The test of a senior design: for each layer, which strategy is in place, and which metric tells you it's working? "Retry the downstream" is a strategy. "And alert on circuit-open rate > 5%" is what makes it observable.
The Resilience Checklist
When you finish an LLD class diagram, walk this checklist. For any external call:
- [ ] Timeout. What's the number? Why that number? Is it less than the caller's SLA?
- [ ] Retry. Is the operation idempotent? If not, is there an idempotency key? How many retries? What's the backoff?
- [ ] Jitter. Full jitter, not fixed. Full jitter, not fixed. (Yes, I said it twice.)
- [ ] Circuit breaker. What's the failure threshold? Cooldown? Half-open probe strategy?
- [ ] Rate limit. Per-caller or global? What's the key? Where is it enforced?
- [ ] Backpressure. Are all my queues bounded? What happens when they fill?
- [ ] Bulkhead. Do I share a thread pool with a flaky dependency?
- [ ] Idempotency. Can this be retried safely? How do I deduplicate?
- [ ] Load shedding. Do I have a way to drop work under overload? By priority?
- [ ] Observable. Do I emit the metrics that would let me tune these settings?
If you walk this aloud as the interview wraps, you have just shown senior reflexes in 90 seconds. Most candidates never get here. This checklist is the distinguishing artifact.
Interview Pushback Responses
The follow-up probes at senior LLD interviews are predictable. Here are the sentence-level answers a senior candidate gives. These are not scripts — they are the shape of the answer. Practice them until they come out naturally.
"What happens if the downstream is slow?"
"Three-layer defense. I have a timeout — if it exceeds my timeout, the call fails fast. The failure feeds the circuit breaker, which after N failures in the window opens and fails subsequent calls without hitting the downstream. Callers waiting on those fast-fail responses free their threads quickly, so my pool isn't exhausted. If I have a cached or degraded fallback, the breaker routes there; otherwise I return a clean 503 to my client with
Retry-After."
"What happens if the downstream is down entirely?"
"Circuit breaker opens within the first few failures. From that point on, calls fail fast — cheap and local. A background task or the half-open probe tests the dependency periodically; when it recovers, the breaker closes and normal traffic resumes. I never take my service down because a dependency is down."
"What if the queue fills up?"
"The queue is bounded by design — never unbounded, because that's an OOM waiting to happen. When it fills, I have a named policy: for user-facing requests, I block the producer, which propagates backpressure to whoever fed me; for background work, I drop with a metric. Either way, the backpressure signal reaches the ingress, where I can rate-limit or load-shed before more work even enters the system."
"What if traffic spikes 10x?"
"Rate limiting at the gateway is the first line — legitimate abuse or runaway clients get cut off before they touch my service. Below that, bounded queues and bulkheads mean the spike fills my capacity but doesn't collapse it — I process at my designed rate, the excess waits briefly or rejects fast. If the spike is sustained, I load-shed by priority — background work dropped first, user-facing preserved. Horizontal scaling is the recovery — autoscaler sees the queue depth and adds replicas — but the design has to survive the minute before the scaler catches up."
"What if two of your three downstream dependencies are slow?"
"Each has its own bulkhead — separate thread pool, separate timeout, separate circuit breaker. Slowness in search doesn't eat capacity for the DB. If both search and the recommendations service are slow, their two pools fill, but the DB pool and the cache pool are untouched. Requests that only need DB + cache continue. Requests that need search return a partial response or fail gracefully, depending on how critical search is to the endpoint."
"You have a retry, you have a circuit breaker, and you have an idempotency key. Walk me through what happens on a 503."
"First call goes out with idempotency key K. Downstream returns 503. The retry policy says: wait
full_jitter(100ms), try again with the same K. Second call: downstream sees K, knows it hasn't processed it, attempts. If it succeeds, fine. If it fails again, I consume another retry budget token. If I hit max retries or the retry budget is depleted, I bubble the failure up. Meanwhile, the circuit breaker has been counting the 503s in its window; if the threshold trips, it opens and subsequent first-attempts fail fast without even making the network call. The idempotency key ensures that a retry after a timeout — where the downstream did receive and process the request but my side didn't see the response — doesn't double-process."
"Your rate limiter key is the user ID. What about abuse from unauthenticated traffic?"
"Layered keys. User ID for authenticated requests. IP address — or better,
(IP, route)— for unauthenticated endpoints. A global safety valve on total request rate. The first limit exceeded rejects the request. Pre-auth IP limits need to tolerate shared NAT, so they're set generously with alerting rather than tight with hard rejection — otherwise the coffee shop wifi users share one penalty box."
"Pick a number for this timeout."
"The downstream's SLA is p99 at 200ms. I'd set my timeout at 500ms — 2.5× their p99, which gives headroom for the tail without waiting on pathological slowness. My own caller gives me 1 second, so 500ms for the call leaves me 500ms for retry, logging, and response overhead. If the downstream's SLA were looser, I'd need to negotiate either a tighter SLA from them or a looser one to my own caller."
"How do you know these settings are right?"
"I don't, in advance. I pick reasonable starting points — timeout from downstream SLA, pool size from Little's Law, retry budget at 10% — and I instrument everything. Pool-wait time, circuit state transitions, retry count, 429 rate. Load test to find where each setting starts to matter. Tune in production with real traffic. The goal of the initial design isn't to be optimal, it's to be observable — I want the metrics that let me tune."
"Why don't you just scale horizontally and skip all this?"
"Scaling helps capacity, not resilience. Ten replicas of a service with an unbounded queue and no circuit breaker is ten OOMs in parallel instead of one. More instances doesn't prevent a retry storm — it invites one, because now the retry multiplier fans out across more callers. Horizontal scaling is a lever I pull in the failure-modes-by-layer table under 'self overloaded'; it does not replace bulkheads, circuit breakers, or timeouts. In fact, the more replicas you run, the more these patterns matter — a shared downstream is now being hammered by more clients, so polite client behavior is the only thing saving it."
"What if the circuit breaker is wrong — what if it opens but the downstream is actually fine?"
"That's a real failure mode, usually caused by threshold tuning or a minority-failure pattern (e.g. a bad shard of the downstream failing 20% of requests, which is below full failure but above the breaker threshold). Mitigations: tune the threshold based on observed baseline error rate, not an arbitrary number; use the rolling ratio with a minimum-sample requirement so we don't trip on two flukes; scope the breaker at a finer granularity — per-shard or per-instance, not per-service — so a bad shard only opens its own sub-breaker. And critically: alert on
circuit_open_rate. A breaker that's flapping open is itself a signal worth paging on."
What is Expected at Each Level
| Level | Expectations |
|---|---|
| Junior | Knows rate limiting and retries exist. Can name "timeout" when prompted. Does not distinguish idempotent from non-idempotent operations. Usually forgets to bound queues. |
| Mid-level | Picks token bucket by default. Exponential backoff with jitter. Bounded queues. Recognizes a retry storm when described. Can implement a circuit breaker when asked. |
| Senior | Volunteers the resilience checklist without prompting. Discusses retry budget and priority-based load shedding. Names timeouts with concrete numbers and reasons. Distinguishes CPU-bound from IO-bound pool sizing. Uses bulkheads between dependencies. Draws the backpressure chain end-to-end from ingress to downstream. Treats idempotency as a prerequisite for retry, not an afterthought. Orders timeout/retry/breaker composition correctly. |
Cross-References
- File 06: Concurrency — Correctness — locks, memory model, idempotency keys. Idempotency is the prerequisite for safe retries.
- File 07: Concurrency — Coordination — producer-consumer, queues, worker pools. The in-process backpressure patterns that this file generalizes to distributed systems.
- File 17: Rate Limiter LLD — full problem walkthrough with class diagram and distributed-counter tradeoffs.
- AWS Architecture Blog: Exponential Backoff And Jitter. The canonical reference on why full jitter wins.
- Google SRE book, Chapter 22: Addressing Cascading Failures. Retry budgets and load shedding in production.
- HikariCP docs on pool sizing — the small-pool-is-faster result is counter-intuitive until you read their write-up.