Skip to content

17 โ€” Problem: Rate Limiter โ€‹

Understanding the Problem โ€‹

A rate limiter decides, for each incoming request, whether the caller has stayed within an agreed-upon request budget. That sounds like a one-line if statement, but the interview almost never stops at "write the if." It asks you to articulate which algorithm to use, why that one, how to keep it correct under concurrency, and how to make it work when the system grows past a single box. The rate limiter is the canonical Strategy-pattern problem in the LLD canon โ€” five legitimate algorithms sit behind one allow() method, each with distinct memory, accuracy, and burstiness properties.

The bar for a senior answer

The interview is not testing whether you can implement a token bucket from memory. It is testing whether you know when each algorithm fits, why you would pick one over another for a specific product, and whether you can hold a conversation about the concurrency and distributed trade-offs without hand-waving. Junior candidates implement one algorithm correctly. Senior candidates design a pluggable limiter where the algorithm is a strategy, defend a particular choice against the others, and can sketch the distributed version (Redis + Lua) and the fail-open/fail-closed product decision when that infrastructure goes down.

Rate limiting shows up in three distinct shapes, and the interviewer's framing tells you which one you are designing:

  1. In-process library โ€” a Guava RateLimiter, a middleware for an Express server. Single JVM/process. No network hop. Correctness under thread contention is the interesting part.
  2. Centralized service โ€” an API gateway or a sidecar that everyone in the fleet consults. Network hop per decision. Latency becomes the interesting part.
  3. Distributed coordinated โ€” every edge node enforces a global budget (Cloudflare, Stripe, Twitter). State has to live somewhere shared (Redis, DynamoDB). Consistency vs latency trade-offs dominate.

We will design the in-process version first, then sketch the distributed extension. That mirrors how the interview actually flows โ€” "start simple, then I'll ask about scale."


Clarifying Questions โ€‹

You: Is this a library embedded in the caller's process, or a standalone service every request consults?

Interviewer: Library first. We can talk about the service version at the end.

You: Who is the unit of limiting โ€” a user, an API key, a source IP, an endpoint, or a tuple of those?

Interviewer: Per-API-key for now. Design it so we can swap the key extractor.

You: Is there a single rate (say, 100 req/sec) or multiple tiers (100/sec AND 10,000/day)?

Interviewer: Assume a single rate for the core design. Multi-tier is a follow-up.

You: How important is burst tolerance? Can a caller use 60 seconds' worth of quota in one second if they have been quiet, or do we smooth strictly?

Interviewer: We want to allow some burst โ€” it is a common API-ergonomics ask โ€” but bounded.

You: Is a rejected request a hard 429, or do we also support a soft/shadow mode where we emit headers but still serve?

Interviewer: Start with hard-reject. Mention how you would add shadow mode.

You: Accuracy budget โ€” is an off-by-one over a one-second window acceptable, or do we need exact counting?

Interviewer: Near-exact is fine. Do not burn memory on perfect accuracy.

You: Expected scale โ€” how many distinct keys live in memory simultaneously?

Interviewer: Assume up to ~1M active keys on a single box, with long tail decaying.

You: Clock โ€” can we trust System.currentTimeMillis(), or do we need an injectable clock for tests and NTP-skew tolerance?

Interviewer: Inject it. You will want that for tests anyway.

You: Failure mode โ€” if the backing store (future Redis) is down, fail-open or fail-closed?

Interviewer: Good question. Treat it as a product-configurable flag and discuss the trade-off.


Functional Requirements โ€‹

  1. allow(key: string, cost: number = 1): Result โ€” decide whether a request from this key, consuming cost units, is permitted.
  2. The limiter is configured with a rule per key (or per tenant tier) describing rate and burst capacity.
  3. The algorithm is pluggable โ€” callers can choose Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, or Sliding Window Counter without changing the facade.
  4. The result carries enough metadata to build X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After HTTP headers.
  5. Support variable cost โ€” a request for /search?expensive=true can consume 10 units while a cheap one consumes 1.
  6. Support key extraction โ€” the key-extractor is pluggable (by API key, by user ID, by IP, by composite).
  7. Eviction of idle per-key state so memory does not grow unboundedly.
  8. Observability โ€” emit per-decision metrics (allowed/denied, remaining, latency).

Out of scope for the core design: Distributed enforcement, multi-tier limits, shadow mode, leaky-bucket-as-queue (we implement leaky bucket as a meter, not as an actual FIFO), adaptive/back-pressure limiting.

Non-Functional Requirements โ€‹

ConcernTarget
Decision latencySub-microsecond on cache-hot path. A rate limiter sits in front of every request โ€” if it adds 1 ms, a 100-RPS server loses 100 ms/s to limiter overhead.
ConcurrencyCorrect under thousands of concurrent threads/fibers per key. No lost updates, no double-admits.
Memory footprint~40โ€“80 bytes per active key (one AtomicLong + last-seen timestamp + pointer overhead). At 1M active keys, ~80 MB โ€” acceptable.
AccuracyOff-by-one over a 1-second window is fine. Off-by-2x (fixed-window boundary bug) is not.
Fail safetyIn-process library cannot "fail" in the Redis sense. But in the distributed extension, the fail-open vs fail-closed decision must be explicit.
Determinism for testsClock must be injectable so tests do not rely on wall-clock sleep().

Core Entities and Relationships โ€‹

EntityResponsibility
RateLimiterFacade. Owns the rule registry, the key extractor, the strategy, and the per-key state map. allow() is its only public verb.
IRateLimitStrategyInterface โ€” given a key's current state, a rule, the clock, and a cost, decides allow/deny and mutates state. The "algorithm" pluggability point.
TokenBucketStrategy. Leaky refill model; allows bursts up to capacity. Stripe's choice.
LeakyBucket (meter form)Strategy. Fixed outflow rate; deny when the bucket is full. Smooths traffic.
FixedWindowStrategy. Count requests in a [t, t+W) bucket, reset at boundary. Cheap, prone to 2x-burst at the boundary.
SlidingWindowLogStrategy. Keeps a timestamp per request in the window; exact but O(N) memory per key.
SlidingWindowCounterStrategy. Weighted average of previous + current fixed-window counts. Near-exact at fixed-window cost.
RuleValue object: (capacity, refillRatePerSec) or (limit, windowMs) depending on strategy. Tier-scoped.
IClockAbstraction over System.currentTimeMillis / performance.now for test injection.
IKeyExtractorPulls a string key out of a request. Swappable (header vs IP vs user ID vs composite).
KeyStatePer-key mutable state. Token Bucket stores (tokens, lastRefillNanos). Fixed Window stores (count, windowStart).
Result{ allowed: boolean, remaining: number, retryAfterMs: number, resetAtMs: number }.

Why Strategy is the right pattern here. All five algorithms share the same shape โ€” "given this key and this cost, should I allow?" โ€” and the same return contract. They differ only in internal state and math. That is Strategy's textbook fit: same interface, swappable guts. Inheritance would couple the algorithms; a big switch on algorithmType would bloat RateLimiter and require touching the facade every time we add an algorithm.

Why separate Rule from KeyState. A rule is configuration (the same rule applies to all keys in a tier). State is data (one record per key). Keeping them apart lets us reload rules at runtime without touching state, and evict state without touching rules.


Interfaces โ€‹

typescript
interface Result {
  readonly allowed: boolean;
  readonly remaining: number;   // tokens / requests left in current window
  readonly retryAfterMs: number; // 0 when allowed
  readonly resetAtMs: number;    // epoch ms when quota fully restored
}

interface IClock {
  nowMs(): number;
  nowNanos(): bigint; // for sub-ms precision where it matters
}

interface IKeyExtractor<Req> {
  extract(request: Req): string;
}

interface Rule {
  readonly capacity: number;        // token bucket capacity, or window limit
  readonly refillPerSec: number;    // tokens/sec (token/leaky bucket)
  readonly windowMs?: number;        // fixed/sliding window size
}

interface KeyState {
  // Opaque; each strategy defines its own concrete subtype.
}

interface IRateLimitStrategy {
  tryAcquire(state: KeyState, rule: Rule, cost: number, nowMs: number): Result;
  newState(nowMs: number): KeyState;
}

interface IRateLimiter {
  allow(key: string, cost?: number): Result;
}

IKeyExtractor is generic over Req so an Express middleware, a gRPC interceptor, and a raw TCP handler can each plug in appropriate extractors without leaking a concrete request type into RateLimiter.

IClock is non-negotiable for senior-level code. Any rate limiter that reads wall-clock time inline is untestable without Thread.sleep, which is slow and flaky. The five-minute cost of an injected clock pays back in every test from then on.


Class Diagram โ€‹

                         +---------------------+
 Request --[extract]---->|      RateLimiter     |
                         |----------------------|
                         | - strategy           |
                         | - keyExtractor       |
                         | - rules: Map         |
                         | - state: ConcurrentMap<key, KeyState>
                         | - clock              |
                         | - metrics            |
                         |----------------------|
                         | + allow(key, cost)   |
                         +----------+-----------+
                                    |
                                    | delegates to
                                    v
                         +----------------------+
                         |  IRateLimitStrategy  |<------ interface
                         +----------+-----------+
                                    ^
         +--------+-----+-----------+-------+------------+
         |        |     |           |       |            |
   TokenBucket Leaky  Fixed   SlidingLog  SlidingCounter ...
   Strategy  Bucket  Window    Strategy   Strategy
              Strategy Strategy

 +----------+     +----------+       +---------+
 | IClock   |     | IKey     |       |  Rule   |
 | (inj.)   |     | Extractor|       | (tier)  |
 +----------+     +----------+       +---------+

Every strategy gets the same three inputs (state, rule, time) and returns the same Result. The facade knows nothing about bucket math.


Class Design โ€‹

RateLimiter (facade) โ€‹

State

  • strategy: IRateLimitStrategy โ€” the active algorithm.
  • keyExtractor: IKeyExtractor<Req> โ€” pulls the key out of a request (not used inside allow(key) directly, but the layer above uses it).
  • rules: Map<TierId, Rule> โ€” tier โ†’ rule. For now, single tier.
  • state: ConcurrentMap<string, KeyState> โ€” per-key mutable state.
  • clock: IClock โ€” injected.
  • metrics: IMetricsEmitter โ€” observer.

allow(key, cost) algorithm

  1. Look up (or lazily create) the KeyState for this key.
  2. Look up the Rule for the tier.
  3. Call strategy.tryAcquire(state, rule, cost, clock.nowMs()).
  4. Emit metrics.
  5. Return the Result.

Lazy state creation uses ConcurrentMap.computeIfAbsent (Java) or a double-checked idiom with a Mutex around the map in languages without it. The per-key mutations inside the strategy are separately guarded (see Concurrency Considerations).

TokenBucket โ€” the canonical strategy โ€‹

A bucket holds up to capacity tokens. Tokens refill continuously at refillPerSec. Each allow(cost) deducts cost tokens if available; otherwise deny.

Why it is the default for API gateways. Token bucket separates the steady rate from the burst capacity โ€” you can say "100 RPS steady, but allow bursts up to 500" just by setting refillPerSec=100, capacity=500. Leaky bucket cannot do that cleanly. Fixed/sliding window can only smooth, not burst.

State: { tokens: number, lastRefillMs: number }.

Laziness. We do not run a refill timer per key โ€” that is millions of timers. Instead we compute "how many tokens would have accumulated since lastRefillMs" at the moment of each allow(). This is the classic lazy-refill trick and it is O(1) regardless of how long the bucket has been idle.

LeakyBucket (meter) โ€‹

Dual of token bucket. The bucket has capacity "slots." A request consumes one slot. Slots drain at leakRatePerSec. Deny when full.

Two variants exist: the meter (what we implement โ€” a counter that drains) and the queue (an actual FIFO that serves requests at a fixed rate). The queue variant is a different system โ€” it adds latency to smooth bursts. Make sure you know which the interviewer is asking about. We implement the meter.

FixedWindow โ€‹

Partition time into fixed W-ms windows starting at epoch. Track (currentWindowStart, count). Increment on allow, reset on window boundary.

The boundary-burst problem. A caller who sends 100 requests at t=0.999s and another 100 at t=1.001s has sent 200 requests in 2 ms, but the fixed-window limiter (window=1s, limit=100) allows it because each half sits in a different window. In practice this means fixed window permits up to 2ร— the nominal rate in the worst case. This is the single most-cited fact about fixed window in interviews.

SlidingWindowLog โ€‹

Keep a sorted list (or deque) of request timestamps in the last W ms for each key. On each allow, drop timestamps older than now - W, then check list.size() + cost <= limit.

Exact. No boundary burst. But memory grows linearly with request rate โ€” a key at 1000 RPS holds 1000 longs per second. For large fleets this is the kill-switch.

SlidingWindowCounter โ€‹

The compromise algorithm. Keep two fixed-window counters: previousCount, currentCount. On each request, estimate the current sliding-window total as:

estimate = previousCount * (1 - elapsedInCurrentWindow / W) + currentCount

This is a weighted linear interpolation โ€” it assumes the previous window's traffic was evenly distributed. In practice the approximation is tight enough for SLA purposes (CloudFlare's blog post on this approach reports <0.003% error on production traffic).

Fixed-window cost, near-sliding-window accuracy. This is the "secretly the best default" answer when the interviewer asks "which one would you actually ship?"

Clock abstraction โ€‹

typescript
class SystemClock implements IClock {
  nowMs(): number { return Date.now(); }
  nowNanos(): bigint { return process.hrtime.bigint(); }
}

class FakeClock implements IClock {
  private current = 0n;
  nowMs(): number { return Number(this.current / 1_000_000n); }
  nowNanos(): bigint { return this.current; }
  advance(ms: number): void { this.current += BigInt(ms) * 1_000_000n; }
}

Tests use FakeClock. Production uses SystemClock. The strategy never touches Date.now() directly.


Key Methods โ€‹

Token Bucket โ€” the canonical implementation โ€‹

typescript
interface TokenBucketState extends KeyState {
  tokens: number;        // fractional; we refill in real numbers
  lastRefillMs: number;
}

class TokenBucketStrategy implements IRateLimitStrategy {
  newState(nowMs: number): TokenBucketState {
    // A fresh key starts full โ€” the friendliest policy. Debatable: some systems
    // start empty to force a "warm-up." Default full = better UX for new keys.
    return { tokens: Number.POSITIVE_INFINITY, lastRefillMs: nowMs };
  }

  tryAcquire(state: KeyState, rule: Rule, cost: number, nowMs: number): Result {
    const s = state as TokenBucketState;
    // Initialize on first real use โ€” we do not know `capacity` in newState.
    if (!Number.isFinite(s.tokens)) s.tokens = rule.capacity;

    // 1. Lazy refill: how many tokens have accrued since lastRefillMs?
    const elapsedMs = nowMs - s.lastRefillMs;
    if (elapsedMs > 0) {
      const accrued = (elapsedMs / 1000) * rule.refillPerSec;
      s.tokens = Math.min(rule.capacity, s.tokens + accrued);
      s.lastRefillMs = nowMs;
    }

    // 2. Deduct if we can afford it.
    if (s.tokens >= cost) {
      s.tokens -= cost;
      return {
        allowed: true,
        remaining: Math.floor(s.tokens),
        retryAfterMs: 0,
        resetAtMs: nowMs + Math.ceil(((rule.capacity - s.tokens) / rule.refillPerSec) * 1000),
      };
    }

    // 3. Deny. Compute retry-after: how long until we accumulate `cost - tokens`?
    const deficit = cost - s.tokens;
    const retryAfterMs = Math.ceil((deficit / rule.refillPerSec) * 1000);
    return {
      allowed: false,
      remaining: Math.floor(s.tokens),
      retryAfterMs,
      resetAtMs: nowMs + retryAfterMs,
    };
  }
}

Why lazy refill instead of a timer. A fleet with 1M active keys cannot afford 1M refill timers โ€” that is a scheduler meltdown. Lazy refill means we pay O(1) per request and zero cost per idle key. The only state that moves is the timestamp.

Why Math.min(capacity, ...). The bucket is bounded. Without the clamp, a key that idled for a week would return with tokens = refillPerSec * 604800 * 1000 โ€” effectively unlimited burst. Token bucket's point is bounded burst.

Why Math.floor(remaining). The internal tokens are fractional (we accrue 0.5 tokens for half a refill interval). The user-visible remaining count is an integer for header cleanliness.

Why per-state mutation, not return-new-state. This object is under a lock (or CAS, see concurrency). Mutation in place is what actually lives in production JVMs and Node-single-thread code. A purely functional version would require atomically swapping the state object via CAS, which is fine but more allocation.

Sliding Window Counter โ€‹

typescript
interface SlidingWindowCounterState extends KeyState {
  prevCount: number;
  currCount: number;
  currWindowStartMs: number;
}

class SlidingWindowCounterStrategy implements IRateLimitStrategy {
  newState(nowMs: number): SlidingWindowCounterState {
    return { prevCount: 0, currCount: 0, currWindowStartMs: nowMs };
  }

  tryAcquire(state: KeyState, rule: Rule, cost: number, nowMs: number): Result {
    const s = state as SlidingWindowCounterState;
    const windowMs = rule.windowMs!;

    // 1. Roll windows forward if we have crossed boundaries.
    const windowsElapsed = Math.floor((nowMs - s.currWindowStartMs) / windowMs);
    if (windowsElapsed >= 2) {
      // Both windows are stale โ€” reset.
      s.prevCount = 0;
      s.currCount = 0;
      s.currWindowStartMs = s.currWindowStartMs + windowsElapsed * windowMs;
    } else if (windowsElapsed === 1) {
      // Slide: current becomes previous, new current starts empty.
      s.prevCount = s.currCount;
      s.currCount = 0;
      s.currWindowStartMs += windowMs;
    }

    // 2. Estimate the sliding window total via linear interpolation of the
    //    previous window's contribution, weighted by how far we are into the
    //    current window.
    const elapsedInCurrent = nowMs - s.currWindowStartMs;
    const prevWeight = 1 - elapsedInCurrent / windowMs; // 1.0 โ†’ 0.0 across the window
    const estimate = s.prevCount * prevWeight + s.currCount;

    // 3. Admit if the estimated total plus cost would not exceed the limit.
    if (estimate + cost <= rule.capacity) {
      s.currCount += cost;
      return {
        allowed: true,
        remaining: Math.max(0, Math.floor(rule.capacity - estimate - cost)),
        retryAfterMs: 0,
        resetAtMs: s.currWindowStartMs + 2 * windowMs, // worst case for full drain
      };
    }

    // 4. Deny. retry-after = time for the previous-window contribution to
    //    decay enough that `estimate + cost <= capacity`.
    const overBy = estimate + cost - rule.capacity;
    const msPerPrevUnit = s.prevCount > 0 ? windowMs / s.prevCount : windowMs;
    const retryAfterMs = Math.min(windowMs, Math.ceil(overBy * msPerPrevUnit));
    return {
      allowed: false,
      remaining: Math.max(0, Math.floor(rule.capacity - estimate)),
      retryAfterMs,
      resetAtMs: nowMs + retryAfterMs,
    };
  }
}

Why the weighting is (1 - elapsedInCurrent / W). At the instant a window starts, the previous window's requests are "fully weighted" โ€” they still fully count. As we advance through the current window, the previous window's contribution linearly decays to zero. By the end of the current window, the previous is entirely out of the sliding view.

Accuracy. CloudFlare's production numbers show the approximation error is ~0.003% of requests on real traffic โ€” within noise. The assumption (previous window traffic was uniform) can fail for bursty callers, but the direction of the error is bounded: it can never admit more than limit + prevCount requests in a sliding window, which is only a problem at boundary conditions that the pure sliding-window-log catches.

Why windowsElapsed >= 2 is a separate branch. If an idle key returns after 3+ windows, we must not carry a stale prevCount forward by accident. The single-step slide only holds when the jump is exactly one window.

Distributed version โ€” Redis + Lua sketch โ€‹

When every node in a fleet must enforce one shared budget, you cannot keep state on each node. Put it in Redis. The subtlety: a decision involves reading state, computing, writing state. If two nodes do this concurrently for the same key, they race.

Redis's answer is server-side Lua scripts โ€” atomic by construction because Redis runs them single-threaded.

lua
-- token_bucket.lua
-- KEYS[1] = key name
-- ARGV[1] = capacity
-- ARGV[2] = refillPerSec
-- ARGV[3] = cost
-- ARGV[4] = nowMs
-- Returns: { allowed (0|1), remaining, retryAfterMs }

local key      = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill   = tonumber(ARGV[2])
local cost     = tonumber(ARGV[3])
local nowMs    = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "lastMs")
local tokens = tonumber(data[1])
local lastMs = tonumber(data[2])

if tokens == nil then
  tokens = capacity
  lastMs = nowMs
end

-- Lazy refill.
local elapsed = nowMs - lastMs
if elapsed > 0 then
  tokens = math.min(capacity, tokens + (elapsed / 1000) * refill)
  lastMs = nowMs
end

local allowed = 0
local retryAfterMs = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  local deficit = cost - tokens
  retryAfterMs = math.ceil((deficit / refill) * 1000)
end

-- Persist and set TTL so idle keys evict.
redis.call("HSET", key, "tokens", tokens, "lastMs", lastMs)
redis.call("PEXPIRE", key, math.ceil(capacity / refill * 1000) * 2)

return { allowed, math.floor(tokens), retryAfterMs }

Why Lua. EVAL/EVALSHA runs atomically on the Redis server. Without it, a GET; compute; SET round-trip from the client is racy. With it, correctness falls out of Redis's single-threaded execution model.

Why PEXPIRE on every write. This is the eviction strategy for Redis state. If a key has not been touched in 2ร— the time it would take to refill fully, nobody is rate-limiting it โ€” let Redis reclaim the memory. Idle keys cost zero.

Latency. One Redis round-trip per decision โ€” typically 0.2โ€“0.5 ms on a well-placed cluster. Acceptable for an edge gateway. Not acceptable for a hot inner loop.


Design Decisions & Tradeoffs โ€‹

The algorithm comparison table โ€” the centerpiece โ€‹

AlgorithmMemory per keyAccuracyBurst handlingImplementation complexityDistributed friendlinessWhen it wins
Fixed WindowO(1): (count, windowStart)Poor โ€” up to 2ร— burst at window boundaryNone (implicit via limit)Trivial โ€” 15 linesGreat โ€” single atomic INCR in RedisLow-stakes limits where simplicity dominates. Good-enough for abuse prevention on toy APIs.
Sliding Window LogO(N): one timestamp per requestExact โ€” no boundary errorNone (implicit)Moderate โ€” sorted collection, expiryPoor โ€” Redis sorted set grows unbounded, ZRANGE per requestCryptographic/compliance limits where exact count matters. Low-rate APIs (~10 RPS) where the log stays small.
Sliding Window CounterO(1): two ints + timestamp~0.003% error (CloudFlare) โ€” indistinguishable in practiceNone (implicit)Moderate โ€” weighted interpolation mathGood โ€” two INCRs and a readThe default recommendation. Fixed-window cost, sliding-window accuracy. CloudFlare uses this.
Token BucketO(1): (tokens, lastRefillMs)Exact w.r.t. rate; allows bursts up to capacityFirst-class โ€” capacity and refillPerSec are independent knobsLow โ€” the lazy-refill trick is ~20 linesGreat โ€” well-known Lua scriptAPI gateways (Stripe, AWS). When burst tolerance is a product requirement.
Leaky Bucket (meter)O(1): (level, lastDrainMs)Exact w.r.t. rate; no burst allowance beyond bucket capacitySmooths; rejects faster than token bucket under loadLow โ€” structurally identical to token bucket with inverted fill directionGreat โ€” same Lua skeletonTraffic shaping at the edge of a slow downstream. When you want to protect the downstream, not to indulge bursty callers.

The senior-level articulation, in interview voice:

Fixed window is the cheapest but has the 2ร— boundary-burst problem โ€” a caller who times their traffic around the window boundary can double the nominal rate. Sliding-window counter fixes that at essentially the same memory cost, which is why I would pick it as the default if the interview forced one choice. Token bucket is different โ€” it's what you want when burst capacity is a product feature, not a bug. Stripe's API is the clearest example: they want to allow a batch of 100 card charges to come in as a burst and then enforce the steady rate afterward. Token bucket expresses that naturally because capacity and refillPerSec are independent knobs. Leaky bucket is token bucket's mirror โ€” same structure, opposite philosophy: no burst, strict smoothing. You would use it to protect a slow downstream, not to serve a bursty caller.

Why sliding-window-log almost never wins a design debate. The only reason to accept O(N) per-key memory is exact accuracy, and exact accuracy matters in a startlingly narrow slice of real products. For SLA enforcement, compliance rate limits, or fraud thresholds where "how many times this hour" has to be legally defensible, yes. For everything else, 0.003% error is free.

Why we did not pick a simpler "counter + timer" design โ€‹

The obvious first instinct โ€” "just keep a counter and reset it every second with setInterval" โ€” breaks in three ways:

  1. A timer per key does not scale. 1M keys = 1M timers. Neither Node's event loop nor a JVM ScheduledExecutor will tolerate that.
  2. Lazy refill is structurally cleaner. You compute elapsed time on demand, pay O(1) per request, and zero cost per idle key.
  3. Timers make testing miserable. You end up with Thread.sleep(1001) scattered through your test suite.

Say this out loud in the interview. It shows you know why the clever laziness exists, not just that it exists.


Patterns Used โ€‹

PatternWhereWhy
StrategyIRateLimitStrategy with five concrete algorithmsThe five algorithms have identical shape but different math. Strategy swaps them without touching the facade.
FacadeRateLimiterHides the map lookups, rule resolution, metrics emission, and strategy dispatch behind one allow() method.
Dependency InjectionIClock, IKeyExtractor, IRateLimitStrategy, IMetricsEmitter all constructor-injectedMakes the limiter testable without any global state.
ObserverMetrics emitter fires onAllow / onDenyKeeps instrumentation out of the hot path's core logic.
DecoratorA LayeredLimiter that chains per-IP โ†’ per-user โ†’ per-endpoint limiters, each wrapping the nextEach layer has its own rule; the request has to pass all of them. Decorator composition is cleaner than a hardcoded pipeline.
SingletonA process-wide RateLimiter registry (one per configured strategy/rule combination) โ€” cautiouslyAvoids allocating state maps per middleware instance, but singletons make testing harder; prefer explicit DI and use a singleton only at the composition root.

Note on Singleton. The interviewer may push on this: "should RateLimiter be a singleton?" The senior answer is "only the instance held by the application should be singular. The class itself has no reason to enforce singleton semantics โ€” enforce it in the composition root (the DI container, or main()) and keep the class testable." Classic Singleton (private constructor, static instance) is a smell in modern code.


Concurrency Considerations โ€‹

This is the subsection where the interviewer separates a mid-level candidate from a senior. Rate limiters live on every hot path โ€” they run under contention by definition.

In-process: the per-key CAS loop โ€‹

Each key's state is a small object. The state map (global) is a ConcurrentHashMap (Java) or lock-striped map (other languages). Contention on the map itself is negligible because insertions happen once per key.

The real contention is on a single key's state under concurrent requests. Two approaches:

1. Synchronized / lock per key. Simple. The lock granularity is per-key, so two different keys don't contend. A ConcurrentHashMap<String, ReentrantLock> with computeIfAbsent for lazy allocation. Overhead: a lock acquire + release per request.

2. Lock-free CAS loop. Pack the state into a single AtomicLong (e.g., 32 bits of tokens + 32 bits of lastRefillMs % someEpoch), compute the next state, and compareAndSet it. Retry on failure.

java
// Lock-free token bucket โ€” state packed into one long.
// high 32 bits = fractional tokens * 1000 (fixed-point)
// low 32 bits  = millis since epoch-offset
long current, next;
do {
  current = state.get();
  long tokens = decodeTokens(current);
  long lastMs = decodeLastMs(current);
  long nowMs  = clock.nowMs();

  long accrued = (long) ((nowMs - lastMs) * refillPerMs);
  long newTokens = Math.min(capacityEncoded, tokens + accrued);

  if (newTokens < costEncoded) {
    // Deny โ€” do NOT update state. Another thread may also be denied concurrently
    // and no-op; first allow wins.
    return deny(...);
  }
  newTokens -= costEncoded;
  next = encode(newTokens, nowMs);
} while (!state.compareAndSet(current, next));

return allow(...);

Trade-off. CAS is faster under low contention, but under high contention (thousands of concurrent requests per key) retries burn CPU. Modern wisdom: lock per key is fine for almost everyone; CAS is the optimization you reach for when profiling shows contention. For an SDE3 interview, articulating both and knowing when each wins is the win.

In single-threaded JavaScript (Node worker, browser), concurrency is moot โ€” allow() is atomic by the event-loop model. Mention it explicitly in a JS answer so the interviewer knows you understand the runtime. Do not over-engineer.

Distributed: Redis + Lua โ€‹

Covered in the Key Methods section. The Lua script gives you atomic read-modify-write on Redis's single-threaded server. Two common mistakes:

  1. WATCH/MULTI/EXEC instead of Lua. This works but is much slower (optimistic concurrency with round-trips per retry). Lua is strictly better for this use case.
  2. Client-side computation with SETNX guards. Racy. Do not do this.

Clock skew in distributed systems โ€‹

If the Lua script uses redis.call("TIME"), you have one authoritative clock โ€” good. If it takes nowMs from the client, you have as many clocks as clients, and NTP skew of ~50 ms (common) corrupts refill math.

The senior answer: prefer Redis's own clock via redis.call("TIME") inside the Lua script. It costs nothing, and it eliminates the class of bugs where a client with a skewed clock gets free tokens (or is under-served).


Scale & Extensibility โ€‹

Scaling the in-process limiter โ€‹

Scale leverMechanism
Shard the state mapBreak ConcurrentMap<String, State> into N sub-maps keyed by hash(key) % N. Reduces lock contention on the map itself. Java's ConcurrentHashMap already does this internally, but in other runtimes you may need to do it yourself.
Evict idle keysA background sweeper removes keys whose lastRefillMs is older than a TTL (e.g., 2 * capacity/refillPerSec). Cheap: one pass, mark-and-sweep. Alternatively, use an LRU-style LinkedHashMap with a max size.
Pre-allocate stateFor known hot keys, pre-populate the map at startup to avoid the computeIfAbsent race on the first request.
Local-decision caching in the distributed versionLet each edge node hold a small subset of tokens (a "quota lease") from the central store, only round-trip to Redis when the local quota is exhausted. Like TCP SYN-cookie amortization. CloudFlare's edge limiter works this way.

Distributed extension โ€” evolving the design โ€‹

Starting from the in-process library, the distributed extension is:

  1. Replace the ConcurrentMap with a Redis client.
  2. Replace the strategy's direct state mutation with a Lua script invocation (EVALSHA).
  3. Add a circuit breaker around the Redis call โ€” if Redis is unreachable, fall back to a fail-open or fail-closed behavior (explicit product decision below).
  4. Add a decision cache: recent allow/deny decisions (say, last 100 ms for a key) cached locally to absorb short Redis outages and reduce round-trips.

The IRateLimitStrategy interface doesn't change. The strategy implementation is what swaps. That is exactly the payoff for choosing Strategy up front.

Multi-tier limits โ€‹

A realistic API has 1000 req/sec AND 100,000 req/day. Model as two independent RateLimiter instances composed under a CompositeRateLimiter:

typescript
class CompositeRateLimiter implements IRateLimiter {
  constructor(private readonly limiters: IRateLimiter[]) {}
  allow(key: string, cost = 1): Result {
    let tightestRemaining = Number.POSITIVE_INFINITY;
    let maxRetryAfter = 0;
    for (const l of this.limiters) {
      const r = l.allow(key, cost);
      if (!r.allowed) return r; // deny on first failure, no side effects on remaining limiters
      tightestRemaining = Math.min(tightestRemaining, r.remaining);
      maxRetryAfter = Math.max(maxRetryAfter, r.retryAfterMs);
    }
    return { allowed: true, remaining: tightestRemaining, retryAfterMs: 0, resetAtMs: 0 };
  }
}

Subtle bug. The naive version above calls allow() on every sub-limiter, which means each sub-limiter deducts its cost. If the second denies, the first has already deducted โ€” we've double-charged the caller on the successful layers. The fix is either (a) two-phase: peek() on all, then commit() on all only if all peek OK; or (b) a lightweight compensation โ€” on deny, refund the earlier layers. (a) is cleaner. Bring it up; interviewers love catching the atomicity issue.

Per-endpoint vs per-tenant โ€‹

Compose via the Decorator pattern:

typescript
// A request passes iff all decorators allow it.
const limiter =
  new PerIpLimiter(
    new PerUserLimiter(
      new PerEndpointLimiter(baseLimiter)
    )
  );

Each layer has its own key extractor (IP, user, endpoint) and its own rule. The composition is the policy.

Warm cache vs cold start โ€‹

The first request for a never-seen key pays an extra cost: map insertion, state allocation, metrics registration. This is bounded โ€” it happens once per key โ€” but on a cold-started service absorbing a traffic surge, cold starts can dominate.

Mitigation. Pre-warm the top-N keys at startup by sniffing the previous instance's state (via a Redis dump or a side-channel snapshot). For the other long tail, the cold-start cost is acceptable as long as it's not happening inside a lock.


Edge Cases โ€‹

  1. Clock rewind / NTP step-back. If nowMs < lastRefillMs, elapsedMs goes negative and the naive formula over-credits tokens (or under-credits in the wrong direction). Guard with if (elapsedMs <= 0) { skip refill; update lastRefillMs to nowMs; } or use a monotonic clock (System.nanoTime() / performance.now()) for elapsed-time math and currentTimeMillis() only for reset-at headers.

  2. Sub-millisecond request rate. At 10K RPS, multiple requests share a millisecond timestamp. Token bucket is fine (fractional tokens handle it). Sliding-window log with ms-precision timestamps collides โ€” use nanos, or accept that sub-ms collisions are bucketed together.

  3. Bootstrap state for a new key. Default to tokens = capacity (welcoming) or tokens = 0 (strict). Both are defensible. The welcoming default is better UX and matches what API gateways actually ship; document it.

  4. Memory eviction of idle state. Without a TTL, the state map leaks slowly as keys come and go. Two strategies: a passive LRU (size-bounded LinkedHashMap) or an active sweeper. Passive is simpler; active is more responsive. Pick one and say why.

  5. Cost > capacity. A request costs 100 units but capacity is 50. Two semantics:

    • Reject immediately โ€” never admittable, retryAfterMs = 0 with a distinct error code. This is the saner default: an impossible request should fail fast.
    • Queue until possible โ€” only makes sense in the leaky-bucket-as-queue variant (which we did not implement). Mention it and move on.
  6. Stampede at the boundary. A deployment rolls out and 10K clients hit the limiter in the same millisecond. Fixed window is worst (they all count against the new window's budget). Token bucket is best (tokens are consumed in sequence; once exhausted, the rest deny). This is one more reason fixed-window is dangerous in practice.

  7. Retry storms on deny. If every denied caller retries immediately with no backoff, you get a retry stampede. The limiter's retryAfterMs in the response is the hint. If callers ignore it, they stay denied โ€” that's correct behavior but operationally noisy. Beyond the limiter's responsibility, but worth mentioning: surface Retry-After in headers; document exponential backoff in the SDK.

  8. Negative cost or zero cost. Validate input. cost <= 0 is a caller bug โ€” throw, don't silently allow.

  9. Key cardinality explosion. A buggy caller sends random keys (e.g., user ID forged per-request). State map explodes. Defense: hard cap on map size with LRU eviction; alert on eviction rate.


Follow-up Questions โ€‹

Interviewer: How would you evolve the in-process design to distributed?

Swap the ConcurrentMap for a Redis client, port each strategy's tryAcquire to a Lua script, keep the same IRateLimitStrategy interface. Add a circuit breaker around the Redis call and an explicit fail-open/fail-closed flag. Cache recent decisions locally to absorb short Redis hiccups and reduce round-trips.

Interviewer: 1M active users, 100K RPS aggregate. Can a single Redis handle it?

Rough math: 100K RPS ร— 1 round-trip ร— ~200 ยตs = 20 seconds of wall time per second of traffic on one Redis instance. Doesn't fit. Options: (a) Redis Cluster โ€” shard by key hash, each shard sees a fraction. (b) Edge-local "quota lease" โ€” each node pulls a batch of tokens from Redis and serves locally until exhausted. (b) scales better, at the cost of weaker global precision (nodes may over-serve by one batch each).

Interviewer: How do you add a soft-limit / shadow mode?

Add a mode flag to the rule: enforce vs shadow. In shadow mode, allow() always returns allowed: true but the metrics emit "would-have-denied" with full context. Ship shadow mode first when rolling out a new limit so you can measure the blast radius before enforcement.

Interviewer: Leaky bucket vs token bucket โ€” when do you pick which?

Token bucket when burst is a feature. Leaky bucket when burst is a bug. Stripe wants burst โ€” token bucket. A narrow downstream service (say, a SOAP gateway to a legacy ERP) wants strict smoothing โ€” leaky bucket. The clock-inverted structure is otherwise identical.

Interviewer: How do you test a CAS loop?

Two angles: (a) correctness โ€” a single-threaded unit test where you mock AtomicLong.compareAndSet to return false a fixed number of times and verify the loop retries the right number of times. (b) concurrency โ€” a stress test with N threads hammering the same key at a known rate; assert that the total allowed count matches capacity + refillPerSec * duration within epsilon. Bonus: run under TSan / JCStress.

Interviewer: Redis goes down. Fail-open or fail-closed?

This is a product decision, not an engineering one. Fail-open says "availability beats correctness" โ€” we'd rather serve too many requests than 500 everyone. Choose this for user-facing read paths, content APIs, anything where denial is worse than mild overuse. Fail-closed says "correctness beats availability" โ€” we'd rather refuse service than risk charging a card twice. Choose this for billing, compliance, security-critical limits. Configure per-rule. The default for most APIs is fail-open with an alert on the circuit breaker so a human can flip the switch.

Interviewer: How would you add per-endpoint + per-user + per-IP limits?

Decorator chain. Each layer is its own RateLimiter with its own key extractor and rule. A CompositeRateLimiter wraps them with peek-then-commit semantics so we don't double-charge the earlier layers when a later one denies.

Interviewer: The metrics emission is on the hot path โ€” how do you keep it cheap?

Emit to a lock-free ring buffer (like LMAX Disruptor) consumed asynchronously by a metrics thread. On the hot path, the emit is one CAS. All aggregation, tag resolution, and export happens off-path.

Interviewer: A client argues 429s are breaking their integration. How do you give them a graceful window?

Add a "quota lease" mode: on the first deny, return allow for a short cushion (say, 100 ms or 10 requests) while setting a X-RateLimit-Warning header. If they keep abusing, drop to hard-deny. This is the same pattern as GitHub's API, and it converts angry Slack messages into SDK changelog entries.

Interviewer: How do you prevent a memory leak from unbounded key cardinality?

Hard-cap the state map with an LRU eviction policy. Alert if eviction rate exceeds a threshold โ€” that means legitimate keys are being evicted, which breaks the limiter's guarantee. The alert is as important as the cap.


SDE2 vs SDE3 โ€” How the Bar Rises โ€‹

DimensionSDE2 expectationsSDE3 expectations
Algorithm articulationCan implement one algorithm (usually token bucket) correctly.Articulates the trade-off table across all five. Defends a specific choice for a specific product. Knows the 2ร— boundary burst fact for fixed window and the accuracy number for sliding-window-counter.
Concurrency depthUses synchronized or a single lock correctly.Can discuss lock-striping, per-key locks vs CAS loops, when each wins under contention, and how to test both. Mentions single-threaded JS runtime if relevant.
Distributed extensibilityAcknowledges "Redis could do this" without specifics.Sketches the Lua script, explains why Lua beats WATCH/MULTI, discusses Redis's TIME vs client-provided now, proposes edge-local quota leases for scale.
Fail-open vs fail-closedDoesn't bring it up unless prompted.Raises it proactively as a product decision. Configurable per rule. Has opinions on defaults (fail-open for reads, fail-closed for billing).
ObservabilityMentions metrics briefly.Specifies what metrics (allow rate, deny rate, remaining quota distribution p50/p99, state map size, eviction rate, Redis latency). Proposes lock-free emission for the hot path.
TestabilityTests the happy path.Injects the clock. Writes determinism tests. Stress-tests the CAS loop. Thinks about shadow-mode rollout before enforcement.
Edge casesHandles the obvious (new key, empty budget).Clock rewind, sub-ms rates, cost > capacity, key cardinality DoS, retry storms, state eviction of idle keys, multi-tier composition atomicity.
Pattern usagePicks Strategy correctly when prompted.Justifies Strategy over inheritance/switch. Uses Decorator for layered limits. Treats Singleton as a smell. Discusses when DI beats static registration.
Scope disciplineImplements what's asked.Flags scope boundaries explicitly ("the queue-form leaky bucket is a different system; I'm implementing the meter form"). Separates out-of-scope from out-of-time.

The deepest signal the interviewer looks for: does the candidate know which knob to turn for which product, and can they defend that knob out loud? Anyone can implement a token bucket with a reference open. The senior tells are the unprompted mentions: "fixed window is up to 2ร—," "Stripe uses token bucket because burst is a feature," "fail-open is the right default for read paths but not billing." Those sentences are what move the hiring-bar needle.

Frontend interview preparation reference.