Skip to content

HLD: Rate Limiter for an API Gateway ​

L4 scoping note. A rate limiter sounds simple until you realize it's a distributed-systems problem in disguise. The interviewer wants you to pick ONE algorithm and implement it correctly across multiple gateway nodes -- not dazzle them with all four algorithms. They also want to hear explicit reasoning about fail-open vs fail-closed and what happens when your coordination layer (Redis) hiccups. Keep it grounded: 10M RPS global is big but tractable; don't invent a custom consensus protocol.


Understanding the Problem ​

What is an API Rate Limiter? ​

A rate limiter sits at the edge of your service and rejects requests that exceed configured thresholds -- per user, per endpoint, per API key, or any combination. It protects downstream services from abuse, ensures fair sharing, and enforces tiered pricing. Think AWS API Gateway quotas, GitHub's X-RateLimit-Remaining header, or Stripe's per-key limits. The interview value: it's a narrow problem with deep corners (algorithm choice, distributed state, atomicity, failure modes).

Functional Requirements ​

Core (above the line):

  1. Per-user limits -- each user has a default of 1000 RPS. Premium users get higher limits.
  2. Per-endpoint limits -- sensitive endpoints (e.g., /auth/login) have stricter caps (e.g., 10 RPS per IP).
  3. Return 429 on reject -- when a request is throttled, return HTTP 429 with a Retry-After header telling the client how long to back off.
  4. Dynamic configuration -- limits are configurable without redeploying the gateway.

Below the line (out of scope):

  • DDoS mitigation at the network layer (that's Cloudflare/Akamai territory, not application rate limiting)
  • Usage-based billing aggregation (related but a separate service)
  • Machine-learning-based adaptive limits (nice, but a follow-up)
  • Client-side rate limiting (we control the server side)

Non-Functional Requirements ​

Core:

  1. Low latency -- rate limit check must add < 5 ms p99. We cannot be the bottleneck on every request.
  2. High availability -- 99.99%. A broken rate limiter can take down every API behind it.
  3. Scale -- 10M RPS globally across the fleet, 1000 RPS default per user, millions of active users.
  4. Correctness -- under normal conditions, limits should be accurate within a small tolerance. Slight over/under-counting across windows is acceptable; letting a user do 100x their limit is not.

Below the line:

  • Exactly-once enforcement across the globe (too expensive; eventual accuracy within a second is fine)
  • Real-time analytics on limit usage (push to a separate analytics pipeline)

L4 sanity check: 10M RPS is large but distributed across many gateway nodes. Each node sees a slice. The only centralized piece is the counter store (Redis). One Redis cluster can do ~1M ops/sec; we'll need a small cluster, not magic.


The Set Up ​

Core Entities ​

EntityDescription
RateLimitRuleid, scope (user / IP / API key), endpointPattern, limit, windowSeconds
CounterPer (scope, subject, endpoint) -- tracks current usage in the active window
QuotaPer user/tier: "1000 RPS default, 10000 RPS for premium"

The API ​

The rate limiter itself is mostly internal -- it sits inline on the request path. But there are two external surfaces:

Admin API (configure rules):

PUT /api/ratelimit/rules/{ruleId}
Content-Type: application/json

{
  "scope": "user",
  "endpointPattern": "/api/v1/search",
  "limit": 100,
  "windowSeconds": 60
}

Response: 200 OK

Data-plane (internal) -- the rate limit check:

Not a public REST endpoint, but conceptually:

checkLimit(userId, endpoint) -> { allowed: bool, remaining: int, resetAt: timestamp }

Response on limit exceeded (to the end user):

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 12
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500000

{
  "error": "rate_limit_exceeded",
  "message": "Request rate exceeded. Try again in 12 seconds.",
  "retryAfter": 12
}

Retry-After can be seconds (integer) or an HTTP date. Use seconds; simpler.


High-Level Design ​

[Client] -> [Load Balancer] -> [API Gateway node] -> [Rate Limiter middleware] -> [Redis cluster]
                                        |                            |
                                        |                            +--- allow / deny decision
                                        v
                                   [Upstream service]

Flow 1: Request admitted ​

  1. Client sends request GET /api/v1/search with a bearer token.
  2. Gateway extracts the userId from the token and the endpoint path.
  3. Gateway calls the Rate Limiter middleware with (userId, endpoint).
  4. Middleware builds the counter key rl:user:123:/api/v1/search and executes a Lua script on Redis that atomically checks and increments.
  5. Redis returns allowed=true, remaining=899, resetAt=....
  6. Gateway forwards the request upstream, adds X-RateLimit-* headers to the response.

Flow 2: Request throttled ​

  1. Steps 1-4 same.
  2. Redis returns allowed=false, retryAfter=12.
  3. Gateway short-circuits. Returns 429 with the Retry-After header. Upstream service is never called.

Flow 3: Rule change ​

  1. Operator calls the Admin API to change limit for a user tier from 1000 to 2000.
  2. Admin service writes to a config store (etcd / PostgreSQL with pub/sub, or a simple Redis hash).
  3. Gateway nodes subscribe to config changes and reload in-memory rule cache within seconds.
  4. No request interruption; rule changes are applied on the next request.

Flow 4: Redis failure ​

Covered in detail in the deep dive. Short version: the gateway has a configurable fail-open vs fail-closed policy. Default is fail-open for general API traffic, fail-closed for sensitive endpoints (/auth/*, /payment/*).


Potential Deep Dives ​

1) Algorithm choice: token bucket vs sliding window log vs sliding window counter ​

You MUST know all three and their trade-offs cold. The interviewer will push on this.

Token Bucket ​

  • How it works: Each user has a bucket with capacity C tokens. Tokens refill at rate R per second (up to C). Every request consumes 1 token. If bucket is empty, request is rejected.
  • Pros: Allows short bursts (up to C), smooth rate overall. Simple to implement. Memoryless beyond the two state variables (tokens, lastRefillTime).
  • Cons: Allows bursts. If a burst is undesirable (e.g., login endpoint), use a smaller C.
  • Use case: General API limits where bursts are OK.

Sliding Window Log ​

  • How it works: Keep a log of every request timestamp. On each request, evict timestamps older than windowSeconds, count the rest. Allow if count < limit.
  • Pros: Exact. No edge-case over/under-counting.
  • Cons: Memory is O(limit * users). For 1000 RPS * 1M users, that's 1B timestamps in memory. Expensive.
  • Use case: Low-limit, high-accuracy scenarios (e.g., 5 login attempts per hour per IP).

Sliding Window Counter (approximate) ​

  • How it works: Two fixed windows (current and previous). For a request at time t in the current window, compute the weighted count: currentCount + prevCount * (1 - positionInCurrentWindow). Allow if < limit.
  • Pros: O(1) memory per key. Smooths over window boundaries.
  • Cons: Approximate. Can over-admit or under-admit by a few percent.
  • Use case: The sweet spot for most per-user API limits.

Fixed Window Counter (the naive one -- often called out as "Bad") ​

  • Bad Solution: Count requests in each fixed 60-second window. Reset at the boundary.
  • Challenges: Boundary problem. A user can fire 1000 requests at 12:00:59 and another 1000 at 12:01:00 -- 2x the rate within one second. Unacceptable for any serious limit.

Recommendation for L4: Token Bucket ​

It's simple, efficient, and handles bursts gracefully. It maps cleanly to Redis with a Lua script. Sliding Window Counter is a good second choice if the interviewer pushes for "no bursts allowed."

lua
-- Token bucket rate limiter: atomic in Redis
-- KEYS[1] = bucket key (e.g., "rl:user:123")
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now (unix ms)
-- ARGV[4] = requested tokens (usually 1)
-- Returns: { allowed, remaining, retryAfterMs }

local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])  -- per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsedMs = now - lastRefill
local refill = (elapsedMs / 1000) * refillRate
tokens = math.min(capacity, tokens + refill)

local allowed = 0
local retryAfterMs = 0
if tokens >= requested then
  tokens = tokens - requested
  allowed = 1
else
  retryAfterMs = math.ceil(((requested - tokens) / refillRate) * 1000)
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', KEYS[1], 3600)  -- stale buckets get cleaned up

return { allowed, math.floor(tokens), retryAfterMs }

Why a Lua script? Atomicity. Without it, you'd have a TOCTOU race -- two concurrent requests read the same token count and both decrement, allowing double the rate. Redis Lua scripts run atomically under a single-threaded execution model.

2) Distributed coordination -- consistent view across N gateway nodes ​

Bad Solution: Per-node in-memory counters ​

  • Approach: Each gateway node tracks its own counts locally. A user hitting 10 nodes can do 10x the limit.
  • Challenges: Completely breaks the limit contract. Don't do this.

Good Solution: Centralized Redis ​

  • Approach: Every gateway node calls Redis for every rate limit check. One shared source of truth.
  • Challenges: Redis is a critical dependency. Latency matters -- Redis RTT is typically < 1 ms in-region but adds up. 10M RPS means 10M Redis ops/sec -- beyond a single Redis node. Shard the Redis cluster by hash of the counter key.

Back-of-envelope: A single Redis node does ~100K ops/sec with Lua. For 10M RPS, we need ~100 Redis shards. More realistically, because of locality (most users hit the same gateway region), per-region Redis clusters of ~10 shards each suffice.

Great Solution: Two-tier -- local pre-filter + global Redis ​

  • Approach: Each gateway node keeps a local token bucket that refills from the global Redis bucket periodically. Most requests are admitted/rejected locally; only occasional "refill" calls hit Redis.
  • Example: local bucket of 100 tokens. When empty, request 100 more tokens from Redis (which atomically decrements the global bucket). If the global bucket is empty, we reject.
  • Pros: Dramatically reduces Redis load. Reduces latency on the hot path.
  • Cons: Harder to reason about. If a gateway node crashes holding 100 tokens, those are briefly "lost." For strict limits, over-allocate a small buffer; for lax limits, accept it.

L4 call: Good Solution (centralized Redis with Lua) is the right answer for an L4 interview. Mention the tiered local cache as a "if we needed to push further, here's what I'd do."

3) Fail-open vs fail-closed when Redis is down ​

This is where interviewers test your product thinking.

The Question ​

If Redis is unreachable, what does the gateway do with an incoming request?

  • Fail-open: Allow the request. Users don't notice the outage. Risk: a malicious client can blast the API while Redis is down.
  • Fail-closed: Reject with 429 or 503. Rate limits are respected. Risk: a Redis outage cascades into a full API outage, which is typically much worse than temporarily lax rate limits.
  • General traffic: fail-open. Log the event so SRE knows.
  • Auth / payment / account-modification endpoints: fail-closed. Return 503 with Retry-After. Better to refuse login attempts than to allow a credential-stuffing attack to skate through during a Redis outage.
  • Circuit breaker: if Redis returns errors for N consecutive requests in M seconds, trip open and skip the check entirely. Retry periodically to see if Redis recovered. This prevents each request from waiting for a Redis timeout.
java
boolean allowRequest(String userId, String endpoint) {
    if (redisCircuitBreaker.isOpen()) {
        return failMode == FAIL_OPEN || isCriticalEndpoint(endpoint) ? false : true;
    }
    try {
        var result = redis.eval(TOKEN_BUCKET_SCRIPT, ...);
        return result.allowed;
    } catch (RedisTimeoutException e) {
        redisCircuitBreaker.recordFailure();
        return failMode == FAIL_OPEN;
    }
}

Explicitly articulate these trade-offs in the interview -- L4 candidates who can reason about fail modes stand out.

4) Making rate limit changes dynamic ​

Bad Solution: Redeploy gateway with new config ​

Minutes of latency to change a limit. Unacceptable for ops.

Good Solution: Central config store with polling ​

Gateways poll a config service (/api/ratelimit/rules) every 10 seconds. Rule changes propagate within ~15 seconds. Simple, reliable.

Great Solution: Pub/sub config distribution ​

  • Admin writes to config DB, which publishes a rule_updated event on Redis pub/sub or Kafka.
  • Each gateway node subscribes and reloads its rule cache within milliseconds.
  • Fall back to periodic full refresh (every 5 minutes) in case a pub/sub message is missed.

For L4, "polling with a short interval" is perfectly fine. Call out pub/sub as an upgrade path.

5) Tiered limits (premium vs free) and multi-dimensional rules ​

A single request might be subject to multiple rules simultaneously:

  • User limit: 1000 RPS
  • Endpoint limit: 100 RPS on /api/v1/search
  • Global limit: 10M RPS total

The middleware checks each applicable rule. Reject if ANY rule is violated. Atomically increment all counters only if all checks pass (use a single Lua script that does multiple HMGET + HMSET under one atomic call).

lua
-- Multi-rule check: only increment all counters if all allowed
-- KEYS = list of counter keys
-- ARGV = corresponding limits and refill rates

for i = 1, #KEYS do
  -- compute allowed per key (similar to single token bucket)
  if not allowed[i] then
    return { 0, retryAfter }  -- reject, no increments applied
  end
end

-- All allowed: now increment all
for i = 1, #KEYS do
  redis.call('HMSET', KEYS[i], 'tokens', newTokens[i], 'lastRefill', now)
end
return { 1, 0 }

6) The 429 response and Retry-After ​

Standards matter. From RFC 6585 and RFC 7231:

  • Retry-After: 12 -- retry after 12 seconds.
  • X-RateLimit-Limit -- the ceiling.
  • X-RateLimit-Remaining -- tokens left in the current window.
  • X-RateLimit-Reset -- unix timestamp when the bucket refills.

Good clients obey Retry-After and back off. Misbehaving clients ignore it -- for those, you may want a second tier of protection (e.g., IP bans after repeated violations).

7) What NOT to build at L4 ​

Interviewers get nervous when L4 candidates overreach. Be disciplined:

  • Don't design CRDT-based rate limiting across regions (too exotic for this scope).
  • Don't propose a custom Raft cluster for the counter store -- just use Redis and accept the single-region blast radius.
  • Don't design real-time ML for adaptive limits unless the problem asks.
  • Don't design a full observability stack; mention "we emit metrics to Prometheus" and move on.

What is Expected at Each Level ​

L3 / Mid-level ​

  • Know that rate limiting exists and that it needs a shared counter store. Might propose fixed-window counters. Hits the 429 status code. Probably won't proactively address fail-open vs fail-closed or atomicity issues.

L4 ​

  • Pick token bucket (or sliding window counter) and justify the choice against the others.
  • Write the Lua script -- or at least clearly describe why atomicity is required and how Redis gives it.
  • Discuss fail-open vs fail-closed with nuance (different policies per endpoint class).
  • Explicit 429 response with Retry-After.
  • Back-of-envelope on Redis shard count for 10M RPS.
  • Dynamic rule updates without redeploy.

L5 / Senior ​

  • Tiered local/global bucket to reduce Redis load on the hot path.
  • Multi-region consistency: mention cross-region replication lag and why it's usually fine for rate limiting (eventual consistency, small over-allocation).
  • Operational concerns: capacity planning for Redis, monitoring counter drift, alerting on circuit-breaker trips.
  • Abuse patterns beyond simple rate limiting: credential stuffing, velocity checks on account creation, automated bot detection.
  • Interaction with cost/billing systems: how do usage records flow to invoicing?

Frontend interview preparation reference.