HLD: Rate Limiter for an API Gateway ​
L4 scoping note. A rate limiter sounds simple until you realize it's a distributed-systems problem in disguise. The interviewer wants you to pick ONE algorithm and implement it correctly across multiple gateway nodes -- not dazzle them with all four algorithms. They also want to hear explicit reasoning about fail-open vs fail-closed and what happens when your coordination layer (Redis) hiccups. Keep it grounded: 10M RPS global is big but tractable; don't invent a custom consensus protocol.
Understanding the Problem ​
What is an API Rate Limiter? ​
A rate limiter sits at the edge of your service and rejects requests that exceed configured thresholds -- per user, per endpoint, per API key, or any combination. It protects downstream services from abuse, ensures fair sharing, and enforces tiered pricing. Think AWS API Gateway quotas, GitHub's X-RateLimit-Remaining header, or Stripe's per-key limits. The interview value: it's a narrow problem with deep corners (algorithm choice, distributed state, atomicity, failure modes).
Functional Requirements ​
Core (above the line):
- Per-user limits -- each user has a default of 1000 RPS. Premium users get higher limits.
- Per-endpoint limits -- sensitive endpoints (e.g.,
/auth/login) have stricter caps (e.g., 10 RPS per IP). - Return 429 on reject -- when a request is throttled, return HTTP 429 with a
Retry-Afterheader telling the client how long to back off. - Dynamic configuration -- limits are configurable without redeploying the gateway.
Below the line (out of scope):
- DDoS mitigation at the network layer (that's Cloudflare/Akamai territory, not application rate limiting)
- Usage-based billing aggregation (related but a separate service)
- Machine-learning-based adaptive limits (nice, but a follow-up)
- Client-side rate limiting (we control the server side)
Non-Functional Requirements ​
Core:
- Low latency -- rate limit check must add < 5 ms p99. We cannot be the bottleneck on every request.
- High availability -- 99.99%. A broken rate limiter can take down every API behind it.
- Scale -- 10M RPS globally across the fleet, 1000 RPS default per user, millions of active users.
- Correctness -- under normal conditions, limits should be accurate within a small tolerance. Slight over/under-counting across windows is acceptable; letting a user do 100x their limit is not.
Below the line:
- Exactly-once enforcement across the globe (too expensive; eventual accuracy within a second is fine)
- Real-time analytics on limit usage (push to a separate analytics pipeline)
L4 sanity check: 10M RPS is large but distributed across many gateway nodes. Each node sees a slice. The only centralized piece is the counter store (Redis). One Redis cluster can do ~1M ops/sec; we'll need a small cluster, not magic.
The Set Up ​
Core Entities ​
| Entity | Description |
|---|---|
| RateLimitRule | id, scope (user / IP / API key), endpointPattern, limit, windowSeconds |
| Counter | Per (scope, subject, endpoint) -- tracks current usage in the active window |
| Quota | Per user/tier: "1000 RPS default, 10000 RPS for premium" |
The API ​
The rate limiter itself is mostly internal -- it sits inline on the request path. But there are two external surfaces:
Admin API (configure rules):
PUT /api/ratelimit/rules/{ruleId}
Content-Type: application/json
{
"scope": "user",
"endpointPattern": "/api/v1/search",
"limit": 100,
"windowSeconds": 60
}
Response: 200 OKData-plane (internal) -- the rate limit check:
Not a public REST endpoint, but conceptually:
checkLimit(userId, endpoint) -> { allowed: bool, remaining: int, resetAt: timestamp }Response on limit exceeded (to the end user):
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 12
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500000
{
"error": "rate_limit_exceeded",
"message": "Request rate exceeded. Try again in 12 seconds.",
"retryAfter": 12
}Retry-After can be seconds (integer) or an HTTP date. Use seconds; simpler.
High-Level Design ​
[Client] -> [Load Balancer] -> [API Gateway node] -> [Rate Limiter middleware] -> [Redis cluster]
| |
| +--- allow / deny decision
v
[Upstream service]Flow 1: Request admitted ​
- Client sends request
GET /api/v1/searchwith a bearer token. - Gateway extracts the
userIdfrom the token and the endpoint path. - Gateway calls the Rate Limiter middleware with
(userId, endpoint). - Middleware builds the counter key
rl:user:123:/api/v1/searchand executes a Lua script on Redis that atomically checks and increments. - Redis returns
allowed=true, remaining=899, resetAt=.... - Gateway forwards the request upstream, adds
X-RateLimit-*headers to the response.
Flow 2: Request throttled ​
- Steps 1-4 same.
- Redis returns
allowed=false, retryAfter=12. - Gateway short-circuits. Returns 429 with the
Retry-Afterheader. Upstream service is never called.
Flow 3: Rule change ​
- Operator calls the Admin API to change
limitfor a user tier from 1000 to 2000. - Admin service writes to a config store (etcd / PostgreSQL with pub/sub, or a simple Redis hash).
- Gateway nodes subscribe to config changes and reload in-memory rule cache within seconds.
- No request interruption; rule changes are applied on the next request.
Flow 4: Redis failure ​
Covered in detail in the deep dive. Short version: the gateway has a configurable fail-open vs fail-closed policy. Default is fail-open for general API traffic, fail-closed for sensitive endpoints (/auth/*, /payment/*).
Potential Deep Dives ​
1) Algorithm choice: token bucket vs sliding window log vs sliding window counter ​
You MUST know all three and their trade-offs cold. The interviewer will push on this.
Token Bucket ​
- How it works: Each user has a bucket with capacity
Ctokens. Tokens refill at rateRper second (up toC). Every request consumes 1 token. If bucket is empty, request is rejected. - Pros: Allows short bursts (up to
C), smooth rate overall. Simple to implement. Memoryless beyond the two state variables(tokens, lastRefillTime). - Cons: Allows bursts. If a burst is undesirable (e.g., login endpoint), use a smaller
C. - Use case: General API limits where bursts are OK.
Sliding Window Log ​
- How it works: Keep a log of every request timestamp. On each request, evict timestamps older than
windowSeconds, count the rest. Allow if count < limit. - Pros: Exact. No edge-case over/under-counting.
- Cons: Memory is
O(limit * users). For 1000 RPS * 1M users, that's 1B timestamps in memory. Expensive. - Use case: Low-limit, high-accuracy scenarios (e.g., 5 login attempts per hour per IP).
Sliding Window Counter (approximate) ​
- How it works: Two fixed windows (current and previous). For a request at time
tin the current window, compute the weighted count:currentCount + prevCount * (1 - positionInCurrentWindow). Allow if < limit. - Pros: O(1) memory per key. Smooths over window boundaries.
- Cons: Approximate. Can over-admit or under-admit by a few percent.
- Use case: The sweet spot for most per-user API limits.
Fixed Window Counter (the naive one -- often called out as "Bad") ​
- Bad Solution: Count requests in each fixed 60-second window. Reset at the boundary.
- Challenges: Boundary problem. A user can fire 1000 requests at 12:00:59 and another 1000 at 12:01:00 -- 2x the rate within one second. Unacceptable for any serious limit.
Recommendation for L4: Token Bucket ​
It's simple, efficient, and handles bursts gracefully. It maps cleanly to Redis with a Lua script. Sliding Window Counter is a good second choice if the interviewer pushes for "no bursts allowed."
-- Token bucket rate limiter: atomic in Redis
-- KEYS[1] = bucket key (e.g., "rl:user:123")
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now (unix ms)
-- ARGV[4] = requested tokens (usually 1)
-- Returns: { allowed, remaining, retryAfterMs }
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2]) -- per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now
-- Refill based on elapsed time
local elapsedMs = now - lastRefill
local refill = (elapsedMs / 1000) * refillRate
tokens = math.min(capacity, tokens + refill)
local allowed = 0
local retryAfterMs = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
else
retryAfterMs = math.ceil(((requested - tokens) / refillRate) * 1000)
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', KEYS[1], 3600) -- stale buckets get cleaned up
return { allowed, math.floor(tokens), retryAfterMs }Why a Lua script? Atomicity. Without it, you'd have a TOCTOU race -- two concurrent requests read the same token count and both decrement, allowing double the rate. Redis Lua scripts run atomically under a single-threaded execution model.
2) Distributed coordination -- consistent view across N gateway nodes ​
Bad Solution: Per-node in-memory counters ​
- Approach: Each gateway node tracks its own counts locally. A user hitting 10 nodes can do 10x the limit.
- Challenges: Completely breaks the limit contract. Don't do this.
Good Solution: Centralized Redis ​
- Approach: Every gateway node calls Redis for every rate limit check. One shared source of truth.
- Challenges: Redis is a critical dependency. Latency matters -- Redis RTT is typically < 1 ms in-region but adds up. 10M RPS means 10M Redis ops/sec -- beyond a single Redis node. Shard the Redis cluster by hash of the counter key.
Back-of-envelope: A single Redis node does ~100K ops/sec with Lua. For 10M RPS, we need ~100 Redis shards. More realistically, because of locality (most users hit the same gateway region), per-region Redis clusters of ~10 shards each suffice.
Great Solution: Two-tier -- local pre-filter + global Redis ​
- Approach: Each gateway node keeps a local token bucket that refills from the global Redis bucket periodically. Most requests are admitted/rejected locally; only occasional "refill" calls hit Redis.
- Example: local bucket of 100 tokens. When empty, request 100 more tokens from Redis (which atomically decrements the global bucket). If the global bucket is empty, we reject.
- Pros: Dramatically reduces Redis load. Reduces latency on the hot path.
- Cons: Harder to reason about. If a gateway node crashes holding 100 tokens, those are briefly "lost." For strict limits, over-allocate a small buffer; for lax limits, accept it.
L4 call: Good Solution (centralized Redis with Lua) is the right answer for an L4 interview. Mention the tiered local cache as a "if we needed to push further, here's what I'd do."
3) Fail-open vs fail-closed when Redis is down ​
This is where interviewers test your product thinking.
The Question ​
If Redis is unreachable, what does the gateway do with an incoming request?
- Fail-open: Allow the request. Users don't notice the outage. Risk: a malicious client can blast the API while Redis is down.
- Fail-closed: Reject with 429 or 503. Rate limits are respected. Risk: a Redis outage cascades into a full API outage, which is typically much worse than temporarily lax rate limits.
Recommended Default: Fail-open, with exceptions ​
- General traffic: fail-open. Log the event so SRE knows.
- Auth / payment / account-modification endpoints: fail-closed. Return 503 with
Retry-After. Better to refuse login attempts than to allow a credential-stuffing attack to skate through during a Redis outage. - Circuit breaker: if Redis returns errors for N consecutive requests in M seconds, trip open and skip the check entirely. Retry periodically to see if Redis recovered. This prevents each request from waiting for a Redis timeout.
boolean allowRequest(String userId, String endpoint) {
if (redisCircuitBreaker.isOpen()) {
return failMode == FAIL_OPEN || isCriticalEndpoint(endpoint) ? false : true;
}
try {
var result = redis.eval(TOKEN_BUCKET_SCRIPT, ...);
return result.allowed;
} catch (RedisTimeoutException e) {
redisCircuitBreaker.recordFailure();
return failMode == FAIL_OPEN;
}
}Explicitly articulate these trade-offs in the interview -- L4 candidates who can reason about fail modes stand out.
4) Making rate limit changes dynamic ​
Bad Solution: Redeploy gateway with new config ​
Minutes of latency to change a limit. Unacceptable for ops.
Good Solution: Central config store with polling ​
Gateways poll a config service (/api/ratelimit/rules) every 10 seconds. Rule changes propagate within ~15 seconds. Simple, reliable.
Great Solution: Pub/sub config distribution ​
- Admin writes to config DB, which publishes a
rule_updatedevent on Redis pub/sub or Kafka. - Each gateway node subscribes and reloads its rule cache within milliseconds.
- Fall back to periodic full refresh (every 5 minutes) in case a pub/sub message is missed.
For L4, "polling with a short interval" is perfectly fine. Call out pub/sub as an upgrade path.
5) Tiered limits (premium vs free) and multi-dimensional rules ​
A single request might be subject to multiple rules simultaneously:
- User limit: 1000 RPS
- Endpoint limit: 100 RPS on
/api/v1/search - Global limit: 10M RPS total
The middleware checks each applicable rule. Reject if ANY rule is violated. Atomically increment all counters only if all checks pass (use a single Lua script that does multiple HMGET + HMSET under one atomic call).
-- Multi-rule check: only increment all counters if all allowed
-- KEYS = list of counter keys
-- ARGV = corresponding limits and refill rates
for i = 1, #KEYS do
-- compute allowed per key (similar to single token bucket)
if not allowed[i] then
return { 0, retryAfter } -- reject, no increments applied
end
end
-- All allowed: now increment all
for i = 1, #KEYS do
redis.call('HMSET', KEYS[i], 'tokens', newTokens[i], 'lastRefill', now)
end
return { 1, 0 }6) The 429 response and Retry-After ​
Standards matter. From RFC 6585 and RFC 7231:
Retry-After: 12-- retry after 12 seconds.X-RateLimit-Limit-- the ceiling.X-RateLimit-Remaining-- tokens left in the current window.X-RateLimit-Reset-- unix timestamp when the bucket refills.
Good clients obey Retry-After and back off. Misbehaving clients ignore it -- for those, you may want a second tier of protection (e.g., IP bans after repeated violations).
7) What NOT to build at L4 ​
Interviewers get nervous when L4 candidates overreach. Be disciplined:
- Don't design CRDT-based rate limiting across regions (too exotic for this scope).
- Don't propose a custom Raft cluster for the counter store -- just use Redis and accept the single-region blast radius.
- Don't design real-time ML for adaptive limits unless the problem asks.
- Don't design a full observability stack; mention "we emit metrics to Prometheus" and move on.
What is Expected at Each Level ​
L3 / Mid-level ​
- Know that rate limiting exists and that it needs a shared counter store. Might propose fixed-window counters. Hits the 429 status code. Probably won't proactively address fail-open vs fail-closed or atomicity issues.
L4 ​
- Pick token bucket (or sliding window counter) and justify the choice against the others.
- Write the Lua script -- or at least clearly describe why atomicity is required and how Redis gives it.
- Discuss fail-open vs fail-closed with nuance (different policies per endpoint class).
- Explicit 429 response with
Retry-After. - Back-of-envelope on Redis shard count for 10M RPS.
- Dynamic rule updates without redeploy.
L5 / Senior ​
- Tiered local/global bucket to reduce Redis load on the hot path.
- Multi-region consistency: mention cross-region replication lag and why it's usually fine for rate limiting (eventual consistency, small over-allocation).
- Operational concerns: capacity planning for Redis, monitoring counter drift, alerting on circuit-breaker trips.
- Abuse patterns beyond simple rate limiting: credential stuffing, velocity checks on account creation, automated bot detection.
- Interaction with cost/billing systems: how do usage records flow to invoicing?