HLD: Rate Limiter for an API Gateway

L4 scoping note. A rate limiter sounds simple until you realize it's a distributed-systems problem in disguise. The interviewer wants you to pick ONE algorithm and implement it correctly across multiple gateway nodes -- not dazzle them with all four algorithms. They also want to hear explicit reasoning about fail-open vs fail-closed and what happens when your coordination layer (Redis) hiccups. Keep it grounded: 10M RPS global is big but tractable; don't invent a custom consensus protocol.

Understanding the Problem

What is an API Rate Limiter?

A rate limiter sits at the edge of your service and rejects requests that exceed configured thresholds -- per user, per endpoint, per API key, or any combination. It protects downstream services from abuse, ensures fair sharing, and enforces tiered pricing. Think AWS API Gateway quotas, GitHub's X-RateLimit-Remaining header, or Stripe's per-key limits. The interview value: it's a narrow problem with deep corners (algorithm choice, distributed state, atomicity, failure modes).

Functional Requirements

Core (above the line):

Per-user limits -- each user has a default of 1000 RPS. Premium users get higher limits.
Per-endpoint limits -- sensitive endpoints (e.g., /auth/login) have stricter caps (e.g., 10 RPS per IP).
Return 429 on reject -- when a request is throttled, return HTTP 429 with a Retry-After header telling the client how long to back off.
Dynamic configuration -- limits are configurable without redeploying the gateway.

Below the line (out of scope):

DDoS mitigation at the network layer (that's Cloudflare/Akamai territory, not application rate limiting)
Usage-based billing aggregation (related but a separate service)
Machine-learning-based adaptive limits (nice, but a follow-up)
Client-side rate limiting (we control the server side)

Non-Functional Requirements

Core:

Low latency -- rate limit check must add < 5 ms p99. We cannot be the bottleneck on every request.
High availability -- 99.99%. A broken rate limiter can take down every API behind it.
Scale -- 10M RPS globally across the fleet, 1000 RPS default per user, millions of active users.
Correctness -- under normal conditions, limits should be accurate within a small tolerance. Slight over/under-counting across windows is acceptable; letting a user do 100x their limit is not.

Below the line:

Exactly-once enforcement across the globe (too expensive; eventual accuracy within a second is fine)
Real-time analytics on limit usage (push to a separate analytics pipeline)

L4 sanity check: 10M RPS is large but distributed across many gateway nodes. Each node sees a slice. The only centralized piece is the counter store (Redis). One Redis cluster can do ~1M ops/sec; we'll need a small cluster, not magic.

The Set Up

Core Entities

Entity	Description
RateLimitRule	`id`, `scope` (user / IP / API key), `endpointPattern`, `limit`, `windowSeconds`
Counter	Per `(scope, subject, endpoint)` -- tracks current usage in the active window
Quota	Per user/tier: "1000 RPS default, 10000 RPS for premium"

The API

The rate limiter itself is mostly internal -- it sits inline on the request path. But there are two external surfaces:

Admin API (configure rules):

PUT /api/ratelimit/rules/{ruleId}
Content-Type: application/json

{
  "scope": "user",
  "endpointPattern": "/api/v1/search",
  "limit": 100,
  "windowSeconds": 60
}

Response: 200 OK

Data-plane (internal) -- the rate limit check:

Not a public REST endpoint, but conceptually:

checkLimit(userId, endpoint) -> { allowed: bool, remaining: int, resetAt: timestamp }

Response on limit exceeded (to the end user):

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 12
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500000

{
  "error": "rate_limit_exceeded",
  "message": "Request rate exceeded. Try again in 12 seconds.",
  "retryAfter": 12
}

Retry-After can be seconds (integer) or an HTTP date. Use seconds; simpler.

High-Level Design

[Client] -> [Load Balancer] -> [API Gateway node] -> [Rate Limiter middleware] -> [Redis cluster]
                                        |                            |
                                        |                            +--- allow / deny decision
                                        v
                                   [Upstream service]

Flow 1: Request admitted

Client sends request GET /api/v1/search with a bearer token.
Gateway extracts the userId from the token and the endpoint path.
Gateway calls the Rate Limiter middleware with (userId, endpoint).
Middleware builds the counter key rl:user:123:/api/v1/search and executes a Lua script on Redis that atomically checks and increments.
Redis returns allowed=true, remaining=899, resetAt=....
Gateway forwards the request upstream, adds X-RateLimit-* headers to the response.

Flow 2: Request throttled

Steps 1-4 same.
Redis returns allowed=false, retryAfter=12.
Gateway short-circuits. Returns 429 with the Retry-After header. Upstream service is never called.

Flow 3: Rule change

Operator calls the Admin API to change limit for a user tier from 1000 to 2000.
Admin service writes to a config store (etcd / PostgreSQL with pub/sub, or a simple Redis hash).
Gateway nodes subscribe to config changes and reload in-memory rule cache within seconds.
No request interruption; rule changes are applied on the next request.

Flow 4: Redis failure

Covered in detail in the deep dive. Short version: the gateway has a configurable fail-open vs fail-closed policy. Default is fail-open for general API traffic, fail-closed for sensitive endpoints (/auth/*, /payment/*).

Potential Deep Dives

1) Algorithm choice: token bucket vs sliding window log vs sliding window counter

You MUST know all three and their trade-offs cold. The interviewer will push on this.

Token Bucket

How it works: Each user has a bucket with capacity C tokens. Tokens refill at rate R per second (up to C). Every request consumes 1 token. If bucket is empty, request is rejected.
Pros: Allows short bursts (up to C), smooth rate overall. Simple to implement. Memoryless beyond the two state variables (tokens, lastRefillTime).
Cons: Allows bursts. If a burst is undesirable (e.g., login endpoint), use a smaller C.
Use case: General API limits where bursts are OK.

Sliding Window Log

How it works: Keep a log of every request timestamp. On each request, evict timestamps older than windowSeconds, count the rest. Allow if count < limit.
Pros: Exact. No edge-case over/under-counting.
Cons: Memory is O(limit * users). For 1000 RPS * 1M users, that's 1B timestamps in memory. Expensive.
Use case: Low-limit, high-accuracy scenarios (e.g., 5 login attempts per hour per IP).

Sliding Window Counter (approximate)

How it works: Two fixed windows (current and previous). For a request at time t in the current window, compute the weighted count: currentCount + prevCount * (1 - positionInCurrentWindow). Allow if < limit.
Pros: O(1) memory per key. Smooths over window boundaries.
Cons: Approximate. Can over-admit or under-admit by a few percent.
Use case: The sweet spot for most per-user API limits.

Fixed Window Counter (the naive one -- often called out as "Bad")

Bad Solution: Count requests in each fixed 60-second window. Reset at the boundary.
Challenges: Boundary problem. A user can fire 1000 requests at 12:00:59 and another 1000 at 12:01:00 -- 2x the rate within one second. Unacceptable for any serious limit.

Recommendation for L4: Token Bucket

It's simple, efficient, and handles bursts gracefully. It maps cleanly to Redis with a Lua script. Sliding Window Counter is a good second choice if the interviewer pushes for "no bursts allowed."

lua

-- Token bucket rate limiter: atomic in Redis
-- KEYS[1] = bucket key (e.g., "rl:user:123")
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now (unix ms)
-- ARGV[4] = requested tokens (usually 1)
-- Returns: { allowed, remaining, retryAfterMs }

local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])  -- per second
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1]) or capacity
local lastRefill = tonumber(bucket[2]) or now

-- Refill based on elapsed time
local elapsedMs = now - lastRefill
local refill = (elapsedMs / 1000) * refillRate
tokens = math.min(capacity, tokens + refill)

local allowed = 0
local retryAfterMs = 0
if tokens >= requested then
  tokens = tokens - requested
  allowed = 1
else
  retryAfterMs = math.ceil(((requested - tokens) / refillRate) * 1000)
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', KEYS[1], 3600)  -- stale buckets get cleaned up

return { allowed, math.floor(tokens), retryAfterMs }

Why a Lua script? Atomicity. Without it, you'd have a TOCTOU race -- two concurrent requests read the same token count and both decrement, allowing double the rate. Redis Lua scripts run atomically under a single-threaded execution model.

2) Distributed coordination -- consistent view across N gateway nodes

Bad Solution: Per-node in-memory counters

Approach: Each gateway node tracks its own counts locally. A user hitting 10 nodes can do 10x the limit.
Challenges: Completely breaks the limit contract. Don't do this.

Good Solution: Centralized Redis

Approach: Every gateway node calls Redis for every rate limit check. One shared source of truth.
Challenges: Redis is a critical dependency. Latency matters -- Redis RTT is typically < 1 ms in-region but adds up. 10M RPS means 10M Redis ops/sec -- beyond a single Redis node. Shard the Redis cluster by hash of the counter key.

Back-of-envelope: A single Redis node does ~100K ops/sec with Lua. For 10M RPS, we need ~100 Redis shards. More realistically, because of locality (most users hit the same gateway region), per-region Redis clusters of ~10 shards each suffice.

Great Solution: Two-tier -- local pre-filter + global Redis

Approach: Each gateway node keeps a local token bucket that refills from the global Redis bucket periodically. Most requests are admitted/rejected locally; only occasional "refill" calls hit Redis.
Example: local bucket of 100 tokens. When empty, request 100 more tokens from Redis (which atomically decrements the global bucket). If the global bucket is empty, we reject.
Pros: Dramatically reduces Redis load. Reduces latency on the hot path.
Cons: Harder to reason about. If a gateway node crashes holding 100 tokens, those are briefly "lost." For strict limits, over-allocate a small buffer; for lax limits, accept it.

L4 call: Good Solution (centralized Redis with Lua) is the right answer for an L4 interview. Mention the tiered local cache as a "if we needed to push further, here's what I'd do."

3) Fail-open vs fail-closed when Redis is down

This is where interviewers test your product thinking.

The Question

If Redis is unreachable, what does the gateway do with an incoming request?

Fail-open: Allow the request. Users don't notice the outage. Risk: a malicious client can blast the API while Redis is down.
Fail-closed: Reject with 429 or 503. Rate limits are respected. Risk: a Redis outage cascades into a full API outage, which is typically much worse than temporarily lax rate limits.

Recommended Default: Fail-open, with exceptions

General traffic: fail-open. Log the event so SRE knows.
Auth / payment / account-modification endpoints: fail-closed. Return 503 with Retry-After. Better to refuse login attempts than to allow a credential-stuffing attack to skate through during a Redis outage.
Circuit breaker: if Redis returns errors for N consecutive requests in M seconds, trip open and skip the check entirely. Retry periodically to see if Redis recovered. This prevents each request from waiting for a Redis timeout.

java

boolean allowRequest(String userId, String endpoint) {
    if (redisCircuitBreaker.isOpen()) {
        return failMode == FAIL_OPEN || isCriticalEndpoint(endpoint) ? false : true;
    }
    try {
        var result = redis.eval(TOKEN_BUCKET_SCRIPT, ...);
        return result.allowed;
    } catch (RedisTimeoutException e) {
        redisCircuitBreaker.recordFailure();
        return failMode == FAIL_OPEN;
    }
}

Explicitly articulate these trade-offs in the interview -- L4 candidates who can reason about fail modes stand out.

4) Making rate limit changes dynamic

Bad Solution: Redeploy gateway with new config

Minutes of latency to change a limit. Unacceptable for ops.

Good Solution: Central config store with polling

Gateways poll a config service (/api/ratelimit/rules) every 10 seconds. Rule changes propagate within ~15 seconds. Simple, reliable.

Great Solution: Pub/sub config distribution

Admin writes to config DB, which publishes a rule_updated event on Redis pub/sub or Kafka.
Each gateway node subscribes and reloads its rule cache within milliseconds.
Fall back to periodic full refresh (every 5 minutes) in case a pub/sub message is missed.

For L4, "polling with a short interval" is perfectly fine. Call out pub/sub as an upgrade path.

5) Tiered limits (premium vs free) and multi-dimensional rules

A single request might be subject to multiple rules simultaneously:

User limit: 1000 RPS
Endpoint limit: 100 RPS on /api/v1/search
Global limit: 10M RPS total

The middleware checks each applicable rule. Reject if ANY rule is violated. Atomically increment all counters only if all checks pass (use a single Lua script that does multiple HMGET + HMSET under one atomic call).

lua

-- Multi-rule check: only increment all counters if all allowed
-- KEYS = list of counter keys
-- ARGV = corresponding limits and refill rates

for i = 1, #KEYS do
  -- compute allowed per key (similar to single token bucket)
  if not allowed[i] then
    return { 0, retryAfter }  -- reject, no increments applied
  end
end

-- All allowed: now increment all
for i = 1, #KEYS do
  redis.call('HMSET', KEYS[i], 'tokens', newTokens[i], 'lastRefill', now)
end
return { 1, 0 }

6) The 429 response and `Retry-After`

Standards matter. From RFC 6585 and RFC 7231:

Retry-After: 12 -- retry after 12 seconds.
X-RateLimit-Limit -- the ceiling.
X-RateLimit-Remaining -- tokens left in the current window.
X-RateLimit-Reset -- unix timestamp when the bucket refills.

Good clients obey Retry-After and back off. Misbehaving clients ignore it -- for those, you may want a second tier of protection (e.g., IP bans after repeated violations).

7) What NOT to build at L4

Interviewers get nervous when L4 candidates overreach. Be disciplined:

Don't design CRDT-based rate limiting across regions (too exotic for this scope).
Don't propose a custom Raft cluster for the counter store -- just use Redis and accept the single-region blast radius.
Don't design real-time ML for adaptive limits unless the problem asks.
Don't design a full observability stack; mention "we emit metrics to Prometheus" and move on.

What is Expected at Each Level

L3 / Mid-level

Know that rate limiting exists and that it needs a shared counter store. Might propose fixed-window counters. Hits the 429 status code. Probably won't proactively address fail-open vs fail-closed or atomicity issues.

L4

Pick token bucket (or sliding window counter) and justify the choice against the others.
Write the Lua script -- or at least clearly describe why atomicity is required and how Redis gives it.
Discuss fail-open vs fail-closed with nuance (different policies per endpoint class).
Explicit 429 response with Retry-After.
Back-of-envelope on Redis shard count for 10M RPS.
Dynamic rule updates without redeploy.

L5 / Senior

Tiered local/global bucket to reduce Redis load on the hot path.
Multi-region consistency: mention cross-region replication lag and why it's usually fine for rate limiting (eventual consistency, small over-allocation).
Operational concerns: capacity planning for Redis, monitoring counter drift, alerting on circuit-breaker trips.
Abuse patterns beyond simple rate limiting: credential stuffing, velocity checks on account creation, automated bot detection.
Interaction with cost/billing systems: how do usage records flow to invoicing?

HLD: Rate Limiter for an API Gateway ​

Understanding the Problem ​

What is an API Rate Limiter? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

The API ​

High-Level Design ​

Flow 1: Request admitted ​

Flow 2: Request throttled ​

Flow 3: Rule change ​

Flow 4: Redis failure ​

Potential Deep Dives ​

1) Algorithm choice: token bucket vs sliding window log vs sliding window counter ​

Token Bucket ​

Sliding Window Log ​

Sliding Window Counter (approximate) ​

Fixed Window Counter (the naive one -- often called out as "Bad") ​

Recommendation for L4: Token Bucket ​

2) Distributed coordination -- consistent view across N gateway nodes ​

Bad Solution: Per-node in-memory counters ​

Good Solution: Centralized Redis ​

Great Solution: Two-tier -- local pre-filter + global Redis ​

3) Fail-open vs fail-closed when Redis is down ​

The Question ​

Recommended Default: Fail-open, with exceptions ​

4) Making rate limit changes dynamic ​

Bad Solution: Redeploy gateway with new config ​

Good Solution: Central config store with polling ​

Great Solution: Pub/sub config distribution ​

5) Tiered limits (premium vs free) and multi-dimensional rules ​

6) The 429 response and Retry-After ​

7) What NOT to build at L4 ​

What is Expected at Each Level ​

L3 / Mid-level ​

L4 ​

L5 / Senior ​