Skip to content

HLD: Rate Limiter ​

Understanding the Problem ​

What is a Rate Limiter? ​

A rate limiter controls the rate of traffic a client can send to a service. It protects backend services from being overwhelmed by excessive requests -- whether from a misbehaving client, a DDoS attack, or an organic traffic spike. Think of it as a bouncer at a club: you can enter, but only N people per minute. At Amazon's scale, rate limiting is critical for every API -- from product search to checkout to third-party seller APIs.

Note: This document covers the high-level / distributed system design of a rate limiter. For the class-level (LLD) design -- including full OOP implementation of the algorithms -- see LLD Rate Limiter (file 15).

Functional Requirements ​

Core (above the line):

  1. Limit requests per client -- given a client identifier (user ID, API key, IP), allow at most N requests per time window
  2. Return clear feedback -- when rate limited, return HTTP 429 with Retry-After header
  3. Support multiple rules -- different rate limits for different endpoints (e.g., 100 req/min for search, 10 req/min for checkout)
  4. Configurable limits -- rate limit rules can be updated without redeployment

Below the line (mention but don't design):

  • Rate limit by geographic region
  • Adaptive rate limiting (auto-adjust based on backend health)
  • Rate limit exemptions (whitelisted clients)
  • Billing-tier-based limits

Non-Functional Requirements ​

  1. Low latency overhead -- rate limiting check must complete in < 1ms (it is on the critical path of every request)
  2. High availability -- if the rate limiter goes down, we should fail open (allow traffic) rather than block everything
  3. Accuracy -- slight over-counting is acceptable (±5%), but we must never allow 10x the limit
  4. Scale -- 1M requests/sec across all services, 10M unique clients

The Set Up ​

Core Entities ​

EntityDescription
RateLimitRuleid, endpoint, clientTier, maxRequests, windowSeconds, algorithm
RateLimitCounterclientId, endpoint, count, windowStart (or token count, depending on algorithm)
ClientclientId, apiKey, tier (free/pro/enterprise), whitelisted

API Design ​

The rate limiter is not a user-facing API -- it is middleware. But it exposes internal APIs for configuration:

Check rate limit (internal, called by API gateway):

// This happens internally, not as a REST call, but conceptually:
RateLimitResult checkRateLimit(clientId, endpoint)

Returns:
{
  "allowed": true/false,
  "remaining": 87,        // requests remaining in current window
  "limit": 100,           // max requests allowed
  "retryAfterMs": 0       // 0 if allowed, otherwise ms until window resets
}

HTTP response when rate limited:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312860
Retry-After: 45
Content-Type: application/json

{
  "error": "RATE_LIMIT_EXCEEDED",
  "message": "Rate limit exceeded. Try again in 45 seconds."
}

HTTP response headers on normal requests:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1705312860

Update rate limit rules (admin API):

PUT /api/admin/rate-limit-rules/{ruleId}
Authorization: Bearer <admin-token>

Request:
{
  "endpoint": "/api/search",
  "clientTier": "free",
  "maxRequests": 100,
  "windowSeconds": 60,
  "algorithm": "sliding_window_counter"
}

High-Level Design ​

Architecture Overview ​

[Client] --> [Load Balancer] --> [API Gateway (Rate Limit Check)]
                                        |
                                  [Rate Limit Service]
                                        |
                                  [Redis Cluster (counters)]
                                        |
                                  [Rules Config Store (DynamoDB)]
                                        |
                                  (if allowed)
                                        v
                                  [Backend Service]

Flow 1: Request Rate Limit Check ​

Step-by-step:

  1. Client sends a request to GET /api/search?q=headphones
  2. Request hits the API Gateway (could be AWS API Gateway, Kong, or custom)
  3. API Gateway extracts the client identifier: apiKey from header, or fallback to IP address
  4. API Gateway calls the Rate Limit Service (co-located sidecar or separate microservice)
  5. Rate Limit Service: a. Loads the applicable rule for this endpoint + client tier from a local cache (rules are synced from DynamoDB every 30 seconds) b. Executes the rate limit algorithm against Redis:
    -- For token bucket (example):
    Key: ratelimit:{clientId}:{endpoint}
    Operation: atomic check-and-decrement via Lua script
    c. Returns allowed: true/false with remaining count
  6. If allowed: API Gateway forwards the request to the backend service
  7. If denied: API Gateway returns 429 with appropriate headers

Flow 2: Rule Configuration Update ​

  1. Admin updates a rule via PUT /api/admin/rate-limit-rules/{ruleId}
  2. Rule is written to DynamoDB (durable config store)
  3. A DynamoDB Stream triggers a Lambda function
  4. Lambda publishes a RuleUpdated event to SNS
  5. All Rate Limit Service instances subscribe to the SNS topic
  6. Each instance updates its local rule cache within seconds
  7. New requests are evaluated against the updated rule

No restart or redeployment needed. Rules propagate in < 10 seconds.


Rate Limiting Algorithms ​

Algorithm Comparison Table ​

AlgorithmAccuracyMemoryBurst HandlingComplexityBest For
Token BucketHighO(1) per clientAllows controlled burstsLowGeneral purpose, API gateways
Fixed Window CounterMedium (boundary burst)O(1) per client2x burst at window boundaryVery LowSimple use cases
Sliding Window LogPerfectO(N) per clientNo bursts allowedMediumWhen exact accuracy is critical
Sliding Window CounterHigh (~99.x%)O(1) per clientMinimal boundary burstMediumBest balance of accuracy and efficiency

Algorithm 1: Token Bucket ​

How it works:

  • Each client has a "bucket" that holds up to maxTokens tokens
  • Tokens are added at a fixed rate: refillRate tokens per second
  • Each request costs 1 token. If the bucket is empty, the request is rejected.
  • Tokens accumulate up to maxTokens, allowing short bursts

Example: maxTokens = 10, refillRate = 2/sec

  • At t=0: bucket has 10 tokens. Client sends 10 requests instantly -- all allowed. Bucket: 0.
  • At t=1: 2 tokens added. Client can send 2 more. Bucket: 0 after 2 requests.
  • At t=5: 10 tokens accumulated (capped at max). Client can burst again.

Redis Implementation (Lua script):

lua
-- Token Bucket Rate Limiter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}
-- ARGV[1] = maxTokens, ARGV[2] = refillRate, ARGV[3] = now (unix timestamp ms)

local key = KEYS[1]
local maxTokens = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1])
local lastRefill = tonumber(bucket[2])

-- Initialize if first request
if tokens == nil then
    tokens = maxTokens
    lastRefill = now
end

-- Calculate tokens to add since last refill
local elapsed = (now - lastRefill) / 1000  -- convert ms to seconds
local newTokens = math.floor(elapsed * refillRate)
tokens = math.min(maxTokens, tokens + newTokens)

-- Update last refill time only if tokens were added
if newTokens > 0 then
    lastRefill = now
end

-- Try to consume a token
local allowed = 0
if tokens >= 1 then
    tokens = tokens - 1
    allowed = 1
end

-- Save state
redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', lastRefill)
redis.call('EXPIRE', key, math.ceil(maxTokens / refillRate) + 10)  -- TTL for cleanup

return {allowed, tokens}  -- {1 or 0, remaining tokens}

Pros: Allows controlled bursts (great for APIs where clients send requests in batches). Memory-efficient (O(1) per client -- just two values).

Cons: Burst size equals maxTokens, which might be too large for some use cases. Harder to reason about than a simple "N per minute" rule.


Algorithm 2: Fixed Window Counter ​

How it works:

  • Divide time into fixed windows (e.g., each minute: 10:00-10:01, 10:01-10:02)
  • Each client has a counter per window
  • If counter < limit, allow and increment. If counter >= limit, reject.

Example: limit = 100 per minute

  • Window 10:00-10:01: client sends 100 requests -- all allowed. Counter = 100.
  • 101st request in the same window: rejected.
  • Window 10:01-10:02: counter resets to 0. Client can send 100 more.

The boundary problem: At 10:00:59, client sends 100 requests (allowed). At 10:01:01, client sends 100 more (new window, allowed). That is 200 requests in 2 seconds -- 2x the intended rate.

Redis Implementation:

lua
-- Fixed Window Counter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}:{windowId}
-- ARGV[1] = limit, ARGV[2] = windowSeconds

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local windowSeconds = tonumber(ARGV[2])

local current = tonumber(redis.call('GET', key) or '0')

if current >= limit then
    return {0, 0}  -- rejected, 0 remaining
end

local newCount = redis.call('INCR', key)
if newCount == 1 then
    redis.call('EXPIRE', key, windowSeconds)
end

return {1, limit - newCount}  -- allowed, remaining

The windowId is computed client-side: math.floor(now / windowSeconds).

Pros: Extremely simple. Low memory (one counter per client per window). Redis INCR is atomic, no Lua script strictly needed for the basic case.

Cons: Boundary burst problem (up to 2x limit). Not suitable when strict rate enforcement is needed.


How it works: A hybrid that combines the memory efficiency of fixed windows with the accuracy of sliding windows. It uses two adjacent fixed windows and weights them by how much of the current sliding window overlaps with each.

Formula:

weightedCount = (previousWindowCount * overlapPercentage) + currentWindowCount

Example: limit = 100 per minute, current time is 10:01:15 (15 seconds into the current window)

  • Previous window (10:00-10:01) had 84 requests
  • Current window (10:01-10:02) has 36 requests so far
  • Overlap of previous window with our sliding window: (60 - 15) / 60 = 75%
  • Weighted count = 84 * 0.75 + 36 = 63 + 36 = 99
  • Next request: 99 < 100, allowed
  • One more after that: 100 >= 100, rejected

Redis Implementation (Lua script):

lua
-- Sliding Window Counter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}
-- ARGV[1] = limit, ARGV[2] = windowSeconds, ARGV[3] = now (unix timestamp seconds)

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local windowSeconds = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Determine current and previous window IDs
local currentWindow = math.floor(now / windowSeconds)
local previousWindow = currentWindow - 1

-- Get counts for both windows
local prevKey = key .. ':' .. previousWindow
local currKey = key .. ':' .. currentWindow

local prevCount = tonumber(redis.call('GET', prevKey) or '0')
local currCount = tonumber(redis.call('GET', currKey) or '0')

-- Calculate the overlap of the previous window with the sliding window
local elapsedInCurrentWindow = now % windowSeconds
local overlapPercentage = (windowSeconds - elapsedInCurrentWindow) / windowSeconds

-- Weighted count
local weightedCount = math.floor(prevCount * overlapPercentage) + currCount

if weightedCount >= limit then
    return {0, 0, math.ceil((1 - elapsedInCurrentWindow / windowSeconds) * windowSeconds)}
    -- rejected, 0 remaining, retry-after seconds
end

-- Increment current window counter
redis.call('INCR', currKey)
redis.call('EXPIRE', currKey, windowSeconds * 2)  -- keep for 2 windows

local remaining = limit - weightedCount - 1
return {1, remaining, 0}  -- allowed, remaining, 0 retry-after

Pros: Near-perfect accuracy (Cloudflare tested this at 99.x% accuracy compared to a true sliding window). O(1) memory per client (just two counters). No boundary burst problem.

Cons: Slightly more complex than fixed window. The weighted count is an approximation -- in theory, a client could get a few extra requests through under specific timing conditions. In practice, this is negligible.

This is the recommended algorithm for most use cases. It gives the best balance of accuracy, memory efficiency, and simplicity.


Sliding Window Log (Mentioned for Completeness) ​

Stores the timestamp of every request in a sorted set. To check the rate:

ZREMRANGEBYSCORE ratelimit:{clientId} 0 (now - windowSeconds)
count = ZCARD ratelimit:{clientId}
if count < limit: ZADD ratelimit:{clientId} now now

Perfect accuracy but O(N) memory per client where N is the limit. For a limit of 1000/min with 10M clients, that is 10B entries. Not practical at scale.


Potential Deep Dives ​

Deep Dive 1: Distributed Rate Limiting with Redis ​

The Problem: We have multiple API Gateway instances. A client's requests might hit different instances. The rate limit counter must be shared across all instances.

Bad Solution -- Local counters per instance: If we have 10 API Gateway instances and each maintains its own counter, a client with a limit of 100/min could actually send 1000/min (100 per instance). Useless.

Good Solution -- Centralized Redis: All instances read/write to the same Redis key. The Lua scripts above are atomic -- Redis is single-threaded, so no race conditions within a single Redis node.

Potential issue: Redis latency. A Redis call takes ~0.5ms on the local network. For our 1ms budget, this is tight but workable.

Great Solution -- Redis Cluster with smart sharding:

  • Shard rate limit keys across a Redis Cluster (6 nodes, 3 primary + 3 replica)
  • The key ratelimit:{clientId}:{endpoint} hashes to a specific shard
  • All requests for the same client hit the same Redis shard -- no cross-shard coordination needed
  • Each shard handles ~200K ops/sec. 6 shards = 1.2M ops/sec capacity. Our 1M requests/sec are covered.

For even lower latency: Deploy Redis replicas in the same availability zone as each API Gateway cluster. Read from local replica (eventual consistency -- counter might be off by 1-2 requests). Write to primary. The slight inaccuracy (±5%) is within our tolerance.

Deep Dive 2: Race Conditions in Counter Updates ​

The Problem: Two requests from the same client arrive at the exact same millisecond on two different API Gateway instances. Both read the counter as 99 (limit is 100). Both increment to 100. Both are allowed. But the true count is 101 -- we exceeded the limit.

Bad Solution -- Read-then-write in application code:

python
count = redis.get(key)        # Both threads read 99
if count < limit:
    redis.incr(key)           # Both increment to 100
    allow()                   # Both requests pass

Classic TOCTOU (time-of-check, time-of-use) race condition.

Good Solution -- Atomic INCR with check:

python
new_count = redis.incr(key)   # Atomic -- Redis serializes these
if new_count > limit:
    redis.decr(key)           # Undo the increment
    reject()
else:
    allow()

This is atomic for the increment, but the decrement creates a brief moment where the counter is over-limit. Also, if the process crashes between INCR and DECR, the counter is permanently off.

Great Solution -- Lua scripts (what we use): The entire check-and-update logic runs as a single atomic Lua script in Redis. No other Redis command can interleave. The Lua scripts shown in the algorithm section above handle this correctly.

Why Lua scripts are atomic: Redis is single-threaded. A Lua script executes as one uninterruptible operation. Even if 1000 requests arrive simultaneously, Redis processes them one by one through the Lua script. Zero race conditions.

Deep Dive 3: Rate Limiting at Different Layers ​

Rate limiting can happen at multiple points in the stack. Each has trade-offs:

Layer 1 -- CDN / Edge (CloudFront, Cloudflare):

  • Block obviously abusive traffic (DDoS) before it reaches your infrastructure
  • Simple rules: IP-based, geographic blocking, request rate per IP
  • Pros: stops bad traffic at the edge, protects your bandwidth
  • Cons: coarse-grained, cannot do per-user or per-API-key limiting

Layer 2 -- API Gateway (Kong, AWS API Gateway, Envoy):

  • Per-API-key, per-endpoint rate limiting
  • This is where our main rate limiter lives
  • Pros: centralized enforcement, supports complex rules
  • Cons: adds latency to every request (must be < 1ms)

Layer 3 -- Application (middleware in each service):

  • Service-specific rate limits (e.g., the checkout service limits add-to-cart to 5/sec per user)
  • Protects against specific abuse patterns
  • Pros: fine-grained, context-aware
  • Cons: each service implements its own limiter, inconsistent

Layer 4 -- Database (connection pool limits):

  • Not a traditional rate limiter, but connection pools and query timeouts protect the DB
  • Pros: last line of defense
  • Cons: not user-facing, results in 500 errors not 429

Recommended architecture: Use Layers 1 + 2 as the primary defense. Layer 3 for specific high-value endpoints. Layer 4 as a safety net.

Deep Dive 4: Handling Rate Limiter Failures ​

The Problem: What if Redis goes down? The rate limiter cannot check counters. Do we block all traffic (fail closed) or allow all traffic (fail open)?

Fail closed (block all):

  • Safest for the backend, but catastrophic for users. A Redis failure causes a complete service outage. This is worse than the problem we are trying to solve.

Fail open (allow all):

  • Users are not impacted. Backend might get a traffic spike, but it was handling traffic before the Redis failure. This is the correct default for most services.

Implementation:

python
def check_rate_limit(client_id, endpoint):
    try:
        result = redis.eval(lua_script, keys=[key], args=[limit, window, now])
        return result.allowed
    except RedisConnectionError:
        # Fail open -- allow the request
        metrics.increment("rate_limiter.redis_failure")
        return True  # allowed

Mitigation for fail-open periods:

  1. Local in-memory fallback: Each API Gateway maintains a local token bucket as a backup. Less accurate (per-instance, not global) but prevents complete loss of rate limiting.
  2. Circuit breaker on Redis: If Redis fails 5 times in 10 seconds, trip the circuit breaker. Stop trying Redis for 30 seconds (avoid wasting time on timeouts). Use local fallback during this period.
  3. Alerting: Trigger a PagerDuty alert immediately when the rate limiter is in fail-open mode.

Deep Dive 5: Multi-Tier Rate Limiting ​

The Problem: Different clients need different limits. A free-tier user gets 100 req/min, a Pro user gets 1000 req/min, and an Enterprise user gets 10,000 req/min. How do we make this configurable and efficient?

Rule evaluation pipeline:

Request arrives with API key "ak_xyz"
    |
    v
1. Look up client tier from local cache: "ak_xyz" -> { tier: "pro", userId: "user_123" }
   (Cached from DynamoDB, refreshed every 60 seconds)
    |
    v
2. Look up applicable rules for (endpoint, tier):
   "/api/search" + "pro" -> { limit: 1000, window: 60, algorithm: "sliding_window_counter" }
   (Also from local cache)
    |
    v
3. Run the algorithm against Redis with the resolved parameters
    |
    v
4. Return allowed/denied with remaining count

Hierarchical limits: A single request might be checked against multiple rules:

  • Per-endpoint: 1000 req/min for /api/search
  • Global per-client: 5000 req/min across all endpoints
  • Per-IP: 10,000 req/min (catches abuse even with rotating API keys)

All checks must pass. If any limit is exceeded, the request is rejected. The API Gateway runs all three checks in parallel (three Redis calls), takes the most restrictive result.

Back-of-envelope for Redis memory:

  • 10M unique clients * 3 rules each * ~100 bytes per counter = 3 GB. Trivial for Redis.

Deep Dive 6: Rate Limiting in a Microservices Architecture ​

Pattern: Sidecar rate limiter (Envoy + external rate limit service)

In a service mesh (Istio/Envoy), rate limiting is implemented as:

  1. Each service has an Envoy sidecar proxy
  2. Envoy is configured with rate limit descriptors:
    yaml
    rate_limits:
      - actions:
          - request_headers:
              header_name: "x-api-key"
              descriptor_key: "api_key"
          - request_headers:
              header_name: ":path"
              descriptor_key: "path"
  3. Envoy sends a gRPC request to the Rate Limit Service (Lyft's open-source ratelimit service)
  4. Rate Limit Service checks Redis and responds with ALLOW or DENY
  5. Envoy either forwards the request to the backend or returns 429

Advantages:

  • Rate limiting is decoupled from application code
  • Consistent enforcement across all services
  • Rules are managed centrally via config (not code deploys)
  • Envoy adds < 0.5ms overhead for the gRPC call

Rate limiting across services (cascading): When Service A calls Service B, Service B should have its own rate limits. But Service B should distinguish between "external user via API Gateway" and "internal service call." Internal calls typically get higher limits or are exempt.

Use a header like X-Internal-Call: true (signed with a shared secret) to identify internal traffic.


What is Expected at Each Level ​

Mid-Level ​

  • Know at least two algorithms (token bucket + fixed window) and explain how they work
  • Design a basic rate limiter using Redis INCR
  • Understand why we need centralized storage (Redis) for distributed rate limiting
  • Know to return 429 with Retry-After header
  • Identify the race condition problem with read-then-write

Senior ​

  • Implement the sliding window counter algorithm with the weighted formula
  • Design the full architecture: API Gateway + Redis + configurable rules
  • Back-of-envelope calculations for Redis memory and throughput
  • Discuss fail-open vs fail-closed with concrete mitigation strategies
  • Handle multi-tier rate limiting (different limits for different client tiers)
  • Discuss rate limiting at different layers (edge, gateway, application)
  • Explain why Lua scripts solve the race condition problem

Staff+ ​

  • Design the rule configuration pipeline (DynamoDB + Streams + SNS for real-time propagation)
  • Local fallback with circuit breaker for Redis failures
  • Service mesh integration (Envoy sidecar pattern)
  • Adaptive rate limiting: dynamically adjust limits based on backend health (increase limits when backend is healthy, decrease when it is struggling)
  • Cross-region rate limiting: how to maintain global rate limits when you have API Gateways in 5 regions
  • Cost analysis: Redis cluster sizing for different traffic patterns
  • Observability: dashboards for rate limit hit rate, false positive rate, latency impact

Frontend interview preparation reference.