HLD: Rate Limiter

Understanding the Problem

What is a Rate Limiter?

A rate limiter controls the rate of traffic a client can send to a service. It protects backend services from being overwhelmed by excessive requests -- whether from a misbehaving client, a DDoS attack, or an organic traffic spike. Think of it as a bouncer at a club: you can enter, but only N people per minute. At Amazon's scale, rate limiting is critical for every API -- from product search to checkout to third-party seller APIs.

Note: This document covers the high-level / distributed system design of a rate limiter. For the class-level (LLD) design -- including full OOP implementation of the algorithms -- see LLD Rate Limiter (file 15).

Functional Requirements

Core (above the line):

Limit requests per client -- given a client identifier (user ID, API key, IP), allow at most N requests per time window
Return clear feedback -- when rate limited, return HTTP 429 with Retry-After header
Support multiple rules -- different rate limits for different endpoints (e.g., 100 req/min for search, 10 req/min for checkout)
Configurable limits -- rate limit rules can be updated without redeployment

Below the line (mention but don't design):

Rate limit by geographic region
Adaptive rate limiting (auto-adjust based on backend health)
Rate limit exemptions (whitelisted clients)
Billing-tier-based limits

Non-Functional Requirements

Low latency overhead -- rate limiting check must complete in < 1ms (it is on the critical path of every request)
High availability -- if the rate limiter goes down, we should fail open (allow traffic) rather than block everything
Accuracy -- slight over-counting is acceptable (±5%), but we must never allow 10x the limit
Scale -- 1M requests/sec across all services, 10M unique clients

The Set Up

Core Entities

Entity	Description
RateLimitRule	`id`, `endpoint`, `clientTier`, `maxRequests`, `windowSeconds`, `algorithm`
RateLimitCounter	`clientId`, `endpoint`, `count`, `windowStart` (or token count, depending on algorithm)
Client	`clientId`, `apiKey`, `tier` (free/pro/enterprise), `whitelisted`

API Design

The rate limiter is not a user-facing API -- it is middleware. But it exposes internal APIs for configuration:

Check rate limit (internal, called by API gateway):

// This happens internally, not as a REST call, but conceptually:
RateLimitResult checkRateLimit(clientId, endpoint)

Returns:
{
  "allowed": true/false,
  "remaining": 87,        // requests remaining in current window
  "limit": 100,           // max requests allowed
  "retryAfterMs": 0       // 0 if allowed, otherwise ms until window resets
}

HTTP response when rate limited:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1705312860
Retry-After: 45
Content-Type: application/json

{
  "error": "RATE_LIMIT_EXCEEDED",
  "message": "Rate limit exceeded. Try again in 45 seconds."
}

HTTP response headers on normal requests:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1705312860

Update rate limit rules (admin API):

PUT /api/admin/rate-limit-rules/{ruleId}
Authorization: Bearer <admin-token>

Request:
{
  "endpoint": "/api/search",
  "clientTier": "free",
  "maxRequests": 100,
  "windowSeconds": 60,
  "algorithm": "sliding_window_counter"
}

High-Level Design

Architecture Overview

[Client] --> [Load Balancer] --> [API Gateway (Rate Limit Check)]
                                        |
                                  [Rate Limit Service]
                                        |
                                  [Redis Cluster (counters)]
                                        |
                                  [Rules Config Store (DynamoDB)]
                                        |
                                  (if allowed)
                                        v
                                  [Backend Service]

Flow 1: Request Rate Limit Check

Step-by-step:

Client sends a request to GET /api/search?q=headphones
Request hits the API Gateway (could be AWS API Gateway, Kong, or custom)
API Gateway extracts the client identifier: apiKey from header, or fallback to IP address
API Gateway calls the Rate Limit Service (co-located sidecar or separate microservice)
Rate Limit Service: a. Loads the applicable rule for this endpoint + client tier from a local cache (rules are synced from DynamoDB every 30 seconds) b. Executes the rate limit algorithm against Redis:
```
-- For token bucket (example):
Key: ratelimit:{clientId}:{endpoint}
Operation: atomic check-and-decrement via Lua script
```
1
2
3
c. Returns allowed: true/false with remaining count
If allowed: API Gateway forwards the request to the backend service
If denied: API Gateway returns 429 with appropriate headers

Flow 2: Rule Configuration Update

Admin updates a rule via PUT /api/admin/rate-limit-rules/{ruleId}
Rule is written to DynamoDB (durable config store)
A DynamoDB Stream triggers a Lambda function
Lambda publishes a RuleUpdated event to SNS
All Rate Limit Service instances subscribe to the SNS topic
Each instance updates its local rule cache within seconds
New requests are evaluated against the updated rule

No restart or redeployment needed. Rules propagate in < 10 seconds.

Rate Limiting Algorithms

Algorithm Comparison Table

Algorithm	Accuracy	Memory	Burst Handling	Complexity	Best For
Token Bucket	High	O(1) per client	Allows controlled bursts	Low	General purpose, API gateways
Fixed Window Counter	Medium (boundary burst)	O(1) per client	2x burst at window boundary	Very Low	Simple use cases
Sliding Window Log	Perfect	O(N) per client	No bursts allowed	Medium	When exact accuracy is critical
Sliding Window Counter	High (~99.x%)	O(1) per client	Minimal boundary burst	Medium	Best balance of accuracy and efficiency

Algorithm 1: Token Bucket

How it works:

Each client has a "bucket" that holds up to maxTokens tokens
Tokens are added at a fixed rate: refillRate tokens per second
Each request costs 1 token. If the bucket is empty, the request is rejected.
Tokens accumulate up to maxTokens, allowing short bursts

Example: maxTokens = 10, refillRate = 2/sec

At t=0: bucket has 10 tokens. Client sends 10 requests instantly -- all allowed. Bucket: 0.
At t=1: 2 tokens added. Client can send 2 more. Bucket: 0 after 2 requests.
At t=5: 10 tokens accumulated (capped at max). Client can burst again.

Redis Implementation (Lua script):

lua

-- Token Bucket Rate Limiter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}
-- ARGV[1] = maxTokens, ARGV[2] = refillRate, ARGV[3] = now (unix timestamp ms)

local key = KEYS[1]
local maxTokens = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])  -- tokens per second
local now = tonumber(ARGV[3])

local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1])
local lastRefill = tonumber(bucket[2])

-- Initialize if first request
if tokens == nil then
    tokens = maxTokens
    lastRefill = now
end

-- Calculate tokens to add since last refill
local elapsed = (now - lastRefill) / 1000  -- convert ms to seconds
local newTokens = math.floor(elapsed * refillRate)
tokens = math.min(maxTokens, tokens + newTokens)

-- Update last refill time only if tokens were added
if newTokens > 0 then
    lastRefill = now
end

-- Try to consume a token
local allowed = 0
if tokens >= 1 then
    tokens = tokens - 1
    allowed = 1
end

-- Save state
redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', lastRefill)
redis.call('EXPIRE', key, math.ceil(maxTokens / refillRate) + 10)  -- TTL for cleanup

return {allowed, tokens}  -- {1 or 0, remaining tokens}

Pros: Allows controlled bursts (great for APIs where clients send requests in batches). Memory-efficient (O(1) per client -- just two values).

Cons: Burst size equals maxTokens, which might be too large for some use cases. Harder to reason about than a simple "N per minute" rule.

Algorithm 2: Fixed Window Counter

How it works:

Divide time into fixed windows (e.g., each minute: 10:00-10:01, 10:01-10:02)
Each client has a counter per window
If counter < limit, allow and increment. If counter >= limit, reject.

Example: limit = 100 per minute

Window 10:00-10:01: client sends 100 requests -- all allowed. Counter = 100.
101st request in the same window: rejected.
Window 10:01-10:02: counter resets to 0. Client can send 100 more.

The boundary problem: At 10:00:59, client sends 100 requests (allowed). At 10:01:01, client sends 100 more (new window, allowed). That is 200 requests in 2 seconds -- 2x the intended rate.

Redis Implementation:

lua

-- Fixed Window Counter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}:{windowId}
-- ARGV[1] = limit, ARGV[2] = windowSeconds

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local windowSeconds = tonumber(ARGV[2])

local current = tonumber(redis.call('GET', key) or '0')

if current >= limit then
    return {0, 0}  -- rejected, 0 remaining
end

local newCount = redis.call('INCR', key)
if newCount == 1 then
    redis.call('EXPIRE', key, windowSeconds)
end

return {1, limit - newCount}  -- allowed, remaining

The windowId is computed client-side: math.floor(now / windowSeconds).

Pros: Extremely simple. Low memory (one counter per client per window). Redis INCR is atomic, no Lua script strictly needed for the basic case.

Cons: Boundary burst problem (up to 2x limit). Not suitable when strict rate enforcement is needed.

Algorithm 3: Sliding Window Counter (Recommended)

How it works: A hybrid that combines the memory efficiency of fixed windows with the accuracy of sliding windows. It uses two adjacent fixed windows and weights them by how much of the current sliding window overlaps with each.

Formula:

weightedCount = (previousWindowCount * overlapPercentage) + currentWindowCount

Example: limit = 100 per minute, current time is 10:01:15 (15 seconds into the current window)

Previous window (10:00-10:01) had 84 requests
Current window (10:01-10:02) has 36 requests so far
Overlap of previous window with our sliding window: (60 - 15) / 60 = 75%
Weighted count = 84 * 0.75 + 36 = 63 + 36 = 99
Next request: 99 < 100, allowed
One more after that: 100 >= 100, rejected

Redis Implementation (Lua script):

lua

-- Sliding Window Counter
-- KEYS[1] = ratelimit:{clientId}:{endpoint}
-- ARGV[1] = limit, ARGV[2] = windowSeconds, ARGV[3] = now (unix timestamp seconds)

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local windowSeconds = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Determine current and previous window IDs
local currentWindow = math.floor(now / windowSeconds)
local previousWindow = currentWindow - 1

-- Get counts for both windows
local prevKey = key .. ':' .. previousWindow
local currKey = key .. ':' .. currentWindow

local prevCount = tonumber(redis.call('GET', prevKey) or '0')
local currCount = tonumber(redis.call('GET', currKey) or '0')

-- Calculate the overlap of the previous window with the sliding window
local elapsedInCurrentWindow = now % windowSeconds
local overlapPercentage = (windowSeconds - elapsedInCurrentWindow) / windowSeconds

-- Weighted count
local weightedCount = math.floor(prevCount * overlapPercentage) + currCount

if weightedCount >= limit then
    return {0, 0, math.ceil((1 - elapsedInCurrentWindow / windowSeconds) * windowSeconds)}
    -- rejected, 0 remaining, retry-after seconds
end

-- Increment current window counter
redis.call('INCR', currKey)
redis.call('EXPIRE', currKey, windowSeconds * 2)  -- keep for 2 windows

local remaining = limit - weightedCount - 1
return {1, remaining, 0}  -- allowed, remaining, 0 retry-after

Pros: Near-perfect accuracy (Cloudflare tested this at 99.x% accuracy compared to a true sliding window). O(1) memory per client (just two counters). No boundary burst problem.

Cons: Slightly more complex than fixed window. The weighted count is an approximation -- in theory, a client could get a few extra requests through under specific timing conditions. In practice, this is negligible.

This is the recommended algorithm for most use cases. It gives the best balance of accuracy, memory efficiency, and simplicity.

Sliding Window Log (Mentioned for Completeness)

Stores the timestamp of every request in a sorted set. To check the rate:

ZREMRANGEBYSCORE ratelimit:{clientId} 0 (now - windowSeconds)
count = ZCARD ratelimit:{clientId}
if count < limit: ZADD ratelimit:{clientId} now now

Perfect accuracy but O(N) memory per client where N is the limit. For a limit of 1000/min with 10M clients, that is 10B entries. Not practical at scale.

Potential Deep Dives

Deep Dive 1: Distributed Rate Limiting with Redis

The Problem: We have multiple API Gateway instances. A client's requests might hit different instances. The rate limit counter must be shared across all instances.

Bad Solution -- Local counters per instance: If we have 10 API Gateway instances and each maintains its own counter, a client with a limit of 100/min could actually send 1000/min (100 per instance). Useless.

Good Solution -- Centralized Redis: All instances read/write to the same Redis key. The Lua scripts above are atomic -- Redis is single-threaded, so no race conditions within a single Redis node.

Potential issue: Redis latency. A Redis call takes ~0.5ms on the local network. For our 1ms budget, this is tight but workable.

Great Solution -- Redis Cluster with smart sharding:

Shard rate limit keys across a Redis Cluster (6 nodes, 3 primary + 3 replica)
The key ratelimit:{clientId}:{endpoint} hashes to a specific shard
All requests for the same client hit the same Redis shard -- no cross-shard coordination needed
Each shard handles ~200K ops/sec. 6 shards = 1.2M ops/sec capacity. Our 1M requests/sec are covered.

For even lower latency: Deploy Redis replicas in the same availability zone as each API Gateway cluster. Read from local replica (eventual consistency -- counter might be off by 1-2 requests). Write to primary. The slight inaccuracy (±5%) is within our tolerance.

Deep Dive 2: Race Conditions in Counter Updates

The Problem: Two requests from the same client arrive at the exact same millisecond on two different API Gateway instances. Both read the counter as 99 (limit is 100). Both increment to 100. Both are allowed. But the true count is 101 -- we exceeded the limit.

Bad Solution -- Read-then-write in application code:

python

count = redis.get(key)        # Both threads read 99
if count < limit:
    redis.incr(key)           # Both increment to 100
    allow()                   # Both requests pass

Classic TOCTOU (time-of-check, time-of-use) race condition.

Good Solution -- Atomic INCR with check:

python

new_count = redis.incr(key)   # Atomic -- Redis serializes these
if new_count > limit:
    redis.decr(key)           # Undo the increment
    reject()
else:
    allow()

This is atomic for the increment, but the decrement creates a brief moment where the counter is over-limit. Also, if the process crashes between INCR and DECR, the counter is permanently off.

Great Solution -- Lua scripts (what we use): The entire check-and-update logic runs as a single atomic Lua script in Redis. No other Redis command can interleave. The Lua scripts shown in the algorithm section above handle this correctly.

Why Lua scripts are atomic: Redis is single-threaded. A Lua script executes as one uninterruptible operation. Even if 1000 requests arrive simultaneously, Redis processes them one by one through the Lua script. Zero race conditions.

Deep Dive 3: Rate Limiting at Different Layers

Rate limiting can happen at multiple points in the stack. Each has trade-offs:

Layer 1 -- CDN / Edge (CloudFront, Cloudflare):

Block obviously abusive traffic (DDoS) before it reaches your infrastructure
Simple rules: IP-based, geographic blocking, request rate per IP
Pros: stops bad traffic at the edge, protects your bandwidth
Cons: coarse-grained, cannot do per-user or per-API-key limiting

Layer 2 -- API Gateway (Kong, AWS API Gateway, Envoy):

Per-API-key, per-endpoint rate limiting
This is where our main rate limiter lives
Pros: centralized enforcement, supports complex rules
Cons: adds latency to every request (must be < 1ms)

Layer 3 -- Application (middleware in each service):

Service-specific rate limits (e.g., the checkout service limits add-to-cart to 5/sec per user)
Protects against specific abuse patterns
Pros: fine-grained, context-aware
Cons: each service implements its own limiter, inconsistent

Layer 4 -- Database (connection pool limits):

Not a traditional rate limiter, but connection pools and query timeouts protect the DB
Pros: last line of defense
Cons: not user-facing, results in 500 errors not 429

Recommended architecture: Use Layers 1 + 2 as the primary defense. Layer 3 for specific high-value endpoints. Layer 4 as a safety net.

Deep Dive 4: Handling Rate Limiter Failures

The Problem: What if Redis goes down? The rate limiter cannot check counters. Do we block all traffic (fail closed) or allow all traffic (fail open)?

Fail closed (block all):

Safest for the backend, but catastrophic for users. A Redis failure causes a complete service outage. This is worse than the problem we are trying to solve.

Fail open (allow all):

Users are not impacted. Backend might get a traffic spike, but it was handling traffic before the Redis failure. This is the correct default for most services.

Implementation:

python

def check_rate_limit(client_id, endpoint):
    try:
        result = redis.eval(lua_script, keys=[key], args=[limit, window, now])
        return result.allowed
    except RedisConnectionError:
        # Fail open -- allow the request
        metrics.increment("rate_limiter.redis_failure")
        return True  # allowed

Mitigation for fail-open periods:

Local in-memory fallback: Each API Gateway maintains a local token bucket as a backup. Less accurate (per-instance, not global) but prevents complete loss of rate limiting.
Circuit breaker on Redis: If Redis fails 5 times in 10 seconds, trip the circuit breaker. Stop trying Redis for 30 seconds (avoid wasting time on timeouts). Use local fallback during this period.
Alerting: Trigger a PagerDuty alert immediately when the rate limiter is in fail-open mode.

Deep Dive 5: Multi-Tier Rate Limiting

The Problem: Different clients need different limits. A free-tier user gets 100 req/min, a Pro user gets 1000 req/min, and an Enterprise user gets 10,000 req/min. How do we make this configurable and efficient?

Rule evaluation pipeline:

Request arrives with API key "ak_xyz"
    |
    v
1. Look up client tier from local cache: "ak_xyz" -> { tier: "pro", userId: "user_123" }
   (Cached from DynamoDB, refreshed every 60 seconds)
    |
    v
2. Look up applicable rules for (endpoint, tier):
   "/api/search" + "pro" -> { limit: 1000, window: 60, algorithm: "sliding_window_counter" }
   (Also from local cache)
    |
    v
3. Run the algorithm against Redis with the resolved parameters
    |
    v
4. Return allowed/denied with remaining count

Hierarchical limits: A single request might be checked against multiple rules:

Per-endpoint: 1000 req/min for /api/search
Global per-client: 5000 req/min across all endpoints
Per-IP: 10,000 req/min (catches abuse even with rotating API keys)

All checks must pass. If any limit is exceeded, the request is rejected. The API Gateway runs all three checks in parallel (three Redis calls), takes the most restrictive result.

Back-of-envelope for Redis memory:

10M unique clients * 3 rules each * ~100 bytes per counter = 3 GB. Trivial for Redis.

Deep Dive 6: Rate Limiting in a Microservices Architecture

Pattern: Sidecar rate limiter (Envoy + external rate limit service)

In a service mesh (Istio/Envoy), rate limiting is implemented as:

Each service has an Envoy sidecar proxy

Envoy is configured with rate limit descriptors:

yaml

rate_limits:
  - actions:
      - request_headers:
          header_name: "x-api-key"
          descriptor_key: "api_key"
      - request_headers:
          header_name: ":path"
          descriptor_key: "path"

Envoy sends a gRPC request to the Rate Limit Service (Lyft's open-source ratelimit service)
Rate Limit Service checks Redis and responds with ALLOW or DENY
Envoy either forwards the request to the backend or returns 429

Advantages:

Rate limiting is decoupled from application code
Consistent enforcement across all services
Rules are managed centrally via config (not code deploys)
Envoy adds < 0.5ms overhead for the gRPC call

Rate limiting across services (cascading): When Service A calls Service B, Service B should have its own rate limits. But Service B should distinguish between "external user via API Gateway" and "internal service call." Internal calls typically get higher limits or are exempt.

Use a header like X-Internal-Call: true (signed with a shared secret) to identify internal traffic.

What is Expected at Each Level

Mid-Level

Know at least two algorithms (token bucket + fixed window) and explain how they work
Design a basic rate limiter using Redis INCR
Understand why we need centralized storage (Redis) for distributed rate limiting
Know to return 429 with Retry-After header
Identify the race condition problem with read-then-write

Senior

Implement the sliding window counter algorithm with the weighted formula
Design the full architecture: API Gateway + Redis + configurable rules
Back-of-envelope calculations for Redis memory and throughput
Discuss fail-open vs fail-closed with concrete mitigation strategies
Handle multi-tier rate limiting (different limits for different client tiers)
Discuss rate limiting at different layers (edge, gateway, application)
Explain why Lua scripts solve the race condition problem

Staff+

Design the rule configuration pipeline (DynamoDB + Streams + SNS for real-time propagation)
Local fallback with circuit breaker for Redis failures
Service mesh integration (Envoy sidecar pattern)
Adaptive rate limiting: dynamically adjust limits based on backend health (increase limits when backend is healthy, decrease when it is struggling)
Cross-region rate limiting: how to maintain global rate limits when you have API Gateways in 5 regions
Cost analysis: Redis cluster sizing for different traffic patterns
Observability: dashboards for rate limit hit rate, false positive rate, latency impact

HLD: Rate Limiter ​

Understanding the Problem ​

What is a Rate Limiter? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

API Design ​

High-Level Design ​

Architecture Overview ​

Flow 1: Request Rate Limit Check ​

Flow 2: Rule Configuration Update ​

Rate Limiting Algorithms ​

Algorithm Comparison Table ​

Algorithm 1: Token Bucket ​

Algorithm 2: Fixed Window Counter ​

Algorithm 3: Sliding Window Counter (Recommended) ​

Sliding Window Log (Mentioned for Completeness) ​

Potential Deep Dives ​

Deep Dive 1: Distributed Rate Limiting with Redis ​

Deep Dive 2: Race Conditions in Counter Updates ​

Deep Dive 3: Rate Limiting at Different Layers ​

Deep Dive 4: Handling Rate Limiter Failures ​

Deep Dive 5: Multi-Tier Rate Limiting ​

Deep Dive 6: Rate Limiting in a Microservices Architecture ​

What is Expected at Each Level ​

Mid-Level ​

Senior ​

Staff+ ​

HLD: Rate Limiter

Understanding the Problem

What is a Rate Limiter?

Functional Requirements

Non-Functional Requirements

The Set Up

Core Entities

API Design

High-Level Design

Architecture Overview

Flow 1: Request Rate Limit Check

Flow 2: Rule Configuration Update

Rate Limiting Algorithms

Algorithm Comparison Table

Algorithm 1: Token Bucket

Algorithm 2: Fixed Window Counter

Algorithm 3: Sliding Window Counter (Recommended)

Sliding Window Log (Mentioned for Completeness)

Potential Deep Dives

Deep Dive 1: Distributed Rate Limiting with Redis

Deep Dive 2: Race Conditions in Counter Updates

Deep Dive 3: Rate Limiting at Different Layers

Deep Dive 4: Handling Rate Limiter Failures

Deep Dive 5: Multi-Tier Rate Limiting

Deep Dive 6: Rate Limiting in a Microservices Architecture

What is Expected at Each Level

Mid-Level

Senior

Staff+