HLD: Multi-Tenant Rate Limiter ​
Understanding the Problem ​
What is a Multi-Tenant Rate Limiter? ​
A rate limiter enforces per-tenant, per-endpoint request budgets across a distributed API fleet. Essentials customers get 15k/day, Enterprise gets 1M/day, internal services get no cap. The service must make a decision in under 2 ms on every request, stay accurate to within 5% of the advertised limit, and keep working when Redis has a blip. This is Salesforce's governor-limit philosophy applied at the API layer: cap the worst case rather than optimize the best. It's a common HLD problem on its own and a frequent deep dive inside bigger designs (notifications, dashboards).
Functional Requirements ​
Core (above the line):
- Evaluate a request against its applicable limits —
(orgId, endpoint),(orgId, user),(orgId, global)— and return allowed/denied in < 2 ms. - Return
Retry-Afteron denial so clients know when to retry. - Configurable policies per org tier — Essentials: 15k/day, Enterprise: 1M/day, Unlimited: soft caps only.
- Burst-friendly — permit short spikes (client has a backlog to drain).
- Observability — per-tenant, per-endpoint consumption dashboards.
Below the line (out of scope):
- Global concurrency caps (tracked by an admission service, not the rate limiter).
- IP-based DDoS protection — edge CDN / WAF handles that.
- Per-user geographic rules — belongs in geo-fencing policy layer.
Non-Functional Requirements ​
Core:
- Scale: 1M API RPS across all tenants at peak.
- Latency: rate-limit check p99 < 2 ms.
- Accuracy: within 5% of the advertised limit for soft limits; strict caps for billing / compliance hard limits.
- Availability: survives Redis failure. Fail-mode is explicit (open or closed) per policy.
- Multi-tenancy: keys strictly namespaced by
orgId. Per-org observability.
Below the line:
- Sub-millisecond accuracy across all shards (eventual consistency within 1s is fine).
- Dynamic reconfiguration under 100 ms (1s is acceptable).
Capacity Estimation ​
- 1M RPS means 1M Redis GET/INCR ops/s minimum. With Lua atomicity and a 12-node Redis cluster, each shard handles ~85k ops/s — well within Redis single-threaded capacity (~200k ops/s).
- Memory: each limit key is ~100 bytes. 10M active
(org, endpoint, window)combinations = ~1 GB per replica. Fits comfortably in a standard Redis node. - Policy storage: 100k orgs × 100 endpoint patterns × 10 overrides = 100M policy rows. Still a small table in Postgres, but cached in-process.
The Set Up ​
Core Entities ​
- Policy —
(policyId, orgTier, endpointPattern, limit, window, algorithm, failMode). Defines "this tier, this endpoint, this limit." - CounterKey (derived at check time) —
rl:{algo}:{orgId}:{endpoint}:{window}. - TenantOverride — per-org tweaks to the default policy. E.g., "Org X gets 10x the default for
POST /records." - PolicyDecision —
(allowed, remaining, retryAfterMs, policyId)returned to the caller.
The API ​
Two planes: data plane runs on every request; control plane manages policies.
Data plane (in-request, sync):
CheckLimit(orgId, userId, endpoint) → { allowed: bool, retryAfterMs, remaining }Implemented as a gRPC service from a local sidecar, or as an Envoy external-authz filter, or as a Kong plugin.
Control plane:
PUT /v1/ratelimit/policies
Body: { "orgTier": "Enterprise", "endpointPattern": "POST /records", "limit": 1000, "window": "1m", "algorithm": "token_bucket", "failMode": "open" }
GET /v1/orgs/{orgId}/ratelimit/usage
→ time-series of remaining budget per endpoint
POST /v1/orgs/{orgId}/ratelimit/overrides
Body: { "endpointPattern": "...", "limit": 5000 }High-Level Design ​
Architecture ​
Request ──▶ Envoy/Kong ──▶ RateLimit Filter ──▶ Redis Cluster (Lua)
│ │ │
│ │ ▼
│ │ Policy Cache (in-proc)
│ ▼
│ Local token bucket (fallback)
â–¼
App ServiceEnd-to-end flow: a request arrives ​
- Request hits Envoy / Kong at the edge with
Authorization: Bearer <jwt>. - RateLimit Filter (Envoy external authz or Kong plugin) extracts
orgId,userId,endpoint. - Filter consults the Policy Cache (in-process, 1s TTL) to find the applicable policy for this
(orgTier, endpoint)plus any tenant overrides. - Filter executes a Lua script on Redis Cluster to atomically check and update the token bucket.
- Redis returns
(allowed, remaining). If allowed, request continues to the app; if denied, filter returns 429 withRetry-AfterandX-RateLimit-*headers. - Every denial is emitted to a Kafka topic
ratelimit.denialsfor dashboards and anomaly alerting.
End-to-end flow: policy update ​
- Admin changes a policy via
PUT /v1/ratelimit/policies. - Policy Service writes to Postgres.
- CDC (Debezium) publishes the change to
policy.updatesKafka topic. - Every RateLimit Filter instance subscribes to
policy.updatesand invalidates its in-process cache for affected keys. - Propagation latency: < 1 second across the fleet.
Data model ​
- Policies: Postgres
policiestable. Mirrored to Redis via CDC for the rare case of cache-cold startup. - Counters: Redis, ephemeral. TTL equals the window (e.g., 1 minute, 1 hour, 24 hours).
- Denial audit: Kafka
ratelimit.denialstopic for dashboards; retained 7 days. - Usage metrics: Prometheus time-series labeled
org_id,endpoint,tier.
Multi-Tenancy Strategy ​
Isolation level: L1 on the policy store (shared Postgres), L1 on Redis with strict key prefixing. Redis Cluster assigns slots via the {orgId} hash tag — this keeps a tenant's keys colocated on one slot, enabling multi-key Lua scripts without cross-slot errors.
Tenant context flow:
orgIdextracted from JWT at the edge.- All Redis keys use the format
rl:tok:{orgId}:{endpoint}— the curly braces tell Redis Cluster to hash only onorgId, keeping all of a tenant's rate-limit counters in the same shard. - Every Prometheus metric is labeled with
org_id,endpoint,tier. - Denials published to Kafka are keyed by
orgIdfor per-org stream processing.
Noisy-neighbor mitigations:
- Shuffle sharding of Redis nodes: each org is mapped to
k=3ofn=12Redis shards. A hot tenant's traffic only touches 25% of Redis capacity. - Meta rate limit: the rate limiter itself applies a cap on how often a single tenant can hit
/checkLimit. Defense in depth against a bug in the policy cache. - Per-tenant bulkhead on the RateLimit Filter: bounded goroutine / thread count per org so one tenant's Redis slowness cannot starve others' threads.
- Priority tiers for Redis access: Enterprise tenants get a dedicated Redis read/write pipeline; Essentials share a shared pool.
Per-tenant observability:
- Prometheus dashboards filter by
org_id. - Per-org billing: count of allowed vs denied per endpoint.
- Alerts when an org approaches its limit consistently — suggests they need to upgrade, and proactively reach out.
Potential Deep Dives ​
1) Which algorithm — fixed window, sliding window log, sliding window counter, token bucket? ​
Walk through the comparison explicitly. Interviewers expect this table.
| Algorithm | Accuracy | Memory | Burst-friendly | Notes |
|---|---|---|---|---|
| Fixed window | Worst (boundary bursts) | O(1) | No | Simple counter; double-load at window boundaries. |
| Sliding window log | Best | O(N) per key | Yes | Store every request timestamp; expensive at 1M RPS. |
| Sliding window counter | Good | O(1) | Yes | Interpolates between two fixed windows. Reasonable default. |
| Token bucket | Good | O(1) | Yes | Chosen — matches "rate + burst" intuition. |
| Leaky bucket | Good (smoothed) | O(1) | No | Shapes traffic; strict throughput. |
Pick token bucket for APIs. Mention sliding-window-counter as an alternative if the interviewer wants something else.
The Lua implementation is the load-bearing piece — atomic check-and-update on Redis:
private static final String LUA = """
local tokens_key = KEYS[1]
local ts_key = KEYS[2]
local rate = tonumber(ARGV[1]) -- tokens/sec
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3]) -- ms
local cost = tonumber(ARGV[4])
local tokens = tonumber(redis.call('GET', tokens_key) or capacity)
local last = tonumber(redis.call('GET', ts_key) or now)
local delta = math.max(0, now - last) / 1000.0
tokens = math.min(capacity, tokens + delta * rate)
local allowed = tokens >= cost and 1 or 0
if allowed == 1 then tokens = tokens - cost end
redis.call('SET', tokens_key, tokens, 'PX', 3600000)
redis.call('SET', ts_key, now, 'PX', 3600000)
return { allowed, tokens }
""";
List<String> keys = List.of(
"rl:tok:{" + org + "}:" + ep,
"rl:ts:{" + org + "}:" + ep);
Object[] r = (Object[]) redis.eval(LUA, keys, rate, cap, now, 1);static const char* kLua = R"LUA(
local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
local last = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
local delta = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
tokens = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
local ok = tokens >= tonumber(ARGV[4]) and 1 or 0
if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
return { ok, tokens }
)LUA";
auto reply = redis.Eval(kLua,
{"rl:tok:{" + org + "}:" + ep, "rl:ts:{" + org + "}:" + ep},
{std::to_string(rate), std::to_string(cap),
std::to_string(now_ms), std::to_string(cost)});const lua = `
local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
local last = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
local delta = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
tokens = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
local ok = tokens >= tonumber(ARGV[4]) and 1 or 0
if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
return { ok, tokens }
`;
const [allowed, remaining] = await redis.eval(
lua, 2,
`rl:tok:{${orgId}}:${endpoint}`,
`rl:ts:{${orgId}}:${endpoint}`,
String(rate), String(capacity), String(Date.now()), String(cost)
);The curly braces around ${orgId} in the key are Redis Cluster hash tags — Redis hashes only the content between {}, so both keys for a given org land on the same slot, making multi-key Lua scripts safe.
2) Redis failure — fail open or closed? ​
Bad Solution: Fail open globally.
- Approach: If Redis is unreachable, allow all requests.
- Challenges: You've just turned off your rate limiter during the moment you need it most — cache failures often coincide with load spikes. Downstream services get DDoSed.
Good Solution: Fail closed globally.
- Approach: If Redis is unreachable, deny all requests.
- Challenges: You've turned your cache outage into an app outage. A 30-second Redis blip becomes a 30-second API outage. Customers complain.
Great Solution: Per-policy fail mode + local fallback bucket.
- Approach:
- Each policy specifies its fail mode:
fail_openfor soft limits,fail_closedfor hard limits (billing, compliance, per-feature quotas). - When Redis is reachable: use the authoritative cluster count.
- When Redis is unreachable: fall back to a local token bucket on each gateway node. Capacity is
globalLimit / nodeCount × 0.7— conservative to avoid global overshoot. - Emit a SEV alert to operators when running in fallback mode; track duration.
- When Redis returns, reconcile: read recent denials and allowances from fallback logs, update authoritative counters via CDC. Brief over-issuance during the switchover is documented and acceptable.
- Each policy specifies its fail mode:
- Challenges: Local fallback is approximate — over-issues by up to
1 / node_countin the worst case. Hard limits withfail_closedstill cause 429s during Redis outage — customers see errors. Tradeoff is explicit: "we'd rather refuse a few billing-metered calls than overcharge."
3) Distributed counter accuracy — how accurate can we be at 1M RPS? ​
Bad Solution: Per-node local counters, summed periodically.
- Approach: Each gateway node keeps its own counter; a background process sums them.
- Challenges: At 100 nodes with 0.1s sync interval, global counter can be off by 100 × 0.1 × average-RPS — potentially 10k requests of over-allowance. Under-counting is also possible if nodes double-report.
Good Solution: Single centralized Redis counter per key.
- Approach: Every node hits the same Redis key. Accurate within the time of a single Lua script.
- Challenges: For a mega-tenant with 100k RPS on one endpoint, that's 100k ops/s hitting one Redis key. Single-key throughput ceiling (~100k ops/s for INCR, lower with Lua) is breached.
Great Solution: Counter sharding for high-QPS tenants with periodic reconciliation.
- Approach:
- Normal tenants use a single counter per
(orgId, endpoint, window). - High-QPS tenants (> 10k RPS on a single endpoint) have sharded counters:
rl:tok:{orgId}:ep:0..N-1. Client picks a shard at random. - Each shard enforces
limit / Nindependently. Total admitted stays within the overall limit at steady state. - A reconciler process runs every 1 s, sums the shards, and redistributes unused capacity. E.g., if shard 0 has burned 80% and shard 3 only 20%, reconciler rebalances on the next tick.
- Trade a small over-issuance (a few percent at window boundaries) for linear scalability.
- Normal tenants use a single counter per
- Challenges: Reconciler adds complexity and a second source of truth. Shard imbalance within a window can cause premature 429 for some users while capacity is free elsewhere. Mitigate by starting with few shards (N=2) and ramping only when needed.
4) Policy evolution and tenant overrides ​
Bad Solution: Hard-coded limits in YAML files.
- Approach: Policies checked into config; deploy to change.
- Challenges: Policy changes take 30+ minutes to deploy across a fleet. Cannot react to an incident ("spike from Org X, raise their limit immediately"). No runtime override for enterprise customers.
Good Solution: Policies in a DB, fetched on each request.
- Approach: Policy service reads Postgres on every check.
- Challenges: Adds DB round-trip to every rate-limit check — kills the 2ms latency budget.
Great Solution: Per-org override hierarchy + CDC-propagated cache + shadow mode for rollouts.
- Approach:
- Hierarchy: effective policy for
(org, endpoint)is:tier default → org override → endpoint-specific override. Most-specific wins. - Cached in-process (1 s TTL); invalidated via CDC from
policy.updatestopic. Propagation latency < 1 s. - Audit every policy change — meta-audit — so we can trace "who raised Org X's limit from 1k to 10k and when."
- Shadow mode for new policies: deploy in log-only mode first ("would have denied") for N hours. Review the denials in a dashboard. If it looks right, flip to enforce mode with one config change.
- Blue-green policy deploys: old and new policies evaluated in parallel for a transition window; serving happens based on a feature flag.
- Hierarchy: effective policy for
- Challenges: Hierarchy resolution has to be fast (it's in-process and cached — fine). Shadow mode requires a parallel execution path — increases Redis load during the shadow window. Auditing adds write load to the audit system.
5) Observability — telling a customer "why did I get 429'd?" ​
Bad Solution: No per-denial context.
- Approach: Return 429 with no body.
- Challenges: Customer calls support. Support calls engineering. Everyone spends hours pawing through logs.
Good Solution: Response headers.
- Approach: Return
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset,Retry-After. - Challenges: Customer knows the limit but not why. Doesn't help debug unexpected spikes.
Great Solution: Self-service usage dashboard + denial stream + policy visibility.
- Approach:
- Response headers give per-request context.
GET /v1/orgs/{orgId}/ratelimit/usagereturns time-series of consumption per endpoint. Customers can plot "we hit 80% at 14:32, why?" themselves.- Denial stream published to Kafka
ratelimit.denialswith per-denial metadata:orgId,endpoint,userId,policyId,timestamp. Customers can subscribe via webhook for real-time alerting. - Policy visibility API:
GET /v1/orgs/{orgId}/ratelimit/policiesreturns the effective policies for this org — tier default + overrides. Removes "why is my limit different from the docs?" confusion. - Self-service: a customer portal shows current usage, historical trends, and which endpoints are approaching limits.
- Challenges: Dashboard requires significant product work. Denial stream adds Kafka load. Policy visibility requires explaining the override hierarchy to customers — some complexity they'd rather not see.
What is Expected at Each Level? ​
Mid-level (SMTS-junior) ​
Single-node token bucket. Basic Redis backend for shared counters. Tenant-keyed counters. Awareness that fail mode matters.
Senior (SMTS / LMTS) ​
Lua atomicity on Redis for correct concurrent updates. Per-policy fail mode. Policy cache with CDC invalidation. Basic per-org observability.
Staff+ (PMTS) ​
Multi-layer fallback (local bucket + Redis). Counter sharding for hot tenants. Shadow mode for policy rollouts. Cost model for Redis cluster and gateway fleet. Clear distinction between global concurrency (admission service) and rate (this service). Self-service usage dashboards.
Salesforce-Specific Considerations ​
- Direct analog: Salesforce API Request Limits (per-org per-24h), Concurrent Request Limit, and governor limits more broadly. The design patterns here are what Salesforce runs in production.
- Fairness philosophy: Salesforce's governor limits cap the worst case rather than optimize the best. Apply the same lens: we would rather refuse a paid Enterprise customer's burst than let them monopolize the platform and degrade smaller tenants.
- Event Relay and Pub/Sub API subscriptions are also rate-limited per-org — same token bucket patterns apply.
- Per-transaction governor limits (SOQL queries, DML statements, CPU time) are in-process rate limits inside Apex. Our API-level rate limiter is the distributed analog.
- Hyperforce regional isolation: rate limits are per-region. Cross-region calls are counted once per region, not globally.
- Platform Events metering: each org has a daily platform event allocation. Our
ratelimit.denialstopic mirrors Salesforce's event allocation alerts. - Shield Event Monitoring: every rate-limit denial is itself an audit event that Shield customers can ingest.
Example snippet — in-process policy cache ​
public class PolicyCache {
private final Cache<String, Policy> cache =
Caffeine.newBuilder().expireAfterWrite(Duration.ofSeconds(1)).build();
public Policy lookup(String orgId, String tier, String endpoint) {
String key = orgId + ":" + tier + ":" + endpoint;
return cache.get(key, k -> resolveHierarchy(orgId, tier, endpoint));
}
private Policy resolveHierarchy(String orgId, String tier, String endpoint) {
// Most-specific wins: endpoint override > org override > tier default.
Policy p = policyRepo.findEndpointOverride(orgId, endpoint);
if (p != null) return p;
p = policyRepo.findOrgOverride(orgId);
if (p != null) return p;
return policyRepo.findTierDefault(tier, endpoint);
}
public void invalidate(String orgId) {
cache.asMap().keySet().removeIf(k -> k.startsWith(orgId + ":"));
}
}Policy PolicyCache::Lookup(const std::string& org_id,
const std::string& tier,
const std::string& endpoint) {
const std::string key = org_id + ":" + tier + ":" + endpoint;
return cache_.GetOrCompute(key, [&] {
if (auto p = repo_.FindEndpointOverride(org_id, endpoint)) return *p;
if (auto p = repo_.FindOrgOverride(org_id)) return *p;
return repo_.FindTierDefault(tier, endpoint);
});
}
void PolicyCache::Invalidate(const std::string& org_id) {
cache_.RemoveIf([&](const std::string& k) {
return k.rfind(org_id + ":", 0) == 0;
});
}class PolicyCache {
private cache = new LRU<string, Policy>({ ttl: 1000, max: 10_000 });
lookup(orgId: string, tier: string, endpoint: string): Policy {
const key = `${orgId}:${tier}:${endpoint}`;
const hit = this.cache.get(key);
if (hit) return hit;
const resolved =
this.repo.findEndpointOverride(orgId, endpoint) ??
this.repo.findOrgOverride(orgId) ??
this.repo.findTierDefault(tier, endpoint);
this.cache.set(key, resolved);
return resolved;
}
invalidate(orgId: string) {
for (const k of this.cache.keys()) {
if (k.startsWith(`${orgId}:`)) this.cache.delete(k);
}
}
}