Skip to content

HLD: Multi-Tenant Rate Limiter ​

Understanding the Problem ​

What is a Multi-Tenant Rate Limiter? ​

A rate limiter enforces per-tenant, per-endpoint request budgets across a distributed API fleet. Essentials customers get 15k/day, Enterprise gets 1M/day, internal services get no cap. The service must make a decision in under 2 ms on every request, stay accurate to within 5% of the advertised limit, and keep working when Redis has a blip. This is Salesforce's governor-limit philosophy applied at the API layer: cap the worst case rather than optimize the best. It's a common HLD problem on its own and a frequent deep dive inside bigger designs (notifications, dashboards).

Functional Requirements ​

Core (above the line):

  1. Evaluate a request against its applicable limits — (orgId, endpoint), (orgId, user), (orgId, global) — and return allowed/denied in < 2 ms.
  2. Return Retry-After on denial so clients know when to retry.
  3. Configurable policies per org tier — Essentials: 15k/day, Enterprise: 1M/day, Unlimited: soft caps only.
  4. Burst-friendly — permit short spikes (client has a backlog to drain).
  5. Observability — per-tenant, per-endpoint consumption dashboards.

Below the line (out of scope):

  • Global concurrency caps (tracked by an admission service, not the rate limiter).
  • IP-based DDoS protection — edge CDN / WAF handles that.
  • Per-user geographic rules — belongs in geo-fencing policy layer.

Non-Functional Requirements ​

Core:

  • Scale: 1M API RPS across all tenants at peak.
  • Latency: rate-limit check p99 < 2 ms.
  • Accuracy: within 5% of the advertised limit for soft limits; strict caps for billing / compliance hard limits.
  • Availability: survives Redis failure. Fail-mode is explicit (open or closed) per policy.
  • Multi-tenancy: keys strictly namespaced by orgId. Per-org observability.

Below the line:

  • Sub-millisecond accuracy across all shards (eventual consistency within 1s is fine).
  • Dynamic reconfiguration under 100 ms (1s is acceptable).

Capacity Estimation ​

  • 1M RPS means 1M Redis GET/INCR ops/s minimum. With Lua atomicity and a 12-node Redis cluster, each shard handles ~85k ops/s — well within Redis single-threaded capacity (~200k ops/s).
  • Memory: each limit key is ~100 bytes. 10M active (org, endpoint, window) combinations = ~1 GB per replica. Fits comfortably in a standard Redis node.
  • Policy storage: 100k orgs × 100 endpoint patterns × 10 overrides = 100M policy rows. Still a small table in Postgres, but cached in-process.

The Set Up ​

Core Entities ​

  • Policy — (policyId, orgTier, endpointPattern, limit, window, algorithm, failMode). Defines "this tier, this endpoint, this limit."
  • CounterKey (derived at check time) — rl:{algo}:{orgId}:{endpoint}:{window}.
  • TenantOverride — per-org tweaks to the default policy. E.g., "Org X gets 10x the default for POST /records."
  • PolicyDecision — (allowed, remaining, retryAfterMs, policyId) returned to the caller.

The API ​

Two planes: data plane runs on every request; control plane manages policies.

Data plane (in-request, sync):

CheckLimit(orgId, userId, endpoint) → { allowed: bool, retryAfterMs, remaining }

Implemented as a gRPC service from a local sidecar, or as an Envoy external-authz filter, or as a Kong plugin.

Control plane:

PUT /v1/ratelimit/policies
Body: { "orgTier": "Enterprise", "endpointPattern": "POST /records", "limit": 1000, "window": "1m", "algorithm": "token_bucket", "failMode": "open" }

GET /v1/orgs/{orgId}/ratelimit/usage
  → time-series of remaining budget per endpoint

POST /v1/orgs/{orgId}/ratelimit/overrides
  Body: { "endpointPattern": "...", "limit": 5000 }

High-Level Design ​

Architecture ​

 Request ──▶ Envoy/Kong ──▶ RateLimit Filter ──▶ Redis Cluster (Lua)
                │                │                   │
                │                │                   ▼
                │                │             Policy Cache (in-proc)
                │                ▼
                │         Local token bucket (fallback)
                â–¼
            App Service

End-to-end flow: a request arrives ​

  1. Request hits Envoy / Kong at the edge with Authorization: Bearer <jwt>.
  2. RateLimit Filter (Envoy external authz or Kong plugin) extracts orgId, userId, endpoint.
  3. Filter consults the Policy Cache (in-process, 1s TTL) to find the applicable policy for this (orgTier, endpoint) plus any tenant overrides.
  4. Filter executes a Lua script on Redis Cluster to atomically check and update the token bucket.
  5. Redis returns (allowed, remaining). If allowed, request continues to the app; if denied, filter returns 429 with Retry-After and X-RateLimit-* headers.
  6. Every denial is emitted to a Kafka topic ratelimit.denials for dashboards and anomaly alerting.

End-to-end flow: policy update ​

  1. Admin changes a policy via PUT /v1/ratelimit/policies.
  2. Policy Service writes to Postgres.
  3. CDC (Debezium) publishes the change to policy.updates Kafka topic.
  4. Every RateLimit Filter instance subscribes to policy.updates and invalidates its in-process cache for affected keys.
  5. Propagation latency: < 1 second across the fleet.

Data model ​

  • Policies: Postgres policies table. Mirrored to Redis via CDC for the rare case of cache-cold startup.
  • Counters: Redis, ephemeral. TTL equals the window (e.g., 1 minute, 1 hour, 24 hours).
  • Denial audit: Kafka ratelimit.denials topic for dashboards; retained 7 days.
  • Usage metrics: Prometheus time-series labeled org_id, endpoint, tier.

Multi-Tenancy Strategy ​

Isolation level: L1 on the policy store (shared Postgres), L1 on Redis with strict key prefixing. Redis Cluster assigns slots via the {orgId} hash tag — this keeps a tenant's keys colocated on one slot, enabling multi-key Lua scripts without cross-slot errors.

Tenant context flow:

  • orgId extracted from JWT at the edge.
  • All Redis keys use the format rl:tok:{orgId}:{endpoint} — the curly braces tell Redis Cluster to hash only on orgId, keeping all of a tenant's rate-limit counters in the same shard.
  • Every Prometheus metric is labeled with org_id, endpoint, tier.
  • Denials published to Kafka are keyed by orgId for per-org stream processing.

Noisy-neighbor mitigations:

  • Shuffle sharding of Redis nodes: each org is mapped to k=3 of n=12 Redis shards. A hot tenant's traffic only touches 25% of Redis capacity.
  • Meta rate limit: the rate limiter itself applies a cap on how often a single tenant can hit /checkLimit. Defense in depth against a bug in the policy cache.
  • Per-tenant bulkhead on the RateLimit Filter: bounded goroutine / thread count per org so one tenant's Redis slowness cannot starve others' threads.
  • Priority tiers for Redis access: Enterprise tenants get a dedicated Redis read/write pipeline; Essentials share a shared pool.

Per-tenant observability:

  • Prometheus dashboards filter by org_id.
  • Per-org billing: count of allowed vs denied per endpoint.
  • Alerts when an org approaches its limit consistently — suggests they need to upgrade, and proactively reach out.

Potential Deep Dives ​

1) Which algorithm — fixed window, sliding window log, sliding window counter, token bucket? ​

Walk through the comparison explicitly. Interviewers expect this table.

AlgorithmAccuracyMemoryBurst-friendlyNotes
Fixed windowWorst (boundary bursts)O(1)NoSimple counter; double-load at window boundaries.
Sliding window logBestO(N) per keyYesStore every request timestamp; expensive at 1M RPS.
Sliding window counterGoodO(1)YesInterpolates between two fixed windows. Reasonable default.
Token bucketGoodO(1)YesChosen — matches "rate + burst" intuition.
Leaky bucketGood (smoothed)O(1)NoShapes traffic; strict throughput.

Pick token bucket for APIs. Mention sliding-window-counter as an alternative if the interviewer wants something else.

The Lua implementation is the load-bearing piece — atomic check-and-update on Redis:

java
private static final String LUA = """
  local tokens_key  = KEYS[1]
  local ts_key      = KEYS[2]
  local rate        = tonumber(ARGV[1])   -- tokens/sec
  local capacity    = tonumber(ARGV[2])
  local now         = tonumber(ARGV[3])   -- ms
  local cost        = tonumber(ARGV[4])
  local tokens      = tonumber(redis.call('GET', tokens_key) or capacity)
  local last        = tonumber(redis.call('GET', ts_key) or now)
  local delta       = math.max(0, now - last) / 1000.0
  tokens            = math.min(capacity, tokens + delta * rate)
  local allowed     = tokens >= cost and 1 or 0
  if allowed == 1 then tokens = tokens - cost end
  redis.call('SET', tokens_key, tokens, 'PX', 3600000)
  redis.call('SET', ts_key,    now,    'PX', 3600000)
  return { allowed, tokens }
""";
List<String> keys = List.of(
    "rl:tok:{" + org + "}:" + ep,
    "rl:ts:{"  + org + "}:" + ep);
Object[] r = (Object[]) redis.eval(LUA, keys, rate, cap, now, 1);
cpp
static const char* kLua = R"LUA(
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
  local last   = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
  local delta  = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
  tokens       = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
  local ok     = tokens >= tonumber(ARGV[4]) and 1 or 0
  if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
  redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
  redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
  return { ok, tokens }
)LUA";
auto reply = redis.Eval(kLua,
    {"rl:tok:{" + org + "}:" + ep, "rl:ts:{" + org + "}:" + ep},
    {std::to_string(rate), std::to_string(cap),
     std::to_string(now_ms), std::to_string(cost)});
typescript
const lua = `
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
  local last   = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
  local delta  = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
  tokens       = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
  local ok     = tokens >= tonumber(ARGV[4]) and 1 or 0
  if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
  redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
  redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
  return { ok, tokens }
`;
const [allowed, remaining] = await redis.eval(
  lua, 2,
  `rl:tok:{${orgId}}:${endpoint}`,
  `rl:ts:{${orgId}}:${endpoint}`,
  String(rate), String(capacity), String(Date.now()), String(cost)
);

The curly braces around ${orgId} in the key are Redis Cluster hash tags — Redis hashes only the content between {}, so both keys for a given org land on the same slot, making multi-key Lua scripts safe.

2) Redis failure — fail open or closed? ​

Bad Solution: Fail open globally.

  • Approach: If Redis is unreachable, allow all requests.
  • Challenges: You've just turned off your rate limiter during the moment you need it most — cache failures often coincide with load spikes. Downstream services get DDoSed.

Good Solution: Fail closed globally.

  • Approach: If Redis is unreachable, deny all requests.
  • Challenges: You've turned your cache outage into an app outage. A 30-second Redis blip becomes a 30-second API outage. Customers complain.

Great Solution: Per-policy fail mode + local fallback bucket.

  • Approach:
    • Each policy specifies its fail mode: fail_open for soft limits, fail_closed for hard limits (billing, compliance, per-feature quotas).
    • When Redis is reachable: use the authoritative cluster count.
    • When Redis is unreachable: fall back to a local token bucket on each gateway node. Capacity is globalLimit / nodeCount × 0.7 — conservative to avoid global overshoot.
    • Emit a SEV alert to operators when running in fallback mode; track duration.
    • When Redis returns, reconcile: read recent denials and allowances from fallback logs, update authoritative counters via CDC. Brief over-issuance during the switchover is documented and acceptable.
  • Challenges: Local fallback is approximate — over-issues by up to 1 / node_count in the worst case. Hard limits with fail_closed still cause 429s during Redis outage — customers see errors. Tradeoff is explicit: "we'd rather refuse a few billing-metered calls than overcharge."

3) Distributed counter accuracy — how accurate can we be at 1M RPS? ​

Bad Solution: Per-node local counters, summed periodically.

  • Approach: Each gateway node keeps its own counter; a background process sums them.
  • Challenges: At 100 nodes with 0.1s sync interval, global counter can be off by 100 × 0.1 × average-RPS — potentially 10k requests of over-allowance. Under-counting is also possible if nodes double-report.

Good Solution: Single centralized Redis counter per key.

  • Approach: Every node hits the same Redis key. Accurate within the time of a single Lua script.
  • Challenges: For a mega-tenant with 100k RPS on one endpoint, that's 100k ops/s hitting one Redis key. Single-key throughput ceiling (~100k ops/s for INCR, lower with Lua) is breached.

Great Solution: Counter sharding for high-QPS tenants with periodic reconciliation.

  • Approach:
    • Normal tenants use a single counter per (orgId, endpoint, window).
    • High-QPS tenants (> 10k RPS on a single endpoint) have sharded counters: rl:tok:{orgId}:ep:0..N-1. Client picks a shard at random.
    • Each shard enforces limit / N independently. Total admitted stays within the overall limit at steady state.
    • A reconciler process runs every 1 s, sums the shards, and redistributes unused capacity. E.g., if shard 0 has burned 80% and shard 3 only 20%, reconciler rebalances on the next tick.
    • Trade a small over-issuance (a few percent at window boundaries) for linear scalability.
  • Challenges: Reconciler adds complexity and a second source of truth. Shard imbalance within a window can cause premature 429 for some users while capacity is free elsewhere. Mitigate by starting with few shards (N=2) and ramping only when needed.

4) Policy evolution and tenant overrides ​

Bad Solution: Hard-coded limits in YAML files.

  • Approach: Policies checked into config; deploy to change.
  • Challenges: Policy changes take 30+ minutes to deploy across a fleet. Cannot react to an incident ("spike from Org X, raise their limit immediately"). No runtime override for enterprise customers.

Good Solution: Policies in a DB, fetched on each request.

  • Approach: Policy service reads Postgres on every check.
  • Challenges: Adds DB round-trip to every rate-limit check — kills the 2ms latency budget.

Great Solution: Per-org override hierarchy + CDC-propagated cache + shadow mode for rollouts.

  • Approach:
    • Hierarchy: effective policy for (org, endpoint) is: tier default → org override → endpoint-specific override. Most-specific wins.
    • Cached in-process (1 s TTL); invalidated via CDC from policy.updates topic. Propagation latency < 1 s.
    • Audit every policy change — meta-audit — so we can trace "who raised Org X's limit from 1k to 10k and when."
    • Shadow mode for new policies: deploy in log-only mode first ("would have denied") for N hours. Review the denials in a dashboard. If it looks right, flip to enforce mode with one config change.
    • Blue-green policy deploys: old and new policies evaluated in parallel for a transition window; serving happens based on a feature flag.
  • Challenges: Hierarchy resolution has to be fast (it's in-process and cached — fine). Shadow mode requires a parallel execution path — increases Redis load during the shadow window. Auditing adds write load to the audit system.

5) Observability — telling a customer "why did I get 429'd?" ​

Bad Solution: No per-denial context.

  • Approach: Return 429 with no body.
  • Challenges: Customer calls support. Support calls engineering. Everyone spends hours pawing through logs.

Good Solution: Response headers.

  • Approach: Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
  • Challenges: Customer knows the limit but not why. Doesn't help debug unexpected spikes.

Great Solution: Self-service usage dashboard + denial stream + policy visibility.

  • Approach:
    • Response headers give per-request context.
    • GET /v1/orgs/{orgId}/ratelimit/usage returns time-series of consumption per endpoint. Customers can plot "we hit 80% at 14:32, why?" themselves.
    • Denial stream published to Kafka ratelimit.denials with per-denial metadata: orgId, endpoint, userId, policyId, timestamp. Customers can subscribe via webhook for real-time alerting.
    • Policy visibility API: GET /v1/orgs/{orgId}/ratelimit/policies returns the effective policies for this org — tier default + overrides. Removes "why is my limit different from the docs?" confusion.
    • Self-service: a customer portal shows current usage, historical trends, and which endpoints are approaching limits.
  • Challenges: Dashboard requires significant product work. Denial stream adds Kafka load. Policy visibility requires explaining the override hierarchy to customers — some complexity they'd rather not see.

What is Expected at Each Level? ​

Mid-level (SMTS-junior) ​

Single-node token bucket. Basic Redis backend for shared counters. Tenant-keyed counters. Awareness that fail mode matters.

Senior (SMTS / LMTS) ​

Lua atomicity on Redis for correct concurrent updates. Per-policy fail mode. Policy cache with CDC invalidation. Basic per-org observability.

Staff+ (PMTS) ​

Multi-layer fallback (local bucket + Redis). Counter sharding for hot tenants. Shadow mode for policy rollouts. Cost model for Redis cluster and gateway fleet. Clear distinction between global concurrency (admission service) and rate (this service). Self-service usage dashboards.


Salesforce-Specific Considerations ​

  • Direct analog: Salesforce API Request Limits (per-org per-24h), Concurrent Request Limit, and governor limits more broadly. The design patterns here are what Salesforce runs in production.
  • Fairness philosophy: Salesforce's governor limits cap the worst case rather than optimize the best. Apply the same lens: we would rather refuse a paid Enterprise customer's burst than let them monopolize the platform and degrade smaller tenants.
  • Event Relay and Pub/Sub API subscriptions are also rate-limited per-org — same token bucket patterns apply.
  • Per-transaction governor limits (SOQL queries, DML statements, CPU time) are in-process rate limits inside Apex. Our API-level rate limiter is the distributed analog.
  • Hyperforce regional isolation: rate limits are per-region. Cross-region calls are counted once per region, not globally.
  • Platform Events metering: each org has a daily platform event allocation. Our ratelimit.denials topic mirrors Salesforce's event allocation alerts.
  • Shield Event Monitoring: every rate-limit denial is itself an audit event that Shield customers can ingest.

Example snippet — in-process policy cache ​

java
public class PolicyCache {
    private final Cache<String, Policy> cache =
        Caffeine.newBuilder().expireAfterWrite(Duration.ofSeconds(1)).build();

    public Policy lookup(String orgId, String tier, String endpoint) {
        String key = orgId + ":" + tier + ":" + endpoint;
        return cache.get(key, k -> resolveHierarchy(orgId, tier, endpoint));
    }

    private Policy resolveHierarchy(String orgId, String tier, String endpoint) {
        // Most-specific wins: endpoint override > org override > tier default.
        Policy p = policyRepo.findEndpointOverride(orgId, endpoint);
        if (p != null) return p;
        p = policyRepo.findOrgOverride(orgId);
        if (p != null) return p;
        return policyRepo.findTierDefault(tier, endpoint);
    }

    public void invalidate(String orgId) {
        cache.asMap().keySet().removeIf(k -> k.startsWith(orgId + ":"));
    }
}
cpp
Policy PolicyCache::Lookup(const std::string& org_id,
                           const std::string& tier,
                           const std::string& endpoint) {
  const std::string key = org_id + ":" + tier + ":" + endpoint;
  return cache_.GetOrCompute(key, [&] {
    if (auto p = repo_.FindEndpointOverride(org_id, endpoint)) return *p;
    if (auto p = repo_.FindOrgOverride(org_id)) return *p;
    return repo_.FindTierDefault(tier, endpoint);
  });
}

void PolicyCache::Invalidate(const std::string& org_id) {
  cache_.RemoveIf([&](const std::string& k) {
    return k.rfind(org_id + ":", 0) == 0;
  });
}
typescript
class PolicyCache {
  private cache = new LRU<string, Policy>({ ttl: 1000, max: 10_000 });

  lookup(orgId: string, tier: string, endpoint: string): Policy {
    const key = `${orgId}:${tier}:${endpoint}`;
    const hit = this.cache.get(key);
    if (hit) return hit;
    const resolved =
      this.repo.findEndpointOverride(orgId, endpoint) ??
      this.repo.findOrgOverride(orgId) ??
      this.repo.findTierDefault(tier, endpoint);
    this.cache.set(key, resolved);
    return resolved;
  }

  invalidate(orgId: string) {
    for (const k of this.cache.keys()) {
      if (k.startsWith(`${orgId}:`)) this.cache.delete(k);
    }
  }
}

Frontend interview preparation reference.