HLD: Multi-Tenant Rate Limiter

Understanding the Problem

What is a Multi-Tenant Rate Limiter?

A rate limiter enforces per-tenant, per-endpoint request budgets across a distributed API fleet. Essentials customers get 15k/day, Enterprise gets 1M/day, internal services get no cap. The service must make a decision in under 2 ms on every request, stay accurate to within 5% of the advertised limit, and keep working when Redis has a blip. This is Salesforce's governor-limit philosophy applied at the API layer: cap the worst case rather than optimize the best. It's a common HLD problem on its own and a frequent deep dive inside bigger designs (notifications, dashboards).

Functional Requirements

Core (above the line):

Evaluate a request against its applicable limits — (orgId, endpoint), (orgId, user), (orgId, global) — and return allowed/denied in < 2 ms.
Return Retry-After on denial so clients know when to retry.
Configurable policies per org tier — Essentials: 15k/day, Enterprise: 1M/day, Unlimited: soft caps only.
Burst-friendly — permit short spikes (client has a backlog to drain).
Observability — per-tenant, per-endpoint consumption dashboards.

Below the line (out of scope):

Global concurrency caps (tracked by an admission service, not the rate limiter).
IP-based DDoS protection — edge CDN / WAF handles that.
Per-user geographic rules — belongs in geo-fencing policy layer.

Non-Functional Requirements

Core:

Scale: 1M API RPS across all tenants at peak.
Latency: rate-limit check p99 < 2 ms.
Accuracy: within 5% of the advertised limit for soft limits; strict caps for billing / compliance hard limits.
Availability: survives Redis failure. Fail-mode is explicit (open or closed) per policy.
Multi-tenancy: keys strictly namespaced by orgId. Per-org observability.

Below the line:

Sub-millisecond accuracy across all shards (eventual consistency within 1s is fine).
Dynamic reconfiguration under 100 ms (1s is acceptable).

Capacity Estimation

1M RPS means 1M Redis GET/INCR ops/s minimum. With Lua atomicity and a 12-node Redis cluster, each shard handles ~85k ops/s — well within Redis single-threaded capacity (~200k ops/s).
Memory: each limit key is ~100 bytes. 10M active (org, endpoint, window) combinations = ~1 GB per replica. Fits comfortably in a standard Redis node.
Policy storage: 100k orgs × 100 endpoint patterns × 10 overrides = 100M policy rows. Still a small table in Postgres, but cached in-process.

The Set Up

Core Entities

Policy — (policyId, orgTier, endpointPattern, limit, window, algorithm, failMode). Defines "this tier, this endpoint, this limit."
CounterKey (derived at check time) — rl:{algo}:{orgId}:{endpoint}:{window}.
TenantOverride — per-org tweaks to the default policy. E.g., "Org X gets 10x the default for POST /records."
PolicyDecision — (allowed, remaining, retryAfterMs, policyId) returned to the caller.

The API

Two planes: data plane runs on every request; control plane manages policies.

Data plane (in-request, sync):

CheckLimit(orgId, userId, endpoint) → { allowed: bool, retryAfterMs, remaining }

Implemented as a gRPC service from a local sidecar, or as an Envoy external-authz filter, or as a Kong plugin.

Control plane:

PUT /v1/ratelimit/policies
Body: { "orgTier": "Enterprise", "endpointPattern": "POST /records", "limit": 1000, "window": "1m", "algorithm": "token_bucket", "failMode": "open" }

GET /v1/orgs/{orgId}/ratelimit/usage
  → time-series of remaining budget per endpoint

POST /v1/orgs/{orgId}/ratelimit/overrides
  Body: { "endpointPattern": "...", "limit": 5000 }

High-Level Design

Architecture

 Request ──▶ Envoy/Kong ──▶ RateLimit Filter ──▶ Redis Cluster (Lua)
                │                │                   │
                │                │                   ▼
                │                │             Policy Cache (in-proc)
                │                ▼
                │         Local token bucket (fallback)
                ▼
            App Service

End-to-end flow: a request arrives

Request hits Envoy / Kong at the edge with Authorization: Bearer <jwt>.
RateLimit Filter (Envoy external authz or Kong plugin) extracts orgId, userId, endpoint.
Filter consults the Policy Cache (in-process, 1s TTL) to find the applicable policy for this (orgTier, endpoint) plus any tenant overrides.
Filter executes a Lua script on Redis Cluster to atomically check and update the token bucket.
Redis returns (allowed, remaining). If allowed, request continues to the app; if denied, filter returns 429 with Retry-After and X-RateLimit-* headers.
Every denial is emitted to a Kafka topic ratelimit.denials for dashboards and anomaly alerting.

End-to-end flow: policy update

Admin changes a policy via PUT /v1/ratelimit/policies.
Policy Service writes to Postgres.
CDC (Debezium) publishes the change to policy.updates Kafka topic.
Every RateLimit Filter instance subscribes to policy.updates and invalidates its in-process cache for affected keys.
Propagation latency: < 1 second across the fleet.

Data model

Policies: Postgres policies table. Mirrored to Redis via CDC for the rare case of cache-cold startup.
Counters: Redis, ephemeral. TTL equals the window (e.g., 1 minute, 1 hour, 24 hours).
Denial audit: Kafka ratelimit.denials topic for dashboards; retained 7 days.
Usage metrics: Prometheus time-series labeled org_id, endpoint, tier.

Multi-Tenancy Strategy

Isolation level: L1 on the policy store (shared Postgres), L1 on Redis with strict key prefixing. Redis Cluster assigns slots via the {orgId} hash tag — this keeps a tenant's keys colocated on one slot, enabling multi-key Lua scripts without cross-slot errors.

Tenant context flow:

orgId extracted from JWT at the edge.
All Redis keys use the format rl:tok:{orgId}:{endpoint} — the curly braces tell Redis Cluster to hash only on orgId, keeping all of a tenant's rate-limit counters in the same shard.
Every Prometheus metric is labeled with org_id, endpoint, tier.
Denials published to Kafka are keyed by orgId for per-org stream processing.

Noisy-neighbor mitigations:

Shuffle sharding of Redis nodes: each org is mapped to k=3 of n=12 Redis shards. A hot tenant's traffic only touches 25% of Redis capacity.
Meta rate limit: the rate limiter itself applies a cap on how often a single tenant can hit /checkLimit. Defense in depth against a bug in the policy cache.
Per-tenant bulkhead on the RateLimit Filter: bounded goroutine / thread count per org so one tenant's Redis slowness cannot starve others' threads.
Priority tiers for Redis access: Enterprise tenants get a dedicated Redis read/write pipeline; Essentials share a shared pool.

Per-tenant observability:

Prometheus dashboards filter by org_id.
Per-org billing: count of allowed vs denied per endpoint.
Alerts when an org approaches its limit consistently — suggests they need to upgrade, and proactively reach out.

Potential Deep Dives

1) Which algorithm — fixed window, sliding window log, sliding window counter, token bucket?

Walk through the comparison explicitly. Interviewers expect this table.

Algorithm	Accuracy	Memory	Burst-friendly	Notes
Fixed window	Worst (boundary bursts)	O(1)	No	Simple counter; double-load at window boundaries.
Sliding window log	Best	O(N) per key	Yes	Store every request timestamp; expensive at 1M RPS.
Sliding window counter	Good	O(1)	Yes	Interpolates between two fixed windows. Reasonable default.
Token bucket	Good	O(1)	Yes	Chosen — matches "rate + burst" intuition.
Leaky bucket	Good (smoothed)	O(1)	No	Shapes traffic; strict throughput.

Pick token bucket for APIs. Mention sliding-window-counter as an alternative if the interviewer wants something else.

The Lua implementation is the load-bearing piece — atomic check-and-update on Redis:

JavaC++TypeScript

java

private static final String LUA = """
  local tokens_key  = KEYS[1]
  local ts_key      = KEYS[2]
  local rate        = tonumber(ARGV[1])   -- tokens/sec
  local capacity    = tonumber(ARGV[2])
  local now         = tonumber(ARGV[3])   -- ms
  local cost        = tonumber(ARGV[4])
  local tokens      = tonumber(redis.call('GET', tokens_key) or capacity)
  local last        = tonumber(redis.call('GET', ts_key) or now)
  local delta       = math.max(0, now - last) / 1000.0
  tokens            = math.min(capacity, tokens + delta * rate)
  local allowed     = tokens >= cost and 1 or 0
  if allowed == 1 then tokens = tokens - cost end
  redis.call('SET', tokens_key, tokens, 'PX', 3600000)
  redis.call('SET', ts_key,    now,    'PX', 3600000)
  return { allowed, tokens }
""";
List<String> keys = List.of(
    "rl:tok:{" + org + "}:" + ep,
    "rl:ts:{"  + org + "}:" + ep);
Object[] r = (Object[]) redis.eval(LUA, keys, rate, cap, now, 1);

cpp

static const char* kLua = R"LUA(
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
  local last   = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
  local delta  = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
  tokens       = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
  local ok     = tokens >= tonumber(ARGV[4]) and 1 or 0
  if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
  redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
  redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
  return { ok, tokens }
)LUA";
auto reply = redis.Eval(kLua,
    {"rl:tok:{" + org + "}:" + ep, "rl:ts:{" + org + "}:" + ep},
    {std::to_string(rate), std::to_string(cap),
     std::to_string(now_ms), std::to_string(cost)});

typescript

const lua = `
  local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[2])
  local last   = tonumber(redis.call('GET', KEYS[2]) or ARGV[3])
  local delta  = math.max(0, tonumber(ARGV[3]) - last) / 1000.0
  tokens       = math.min(tonumber(ARGV[2]), tokens + delta * tonumber(ARGV[1]))
  local ok     = tokens >= tonumber(ARGV[4]) and 1 or 0
  if ok == 1 then tokens = tokens - tonumber(ARGV[4]) end
  redis.call('SET', KEYS[1], tokens, 'PX', 3600000)
  redis.call('SET', KEYS[2], ARGV[3], 'PX', 3600000)
  return { ok, tokens }
`;
const [allowed, remaining] = await redis.eval(
  lua, 2,
  `rl:tok:{${orgId}}:${endpoint}`,
  `rl:ts:{${orgId}}:${endpoint}`,
  String(rate), String(capacity), String(Date.now()), String(cost)
);

The curly braces around ${orgId} in the key are Redis Cluster hash tags — Redis hashes only the content between {}, so both keys for a given org land on the same slot, making multi-key Lua scripts safe.

2) Redis failure — fail open or closed?

Bad Solution: Fail open globally.

Approach: If Redis is unreachable, allow all requests.
Challenges: You've just turned off your rate limiter during the moment you need it most — cache failures often coincide with load spikes. Downstream services get DDoSed.

Good Solution: Fail closed globally.

Approach: If Redis is unreachable, deny all requests.
Challenges: You've turned your cache outage into an app outage. A 30-second Redis blip becomes a 30-second API outage. Customers complain.

Great Solution: Per-policy fail mode + local fallback bucket.

Approach:
- Each policy specifies its fail mode: fail_open for soft limits, fail_closed for hard limits (billing, compliance, per-feature quotas).
- When Redis is reachable: use the authoritative cluster count.
- When Redis is unreachable: fall back to a local token bucket on each gateway node. Capacity is globalLimit / nodeCount × 0.7 — conservative to avoid global overshoot.
- Emit a SEV alert to operators when running in fallback mode; track duration.
- When Redis returns, reconcile: read recent denials and allowances from fallback logs, update authoritative counters via CDC. Brief over-issuance during the switchover is documented and acceptable.
Challenges: Local fallback is approximate — over-issues by up to 1 / node_count in the worst case. Hard limits with fail_closed still cause 429s during Redis outage — customers see errors. Tradeoff is explicit: "we'd rather refuse a few billing-metered calls than overcharge."

3) Distributed counter accuracy — how accurate can we be at 1M RPS?

Bad Solution: Per-node local counters, summed periodically.

Approach: Each gateway node keeps its own counter; a background process sums them.
Challenges: At 100 nodes with 0.1s sync interval, global counter can be off by 100 × 0.1 × average-RPS — potentially 10k requests of over-allowance. Under-counting is also possible if nodes double-report.

Good Solution: Single centralized Redis counter per key.

Approach: Every node hits the same Redis key. Accurate within the time of a single Lua script.
Challenges: For a mega-tenant with 100k RPS on one endpoint, that's 100k ops/s hitting one Redis key. Single-key throughput ceiling (~100k ops/s for INCR, lower with Lua) is breached.

Great Solution: Counter sharding for high-QPS tenants with periodic reconciliation.

Approach:
- Normal tenants use a single counter per (orgId, endpoint, window).
- High-QPS tenants (> 10k RPS on a single endpoint) have sharded counters: rl:tok:{orgId}:ep:0..N-1. Client picks a shard at random.
- Each shard enforces limit / N independently. Total admitted stays within the overall limit at steady state.
- A reconciler process runs every 1 s, sums the shards, and redistributes unused capacity. E.g., if shard 0 has burned 80% and shard 3 only 20%, reconciler rebalances on the next tick.
- Trade a small over-issuance (a few percent at window boundaries) for linear scalability.
Challenges: Reconciler adds complexity and a second source of truth. Shard imbalance within a window can cause premature 429 for some users while capacity is free elsewhere. Mitigate by starting with few shards (N=2) and ramping only when needed.

4) Policy evolution and tenant overrides

Bad Solution: Hard-coded limits in YAML files.

Approach: Policies checked into config; deploy to change.
Challenges: Policy changes take 30+ minutes to deploy across a fleet. Cannot react to an incident ("spike from Org X, raise their limit immediately"). No runtime override for enterprise customers.

Good Solution: Policies in a DB, fetched on each request.

Approach: Policy service reads Postgres on every check.
Challenges: Adds DB round-trip to every rate-limit check — kills the 2ms latency budget.

Great Solution: Per-org override hierarchy + CDC-propagated cache + shadow mode for rollouts.

Approach:
- Hierarchy: effective policy for (org, endpoint) is: tier default → org override → endpoint-specific override. Most-specific wins.
- Cached in-process (1 s TTL); invalidated via CDC from policy.updates topic. Propagation latency < 1 s.
- Audit every policy change — meta-audit — so we can trace "who raised Org X's limit from 1k to 10k and when."
- Shadow mode for new policies: deploy in log-only mode first ("would have denied") for N hours. Review the denials in a dashboard. If it looks right, flip to enforce mode with one config change.
- Blue-green policy deploys: old and new policies evaluated in parallel for a transition window; serving happens based on a feature flag.
Challenges: Hierarchy resolution has to be fast (it's in-process and cached — fine). Shadow mode requires a parallel execution path — increases Redis load during the shadow window. Auditing adds write load to the audit system.

5) Observability — telling a customer "why did I get 429'd?"

Bad Solution: No per-denial context.

Approach: Return 429 with no body.
Challenges: Customer calls support. Support calls engineering. Everyone spends hours pawing through logs.

Good Solution: Response headers.

Approach: Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
Challenges: Customer knows the limit but not why. Doesn't help debug unexpected spikes.

Great Solution: Self-service usage dashboard + denial stream + policy visibility.

Approach:
- Response headers give per-request context.
- GET /v1/orgs/{orgId}/ratelimit/usage returns time-series of consumption per endpoint. Customers can plot "we hit 80% at 14:32, why?" themselves.
- Denial stream published to Kafka ratelimit.denials with per-denial metadata: orgId, endpoint, userId, policyId, timestamp. Customers can subscribe via webhook for real-time alerting.
- Policy visibility API: GET /v1/orgs/{orgId}/ratelimit/policies returns the effective policies for this org — tier default + overrides. Removes "why is my limit different from the docs?" confusion.
- Self-service: a customer portal shows current usage, historical trends, and which endpoints are approaching limits.
Challenges: Dashboard requires significant product work. Denial stream adds Kafka load. Policy visibility requires explaining the override hierarchy to customers — some complexity they'd rather not see.

What is Expected at Each Level?

Mid-level (SMTS-junior)

Single-node token bucket. Basic Redis backend for shared counters. Tenant-keyed counters. Awareness that fail mode matters.

Senior (SMTS / LMTS)

Lua atomicity on Redis for correct concurrent updates. Per-policy fail mode. Policy cache with CDC invalidation. Basic per-org observability.

Staff+ (PMTS)

Multi-layer fallback (local bucket + Redis). Counter sharding for hot tenants. Shadow mode for policy rollouts. Cost model for Redis cluster and gateway fleet. Clear distinction between global concurrency (admission service) and rate (this service). Self-service usage dashboards.

Salesforce-Specific Considerations

Direct analog: Salesforce API Request Limits (per-org per-24h), Concurrent Request Limit, and governor limits more broadly. The design patterns here are what Salesforce runs in production.
Fairness philosophy: Salesforce's governor limits cap the worst case rather than optimize the best. Apply the same lens: we would rather refuse a paid Enterprise customer's burst than let them monopolize the platform and degrade smaller tenants.
Event Relay and Pub/Sub API subscriptions are also rate-limited per-org — same token bucket patterns apply.
Per-transaction governor limits (SOQL queries, DML statements, CPU time) are in-process rate limits inside Apex. Our API-level rate limiter is the distributed analog.
Hyperforce regional isolation: rate limits are per-region. Cross-region calls are counted once per region, not globally.
Platform Events metering: each org has a daily platform event allocation. Our ratelimit.denials topic mirrors Salesforce's event allocation alerts.
Shield Event Monitoring: every rate-limit denial is itself an audit event that Shield customers can ingest.

Example snippet — in-process policy cache

JavaC++TypeScript

java

public class PolicyCache {
    private final Cache<String, Policy> cache =
        Caffeine.newBuilder().expireAfterWrite(Duration.ofSeconds(1)).build();

    public Policy lookup(String orgId, String tier, String endpoint) {
        String key = orgId + ":" + tier + ":" + endpoint;
        return cache.get(key, k -> resolveHierarchy(orgId, tier, endpoint));
    }

    private Policy resolveHierarchy(String orgId, String tier, String endpoint) {
        // Most-specific wins: endpoint override > org override > tier default.
        Policy p = policyRepo.findEndpointOverride(orgId, endpoint);
        if (p != null) return p;
        p = policyRepo.findOrgOverride(orgId);
        if (p != null) return p;
        return policyRepo.findTierDefault(tier, endpoint);
    }

    public void invalidate(String orgId) {
        cache.asMap().keySet().removeIf(k -> k.startsWith(orgId + ":"));
    }
}

cpp

Policy PolicyCache::Lookup(const std::string& org_id,
                           const std::string& tier,
                           const std::string& endpoint) {
  const std::string key = org_id + ":" + tier + ":" + endpoint;
  return cache_.GetOrCompute(key, [&] {
    if (auto p = repo_.FindEndpointOverride(org_id, endpoint)) return *p;
    if (auto p = repo_.FindOrgOverride(org_id)) return *p;
    return repo_.FindTierDefault(tier, endpoint);
  });
}

void PolicyCache::Invalidate(const std::string& org_id) {
  cache_.RemoveIf([&](const std::string& k) {
    return k.rfind(org_id + ":", 0) == 0;
  });
}

typescript

class PolicyCache {
  private cache = new LRU<string, Policy>({ ttl: 1000, max: 10_000 });

  lookup(orgId: string, tier: string, endpoint: string): Policy {
    const key = `${orgId}:${tier}:${endpoint}`;
    const hit = this.cache.get(key);
    if (hit) return hit;
    const resolved =
      this.repo.findEndpointOverride(orgId, endpoint) ??
      this.repo.findOrgOverride(orgId) ??
      this.repo.findTierDefault(tier, endpoint);
    this.cache.set(key, resolved);
    return resolved;
  }

  invalidate(orgId: string) {
    for (const k of this.cache.keys()) {
      if (k.startsWith(`${orgId}:`)) this.cache.delete(k);
    }
  }
}

HLD: Multi-Tenant Rate Limiter ​

Understanding the Problem ​

What is a Multi-Tenant Rate Limiter? ​

Functional Requirements ​

Non-Functional Requirements ​

Capacity Estimation ​

The Set Up ​

Core Entities ​

The API ​

High-Level Design ​

Architecture ​

End-to-end flow: a request arrives ​

End-to-end flow: policy update ​

Data model ​

Multi-Tenancy Strategy ​

Potential Deep Dives ​

1) Which algorithm — fixed window, sliding window log, sliding window counter, token bucket? ​

2) Redis failure — fail open or closed? ​

3) Distributed counter accuracy — how accurate can we be at 1M RPS? ​

4) Policy evolution and tenant overrides ​

5) Observability — telling a customer "why did I get 429'd?" ​

What is Expected at Each Level? ​

Mid-level (SMTS-junior) ​

Senior (SMTS / LMTS) ​

Staff+ (PMTS) ​

Salesforce-Specific Considerations ​

Example snippet — in-process policy cache ​