HLD: Multi-Channel Notification Service

Frequently Asked at Salesforce SMTS — Common HLD problem in recent SMTS interviews (2024-2026).

Understanding the Problem

What is a Notification Service?

A notification service is a platform API that any internal team (CRM, Marketing Cloud, Service Cloud) or customer org can call to deliver messages to end users across multiple channels — email, SMS, mobile push, and in-app. It sits between upstream producers that generate events (a password reset, a quote approval, a marketing blast) and downstream providers like SES, Twilio, APNs, and FCM. At Salesforce, the analog is the Notifications API and Notification Builder, and the design test is whether you can build something that gracefully handles a 10M-recipient marketing burst from one org without delaying another org's 2-recipient password-reset email.

Functional Requirements

Core (above the line):

Multi-channel delivery — send notifications over email, SMS, push (APNs/FCM), and in-app (WebSocket / long-poll) from a single API.
Per-user preferences — users opt in or out per channel per category per org ("don't email me marketing, but SMS me for security alerts").
Templates — reusable templates with localization and per-org branding.
Scheduled + immediate sends — fire-at-time-T or enqueue now.
Idempotency — producers can retry without generating duplicates.
Delivery status tracking — queued / sent / delivered / bounced / opened, exposed via an API and webhooks.
Per-org rate limits — governor-style caps so a bulk tenant cannot monopolize the pipeline.

Below the line (out of scope):

A/B testing of templates — belongs in the marketing product.
Rich marketing segmentation — Marketing Cloud owns this, we are a transport.
In-editor WYSIWYG composition — UI team.
Click-tracking link rewriting — a separate subsystem; we only pass-through.

Non-Functional Requirements

Core:

Scale: 500M notifications/day peak, averaging ~5.8k/s, with 30-50k/s bursts during marketing events.
Latency: P99 enqueue < 50 ms; user-visible delivery < 60 s for transactional, < 10 min for bulk.
Durability: Zero silent drops. At-least-once delivery at the platform; the channel handles final-mile dedup via message IDs.
Availability: 99.95% (that's ~4.38h of allowed downtime per year).
Multi-tenancy: Every request scoped to an org. One org's blast must not delay another org's security email.
Compliance: GDPR deletion, CAN-SPAM unsubscribe, per-region data residency.

Below the line:

Exactly-once (we do at-least-once + idempotent receivers).
Sub-second bulk delivery (bulk gets a loose 10-min SLA).

Capacity Estimation

Do this math out loud — interviewers grade you on concrete numbers:

500M/day divided by 86,400 seconds gives ~5.8k/s average. Peak multiplier of 10x puts us at ~60k/s.
Average payload is 2 KB (template variables + metadata), so peak bandwidth is 120 MB/s and ~10 TB/day of raw ingress.
Hot metadata for 90 days with 5-10x amplification (indexes, replicas, audit) is ~1 PB across regions.
Worker fleet: if each worker does 200 sends/s, we need ~300 workers at peak, with 2x headroom = ~600.

The Set Up

Core Entities

Organization — orgId, tier (Essentials/Professional/Enterprise), region, quotas.
User — userId, orgId, channel endpoints (email address, phone, device tokens).
NotificationRequest — the top-level enqueue: requestId, orgId, category, templateId, recipients[], channels[], priority, scheduledAt, idempotencyKey.
Notification — one fan-out row per recipient per channel, each with its own status.
UserPreference — (userId, orgId, category, channel) with an opt_in flag.
Template — (templateId, orgId, channel, locale) and a rendered body template.
DeliveryEvent — append-only audit of state transitions.

The API

We use REST with tenant-scoped URLs. Idempotency headers on every mutating call.

Enqueue a notification:

POST /v1/orgs/{orgId}/notifications
Headers: X-Idempotency-Key: <uuid>, Authorization: Bearer <token>
Body:
{
  "category": "security.password_reset",
  "priority": "transactional",        // transactional | bulk
  "templateId": "tpl_pwreset_v3",
  "locale": "en-US",
  "recipients": [
    { "userId": "005...", "channels": ["email", "sms"] }
  ],
  "variables": { "resetLink": "https://..." },
  "scheduledAt": null
}
Response 202:
{ "requestId": "req_01HX...", "accepted": 1, "rejected": 0 }

We return 202 Accepted because the actual delivery is async — the caller should not wait on SES.

Query status, preferences, quotas:

GET  /v1/orgs/{orgId}/notifications/{requestId}
GET  /v1/orgs/{orgId}/notifications/{requestId}/events
PUT  /v1/orgs/{orgId}/users/{userId}/preferences
GET  /v1/orgs/{orgId}/quotas

High-Level Design

Architecture

 ┌────────┐   ┌─────────────┐   ┌───────────────┐   ┌────────────┐
 │ Caller │──▶│ API Gateway │──▶│  Ingestion    │──▶│ Kafka:     │
 │ (Apex, │   │  (authz,    │   │  Service      │   │ notif.in   │
 │  apps) │   │  per-org    │   │  (validate,   │   │ (partition │
 └────────┘   │  rate lim.) │   │  dedupe,      │   │ by orgId)  │
              └─────────────┘   │  preferences) │   └─────┬──────┘
                                 └───────────────┘         │
                                          │                ▼
                                          ▼          ┌──────────────────────┐
                                   ┌──────────────┐  │  Channel Dispatchers │
                                   │ Preferences  │  │ ┌──────┬──────┬────┐ │
                                   │  Service     │  │ │email │ sms  │push│ │
                                   │  + Cache     │  │ └──────┴──────┴────┘ │
                                   └──────────────┘  └─────┬────────┬───────┘
                                                           │        │
                                          ┌────────────────┘        ▼
                                          ▼                   ┌─────────────┐
                                   ┌─────────────┐            │ SES / Twilio│
                                   │ Retry /     │            │ APNs / FCM  │
                                   │ DLQ topic   │            └─────────────┘
                                   └─────────────┘

End-to-end flow (transactional send)

Caller does POST /v1/orgs/{orgId}/notifications with an X-Idempotency-Key.
API Gateway authenticates the JWT, checks the per-org token bucket in Redis, and rejects with 429 if over budget.
Ingestion Service checks notif:idem:{orgId}:{key} in Redis — if seen before, returns the prior requestId (idempotent replay). Otherwise it validates the template, looks up preferences for each recipient, and filters out opted-out channels.
Ingestion writes a row into notifications Postgres (transactional outbox pattern), fans out per-recipient rows into notification_items, and in the same transaction writes a record to an outbox table.
Debezium (CDC on the outbox) produces a message to Kafka topic notif.in, partitioned by hash(orgId, shardSalt). This is the "outbox pattern" — no produce-without-commit.
Channel Dispatcher consumer groups (one per channel) read from notif.in, render the template with per-org branding, and call the provider (SES, Twilio, APNs, FCM) with a provider-level idempotency key derived from notification_item.id.
On success, dispatcher emits a delivery.events record with status SENT. On failure, it writes to notif.retry with exponential backoff (1s, 5s, 30s, 5m, 30m).
Provider webhooks (SES bounce, Twilio DLR) hit a webhook ingest that flips notification_item.status to DELIVERED / BOUNCED and emits another delivery event.

Data model

notifications (L1 shared-shared Postgres): (org_id, request_id) PK, status, priority, created_at. Partition by HASH(org_id) on 64 shards.
notification_items per recipient: (org_id, request_id, user_id, channel) PK, status, provider_message_id.
user_preferences: (org_id, user_id, category, channel) PK, opt_in bool, updated_at.
delivery_events: Cassandra (time-series friendly) keyed by (org_id, request_id) with TTL 90 days hot; after that, archive to S3 Parquet.

Indexing rule: every index is prefixed by org_id — e.g., idx_notif_org_status on (org_id, status, created_at DESC). This keeps each tenant's working set contiguous and avoids cross-tenant page cache pollution.

Multi-Tenancy Strategy

This is the most graded part of the design at Salesforce. State it explicitly.

Isolation level: L1 shared DB + shared schema. At 500M/day, L3 (DB per tenant) would explode cost and ops burden. Cassandra uses org_id as the partition key for events.

Tenant context flow:

orgId is extracted from the caller's JWT at the API gateway.
A tenantContext object is stamped into the request and propagated via:
- Internal RPC headers (X-Org-Id).
- Kafka message keys (for ordering and partition affinity).
- Log MDC (orgId=005...) for per-org log filtering.
- Metric labels (org_id Prometheus tag) for per-org dashboards.
Every DB query is wrapped by a middleware that injects WHERE org_id = :ctx.orgId — you cannot issue an un-scoped query.

Noisy-neighbor mitigations:

Per-org token bucket at the edge: 10k/min for Enterprise, 1k/min for Professional, 100/min for Essentials.
Split topics by priority: notif.transactional and notif.bulk. Transactional consumers get 70% of worker capacity. A bulk blast cannot queue behind transactional.
Kafka partition stickiness: Marketing Cloud-tier orgs get a dedicated partition group to keep their 10M-row blasts off shared partitions.
Per-org bulkhead thread pools in each dispatcher. Bounded queues per org; overflow re-queues with delay rather than dropping.
Shuffle sharding on the dispatcher layer: each org is assigned to k=4 of n=100 dispatcher cells. One runaway org affects 4% of capacity, not 100%.

Per-tenant observability:

Every metric has an org_id label. Dashboards filterable by org.
Per-org SLO dashboards show enqueue p99, delivery latency, bounce rate.
Cost allocation: we can answer "how much does Org X cost us per month" by summing labeled metrics.

Potential Deep Dives

1) How do we guarantee fan-out correctness under retries and idempotency?

Bad Solution: Trust the caller to retry correctly.

Approach: Caller retries on any 5xx. We write a new row each time.
Challenges: Producer retries on timeout generate duplicates. Consumer retries on channel errors generate more duplicates. Users receive 5 password-reset emails. Unacceptable.

Good Solution: Edge-level idempotency key.

Approach: Every POST carries an X-Idempotency-Key. Store it in Redis with SETNX and a 24-hour TTL. On replay, return the stored requestId.
Challenges: Still possible to double-send at the dispatcher layer if a consumer crashes after provider call but before committing offset. Redis memory pressure at 500M keys/day (~11 GB just for idempotency).

Great Solution: Two-level dedup — request-level + item-level.

Approach:
- Request-level: SETNX notif:idem:{orgId}:{key} with 24h TTL. First writer wins; replays return the cached requestId.
- Item-level: each notification_item has a stable message_id computed as sha256(requestId, userId, channel). This ID is passed to provider APIs that support idempotent send (SES and Twilio both do — SES accepts MessageDeduplicationId, Twilio accepts Idempotency-Key header).
- State machine transitions are persisted via conditional updates in Cassandra: UPDATE delivery_events SET status = 'SENT' WHERE item_id = ? IF status = 'QUEUED'. Concurrent retries converge.
Challenges: Memory cost of keeping 24h of idempotency keys is real. Mitigate by sharding Redis by orgId hash and TTL-compacting.

2) How do we keep fairness across orgs during a burst?

Bad Solution: FIFO Kafka topic.

Approach: One big topic; producers push in order; consumers drain FIFO.
Challenges: A 10M-recipient blast from Org X blocks Org Y's 2-recipient transactional email behind it. Head-of-line blocking kills transactional SLA.

Good Solution: Priority topics.

Approach: Two topics — notif.transactional and notif.bulk. Separate consumer groups. Transactional gets 70% of workers. Within each topic, partition by orgId.
Challenges: Inside notif.bulk, Org X's 10M blast still blocks Org Y's 1M blast if they hash to the same partition. Not fully fair.

Great Solution: Weighted fair queuing with shuffle sharding.

Approach:
- Scheduler inside each consumer pulls min(perOrgBudget, globalBudget) messages per tick.
- Implemented as N virtual per-org queues per partition with round-robin driven by a per-org virtual clock (WFQ — weighted fair queuing).
- Shuffle sharding at the consumer level: each org is mapped to k=4 of n=100 consumer pods. One mega-tenant only saturates 4 pods, leaving 96 pods free.
- Enterprise-tier orgs get weight 10, Essentials get weight 1 — so paid tiers do get more throughput, but no tier can starve another.
Challenges: WFQ adds scheduling overhead (~5-10% throughput cost). Virtual clocks drift under clock skew; handle by resetting periodically. Shuffle sharding reduces consumer cache locality.

3) How do we prevent retry storms from DDoSing downstream providers?

Bad Solution: Retry immediately on failure.

Approach: Dispatcher catches error, re-sends the message.
Challenges: When SES has a 5-minute outage, every message queued in that window retries instantly when SES recovers. Thundering herd DDoSes SES harder than the original traffic.

Good Solution: Exponential backoff with jitter + DLQ.

Approach: Retry at 1s, 5s, 30s, 5m, 30m with ±20% jitter. After 5 retries, move to a DLQ for manual investigation.
Challenges: Still doesn't protect the downstream: all orgs' retries pile up at the same intervals. DLQ grows silently without operator alerts.

Great Solution: Per-provider circuit breakers + rate-limited re-drive.

Approach:
- Circuit breakers keyed by (provider, region) — e.g., (SES, us-east-1). Open on error rate > 20% over 30s. While open, dispatcher fast-fails to retry topic without calling the provider at all.
- Half-open state lets 10% of traffic through; if that succeeds, close; else stay open.
- Rate-limited re-drive tool: when a circuit closes, we do not flood retries. An operator tool drains the retry topic at 10% of normal rate for 5 minutes, then ramps.
- DLQ has a dedicated operator UI with replay-by-orgId, replay-by-category, and a hard cap of N replays/min.
- Alerting: SEV on DLQ depth, circuit open duration, and provider error rate per tenant.
Challenges: Circuit state is per-pod; global coordination needs Redis or a shared state store. Half-open traffic decisions are fiddly — some orgs complain their messages sit in retry while others send fine.

4) How do we guarantee delivery (outbox pattern)?

Bad Solution: Fire-and-forget to SES.

Approach: API handler calls ses.sendEmail() and returns 202.
Challenges: If the DB write succeeds but the Kafka produce fails, we have a DB row stuck in QUEUED forever. If Kafka produce succeeds but DB write fails, we send to SES but have no audit trail. Classic dual-write problem.

Good Solution: Write DB first, produce second.

Approach: Persist notification_item with status QUEUED, then produce to Kafka. Dispatcher flips to SENT on provider 2xx, FAILED on 4xx/5xx. Webhooks update to DELIVERED / BOUNCED.
Challenges: Still subject to dual-write. Crash between DB write and Kafka produce leaves rows in QUEUED without a message in Kafka.

Great Solution: Transactional outbox with CDC.

Approach:
- In the same DB transaction as the notification_item write, insert a row into an outbox table: (id, aggregate, payload, created_at).
- Debezium tails the Postgres WAL, emits outbox rows as Kafka messages, and deletes them after ack (or a housekeeping job does). The DB transaction is atomic — either both the business row and the outbox row commit, or neither does.
- Kafka consumer is the dispatcher. Idempotent because of the two-level dedup already described.
- Combined with per-item message_id passed to providers, we get end-to-end at-least-once with idempotent receivers.
Challenges: Adds Debezium as an ops dependency. Outbox table growth needs cleanup. Kafka-side order may differ from DB commit order, but partition key by aggregate keeps per-aggregate order.

5) How do we handle per-user preferences at scale?

Bad Solution: Query preferences DB on every send.

Approach: Dispatcher calls SELECT opt_in FROM user_preferences WHERE ....
Challenges: At 60k/s, that's 60k preference lookups/s. DB becomes the bottleneck.

Good Solution: Redis read-through cache, invalidate on update.

Approach: Cache key pref:{orgId}:{userId}:{category}:{channel} → opt_in bool, TTL 5 min. Cache miss fetches from DB and populates. Preference updates invalidate the key.
Challenges: Cache inconsistency window when a user opts out right before a marketing blast — they might still receive one more email. Acceptable per CAN-SPAM (documented).

Great Solution: CDC-warmed cache + bloom filter optimization.

Approach:
- Preferences in Postgres; Debezium publishes changes to pref.changes Kafka topic.
- A cache-warmer consumer updates Redis on every change. Write-behind keeps cache warm without TTL-based thrash.
- For high-throughput bulk sends, batch preference lookup with MGET and short-circuit via a per-org opt-out bloom filter loaded at send start — if user not in bloom, they are definitely opted in, skip the Redis GET entirely.
- Preferences service exposes a gRPC API with p99 < 5 ms, backed by the CDC-warmed Redis.
Challenges: Bloom filter false positives cause unnecessary Redis lookups (fine). CDC lag means opt-out takes a few seconds to propagate.

What is Expected at Each Level?

Mid-level (SMTS-junior)

You should nail the core architecture: API gateway, ingestion service, Kafka topic, per-channel dispatchers, preferences service, provider APIs. You should know that we need idempotency and at-least-once delivery. You can miss fairness scheduling and circuit breakers, but the interviewer should only have to prompt once.

Senior (SMTS / LMTS)

Drive priority topics, per-org rate limits, idempotency at multiple layers, DLQ with operator tooling, observability broken out per-org. Articulate the outbox pattern when asked about dual-write. Give concrete back-of-envelope numbers without hand-waving.

Staff+ (PMTS)

Proactively introduce weighted fair queuing, shuffle sharding, circuit breakers keyed by (provider, region), crypto-shredding for GDPR, regional routing for data residency. Show a capacity model with headroom math. Discuss cost optimization (which tier pays for dedicated partitions vs shared). Propose a rollout plan with feature flags and per-org canaries.

Salesforce-Specific Considerations

Product analog: Salesforce Notifications API, Notification Builder, Marketing Cloud Transactional Messaging.
Governor-limit analog: "Email Invocations per transaction (10) / per day (5000 per-user license)" maps directly to our per-org rate limit. Mention this explicitly — the interviewer is watching for it.
Platform Events parallel: transactional path mirrors Salesforce Platform Events via CometD / Pub-Sub API. Consider reusing the org's event allocation as a capacity check.
Async Apex parallel: a @future / Queueable caller fires-and-forgets and polls for status. Our 202-then-poll API mirrors this. A senior interviewer will connect these dots if you do not.
Shield Event Monitoring consumes our delivery.events stream for compliance auditing.
Hyperforce regional routing: orgId → region → regional notification cluster. Cross-region only for explicit multi-region orgs.

Example snippet — idempotent enqueue

JavaC++TypeScript

java

public EnqueueResult enqueue(String orgId, NotificationRequest req, String idemKey) {
    String redisKey = "notif:idem:" + orgId + ":" + idemKey;
    Boolean firstWrite = redis.setIfAbsent(redisKey, req.getRequestId(),
                                            Duration.ofHours(24));
    if (Boolean.FALSE.equals(firstWrite)) {
        String existing = redis.get(redisKey);
        return EnqueueResult.duplicate(existing);
    }
    outbox.write(orgId, req);                     // same txn as DB row
    return EnqueueResult.accepted(req.getRequestId());
}

cpp

EnqueueResult Enqueue(const std::string& orgId,
                      const NotificationRequest& req,
                      const std::string& idemKey) {
  const std::string redisKey = "notif:idem:" + orgId + ":" + idemKey;
  if (!redis_.SetNX(redisKey, req.request_id(), std::chrono::hours(24))) {
    return EnqueueResult::Duplicate(redis_.Get(redisKey));
  }
  outbox_.Write(orgId, req);   // transactional outbox
  return EnqueueResult::Accepted(req.request_id());
}

typescript

async function enqueue(orgId: string, req: NotificationRequest, idemKey: string) {
  const redisKey = `notif:idem:${orgId}:${idemKey}`;
  const firstWrite = await redis.set(redisKey, req.requestId, "EX", 86400, "NX");
  if (firstWrite !== "OK") {
    const existing = await redis.get(redisKey);
    return { status: "duplicate", requestId: existing };
  }
  await outbox.write(orgId, req);          // same DB txn
  return { status: "accepted", requestId: req.requestId };
}

HLD: Multi-Channel Notification Service ​

Understanding the Problem ​

What is a Notification Service? ​

Functional Requirements ​

Non-Functional Requirements ​

Capacity Estimation ​

The Set Up ​

Core Entities ​

The API ​

High-Level Design ​

Architecture ​

End-to-end flow (transactional send) ​

Data model ​

Multi-Tenancy Strategy ​

Potential Deep Dives ​

1) How do we guarantee fan-out correctness under retries and idempotency? ​

2) How do we keep fairness across orgs during a burst? ​

3) How do we prevent retry storms from DDoSing downstream providers? ​

4) How do we guarantee delivery (outbox pattern)? ​

5) How do we handle per-user preferences at scale? ​

What is Expected at Each Level? ​

Mid-level (SMTS-junior) ​

Senior (SMTS / LMTS) ​

Staff+ (PMTS) ​

Salesforce-Specific Considerations ​

Example snippet — idempotent enqueue ​