HLD: Multi-Channel Notification Service β
Frequently Asked at Salesforce SMTS β Common HLD problem in recent SMTS interviews (2024-2026).
Understanding the Problem β
What is a Notification Service? β
A notification service is a platform API that any internal team (CRM, Marketing Cloud, Service Cloud) or customer org can call to deliver messages to end users across multiple channels β email, SMS, mobile push, and in-app. It sits between upstream producers that generate events (a password reset, a quote approval, a marketing blast) and downstream providers like SES, Twilio, APNs, and FCM. At Salesforce, the analog is the Notifications API and Notification Builder, and the design test is whether you can build something that gracefully handles a 10M-recipient marketing burst from one org without delaying another org's 2-recipient password-reset email.
Functional Requirements β
Core (above the line):
- Multi-channel delivery β send notifications over email, SMS, push (APNs/FCM), and in-app (WebSocket / long-poll) from a single API.
- Per-user preferences β users opt in or out per channel per category per org ("don't email me marketing, but SMS me for security alerts").
- Templates β reusable templates with localization and per-org branding.
- Scheduled + immediate sends β fire-at-time-T or enqueue now.
- Idempotency β producers can retry without generating duplicates.
- Delivery status tracking β queued / sent / delivered / bounced / opened, exposed via an API and webhooks.
- Per-org rate limits β governor-style caps so a bulk tenant cannot monopolize the pipeline.
Below the line (out of scope):
- A/B testing of templates β belongs in the marketing product.
- Rich marketing segmentation β Marketing Cloud owns this, we are a transport.
- In-editor WYSIWYG composition β UI team.
- Click-tracking link rewriting β a separate subsystem; we only pass-through.
Non-Functional Requirements β
Core:
- Scale: 500M notifications/day peak, averaging ~5.8k/s, with 30-50k/s bursts during marketing events.
- Latency: P99 enqueue < 50 ms; user-visible delivery < 60 s for transactional, < 10 min for bulk.
- Durability: Zero silent drops. At-least-once delivery at the platform; the channel handles final-mile dedup via message IDs.
- Availability: 99.95% (that's ~4.38h of allowed downtime per year).
- Multi-tenancy: Every request scoped to an org. One org's blast must not delay another org's security email.
- Compliance: GDPR deletion, CAN-SPAM unsubscribe, per-region data residency.
Below the line:
- Exactly-once (we do at-least-once + idempotent receivers).
- Sub-second bulk delivery (bulk gets a loose 10-min SLA).
Capacity Estimation β
Do this math out loud β interviewers grade you on concrete numbers:
- 500M/day divided by 86,400 seconds gives ~5.8k/s average. Peak multiplier of 10x puts us at ~60k/s.
- Average payload is 2 KB (template variables + metadata), so peak bandwidth is 120 MB/s and ~10 TB/day of raw ingress.
- Hot metadata for 90 days with 5-10x amplification (indexes, replicas, audit) is ~1 PB across regions.
- Worker fleet: if each worker does 200 sends/s, we need ~300 workers at peak, with 2x headroom = ~600.
The Set Up β
Core Entities β
- Organization β
orgId, tier (Essentials/Professional/Enterprise), region, quotas. - User β
userId,orgId, channel endpoints (email address, phone, device tokens). - NotificationRequest β the top-level enqueue:
requestId,orgId,category,templateId,recipients[],channels[],priority,scheduledAt,idempotencyKey. - Notification β one fan-out row per recipient per channel, each with its own status.
- UserPreference β
(userId, orgId, category, channel)with anopt_inflag. - Template β
(templateId, orgId, channel, locale)and a rendered body template. - DeliveryEvent β append-only audit of state transitions.
The API β
We use REST with tenant-scoped URLs. Idempotency headers on every mutating call.
Enqueue a notification:
POST /v1/orgs/{orgId}/notifications
Headers: X-Idempotency-Key: <uuid>, Authorization: Bearer <token>
Body:
{
"category": "security.password_reset",
"priority": "transactional", // transactional | bulk
"templateId": "tpl_pwreset_v3",
"locale": "en-US",
"recipients": [
{ "userId": "005...", "channels": ["email", "sms"] }
],
"variables": { "resetLink": "https://..." },
"scheduledAt": null
}
Response 202:
{ "requestId": "req_01HX...", "accepted": 1, "rejected": 0 }We return 202 Accepted because the actual delivery is async β the caller should not wait on SES.
Query status, preferences, quotas:
GET /v1/orgs/{orgId}/notifications/{requestId}
GET /v1/orgs/{orgId}/notifications/{requestId}/events
PUT /v1/orgs/{orgId}/users/{userId}/preferences
GET /v1/orgs/{orgId}/quotasHigh-Level Design β
Architecture β
ββββββββββ βββββββββββββββ βββββββββββββββββ ββββββββββββββ
β Caller ββββΆβ API Gateway ββββΆβ Ingestion ββββΆβ Kafka: β
β (Apex, β β (authz, β β Service β β notif.in β
β apps) β β per-org β β (validate, β β (partition β
ββββββββββ β rate lim.) β β dedupe, β β by orgId) β
βββββββββββββββ β preferences) β βββββββ¬βββββββ
βββββββββββββββββ β
β βΌ
βΌ ββββββββββββββββββββββββ
ββββββββββββββββ β Channel Dispatchers β
β Preferences β β ββββββββ¬βββββββ¬βββββ β
β Service β β βemail β sms βpushβ β
β + Cache β β ββββββββ΄βββββββ΄βββββ β
ββββββββββββββββ βββββββ¬βββββββββ¬ββββββββ
β β
ββββββββββββββββββ βΌ
βΌ βββββββββββββββ
βββββββββββββββ β SES / Twilioβ
β Retry / β β APNs / FCM β
β DLQ topic β βββββββββββββββ
βββββββββββββββEnd-to-end flow (transactional send) β
- Caller does
POST /v1/orgs/{orgId}/notificationswith anX-Idempotency-Key. - API Gateway authenticates the JWT, checks the per-org token bucket in Redis, and rejects with 429 if over budget.
- Ingestion Service checks
notif:idem:{orgId}:{key}in Redis β if seen before, returns the priorrequestId(idempotent replay). Otherwise it validates the template, looks up preferences for each recipient, and filters out opted-out channels. - Ingestion writes a row into
notificationsPostgres (transactional outbox pattern), fans out per-recipient rows intonotification_items, and in the same transaction writes a record to an outbox table. - Debezium (CDC on the outbox) produces a message to Kafka topic
notif.in, partitioned byhash(orgId, shardSalt). This is the "outbox pattern" β no produce-without-commit. - Channel Dispatcher consumer groups (one per channel) read from
notif.in, render the template with per-org branding, and call the provider (SES, Twilio, APNs, FCM) with a provider-level idempotency key derived fromnotification_item.id. - On success, dispatcher emits a
delivery.eventsrecord with statusSENT. On failure, it writes tonotif.retrywith exponential backoff (1s, 5s, 30s, 5m, 30m). - Provider webhooks (SES bounce, Twilio DLR) hit a webhook ingest that flips
notification_item.statustoDELIVERED/BOUNCEDand emits another delivery event.
Data model β
notifications(L1 shared-shared Postgres):(org_id, request_id)PK,status,priority,created_at. Partition byHASH(org_id)on 64 shards.notification_itemsper recipient:(org_id, request_id, user_id, channel)PK,status,provider_message_id.user_preferences:(org_id, user_id, category, channel)PK,opt_inbool,updated_at.delivery_events: Cassandra (time-series friendly) keyed by(org_id, request_id)with TTL 90 days hot; after that, archive to S3 Parquet.
Indexing rule: every index is prefixed by org_id β e.g., idx_notif_org_status on (org_id, status, created_at DESC). This keeps each tenant's working set contiguous and avoids cross-tenant page cache pollution.
Multi-Tenancy Strategy β
This is the most graded part of the design at Salesforce. State it explicitly.
Isolation level: L1 shared DB + shared schema. At 500M/day, L3 (DB per tenant) would explode cost and ops burden. Cassandra uses org_id as the partition key for events.
Tenant context flow:
orgIdis extracted from the caller's JWT at the API gateway.- A
tenantContextobject is stamped into the request and propagated via:- Internal RPC headers (
X-Org-Id). - Kafka message keys (for ordering and partition affinity).
- Log MDC (
orgId=005...) for per-org log filtering. - Metric labels (
org_idPrometheus tag) for per-org dashboards.
- Internal RPC headers (
- Every DB query is wrapped by a middleware that injects
WHERE org_id = :ctx.orgIdβ you cannot issue an un-scoped query.
Noisy-neighbor mitigations:
- Per-org token bucket at the edge: 10k/min for Enterprise, 1k/min for Professional, 100/min for Essentials.
- Split topics by priority:
notif.transactionalandnotif.bulk. Transactional consumers get 70% of worker capacity. A bulk blast cannot queue behind transactional. - Kafka partition stickiness: Marketing Cloud-tier orgs get a dedicated partition group to keep their 10M-row blasts off shared partitions.
- Per-org bulkhead thread pools in each dispatcher. Bounded queues per org; overflow re-queues with delay rather than dropping.
- Shuffle sharding on the dispatcher layer: each org is assigned to
k=4ofn=100dispatcher cells. One runaway org affects 4% of capacity, not 100%.
Per-tenant observability:
- Every metric has an
org_idlabel. Dashboards filterable by org. - Per-org SLO dashboards show enqueue p99, delivery latency, bounce rate.
- Cost allocation: we can answer "how much does Org X cost us per month" by summing labeled metrics.
Potential Deep Dives β
1) How do we guarantee fan-out correctness under retries and idempotency? β
Bad Solution: Trust the caller to retry correctly.
- Approach: Caller retries on any 5xx. We write a new row each time.
- Challenges: Producer retries on timeout generate duplicates. Consumer retries on channel errors generate more duplicates. Users receive 5 password-reset emails. Unacceptable.
Good Solution: Edge-level idempotency key.
- Approach: Every
POSTcarries anX-Idempotency-Key. Store it in Redis withSETNXand a 24-hour TTL. On replay, return the storedrequestId. - Challenges: Still possible to double-send at the dispatcher layer if a consumer crashes after provider call but before committing offset. Redis memory pressure at 500M keys/day (~11 GB just for idempotency).
Great Solution: Two-level dedup β request-level + item-level.
- Approach:
- Request-level:
SETNX notif:idem:{orgId}:{key}with 24h TTL. First writer wins; replays return the cachedrequestId. - Item-level: each
notification_itemhas a stablemessage_idcomputed assha256(requestId, userId, channel). This ID is passed to provider APIs that support idempotent send (SES and Twilio both do β SES acceptsMessageDeduplicationId, Twilio acceptsIdempotency-Keyheader). - State machine transitions are persisted via conditional updates in Cassandra:
UPDATE delivery_events SET status = 'SENT' WHERE item_id = ? IF status = 'QUEUED'. Concurrent retries converge.
- Request-level:
- Challenges: Memory cost of keeping 24h of idempotency keys is real. Mitigate by sharding Redis by
orgIdhash and TTL-compacting.
2) How do we keep fairness across orgs during a burst? β
Bad Solution: FIFO Kafka topic.
- Approach: One big topic; producers push in order; consumers drain FIFO.
- Challenges: A 10M-recipient blast from Org X blocks Org Y's 2-recipient transactional email behind it. Head-of-line blocking kills transactional SLA.
Good Solution: Priority topics.
- Approach: Two topics β
notif.transactionalandnotif.bulk. Separate consumer groups. Transactional gets 70% of workers. Within each topic, partition byorgId. - Challenges: Inside
notif.bulk, Org X's 10M blast still blocks Org Y's 1M blast if they hash to the same partition. Not fully fair.
Great Solution: Weighted fair queuing with shuffle sharding.
- Approach:
- Scheduler inside each consumer pulls
min(perOrgBudget, globalBudget)messages per tick. - Implemented as N virtual per-org queues per partition with round-robin driven by a per-org virtual clock (WFQ β weighted fair queuing).
- Shuffle sharding at the consumer level: each org is mapped to
k=4ofn=100consumer pods. One mega-tenant only saturates 4 pods, leaving 96 pods free. - Enterprise-tier orgs get weight 10, Essentials get weight 1 β so paid tiers do get more throughput, but no tier can starve another.
- Scheduler inside each consumer pulls
- Challenges: WFQ adds scheduling overhead (~5-10% throughput cost). Virtual clocks drift under clock skew; handle by resetting periodically. Shuffle sharding reduces consumer cache locality.
3) How do we prevent retry storms from DDoSing downstream providers? β
Bad Solution: Retry immediately on failure.
- Approach: Dispatcher catches error, re-sends the message.
- Challenges: When SES has a 5-minute outage, every message queued in that window retries instantly when SES recovers. Thundering herd DDoSes SES harder than the original traffic.
Good Solution: Exponential backoff with jitter + DLQ.
- Approach: Retry at 1s, 5s, 30s, 5m, 30m with Β±20% jitter. After 5 retries, move to a DLQ for manual investigation.
- Challenges: Still doesn't protect the downstream: all orgs' retries pile up at the same intervals. DLQ grows silently without operator alerts.
Great Solution: Per-provider circuit breakers + rate-limited re-drive.
- Approach:
- Circuit breakers keyed by
(provider, region)β e.g.,(SES, us-east-1). Open on error rate > 20% over 30s. While open, dispatcher fast-fails to retry topic without calling the provider at all. - Half-open state lets 10% of traffic through; if that succeeds, close; else stay open.
- Rate-limited re-drive tool: when a circuit closes, we do not flood retries. An operator tool drains the retry topic at 10% of normal rate for 5 minutes, then ramps.
- DLQ has a dedicated operator UI with replay-by-orgId, replay-by-category, and a hard cap of N replays/min.
- Alerting: SEV on DLQ depth, circuit open duration, and provider error rate per tenant.
- Circuit breakers keyed by
- Challenges: Circuit state is per-pod; global coordination needs Redis or a shared state store. Half-open traffic decisions are fiddly β some orgs complain their messages sit in retry while others send fine.
4) How do we guarantee delivery (outbox pattern)? β
Bad Solution: Fire-and-forget to SES.
- Approach: API handler calls
ses.sendEmail()and returns 202. - Challenges: If the DB write succeeds but the Kafka produce fails, we have a DB row stuck in QUEUED forever. If Kafka produce succeeds but DB write fails, we send to SES but have no audit trail. Classic dual-write problem.
Good Solution: Write DB first, produce second.
- Approach: Persist
notification_itemwith status QUEUED, then produce to Kafka. Dispatcher flips to SENT on provider 2xx, FAILED on 4xx/5xx. Webhooks update to DELIVERED / BOUNCED. - Challenges: Still subject to dual-write. Crash between DB write and Kafka produce leaves rows in QUEUED without a message in Kafka.
Great Solution: Transactional outbox with CDC.
- Approach:
- In the same DB transaction as the
notification_itemwrite, insert a row into anoutboxtable:(id, aggregate, payload, created_at). - Debezium tails the Postgres WAL, emits outbox rows as Kafka messages, and deletes them after ack (or a housekeeping job does). The DB transaction is atomic β either both the business row and the outbox row commit, or neither does.
- Kafka consumer is the dispatcher. Idempotent because of the two-level dedup already described.
- Combined with per-item
message_idpassed to providers, we get end-to-end at-least-once with idempotent receivers.
- In the same DB transaction as the
- Challenges: Adds Debezium as an ops dependency. Outbox table growth needs cleanup. Kafka-side order may differ from DB commit order, but partition key by aggregate keeps per-aggregate order.
5) How do we handle per-user preferences at scale? β
Bad Solution: Query preferences DB on every send.
- Approach: Dispatcher calls
SELECT opt_in FROM user_preferences WHERE .... - Challenges: At 60k/s, that's 60k preference lookups/s. DB becomes the bottleneck.
Good Solution: Redis read-through cache, invalidate on update.
- Approach: Cache key
pref:{orgId}:{userId}:{category}:{channel}βopt_inbool, TTL 5 min. Cache miss fetches from DB and populates. Preference updates invalidate the key. - Challenges: Cache inconsistency window when a user opts out right before a marketing blast β they might still receive one more email. Acceptable per CAN-SPAM (documented).
Great Solution: CDC-warmed cache + bloom filter optimization.
- Approach:
- Preferences in Postgres; Debezium publishes changes to
pref.changesKafka topic. - A cache-warmer consumer updates Redis on every change. Write-behind keeps cache warm without TTL-based thrash.
- For high-throughput bulk sends, batch preference lookup with
MGETand short-circuit via a per-org opt-out bloom filter loaded at send start β if user not in bloom, they are definitely opted in, skip the Redis GET entirely. - Preferences service exposes a gRPC API with p99 < 5 ms, backed by the CDC-warmed Redis.
- Preferences in Postgres; Debezium publishes changes to
- Challenges: Bloom filter false positives cause unnecessary Redis lookups (fine). CDC lag means opt-out takes a few seconds to propagate.
What is Expected at Each Level? β
Mid-level (SMTS-junior) β
You should nail the core architecture: API gateway, ingestion service, Kafka topic, per-channel dispatchers, preferences service, provider APIs. You should know that we need idempotency and at-least-once delivery. You can miss fairness scheduling and circuit breakers, but the interviewer should only have to prompt once.
Senior (SMTS / LMTS) β
Drive priority topics, per-org rate limits, idempotency at multiple layers, DLQ with operator tooling, observability broken out per-org. Articulate the outbox pattern when asked about dual-write. Give concrete back-of-envelope numbers without hand-waving.
Staff+ (PMTS) β
Proactively introduce weighted fair queuing, shuffle sharding, circuit breakers keyed by (provider, region), crypto-shredding for GDPR, regional routing for data residency. Show a capacity model with headroom math. Discuss cost optimization (which tier pays for dedicated partitions vs shared). Propose a rollout plan with feature flags and per-org canaries.
Salesforce-Specific Considerations β
- Product analog: Salesforce Notifications API, Notification Builder, Marketing Cloud Transactional Messaging.
- Governor-limit analog: "Email Invocations per transaction (10) / per day (5000 per-user license)" maps directly to our per-org rate limit. Mention this explicitly β the interviewer is watching for it.
- Platform Events parallel: transactional path mirrors Salesforce Platform Events via CometD / Pub-Sub API. Consider reusing the org's event allocation as a capacity check.
- Async Apex parallel: a
@future/Queueablecaller fires-and-forgets and polls for status. Our 202-then-poll API mirrors this. A senior interviewer will connect these dots if you do not. - Shield Event Monitoring consumes our
delivery.eventsstream for compliance auditing. - Hyperforce regional routing:
orgId β region β regional notification cluster. Cross-region only for explicit multi-region orgs.
Example snippet β idempotent enqueue β
public EnqueueResult enqueue(String orgId, NotificationRequest req, String idemKey) {
String redisKey = "notif:idem:" + orgId + ":" + idemKey;
Boolean firstWrite = redis.setIfAbsent(redisKey, req.getRequestId(),
Duration.ofHours(24));
if (Boolean.FALSE.equals(firstWrite)) {
String existing = redis.get(redisKey);
return EnqueueResult.duplicate(existing);
}
outbox.write(orgId, req); // same txn as DB row
return EnqueueResult.accepted(req.getRequestId());
}EnqueueResult Enqueue(const std::string& orgId,
const NotificationRequest& req,
const std::string& idemKey) {
const std::string redisKey = "notif:idem:" + orgId + ":" + idemKey;
if (!redis_.SetNX(redisKey, req.request_id(), std::chrono::hours(24))) {
return EnqueueResult::Duplicate(redis_.Get(redisKey));
}
outbox_.Write(orgId, req); // transactional outbox
return EnqueueResult::Accepted(req.request_id());
}async function enqueue(orgId: string, req: NotificationRequest, idemKey: string) {
const redisKey = `notif:idem:${orgId}:${idemKey}`;
const firstWrite = await redis.set(redisKey, req.requestId, "EX", 86400, "NX");
if (firstWrite !== "OK") {
const existing = await redis.get(redisKey);
return { status: "duplicate", requestId: existing };
}
await outbox.write(orgId, req); // same DB txn
return { status: "accepted", requestId: req.requestId };
}