19 — Problem: Notification Service
Understanding the Problem
A notification service is not about sending an email. That part is a three-line SDK call. The interview depth — and the reason this problem shows up at senior level — lives entirely in what happens around that SDK call: one business event fans out into N channel-specific deliveries, each of which must respect user preferences, quiet hours, per-channel rate limits, retries on transient failures, and deduplication against replays of the same event. Strip those concerns away and the problem is trivial. Keep them and the problem becomes an exercise in distributed-systems delivery semantics stuffed into an LLD interview.
The interview depth is 100% in the delivery semantics
If you spend twenty minutes designing a beautiful EmailChannel class hierarchy and never say the words "at-least-once," "idempotency key," or "dead-letter queue," you will get downleveled. Notification services fail in production because of how they handle the unhappy path — provider outages, duplicate deliveries, stuck retries, misconfigured templates blasting a million users. Every senior question about this system is a question about delivery guarantees, failure modes, and the rate limiter in front of each provider. Design the pipeline; the channel adapters are almost an afterthought.
A concrete mental model: an e-commerce app emits an order.shipped event. The notification service subscribes, looks up the user, figures out "send this as email and push (not SMS — user turned that off)," renders the order_shipped template for each channel in the user's locale, enqueues one DeliveryJob per channel, and a worker pool picks those up, hits SendGrid / FCM, records the result, and retries the ones that failed with a transient error. Everything in that sentence — subscription, preference resolution, fan-out, templating, enqueueing, worker pools, retries, result tracking — is a separate class with a clean seam, and every seam has a failure mode you need to reason about.
Clarifying Questions
You: What channels need to be supported — and is the list closed or open?
Interviewer: Four for now: email, SMS, push (mobile), and in-app. But adding a fifth (say, Slack or WhatsApp) should not require changing core code.
You: Are templates code-defined, stored in a database, or both? Do they need to be localized?
Interviewer: Stored in a database, versioned, with variable interpolation (e.g.
{{user.firstName}}). Multi-language — each template has variants per locale.
You: Per-user preferences — what's the shape? Per-channel on/off? Per-category (marketing vs transactional)?
Interviewer: Per-channel AND per-category. A user might accept transactional email but block marketing email while still allowing marketing push. Also quiet hours — a window per day in the user's timezone during which non-critical notifications must defer.
You: What delivery guarantee do we owe the caller? At-most-once, at-least-once, exactly-once?
Interviewer: At-least-once end-to-end, with idempotency keys so the same event never produces duplicate deliveries even if the upstream replays it. Exactly-once is a lie at the provider boundary — you can only approximate it.
You: Scheduled and recurring notifications?
Interviewer: Scheduled yes — "send this at 9am tomorrow." Recurring / cron-style is out of scope.
You: Priority tiers — do critical notifications (2FA codes, fraud alerts) jump the queue?
Interviewer: Yes. Critical bypasses quiet hours and marketing-preference filters. And critical notifications must meet a tight p99 — seconds, not minutes.
You: Attachments? Rich content? Inline images in email?
Interviewer: Out of scope for the core design. Assume text + interpolated variables; mention the extension point.
You: Who's the caller? Internal services via an API, or do we subscribe to a shared event bus?
Interviewer: Both. Some services POST to our HTTP API; others publish to Kafka and we consume. Design the entry point as an interface, not either one concretely.
You: What observability do we need to expose — just success/failure counts, or per-channel provider latencies, retry counts, template render failures?
Interviewer: All of the above. Assume every decision in the pipeline emits a metric and a structured log.
You: Rate limits — per channel (SES has sending quotas), per user (don't spam someone), per tenant (multi-tenant SaaS)?
Interviewer: All three, but focus on per-channel (provider-imposed) and per-user (product-imposed). Per-tenant is the same machinery with a different key.
Functional Requirements
- Accept a
NotificationEventfrom an HTTP API or a Kafka topic with a stable schema:{ eventId, userId, type, data, priority?, scheduledAt? }. - Resolve the set of channels for this user × event-type using
UserPreference(per-channel + per-category toggles). - Render a localized
Templatefor each (channel, locale) pair with variable interpolation and missing-variable detection. - Enqueue one
DeliveryJobper channel and dispatch through a worker pool that calls the channel-specific provider. - Respect per-channel rate limits and per-user rate limits, deferring (not dropping) when limited.
- Retry transient failures with exponential backoff + jitter; route permanent failures to a dead-letter queue.
- Deduplicate on an idempotency key — the same logical event processed twice must produce one delivery per channel, not two.
- Honor quiet hours for non-critical notifications; defer delivery to the next allowed window in the user's timezone.
- Support scheduled sends: accept
scheduledAt, persist, and deliver at or after that time. - Expose observability: per-channel success/failure/latency, retry counts, queue depths, DLQ size.
Non-Functional Requirements
- Throughput: millions of notifications per hour aggregate across channels. The pipeline must scale horizontally — no singleton bottlenecks on the hot path.
- Latency:
- Critical (p99): < 5 seconds from event arrival to provider accept.
- Transactional (p99): < 30 seconds.
- Marketing / batch: minutes is fine.
- Durability: zero silent loss. An accepted event must either be delivered, DLQ'd with a reason, or still in flight. There is no "we lost it." This means every handoff between stages is backed by a durable queue (Kafka, SQS, or similar), not an in-process
ArrayBlockingQueue. - Availability: a single provider outage (SendGrid 503s) must not degrade other channels. Channel workers are isolated; one failing channel cannot block another.
- Observability: per-channel success rate, delivery latency histogram, retry rate, DLQ ingress rate, template render failure count. Every metric tagged by
channel,template_id,priority. - Correctness: at-least-once + idempotency = effective exactly-once at the deduplication boundary. At the provider boundary, we accept that duplicates are possible and design accordingly (idempotency-key headers where providers support them).
- Out of scope: UI for composing notifications, a/b test statistical analysis, GDPR purge pipelines, bounce handling beyond routing to DLQ, consent management.
Core Entities and Relationships
| Entity | Responsibility |
|---|---|
NotificationEvent | Immutable input. { eventId, userId, type, data, priority, scheduledAt?, locale? }. eventId is the upstream idempotency anchor. |
Notification | Post-resolution record. One event fans out to multiple Notifications — one per channel. { notificationId, eventId, userId, channel, templateId, renderedPayload, priority, dedupKey, createdAt }. |
Channel | Enum or string: email, sms, push, in_app. |
IChannel | Interface — the seam where a provider (SES, Twilio, FCM) plugs in. send(notification): Promise<SendResult>. |
Template | { templateId, version, channel, locale, subject?, body, requiredVars[] }. Rendering is separate. |
UserPreference | { userId, channels: Record<Channel, boolean>, categories: Record<Category, ChannelMask>, quietHours: { start, end, timezone } }. |
DeliveryJob | A Notification + retry metadata: { notification, attempt, nextAttemptAt, lastError? }. Lives on a queue. |
RetryPolicy | Strategy: given attempt count + error class, returns { retry: boolean, nextDelay: Duration }. |
RateLimiter | Per-channel; token-bucket or leaky-bucket. tryAcquire(key): boolean or acquire(key): Promise<void>. |
DedupStore | { setIfAbsent(key, ttl): boolean } — Redis SET NX EX or equivalent. |
EventBus | Source of events. subscribe(handler). |
DLQ | Dead-letter queue for permanent failures. |
Pipeline | Ordered chain of Middleware stages between event and delivery. |
NotificationTracker | Records every state transition (accepted, rendered, enqueued, attempted, delivered, failed, dlq) for observability and audit. |
Relationships: NotificationEvent fans out to N Notifications (one per channel). Each Notification becomes a DeliveryJob on a per-channel queue. A worker pulls the job, calls the IChannel impl, records the result, and either acks or re-enqueues via the RetryPolicy. RateLimiter and DedupStore are consulted by middleware stages before the final send. The Pipeline is the wiring that ties them together.
Interfaces
type Channel = "email" | "sms" | "push" | "in_app";
type Priority = "critical" | "transactional" | "marketing";
type Category = "security" | "transactional" | "marketing" | "social";
interface NotificationEvent {
readonly eventId: string; // upstream-provided, stable
readonly userId: string;
readonly type: string; // e.g. "order.shipped"
readonly data: Record<string, unknown>;
readonly priority: Priority;
readonly scheduledAt?: number; // epoch ms; undefined means "now"
readonly locale?: string; // override; else resolved from user
}
interface Notification {
readonly notificationId: string;
readonly eventId: string;
readonly userId: string;
readonly channel: Channel;
readonly templateId: string;
readonly rendered: RenderedPayload;
readonly priority: Priority;
readonly dedupKey: string;
readonly createdAt: number;
}
interface RenderedPayload {
readonly subject?: string;
readonly body: string;
readonly metadata: Record<string, string>;
}
interface SendResult {
readonly status: "delivered" | "transient_failure" | "permanent_failure";
readonly providerMessageId?: string;
readonly errorCode?: string;
readonly errorMessage?: string;
readonly latencyMs: number;
}
interface IChannel {
readonly channel: Channel;
send(notification: Notification): Promise<SendResult>;
}
interface ITemplateRenderer {
render(templateId: string, channel: Channel, locale: string, data: Record<string, unknown>): RenderedPayload;
}
interface IUserPreferenceStore {
get(userId: string): Promise<UserPreference>;
}
interface IRetryPolicy {
nextAttempt(attempt: number, error: SendResult): { retry: boolean; delayMs: number };
}
interface IEventBus {
subscribe(handler: (event: NotificationEvent) => Promise<void>): void;
}
interface IDedupStore {
// Returns true if the key was newly set (first time seen).
// Returns false if the key already exists (duplicate).
setIfAbsent(key: string, ttlMs: number): Promise<boolean>;
}
interface IRateLimiter {
tryAcquire(key: string, tokens?: number): Promise<boolean>;
}
interface Middleware {
// Returns true to continue the pipeline, false to stop (deferred / dropped / dedup'd).
handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void>;
}Class Diagram
+-------------------+
| IEventBus |
| (Kafka / HTTP API)|
+---------+---------+
|
v
+--------------------------------+
| NotificationService (facade) |
+---------------+----------------+
|
v
+--------------------------------------------------+
| Pipeline |
| |
| [IdempotencyMw] -> [PreferenceMw] |
| | | |
| v v |
| [QuietHoursMw] -> [FanOutMw] -> [RenderMw] |
| | |
| v |
| [DedupMw] -> [RateLimitMw] |
| | |
| v |
| [EnqueueMw] |
+--------------------------------------------------+
|
+----------------------+-----------------------+
| | |
v v v
+---------------+ +---------------+ +---------------+
| email queue | | sms queue | | push queue |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
v v v
+---------------+ +---------------+ +---------------+
| EmailWorker * | | SmsWorker * | | PushWorker * |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
v v v
+---------------+ +---------------+ +---------------+
| EmailChannel | | SmsChannel | | PushChannel |
| (SES impl) | | (Twilio) | | (FCM/APNS) |
+-------+-------+ +-------+-------+ +-------+-------+
| | |
+--------+-----------+-----------+------------+
| |
v v
+-----------+ +-----------+
| DLQ | | Tracker |
+-----------+ +-----------+
Supporting stores (consulted by middleware & channels):
- DedupStore (Redis SET NX EX)
- RateLimiter (token bucket per channel / per user)
- UserPreferenceStore
- TemplateStore + TemplateRenderer
- RetryPolicyClass Design
NotificationService — the facade
Owns the pipeline, the event-bus subscription, the worker pools, and the tracker. Its public API is tiny — send(event) and lifecycle hooks (start, shutdown). Callers don't see middleware, queues, or retry policies; those are wired up in the constructor.
NotificationEvent vs Notification
The split matters. NotificationEvent is the caller's contract — one logical thing happened. Notification is the internal unit after fan-out — one thing to deliver to one channel. This distinction is the first thing a reviewer looks for because it forces you to name the fan-out step. If you call both of them "Notification" you've already conflated the before-and-after of the pipeline.
IChannel implementations
Strategy pattern, one per provider. Each one is a thin adapter around an SDK:
EmailChannel— wraps@aws-sdk/client-sesor@sendgrid/mail. Converts aNotificationinto the provider's request shape. Maps provider errors intoSendResultwith a correctstatus(transient_failureon 5xx / throttle / network,permanent_failureon 4xx likeinvalid_email,deliveredon 2xx).SmsChannel— Twilio. Same shape. Carrier-specific failure mapping (e.g.landlineis permanent,queuedis in-flight-not-delivered).PushChannel— FCM for Android, APNS for iOS. Device tokens are per-user-per-device; failures likeNotRegistered(user uninstalled) are permanent and should trigger device-token cleanup.InAppChannel— writes to anin_app_notificationstable and pokes a websocket gateway. No external provider; failure modes are DB or WS.
The IChannel interface exposes none of this provider detail. The worker doesn't know what channel it's using beyond the enum value.
Pipeline — the middleware chain
The pipeline is a Chain of Responsibility. Each stage either continues (next()) or short-circuits. In order:
IdempotencyMw— computes a primary dedup key fromeventIdand callsDedupStore.setIfAbsent. If the event was already accepted, short-circuits withstatus: duplicate_event. This catches upstream replays (Kafka redelivery, retried HTTP POST).PreferenceMw— loadsUserPreference, filters the channel list per the user's per-category settings. If the user has blocked all channels for this category, short-circuits withstatus: user_opted_out.QuietHoursMw— for non-critical events, if the user's current local time is inside quiet hours, reschedules to the next allowed boundary and short-circuits withstatus: deferred. Critical bypasses this.FanOutMw— expands one event into oneNotificationper surviving channel. Everything downstream runs per-notification.RenderMw— callsITemplateRendererfor each notification. Render failures (missing variable, template not found) are permanent; they go to DLQ withreason: render_failedand do not proceed.DedupMw— second-level dedup, now per-channel. Key =hash(userId + eventId + channel + timeBucket). Catches the case where the same logical notification is re-emitted within a coalescing window. TTL is typically the length of the time bucket plus a margin.RateLimitMw— consults per-user and per-channel rate limiters. On hit, re-enqueues the notification with a small delay (not drop — drop would be correctness loss). Critical notifications have a reserved capacity pool.EnqueueMw— pushes the notification onto the per-channel durable queue. This is the end of the producer-side pipeline. The worker pool is the consumer side.
Each stage is a separate class implementing Middleware. Adding a new stage (e.g. A/B template selection) = one class, one wiring line in the builder. The core service code does not change.
Workers
One worker pool per channel. Workers pull DeliveryJobs, call the channel's send(), consult IRetryPolicy on failure, and either ack or re-enqueue with a delay. A separate retry pool per channel prevents retries from starving fresh events.
Deduplication keys — how they're shaped
The dedup key design is load-bearing. Two keys, two TTLs, two purposes:
- Event-level key:
evt:{eventId}. TTL ~24h. Set once per accepted event. Catches upstream replays. - Notification-level key:
ntf:{userId}:{eventType}:{channel}:{timeBucket}wheretimeBucket = floor(now / windowMs). TTL equal to window. Catches "same notification emitted twice in a coalescing window" — e.g. two services both trying to notify the user about the same shipment.
The difference matters: event-level dedup protects against infrastructure faults (Kafka redelivery). Notification-level dedup protects against product faults (two features emitting the same user-visible notification).
Key Methods
// ---- Context carried through the pipeline ----
interface PipelineContext {
readonly event: NotificationEvent;
channels: Channel[]; // narrowed by PreferenceMw
preferences?: UserPreference; // hydrated by PreferenceMw
notifications: Notification[]; // populated by FanOutMw, RenderMw
status: PipelineStatus; // "accepted" | "duplicate_event" | "user_opted_out" | "deferred" | "rendered" | "enqueued" | "dropped"
reason?: string;
}
type PipelineStatus =
| "accepted"
| "duplicate_event"
| "user_opted_out"
| "deferred"
| "rendered"
| "enqueued"
| "dropped";
// ---- Facade ----
class NotificationService {
constructor(
private readonly pipeline: Pipeline,
private readonly bus: IEventBus,
private readonly workers: Map<Channel, WorkerPool>,
private readonly tracker: NotificationTracker,
) {}
start(): void {
this.bus.subscribe((event) => this.send(event));
for (const pool of this.workers.values()) pool.start();
}
async send(event: NotificationEvent): Promise<void> {
const ctx: PipelineContext = {
event,
channels: [],
notifications: [],
status: "accepted",
};
try {
await this.pipeline.execute(ctx);
this.tracker.record(event.eventId, ctx.status, ctx.reason);
} catch (err) {
this.tracker.record(event.eventId, "dropped", (err as Error).message);
// Bubble up so the caller / bus can decide to redeliver.
// Idempotency middleware ensures the redelivered event is caught.
throw err;
}
}
async shutdown(): Promise<void> {
for (const pool of this.workers.values()) await pool.drain();
}
}
// ---- Pipeline execution ----
class Pipeline {
constructor(private readonly stages: Middleware[]) {}
async execute(ctx: PipelineContext): Promise<void> {
let i = -1;
const dispatch = async (): Promise<void> => {
i++;
if (i >= this.stages.length) return;
await this.stages[i].handle(ctx, dispatch);
};
await dispatch();
}
}
// ---- Middleware implementations ----
class IdempotencyMw implements Middleware {
constructor(private readonly dedup: IDedupStore, private readonly ttlMs = 24 * 60 * 60 * 1000) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
const key = `evt:${ctx.event.eventId}`;
const fresh = await this.dedup.setIfAbsent(key, this.ttlMs);
if (!fresh) {
ctx.status = "duplicate_event";
return; // short-circuit
}
await next();
}
}
class PreferenceMw implements Middleware {
constructor(private readonly prefs: IUserPreferenceStore) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
const pref = await this.prefs.get(ctx.event.userId);
ctx.preferences = pref;
const category = categoryForType(ctx.event.type);
const mask = pref.categories[category] ?? pref.channels;
ctx.channels = (Object.keys(mask) as Channel[]).filter((c) => {
if (ctx.event.priority === "critical") return true; // bypass for critical
return mask[c] === true && pref.channels[c] === true;
});
if (ctx.channels.length === 0) {
ctx.status = "user_opted_out";
return;
}
await next();
}
}
class QuietHoursMw implements Middleware {
constructor(private readonly scheduler: Scheduler) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
if (ctx.event.priority === "critical") return next();
const qh = ctx.preferences!.quietHours;
if (!qh) return next();
const nowInTz = currentTimeInTz(qh.timezone);
if (isInQuietHours(nowInTz, qh)) {
const wakeAt = nextAllowedWindow(nowInTz, qh);
await this.scheduler.schedule(ctx.event, wakeAt);
ctx.status = "deferred";
return;
}
await next();
}
}
class FanOutMw implements Middleware {
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
ctx.notifications = ctx.channels.map((channel) => ({
notificationId: uuid(),
eventId: ctx.event.eventId,
userId: ctx.event.userId,
channel,
templateId: templateIdFor(ctx.event.type, channel),
rendered: { body: "" }, // filled by RenderMw
priority: ctx.event.priority,
dedupKey: dedupKeyFor(ctx.event, channel),
createdAt: Date.now(),
}));
await next();
}
}
class RenderMw implements Middleware {
constructor(private readonly renderer: ITemplateRenderer, private readonly dlq: DLQ) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
const rendered: Notification[] = [];
for (const n of ctx.notifications) {
try {
const payload = this.renderer.render(
n.templateId,
n.channel,
ctx.preferences!.locale ?? "en",
ctx.event.data,
);
rendered.push({ ...n, rendered: payload });
} catch (err) {
this.dlq.push({
notification: n,
attempt: 0,
reason: "render_failed",
error: (err as Error).message,
});
}
}
ctx.notifications = rendered;
ctx.status = "rendered";
if (rendered.length === 0) return; // all failed
await next();
}
}
class DedupMw implements Middleware {
constructor(private readonly dedup: IDedupStore, private readonly windowMs = 60_000) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
const deduped: Notification[] = [];
for (const n of ctx.notifications) {
const fresh = await this.dedup.setIfAbsent(`ntf:${n.dedupKey}`, this.windowMs * 2);
if (fresh) deduped.push(n);
}
ctx.notifications = deduped;
if (deduped.length === 0) {
ctx.status = "duplicate_event";
return;
}
await next();
}
}
class RateLimitMw implements Middleware {
constructor(
private readonly perChannel: Map<Channel, IRateLimiter>,
private readonly perUser: IRateLimiter,
private readonly requeue: (n: Notification, delayMs: number) => void,
) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
const allowed: Notification[] = [];
for (const n of ctx.notifications) {
const channelOk = await this.perChannel.get(n.channel)!.tryAcquire(n.channel);
const userOk = await this.perUser.tryAcquire(n.userId);
if (channelOk && userOk) {
allowed.push(n);
} else {
// Re-queue, do not drop. Backoff 1-5s with jitter.
this.requeue(n, 1000 + Math.floor(Math.random() * 4000));
}
}
ctx.notifications = allowed;
if (allowed.length === 0) return;
await next();
}
}
class EnqueueMw implements Middleware {
constructor(private readonly queues: Map<Channel, DeliveryQueue>) {}
async handle(ctx: PipelineContext, next: () => Promise<void>): Promise<void> {
for (const n of ctx.notifications) {
await this.queues.get(n.channel)!.enqueue({ notification: n, attempt: 0 });
}
ctx.status = "enqueued";
await next();
}
}
// ---- Worker ----
class ChannelWorker {
constructor(
private readonly queue: DeliveryQueue,
private readonly channel: IChannel,
private readonly retry: IRetryPolicy,
private readonly dlq: DLQ,
private readonly tracker: NotificationTracker,
) {}
async run(): Promise<void> {
while (this.running) {
const job = await this.queue.dequeue();
if (!job) continue;
const result = await this.channel.send(job.notification);
this.tracker.recordSend(job.notification, result);
if (result.status === "delivered") continue;
const decision = this.retry.nextAttempt(job.attempt, result);
if (!decision.retry) {
this.dlq.push({ notification: job.notification, attempt: job.attempt, reason: result.errorCode ?? "unknown", error: result.errorMessage ?? "" });
continue;
}
await this.queue.enqueue(
{ notification: job.notification, attempt: job.attempt + 1 },
decision.delayMs,
);
}
}
private running = true;
stop(): void { this.running = false; }
}
// ---- Retry policy ----
class ExponentialBackoffRetry implements IRetryPolicy {
constructor(private readonly maxAttempts = 5, private readonly baseMs = 1000) {}
nextAttempt(attempt: number, error: SendResult): { retry: boolean; delayMs: number } {
if (error.status === "permanent_failure") return { retry: false, delayMs: 0 };
if (attempt + 1 >= this.maxAttempts) return { retry: false, delayMs: 0 };
const exp = Math.min(this.baseMs * 2 ** attempt, 60_000);
const jitter = Math.floor(Math.random() * exp * 0.3);
return { retry: true, delayMs: exp + jitter };
}
}
// ---- Dedup key derivation ----
function dedupKeyFor(event: NotificationEvent, channel: Channel): string {
// Deterministic key: same inputs -> same key.
// Time bucket (1 min) collapses retries within the window.
const bucket = Math.floor(Date.now() / 60_000);
return `${event.userId}:${event.type}:${channel}:${bucket}`;
}The worker loop above is the heart of the delivery machinery. Notice what it does not do: it doesn't know about preferences, rate limits, or fan-out. By the time a job reaches the worker, every filtering decision has already been made. The worker's only responsibility is "send, interpret result, ack or retry." That separation is what makes the channel implementations swappable and the workers unit-testable with a mocked IChannel.
Design Decisions & Tradeoffs
Synchronous send vs async queue
The choice isn't really a choice. A synchronous send — "HTTP request in, hit SendGrid, return" — collapses under two kinds of load: bursts (10k events in a second from a marketing batch) and provider slowness (SendGrid averaging 2s today instead of 200ms). Neither is rare. A synchronous design couples your availability to every provider's availability and your throughput to every provider's latency. Async queueing decouples them: producers enqueue at line rate, workers consume at provider rate, and backpressure shows up as queue depth (a metric you can alarm on) instead of 502s to callers.
The cost is observability complexity — the caller no longer knows from their response whether the email was delivered, only that it was accepted. That's the right tradeoff for a notification service: deliveries happen at the end of a chain of decisions, and the caller never wanted synchronous status anyway.
Pull vs push worker dispatch
Two models:
- Pull: workers poll a queue. Scales horizontally, each worker controls its own rate, natural backpressure — if workers are slow, queue grows, metrics fire.
- Push: a dispatcher hands jobs to workers. Requires worker registration, load balancing, health checks. More moving parts.
Pull wins for this problem. Every production notification system I've seen uses a durable queue (SQS, Kafka, Redis Streams) and pull-based workers. Push becomes interesting only when you need strict per-worker routing (e.g. stateful sessions), which this system doesn't have.
At-most-once vs at-least-once + dedup
At-most-once means "if we lose a message, we lose it." That's unacceptable for a 2FA code. So we commit to at-least-once — which means the same notification may be processed more than once — and pair it with deduplication at two levels:
- Event-level dedup (key =
eventId) catches upstream replays: the same event arriving twice from Kafka after a rebalance, or a client retrying an HTTP POST with the same idempotency key. - Notification-level dedup (key =
userId + type + channel + timeBucket) catches semantic duplicates: two services both emitting the same user-visible notification within a short window.
At the provider boundary, we can't achieve exactly-once without the provider's cooperation. SES and Twilio both accept idempotency-key headers; we pass ours through where supported. Where they don't, we accept that a retry after a network-cut-during-response may produce two real deliveries and document that risk.
Fan-out location — producer-side vs consumer-side
Two shapes:
- Producer-side fan-out:
NotificationService.send()fans out one event into N per-channel notifications up front. Each is enqueued on the corresponding channel queue. - Consumer-side fan-out: enqueue the original event on a single queue; each channel worker independently decides whether this event applies to it.
Producer-side wins for three reasons. First, preference resolution happens once, not N times. Second, render happens once per channel before enqueue, so the queue carries the rendered payload — failures during render are surfaced immediately rather than hours later after retry delays. Third, per-channel queues give you per-channel queue-depth metrics, which is what you actually want to alarm on.
Consumer-side is tempting if you expect N to be large and sparse (many channels, few applicable per event), but for this problem N ≤ 4.
Template engine choice
Three plausible options:
- Handlebars/Mustache: lightweight, safe (no code execution), good enough for variable interpolation and simple conditionals. The right choice for 95% of template needs.
- Liquid / Jinja: more power (loops, filters). Useful if marketing wants to write templates that iterate over a list of order items inline.
- Code (TypeScript functions returning strings): fastest, most flexible, but every template change is a deploy.
Default to Handlebars. Expose an ITemplateRenderer interface so Jinja or code-based rendering can plug in later. Never use eval.
Per-priority queue partitioning
Critical notifications (2FA, fraud alerts) cannot share a queue with marketing batch sends — a 100k-user marketing blast would starve a 2FA code behind it. Options:
- Separate queues per priority: critical / transactional / marketing, each with its own worker pool and capacity. Simple, explicit, observable.
- Single queue with priority field: requires a queue implementation that supports priority dequeue. Redis sorted sets work; SQS doesn't natively.
Default to separate queues. The cost is three queues per channel instead of one; the benefit is predictable p99 on critical.
Patterns Used
| Pattern | Where |
|---|---|
| Strategy | IChannel implementations (EmailChannel, SmsChannel, PushChannel, InAppChannel). Also IRetryPolicy. |
| Chain of Responsibility | The Pipeline middleware chain. Each stage decides to continue or short-circuit. |
| Observer / Publish-Subscribe | IEventBus. The service subscribes to events from multiple producers without coupling. |
| Builder | NotificationServiceBuilder wires up middleware chain, channels, workers, and queues — constructor DI would be unreadable with this many collaborators. |
| Template Method | Base ChannelWorker defines the loop (dequeue → send → handle result); subclasses or injected policies customize retry and error mapping. |
| Facade | NotificationService hides the pipeline, queues, workers, and trackers behind send(event). |
| Adapter | Each IChannel impl adapts a vendor SDK (SES, Twilio, FCM) to our internal Notification / SendResult shape. |
| Dead-Letter Queue | DLQ class captures permanent failures out-of-band so the main flow isn't blocked by poison pills. |
Concurrency Considerations
Worker pool per channel
Each channel has its own worker pool with its own size. SES can handle 10k TPS; Twilio typically hundreds per second; FCM is in between. Sizing the pools independently keeps one provider's limits from back-pressuring another.
Pool size is capped by provider rate limit × target utilization, not by CPU. A single EmailWorker is mostly I/O-waiting on SES; dozens of workers fit on one host.
Per-channel rate limiters (see file 08)
A token-bucket limiter sits in RateLimitMw and again inside each worker as a belt-and-suspenders check (workers might retry from DLQ paths that skipped the pipeline). The limiter is backed by Redis when the service runs on multiple hosts — in-process limiters let each host send at the full limit, which collectively exceeds the provider's quota.
Bounded queues upstream
The per-channel queues are durable (Kafka/SQS), but the in-process transfer between the pipeline and the queue publisher goes through a bounded in-memory queue. Bounded is important — unbounded lets a slow downstream accumulate millions of items in heap until OOM. When the bound is hit, send() blocks, which backpressures the event-bus consumer, which eventually slows Kafka consumption — the backpressure chain bottoms out at the upstream producer.
Dedup across workers
The DedupStore has to be shared across every worker and every pipeline instance. Redis SET NX EX is the canonical primitive. The TTL has to be longer than the maximum retry horizon (otherwise a retry arriving after the dedup entry expired would look like a fresh notification). For a retry policy that tops out at 5 minutes, use a 1-hour TTL and don't overthink it.
Ordering within a user
If two notifications arrive for the same user in quick succession, ordering matters for some use cases (a "shipped" followed by "delivered" — the user shouldn't see them reversed). Two mechanisms together:
- Shard by userId: the upstream event bus uses
userIdas the partition key. Within a partition, order is preserved. - Per-user sequencing inside the worker: if strict ordering is needed within a channel, a worker holds a per-user lock while delivering. Usually unnecessary because provider latencies dominate and users aren't watching both notifications to the millisecond.
Retry pool separate from first-attempt pool
If retries share a pool with new events, a storm of retries during a provider blip will starve new events — the pool fills with backoff-waiting retries and nothing fresh gets through. Split: a primary pool consumes from the main queue, a retry pool consumes from a retry queue. Jobs move between queues via the retry scheduler. During an outage, the retry pool saturates but the primary pool keeps flowing; critical notifications aren't blocked behind yesterday's marketing retries.
Scale & Extensibility
Sharding by user_id
The event bus partitions by userId. This gives two properties: ordered delivery within a user (see above) and natural horizontal scale — double the partitions, double the consumers, double the throughput.
Priority-tier queues
Separate queue per (channel, priority). Critical-email and marketing-email are different queues with different worker pools. Critical gets 80% of the pool capacity and a separate rate-limiter bucket with reserved headroom.
Adding a new channel
A new channel = one new IChannel implementation + one wiring line in the builder. Core pipeline code does not change. The Channel enum widens, preferences default to "on" (or "off" depending on policy), templates for that channel are added to the template store. Onboarding a Slack channel or a WhatsApp channel takes a day, not a sprint.
A/B template testing
Extend RenderMw. Given a notification, consult an IExperimentService for which template variant to render. The experiment assignment is stable per (userId, experimentId) so the same user always sees the same variant. Record the variant in the notification metadata so downstream attribution can tie conversions back to variants.
Digest / batching
Some notifications are better delivered as daily digests than immediately. Add a DigestMw early in the pipeline: for events tagged as digest-eligible, instead of proceeding, append to a per-user digest bucket in Redis with a TTL matching the digest window. A separate scheduled job flushes buckets, renders a summary template, and emits a synthetic "digest ready" event that flows through the normal pipeline.
Push vs poll for in-app
In-app notifications are slightly different: the user's client needs to know about them in real time if online, and at fetch time if offline. The InAppChannel writes to the database (poll path) and publishes to a WebSocket gateway (push path). The client's fetch-on-open always works; the push is a latency optimization.
Webhook channel for integrations
Customers sometimes want notifications delivered to their own system via webhook. Implement as a new IChannel: the "provider" is an HTTP POST to a customer-configured URL with HMAC signing. Retry and DLQ reuse the existing pipeline. The only twist is per-customer rate limits to avoid hammering a customer's endpoint during a burst.
Scaling to 100M users
At 100M users the bottlenecks move:
UserPreferenceStorereads dominate — cache hot users in a local LRU with a 1-minute TTL. Invalidate on preference writes via the same bus.- Redis dedup store needs sharding. Partition by
hash(eventId)across a Redis cluster. - Template rendering can become CPU-bound for complex templates — consider a dedicated render worker pool fronted by its own queue.
- Observability cardinality explodes; emit pre-aggregated per-(channel, priority, template_id) counters rather than per-notification tracer events.
Edge Cases
- User is in quiet hours →
QuietHoursMwdefers to the next allowed window. Critical bypasses entirely. If the deferred window crosses a timezone change (DST), recompute the window on wake, not on defer. - Provider 5xx (transient) →
ExponentialBackoffRetryre-queues with jitter. Max attempts 5; at attempt 6 it hits DLQ withreason: exhausted_retries. - Provider 4xx permanent (invalid email, unregistered device token) → immediate DLQ with
reason: permanent_failure. A side process consumes DLQ for device-token cleanup and suppression-list updates. - Duplicate event arrives (Kafka rebalance, client retry) →
IdempotencyMwcatches via theeventIddedup key. No fan-out, no work. - Two services emit the same logical notification →
DedupMw's notification-level key catches the second one within the time bucket. - User unsubscribes mid-flight — event already fanned out and enqueued before the preference update? The channel worker re-checks preferences on dequeue as a final guard. Stale events drop on re-check.
- Rate limit hit →
RateLimitMwre-queues with a small backoff, does NOT drop. Dropping would cause silent loss. - Template rendering failure (missing variable, template not found) → individual notification goes to DLQ with
reason: render_failed, emits an alert; sibling notifications on other channels continue. - Scheduled notification for a deleted user → at delivery time,
PreferenceMwreturnsuser_not_foundand the notification is dropped with tracking. No partial rendering, no provider call. - Clock skew across hosts — quiet-hours windows computed in the user's timezone on whichever host processes the event. Sub-second skew is tolerable; use NTP and don't depend on ms precision.
- DLQ itself fills — the DLQ is a queue with finite retention. Alarms on DLQ ingress rate catch issues before retention expires. A separate process drains DLQ to long-term storage for audit.
- Poison pill (an event that crashes the pipeline deterministically) — middleware wraps each notification in try/catch, so one bad notification takes itself out instead of the batch. The bad notification goes to DLQ with the exception as the reason.
- Quiet-hours loop: user has quiet hours 00:00–23:59 (effectively always quiet). A naive defer would ping-pong forever. Cap defer count per notification (say 3); after that, drop or escalate to a fallback channel.
- Template version rollback: bad template shipped, users get garbage. Mitigation: templates are versioned, rollout is gated (canary to 1%, then 10%, then 100%). Emergency rollback flips the active version pointer, new renders use the prior version, in-flight deliveries carrying the bad rendered payload can't be recalled but can be de-scoped by adding a suppression rule for the template version in the final render-time guard.
Follow-up Questions
How do you achieve exactly-once delivery? You can't, strictly, without the provider cooperating. Express it as at-least-once + idempotency keys (both ours to Redis, and ours forwarded to the provider where supported). Call out where duplicates remain possible (network cut during provider response) and what product consequence that has.
How do you handle a full SendGrid outage? Provider-specific circuit breaker opens on elevated error rate; the email queue drains into a "parked" state; a fallback provider (a secondary SES account, or a different provider like Mailgun) takes traffic via a feature-flag swap of the
EmailChannelimplementation. Retries accumulate but don't explode because the breaker short-circuits the provider call after the first few errors.How do you A/B test templates?
RenderMwconsults an experiment service; assignment is stable per (userId, experimentId). Variant ID is recorded on the notification and propagated to analytics so conversion can be attributed.How do you implement a daily digest? A
DigestMwappends digest-eligible notifications to a per-user Redis list with a TTL that matches the digest window. A scheduled job flushes expired buckets, renders a digest template, and injects a synthetic event back into the pipeline.How do you scale to 100M users? Shard everything by userId: event bus partitions, preference-store shards, dedup-store Redis cluster. Cache hot preferences in-process. Separate worker pools per priority. Pre-aggregate observability. Move template rendering to a dedicated pool if CPU-bound.
How do quiet hours work across timezones? The preference record stores the user's timezone explicitly (not "UTC offset"). Middleware converts
nowto that timezone via a TZ library (Luxon/date-fns-tz). DST transitions are handled by the library; we test the two days per year explicitly.How do you roll back a bad template? Templates are versioned. Active version is a pointer (DB row or config key) with a canary rollout (1% → 10% → 100%). Rollback = flip the pointer to the prior version. In-flight notifications carrying the bad rendered body are unrecoverable unless you add a suppression step at the channel worker that checks version liveness before sending (expensive; usually not worth it).
How do you observe per-channel success rate? Every
SendResultgoes intoNotificationTracker, which publishes counters tagged bychannel,template_id,priority,status. Dashboards compute success rate = delivered / (delivered + permanent_failure) over a rolling 5-minute window. Alarms fire when success rate drops by > X% for Y minutes.How do you onboard a new channel in one day? Implement
IChannel, wire it into the builder, add aChannelenum value, add templates. Existing middleware, queue, rate-limit, retry, dedup infrastructure already applies. The only novelty is provider error-code mapping.How do you guarantee ordering of two notifications for the same user? Partition the event bus by userId; within a partition the bus delivers in order. If a single channel has multiple workers, either use a per-user mutex (expensive) or accept that provider latencies dominate and strict ordering is a product choice that deserves explicit scoping.
What happens if the dedup store (Redis) is down? Graceful degradation:
IdempotencyMwfails open on a 500ms timeout — accept the notification without dedup, emit adedup_unavailablemetric, rely on downstream idempotency where available. Failing closed (rejecting all notifications when Redis is down) is a worse outcome.How do you stop runaway send loops? Every notification has a hard cap on total attempts across all queues (e.g. 10). After that, DLQ regardless of error class. A separate loop-detector alarms when the same
eventIdappears in retry queues beyond its max attempt count.Can you guarantee "send within 5 seconds for critical"? p99 yes, p100 no — a provider can always be slow. Budget: 100ms pipeline + 500ms queue wait + 2s provider = 2.6s target. A separate critical-only queue with pre-warmed workers and no marketing contention makes this achievable. Alarms fire at p95 > 3s.
How do you handle GDPR delete? On user delete, purge preferences immediately. Notifications already accepted but not yet sent are harder — they're in durable queues with rendered payloads that may contain PII. A delete-user consumer on the queues filters the user out on dequeue. Notifications already sent are audit-retained per compliance, not "deleted from history."
SDE2 vs SDE3 — How the Bar Rises
| Dimension | SDE2 | SDE3 |
|---|---|---|
| Delivery semantics articulation | Says "we'll retry on failure." | Names at-least-once explicitly, pairs it with idempotency keys at two levels (event and notification), identifies the exact boundary where exactly-once becomes a lie (the provider). |
| Retry / dedup depth | Exponential backoff. Maybe jitter. | Exponential + jitter + max-attempts + permanent-vs-transient classification + separate retry pool + DLQ routing + dedup TTL longer than max retry horizon. Reasons about each. |
| Channel extensibility | IChannel interface. | IChannel plus: channel-specific retry policies, provider error-code mapping tables, onboarding playbook (template, preference default, rate-limit baseline, test plan), reusable pipeline so a new channel is one day of work. |
| Observability | "We'd log failures." | Per-channel success rate, p99 latency by priority, queue-depth dashboards, DLQ ingress alarms, structured spans across pipeline stages, pre-aggregated counters to control cardinality at scale. |
| Rate limiting per channel | A rate limiter exists. | Token bucket with per-channel and per-user keys, backed by Redis for cross-host sharing, with reserved critical-tier capacity, and a re-queue (not drop) policy on limit hit. |
| Failure modes | "We'd retry on errors." | Enumerates: provider 5xx, provider 4xx, network timeout, render failure, preference flip mid-flight, rate-limit hit, DLQ fill, Redis outage, template rollback, clock skew, quiet-hours loop, poison pill. Has a planned response for each. |
| Rollout & rollback strategy | Not mentioned. | Template versions with canary rollout, feature-flagged channel implementations, circuit breakers per provider, suppression lists for emergency stop, DLQ drain tooling, replay-from-DLQ for post-incident recovery. |
| Concurrency reasoning | Worker pool per channel. | Worker pool per (channel, priority), bounded in-process queues for backpressure, separate retry pool so retries don't starve new events, shared dedup store across workers, per-user sharding for in-order delivery, explicit reasoning about where ordering matters and where it doesn't. |
| Scope discipline | Tries to cover everything. | Cuts recurring schedules, rich content, consent management, bounce-handling UX with "out of scope, here's the extension point." Scope is a senior signal. |