Skip to content

HLD: Uber Notification Service ​

Understanding the Problem ​

What is the Notification Service? ​

A single system that delivers push notifications, SMS, email, and in-app messages triggered by product events (trip state changes, receipts, promos). At Uber's scale, it processes hundreds of millions of notifications per day, spanning critical transactional messages ("your driver is here") and marketing blasts ("try Uber Teen!"). It must respect per-user preferences and quiet hours, deduplicate aggressively, survive provider outages, and never let marketing traffic starve transactional delivery.

Functional Requirements ​

Core (above the line):

  1. Accept notification requests from any internal service.
  2. Deliver via user-preferred channel (push > SMS > email fallback).
  3. Enforce user preferences (opt-outs, quiet hours, marketing caps).
  4. Retries with exponential backoff; dead-letter queue for poison messages.
  5. Exactly-once semantics at user perception (idempotency keys).

Below the line (out of scope):

  • Rich media composition — separate templating service.
  • Content localization — i18n service.
  • Deliverability analytics dashboard — offline pipeline.

Non-Functional Requirements ​

Core:

  1. Scale — 300M notifications/day peak → ~10K/sec steady, 100K/sec burst.
  2. Latency — transactional (trip events): p99 < 2s end-to-end (producer → device). Marketing: within minutes.
  3. Delivery guarantees — at-least-once to provider, dedup at device level.
  4. Availability — 99.99% for transactional; 99.9% for marketing.

Below the line:

  • Deep deliverability analytics — offline.
  • Rich media previews — handled by an asset service, we forward URLs only.

Capacity Estimation ​

Quick-reference numbers:

  • Peak QPS: 100K/sec input; about 3× fan-out to channels → 300K/sec channel-send rate.
  • Storage: 300M/day × 500 B = 150 GB/day of send logs; 90 days hot = 14 TB. Compressed with Cassandra LZ4 → ~5 TB on disk.
  • Provider fan-out: APNs/FCM accept 5–10K RPS per HTTP/2 connection; need connection pools of ~20 per region.
  • Idempotency cache: 10M active keys in Redis at any time. At 64 B per key → 640 MB footprint.
  • Router compute: 100K/sec × 5 ms = ~500 core-seconds/sec = 500 cores peak. Provision 800 cores across consumer groups for headroom and isolation.

The Set Up ​

Core Entities ​

EntityDescription
NotificationnotifId, userId, channel, template, payload, priority, idempotencyKey, state
UserPreferenceuserId, channelPrefs, quietHoursTz, marketingOptIn, maxPerDay
DeliveryAttemptnotifId, attempt, provider, result, ts

API Design ​

Producer services POST a notification request with an idempotency key. Internal services get back a notification ID for tracking.

typescript
POST /v1/notifications
Headers: Idempotency-Key
Body: {
  userId: string;
  template: string;                 // "trip.driver_arrived"
  payload: Record<string, any>;
  priority: "critical" | "transactional" | "marketing";
  channels?: Channel[];             // optional override
}
Response 202: { notifId: string; state: "ACCEPTED" }

GET /v1/notifications/:id
Response: { state, attempts, deliveredAt? }
java
service Notifier {
  rpc Send(SendReq) returns (SendResp);
  rpc GetStatus(StatusReq) returns (NotificationStatus);
}

message SendReq {
  string idempotency_key = 1;
  string user_id = 2;
  string template = 3;
  google.protobuf.Struct payload = 4;
  Priority priority = 5;
  repeated Channel channels = 6;
}
cpp
auto client = notifier::NewStub(channel);
SendReq req;
req.set_idempotency_key(uuid);
req.set_user_id(userId);
req.set_template("trip.driver_arrived");
req.set_priority(Priority::TRANSACTIONAL);

SendResp resp;
grpc::ClientContext ctx;
ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::milliseconds(200));
client->Send(&ctx, req, &resp);

High-Level Design ​

Producers (Trip, Payments, Marketing)
           |
           v
    +---------------+     +------------+
    | Ingress API   |---->| Dedup Cache|
    | (write-ahead) |     | (Redis,    |
    +---------------+     | key->exp)  |
           |              +------------+
           v
    +---------------+
    | Kafka topics  | priority=critical,transactional,marketing
    | (3 priorities)|
    +---------------+
           |
           v
    +---------------+    +-------------+
    | Router svc    |--->| Prefs cache |
    | (channel pick |<---| (Redis)     |
    |  per user)    |    +-------------+
    +---------------+
           |
           v
    +---------------+----+----+-----+----+
    | Push svc (APNs/FCM)| SMS | Email| In-app |
    +---------------+----+----+-----+----+
           |   |    |
           v   v    v
    +---------------+
    | DLQ + retry   |
    | scheduler     |
    +---------------+

End-to-End Flow ​

  1. A producer (e.g., TripService) calls POST /v1/notifications with an idempotency key.
  2. Ingress API performs write-ahead dedup against Redis (SET NX idempKey 1 EX 86400). On hit, return the existing notifId.
  3. Ingress publishes to the Kafka topic matching the priority (notif.critical, notif.transactional, notif.marketing).
  4. Router consumers per priority pull, check user preferences (Redis cache of UserPreference), pick the best channel, apply rate limits, and produce a per-channel message to push.send, sms.send, email.send, or inapp.send.
  5. Channel workers hand off to provider SDKs. Success updates status in the Notification store; failure routes to the retry/DLQ topic.
  6. Dead-letter analyzer pages on-call when poison rate exceeds thresholds.

Why These Components ​

  • Priority-separated Kafka topics give strict isolation between marketing blasts and "your driver is here".
  • Redis dedup on ingress is the first line of defense against duplicate sends from buggy retry loops.
  • Router is the policy engine: all per-user rules live here and are easy to test.
  • DLQ + retry scheduler isolates poison messages from healthy traffic.

Data Model Detail ​

  • Notification record (Cassandra): Partition key (userId, dayBucket), clustering key notifId. Columns: template, channel, state, attempts, payload, createdAt, deliveredAt. TTL 90 days. Partition bucketing prevents per-user wide rows.
  • Idempotency (Redis): Key idemp:{userId}:{key}, value notifId, TTL 24h.
  • UserPreference (Redis + Cassandra): Redis for hot-path reads, Cassandra for persistence. Invalidation via pub/sub.
  • Kafka topics: notif.critical (12 partitions, RF=3), notif.transactional (48 partitions), notif.marketing (192 partitions). Marketing is more partitioned to absorb blast bursts.

Capacity Walkthrough ​

  • Peak 100K/sec → Kafka ingress with 250 producers delivers 400/sec each at batch-friendly rates.
  • Router: 50 consumers on transactional topic × 2K events/sec/consumer = 100K/sec.
  • Provider fan-out: APNs/FCM accept 5–10K RPS per HTTP/2 connection; 20 parallel connections per region → 100–200K RPS capacity.
  • Cassandra writes: 300M notifications/day / 86400 s = ~3.5K writes/sec average, 30K peak. Small cluster (6 nodes).

Potential Deep Dives ​

1) How do we guarantee idempotency under retries? ​

Mobile networks and unreliable producers will retry the same request. You must never double-notify.

Bad Solution — No dedup ​

  • Approach: Client resends on timeout; server creates two notifications.
  • Challenges: User gets two pings. Cannot be taken back.

Good Solution — Server-side content hash ​

  • Approach: Server hashes (userId, template, payloadHash) with 24h dedup.
  • Challenges: Collides for legitimate resends of identical content for different occasions (e.g., two identical promo pings days apart — fine, but within 24h it would collapse).

Great Solution — Client-provided idempotency key + Lua atomic dedup ​

  • Approach:
    • Producer supplies Idempotency-Key. Ingress runs a Lua script doing SET NX key 1 EX 86400 plus an inner HSET that stores the returned notifId. Retries return the stored notifId directly.
    • At provider layer, each device message carries a collapse key so APNs/FCM deduplicate at delivery.
  • Challenges: Redis must be HA — if the primary fails, failover should preserve the dedup state (use AOF + replica reads under circuit breaker). Key cardinality scales with traffic; run 24h TTL.

1.5) How do we ensure a notification actually reaches the user? ​

A "delivered" status from the provider only means the provider received it, not the user.

Good Solution — Provider callbacks ​

  • Approach: Consume APNs feedback / FCM delivery reports. Mark as DELIVERED when received.
  • Challenges: APNs feedback is only for failures (invalid tokens). No reliable per-message delivery acknowledgment from APNs.

Great Solution — App-side open/view receipts + inference model ​

  • Approach:
    • Client SDK reports notification_opened / notification_viewed when the push is surfaced.
    • Server correlates via the collapse key / notifId.
    • For users who never open, infer reach via next app-session timing; feed into a per-user delivery propensity model.
  • Challenges: Privacy — some jurisdictions require explicit consent for open tracking. Configurable per region.

2) How do we keep priorities isolated? ​

During marketing blasts, transactional "your driver is here" cannot be delayed behind 5M Taco Tuesday messages.

Bad Solution — Single topic, shared consumer pool ​

  • Approach: One Kafka topic, shared consumer group.
  • Challenges: During marketing blasts, transactional sits behind millions of marketing sends.

Good Solution — Priority topics with weighted consumer pool ​

  • Approach: Separate topics per priority. Higher priority gets more consumers.
  • Challenges: Works until a priority's traffic spikes and burns the whole pool.

Great Solution — Per-priority topics with independent consumer groups and SLO-based autoscaling ​

  • Approach:
    • Critical: dedicated consumers, isolated producer quota, strict latency SLO.
    • Marketing: lower consumer count, eligible to be throttled.
    • Rate limit at per-user layer: transactional bypasses limits; marketing uses a token bucket (Redis + Lua) per user (5/day default).
    • Autoscaler reads per-topic consumer lag and SLO, not CPU.
  • Challenges: Producer abuse (e.g., a team labels everything "critical"). Enforce a quota per producer team with automated capacity reviews.

3) What is the retry and DLQ semantics? ​

Transient failures (network blip, provider hiccup) should be retried; permanent failures should move out of the way.

Bad Solution — In-process retry loop ​

  • Approach: Worker retries in memory.
  • Challenges: Lost on worker crash; blocks queue for retry duration.

Good Solution — Retry topic with exponential backoff ​

  • Approach: Failed messages go to retry topic; a scheduler consumer retries with exponential backoff.
  • Challenges: Retry topic grows unboundedly on provider-wide outage.

Great Solution — Tiered retry with jitter, DLQ, and circuit breakers ​

  • Approach:
    • Attempt 1 inline; attempts 2–4 scheduled at 30s, 2m, 10m with 50% jitter.
    • After 5 attempts, move to DLQ. DLQ consumer has a circuit breaker per provider — if APNs error rate > 10% over 1 min, pause and alert; auto-resume on recovery.
    • Poison messages (malformed templates, invalid user IDs) go to a quarantine topic with a human workflow.
  • Challenges: Time-sensitive notifications may expire before retry completes. Mark notifications with expiresAt; skip retry if expired and record drop.

4) How do we fan out a trip event into multiple notifications? ​

A trip.completed event triggers a push, an email receipt, and an in-app rating prompt. You do not want a caller to enumerate these.

Bad Solution — Caller sends multiple requests ​

  • Approach: TripService calls Send three times.
  • Challenges: Tightly couples TripService to notification templates. Any schema change needs every producer to update.

Good Solution — Template expansion in Router ​

  • Approach: Producer publishes with a single template key; Router consults a template registry and expands to per-channel messages.
  • Challenges: Registry becomes critical infrastructure. Must be versioned.

Great Solution — Event-driven template expansion with at-most-once per channel ​

  • Approach:
    • For trip.completed, Router consults template registry: [push:receipt_summary, email:receipt_detail, inapp:rate_trip].
    • Each channel-send is a separate Kafka message with its own idempotency key, so retries on one channel don't duplicate another.
    • Templates live in a versioned registry with staged rollouts.
  • Challenges: Schema drift between template and caller payload. Enforce schema validation at the template registry with breaking-change gating.

5) How do we enforce quiet hours and per-user rate limits? ​

Hammering a user at 3am burns trust and invites regulatory pain.

Good Solution — Static checks in Router ​

  • Approach: Router consults UserPreference. Reject marketing outside quiet hours.
  • Challenges: Static rules miss edge cases (e.g., a user in transit across time zones; critical notifications that must override quiet hours).

Great Solution — Policy engine with priority override + per-user token bucket ​

  • Approach:
    • Policy engine evaluates (priority, quietHours, userPrefs, rateBuckets).
    • Critical notifications override quiet hours; transactional respect soft preference; marketing respect hard preference.
    • Per-user token bucket in Redis (1 token per 24h for marketing, unlimited for transactional). Lua enforces atomic decrement.
  • Challenges: Time-zone inference; use user's last-known location's time zone with a fallback to profile setting.

5.5) How do we batch without starving low-volume users? ​

Batching to APNs or SMTP saves cost, but a single user's notification shouldn't wait for a batch to fill.

Good Solution — Time-bounded batches ​

  • Approach: Each channel worker flushes batches when size > 100 or time > 500ms.
  • Challenges: 500ms added latency on low-volume paths.

Great Solution — Per-priority batch settings with dynamic tuning ​

  • Approach:
    • Critical: batch size 10, time 20ms — tight latency.
    • Transactional: batch size 50, time 100ms — balanced.
    • Marketing: batch size 500, time 1s — throughput-optimized.
    • Control plane tunes per-priority based on observed p99 vs SLO.
  • Challenges: Requires careful observability; per-priority rollout.

6) How do we survive provider outages? ​

APNs or FCM going down is a when-not-if event.

Great Solution — Provider-aware circuit breakers + channel failover for critical ​

  • Approach: Each provider has its own circuit breaker metrics in Router. If APNs error rate breaches 10% for 1 minute, route critical pushes to SMS fallback; non-critical gets queued for later. On APNs recovery, drain queued work at a rate that respects the post-outage capacity.
  • Challenges: SMS costs 50× more than push. Budget and rate-limit SMS fallback so one long APNs outage doesn't blow the quarterly budget.

Rapid-Fire Q&A Anticipations ​

  • "How do you test APNs outages?" Chaos experiments inject synthetic 5xx from a stubbed APNs client; verify circuit breaker trips and fallback path kicks in.
  • "What's the typical end-to-end latency breakdown?" Producer → Kafka: ~20ms. Kafka lag: ~50ms. Router compute: ~5ms. Channel worker → APNs: ~100ms. Device receipt: ~500ms–2s.
  • "How do you handle unsubscribed or invalid tokens?" APNs feedback service returns invalid tokens; background job cleans them from the UserPreference store.
  • "What about DNS-level failures to APNs?" Cached A records with short TTL; Layer-7 fallback to TCP connect timeout. Degrade gracefully.
  • "What's the escalation if a critical push doesn't deliver?" Retry ladder with decreasing channel cost (push → in-app → SMS → email). After final failure, log to ops dashboard.

Alternatives Considered ​

  • Twilio / SendGrid as platform vs in-house: Third-party platforms simplify ops but lock you in. At Uber scale, in-house with abstractions over multiple providers is standard.
  • Kafka vs SQS: SQS is simpler; Kafka gives strict ordering and high throughput. At 100K/sec, Kafka.
  • Per-user consumer vs shared consumer: Per-user doesn't scale; shared consumer with partition-by-userId is standard.
  • Real-time SMS fallback vs scheduled: Real-time is costly; scheduled batches save cost but risk delivery delay. Critical paths need real-time.
  • Single template store vs per-service: Centralized template store enforces consistent rules (quiet hours, localization) and simplifies audits.

Frequently Asked Follow-ups ​

  • "How do you support localization?" — Template registry has per-locale variants. Router picks locale from user profile; i18n service handles string interpolation.
  • "A marketing manager wants to send a one-time blast to 50M users. How?" — Bulk-ingest endpoint that accepts user ID lists + template. Bulk producer writes to the marketing topic at throttled rate (say 50K/sec). Router handles it like any other marketing event.
  • "What about PII in notifications?" — Payload encrypted at rest; logs redact known PII fields; audit trail for who sent what.
  • "How do you measure delivery?" — Provider callbacks (APNs feedback service, email webhooks) update delivery status. We also trust "opened" events from the app.
  • "Cross-region?" — Each region has its own Kafka + Router. A global user-preference service replicates async. Producer publishes in any region; Router in the user's home region picks up for delivery.

Visual Aids to Draw ​

  • Priority-partitioned topology: three Kafka topics with different consumer pool sizes.
  • Template fan-out: single trip event → 3 channel-specific messages with separate idempotency keys.
  • Circuit breaker state machine: closed → half-open → open transitions on error thresholds.
  • DLQ + retry ladder showing delays and jitter.
  • Provider abstraction layer with Stripe, Twilio, SendGrid, APNs, FCM as pluggable backends.

What's Expected at Each Level ​

Mid-level (L4) ​

  • Queue-based architecture, retry with DLQ, provider abstraction.
  • Idempotency at ingress.
  • Misses priority isolation and the failover story.

Senior (L5 / L5A) ​

  • Priority isolation, idempotency at ingress and provider, per-user rate limiting.
  • Fan-out template expansion with per-channel idempotency.
  • Back-of-envelope on Kafka partitioning and Redis dedup cache size.

Staff+ (L6) ​

  • Cross-region independence, provider failover (APNs outage → SMS for critical).
  • End-to-end observability of delivery rate per template per region.
  • Budget controls on fallback channels.
  • Deprecation story: how a template is versioned, rolled out, and removed without breaking in-flight notifications.

Common Pitfalls ​

  • Single Kafka topic for all priorities. Marketing will starve transactional sooner or later.
  • Treating provider APIs as reliable. Always assume APNs/FCM will go down for an hour at some point.
  • Missing idempotency. Retries will happen; without dedup you ship double notifications.
  • Ignoring quiet hours. Some jurisdictions have legal quiet-hour rules. Don't skip this.
  • Unbounded retry. Without a DLQ + circuit breakers, one bad template can drain the retry pool.

Walkthrough: Interview Dialogue Example ​

Interviewer: "Trip completes. Walk through all the notifications that fire."

You should answer:

  1. TripService publishes trip.completed event with tripId and userId.
  2. Notification Ingress (subscribed to trip.events Kafka) converts to notification request with idempotency key trip-completed-{tripId}.
  3. Redis dedup passes (first time). Ingress publishes to notif.transactional topic.
  4. Router consumer pulls, expands via template registry into three per-channel messages: push:receipt_summary, email:receipt_detail, inapp:rate_trip.
  5. Each per-channel message has its own idempotency key like trip-completed-{tripId}:push. Produced to the appropriate channel topic.
  6. Channel workers: Push worker batches HTTP/2 to APNs (~50ms); Email worker hands off to SMTP relay (~500ms); In-app writes to user's inbox table.
  7. All three writes hit Cassandra; state transitions to DELIVERED (for push with APNs ack) or QUEUED (email/in-app).

Total visible to user: push within 2s; email within 30s; in-app immediately on next app open.

What If They Pivot Mid-Interview? ​

  • "Support scheduled notifications." — Cadence workflow with workflow.Sleep(duration) delivers reliably even across weeks. Great fit.
  • "Design per-user digest emails." — Nightly Flink job aggregates events per user, produces a summary payload, pushes to Notification Service with template=daily_digest.
  • "Add A/B testing." — Experimentation service assigns users to variants. Template registry resolves template:v1 vs template:v2 at Router time.
  • "How do we measure ROI of a marketing campaign?" — Join notification send logs with downstream conversion events in the data warehouse. Attribution logic lives in analytics, not the send path.

Reliability and Observability ​

  • SLO: 99.99% for transactional p99 < 2s end-to-end; 99.9% for marketing (minutes tolerance).
  • Failure modes:
    • APNs/FCM outage → provider circuit breaker opens; critical pushes fall back to SMS; non-critical queued.
    • Kafka priority-critical lag → alert immediately; drain marketing consumers to free resources.
    • Redis dedup failure → failover to DB unique constraint as idempotency backstop; slower but correct.
  • Deployment: Template rollouts are versioned; template registry enforces backward-compatible schema changes. Canary new templates to 1% of users first.
  • Monitoring: Per-provider error rate, per-template CTR, per-priority end-to-end latency. Alert when critical path p99 > 5s.
  • Runbook: If SMS spend spikes beyond budget in a region, automatically disable SMS fallback and alert on-call.

Uber-Specific Notes ​

  • Uber's notification platform uses Kafka extensively. Trip events are published to a canonical trip.events topic consumed by notifications, analytics, and fraud.
  • Cadence workflows often orchestrate multi-step notification flows (e.g., "remind driver to go online in 30 min" — scheduled activity with a durable timer).
  • M3 dashboards show delivery_rate_by_template_by_region. Jaeger traces sampled 0.01% of notifications for debugging.
  • Call out that Uber's quiet-hours enforcement has legal weight in some markets (Brazil, France) — mention it briefly.
  • Interviewers sometimes probe on push vs in-app consistency: a rider on an active trip sees an in-app banner AND a push; both should reference the same event. Use the same notifId as the in-app message key.
  • For a stretch staff+ answer, sketch the delivery-optimization loop: if a user opens a push within 30s, we've learned their preferred channel; learn and adapt per-user with a contextual bandit.

Scaling Milestones ​

  • Startup (1M users): Single Rails/Go service → APNs/FCM directly. No retries.
  • Growing (10M users): Queue-backed (RabbitMQ or SQS) + worker pool + basic retry.
  • Big (100M users): Kafka, Router, template registry, priority topics, idempotency.
  • Global (600M users): Per-priority autoscaling, provider failover, contextual delivery model, multi-region Router.

Summary Checklist ​

  • [ ] Idempotency with client-provided key + Lua dedup + DB unique constraint.
  • [ ] Priority-separated Kafka topics.
  • [ ] Template registry + per-channel fan-out.
  • [ ] DLQ + tiered retry + circuit breakers.
  • [ ] Per-user quiet hours + rate limits.
  • [ ] Provider failover (APNs → SMS for critical).
  • [ ] Budget controls on fallback channels.
  • [ ] Observability: per-template delivery rate, end-to-end latency.

Key Numbers to Memorize ​

MetricValue
Notifications/day (peak)300M
Peak QPS100K
Steady QPS10K
Storage/day150 GB
Hot retention90 days (~14 TB)
Transactional p99< 2s end-to-end
Marketing delivery SLAwithin minutes
Idempotency TTL24h
APNs/FCM RPS per connection5–10K
Default marketing cap per user5/day

One-Liner You Should Remember ​

"Priority-separated Kafka topics, Router expands templates into per-channel sends each with idempotency keys, Redis dedup at ingress + DB unique constraint as ground truth, circuit breakers per provider, DLQ with tiered retry. 100K peak/sec, transactional p99 < 2s, 300M/day."

Frontend interview preparation reference.