Skip to content

HLD: Uber Payments / Wallet ​

Understanding the Problem ​

What is the Payments / Wallet Service? ​

The payments service debits the rider's payment method, credits the driver's earnings, handles refunds, applies promo credits, and must stay consistent across service failures. It is one of the most serious correctness problems at Uber because money is involved: no double-charges, no lost credits, no negative balances, full auditability, and regulatory compliance. Interviewers use this problem to see if you can reason about sagas, idempotency, double-entry ledgers, and external PSP (Payment Service Provider) semantics.

Functional Requirements ​

Core (above the line):

  1. Authorize a payment at trip start (card hold); capture at trip end.
  2. Debit rider wallet / promo credits before external card charge.
  3. Credit driver earnings with scheduled payout.
  4. Handle refunds (full and partial).
  5. Idempotent APIs (never double-charge on client retry).

Below the line (out of scope):

  • KYC / onboarding — compliance service.
  • Tax computation — tax service.
  • Fraud scoring — fraud service (we consume a decision).
  • Chargebacks — downstream disputes workflow.

Non-Functional Requirements ​

Core:

  1. Scale — 25M trips/day → 25M capture events/day, peak 1K/sec.
  2. Latency — authorize at trip request p99 < 500ms; capture at trip end within 30s.
  3. Consistency — strong consistency on the ledger. Zero tolerance for lost/duplicate entries.
  4. Availability — 99.99% for auth; 99.999% durability.
  5. Compliance — PCI-DSS, SOX audit trails, GDPR.

Below the line:

  • Deep analytics — offline via data warehouse.
  • Cross-border remittance — partner integrations handle.

Capacity Estimation ​

Show these numbers on the board:

  • Ledger entries: 25M trips × ~4 entries (auth, capture, driver credit, platform fee) = 100M/day × 500 B = 50 GB/day = 18 TB/year.
  • Idempotency cache: 10M active keys in Redis at any time. At 128 B per key (includes cached response) → 1.3 GB footprint.
  • PSP throughput: Stripe caps at low thousands of ops/sec per merchant account; we pool across multiple merchant accounts (one per subsidiary/region).
  • Workflow volume: 25M/day PaymentWorkflows across ~300/sec peak. Each workflow history: ~10–20 events, ~5 KB persisted.
  • Reconciliation files: 25M rows × 100 B = 2.5 GB/day CSV from PSPs; nightly batch job completes in < 1h.

The Set Up ​

Core Entities ​

EntityDescription
AccountaccountId, ownerType (USER/DRIVER/PLATFORM), currency, balanceCached
LedgerEntryentryId, debitAccount, creditAccount, amount, currency, txnId, ts
TransactiontxnId, type (AUTH/CAPTURE/REFUND/PROMO), state, idempotencyKey, entries[]
PaymentMethodpmId, type (CARD/WALLET/UPI), tokenRef (Stripe/Adyen)

API Design ​

Every mutation carries an idempotency key. Three main endpoints mirror the PSP lifecycle.

typescript
POST /v1/payments/authorize
Headers: Idempotency-Key
Body: { tripId; riderId; amount; currency; paymentMethodId }
Response 200: { txnId; state: "AUTHORIZED"; expiresAt }

POST /v1/payments/capture
Headers: Idempotency-Key
Body: { txnId; finalAmount }
Response 200: { state: "CAPTURED"; ledgerEntryIds: string[] }

POST /v1/payments/refund
Headers: Idempotency-Key
Body: { originalTxnId; amount; reason }
Response 200: { refundTxnId; state: "REFUNDED" }
java
service Payments {
  rpc Authorize(AuthReq) returns (AuthResp);
  rpc Capture(CaptureReq) returns (CaptureResp);
  rpc Refund(RefundReq) returns (RefundResp);
  rpc GetBalance(BalanceReq) returns (Balance);
}

message AuthReq {
  string idempotency_key = 1;
  string trip_id = 2;
  string rider_id = 3;
  Money amount = 4;
  string payment_method_id = 5;
}
cpp
Transaction txn;
txn.id = genUUID();
txn.idempotency_key = key;
txn.entries = {
  LedgerEntry{rider_account, pending_settlement, amount},
  LedgerEntry{pending_settlement, driver_account, driver_share},
  LedgerEntry{pending_settlement, platform_account, platform_fee},
};

// Single atomic DB transaction — debits and credits balance to zero
db.transactional([&](auto& tx) {
  for (auto& e : txn.entries) tx.insert("ledger", e);
  tx.insert("transactions", txn);
});

High-Level Design ​

API Gateway
    |
    v
+----------------+     +-------------------+
| Payments API   |---->| Idempotency store |
| (stateless)    |     | (Redis, NX)       |
+----------------+     +-------------------+
    |
    v
+---------------------+           +-------------------+
| Cadence workflow    |---------->| Fraud svc (sync)  |
| PaymentWorkflow     |           +-------------------+
+---------------------+
    |             \
    |              v
    |        +-------------------+
    |        | PSP adapters      |
    |        | (Stripe, Adyen,   |
    |        |  Razorpay, UPI)   |
    |        +-------------------+
    v
+---------------------+
| Ledger Service      |
| (double-entry,      |
|  Schemaless-backed) |
+---------------------+
    |
    v
+---------------------+
| Event stream Kafka  |
| (ledger.events)     |
+---------------------+
    |
    v
+---------------------+
| Consumers: reports, |
| analytics, balance  |
| projection, alerts  |
+---------------------+

End-to-End Flow for a Trip Payment ​

  1. Trip request → Payments Authorize → Cadence PaymentWorkflow starts keyed by tripId.
  2. Workflow calls PSP to create an auth hold (idempotent via idempotency key). Records pending ledger entry.
  3. Trip completes → Payments Capture → workflow captures auth, writes final ledger entries (debit rider, credit driver, credit platform).
  4. Ledger writes are a single atomic Schemaless transaction per txnId.
  5. ledger.events Kafka topic broadcasts entries to analytics, balance projection (Redis), and user-facing receipts.

Why These Components ​

  • Cadence holds durable workflow state. A crash at step 3 does not lose the pending auth; workflow resumes.
  • Schemaless is the authoritative ledger store — append-only, sharded, SOX-auditable.
  • Redis idempotency is the fast path; DB unique constraint is the ground truth.
  • Kafka broadcasts ledger events to consumers without coupling them to the write path.

Data Model Detail ​

  • Transactions (Schemaless): Partition by txnId. Columns: type, state, idempotencyKey UNIQUE, createdAt, pspRef, parentTxnId. Secondary index on (userId, createdAt) for history.
  • LedgerEntries (Schemaless): Partition by txnId. Columns: entryId, debitAccount, creditAccount, amount, currency, ts. Sum of debits equals sum of credits per txn — enforced at write time.
  • Accounts (Schemaless): Partition by accountId. ownerType, currency, balanceCached. Balance is a projection; source-of-truth is the ledger sum.
  • Idempotency (Redis): Key idemp:pay:{userId}:{key}, value: full serialized response, TTL 24h.

Capacity Walkthrough ​

  • 1K/sec peak auth + 1K/sec capture + scatter refunds = ~3K/sec DB writes, well within Schemaless capacity.
  • Ledger reads for reports: served from a Kafka-fed projection in Redis; hot balances have < 10ms p99.
  • Cadence workflow volume: 25M/day trip workflows + 1M refund/other = ~300/sec active workflows. A 20-shard Cadence cluster handles this easily.
  • PSP calls: ~1K auth/sec across Stripe (primary) + Adyen (secondary); well under per-account quotas.

Potential Deep Dives ​

1) How do we span Payments + Rider Wallet + Driver Earnings atomically? ​

One trip creates entries across multiple logical accounts that must balance exactly.

Bad Solution — XA / 2PC across services ​

  • Approach: Distributed transaction coordinator locks all participants.
  • Challenges: Coordinator is a SPOF; locks held during network calls; throughput collapses under contention. Unworkable at Uber scale.

Good Solution — Saga with compensating actions ​

  • Approach: CreditPromo -> ChargeCard -> CreditDriver. If any step fails, run the reverse compensations.
  • Challenges: Hand-rolled sagas are fragile. Compensation ordering is hard to get right.

Great Solution — Cadence workflow as orchestrator + double-entry ledger as consistency boundary ​

  • Approach:
    • The ledger is the only atomic unit. Writing 4 entries for one txn is a single DB transaction.
    • Cadence holds durable workflow state. On any activity failure it retries or runs a registered compensation.
    • Workflow code is deterministic and resumable, so a crashed coordinator picks up exactly where it left off.
    • External side effects (PSP charges) are activity calls with their own idempotency keys — a retried activity re-sends the same idempotency key to Stripe; Stripe dedups. We never double-charge.
  • Challenges: Cadence ops overhead (history retention, shard rebalancing). Keep workflow histories bounded; use continue-as-new for long-running payouts.

2) How do we make APIs truly idempotent? ​

Retries at mobile networks, PSP reconnects, and client app backgrounding all compound here.

Bad Solution — Primary key collisions ​

  • Approach: Rely on DB unique constraints for inserts.
  • Challenges: Works for inserts but not for multi-step flows touching external PSPs.

Good Solution — Client idempotency key in Redis ​

  • Approach: Client supplies Idempotency-Key; server caches in Redis with 24h TTL. Second request returns cached response.
  • Challenges: Fails if Redis is unavailable; race between cache write and response commit.

Great Solution — Two-tier idempotency: Redis for speed + DB unique constraint as ground truth ​

  • Approach:
    • SET NX idempKey in Redis.
    • On Redis miss, DB lookup idempotency_key UNIQUE resolves the race.
    • Response body is stored with the key so retries return byte-for-byte the same payload (clients may diff responses — they must match).
    • Idempotency keys scoped per (userId, endpoint) to avoid cross-user collisions.
  • Challenges: DB write amplification — keep the idempotency table lean (TTL or nightly purge older than 7 days). Ensure PSP idempotency key is derived from our txnId so a PSP retry also dedups.

2.5) How do we handle async PSP webhooks? ​

Many PSPs respond async — they accept a charge, then confirm settlement hours later via webhook.

Good Solution — Dedicated webhook receiver + event table ​

  • Approach: Webhook receiver validates signature, writes to psp_events table, and kicks off a processor.
  • Challenges: Duplicate webhooks from PSP retries can reprocess; easy to get wrong.

Great Solution — Idempotent webhook dedup + workflow signal ​

  • Approach:
    • Webhook body carries a PSP event ID. Dedup table with UNIQUE(psp_event_id).
    • On new event, look up the Cadence workflow for that txnId and signal it. Workflow transitions state based on the event.
    • Workflow is idempotent: re-sent signals are no-ops.
  • Challenges: Webhooks can arrive before or after our local state catches up; workflow must handle either order.

3) How do refunds and partial captures stay correct? ​

Concurrent refund calls are a real risk in ops tools.

Bad Solution — Allow any refund amount ​

  • Approach: Just write a negative charge entry.
  • Challenges: Over-refund risk — concurrent refund requests each subtract independently.

Good Solution — Track refundedSoFar on the original transaction ​

  • Approach: Reject if new refund amount > (captured − refundedSoFar).
  • Challenges: Read-modify-write race allows two concurrent refunds to both pass the check.

Great Solution — Refund as its own txn with parent pointer + SELECT FOR UPDATE ​

  • Approach:
    • Use a DB row-lock (SELECT ... FOR UPDATE) on the parent transaction when computing available-to-refund.
    • Concurrent refunds serialize on the row lock.
    • For eventual-consistency multi-region environments, use a version counter with compare-and-set and retry on conflict.
    • Write the refund as a new txn with entries reversing the original direction; ledger balance remains auditable.
  • Challenges: Row locks create hotspots on high-frequency refund parents (e.g., bulk refund job). Batch refunds with explicit coordination.

4) How do promo codes apply safely under outages? ​

A promo service outage should not block trips.

Bad Solution — Synchronous promo call in charge path ​

  • Approach: Call promo service at charge time.
  • Challenges: Promo outage blocks all charges.

Good Solution — Prefetch promo at trip request time ​

  • Approach: Apply prefetched promo as a ledger entry; charge card for remainder.
  • Challenges: Prefetch becomes stale if user modifies promos mid-trip.

Great Solution — Promo as a separate account with graceful degradation ​

  • Approach:
    • Promo is a separate account with its own ledger. Workflow debits promo first (up to trip amount), charges card for remainder.
    • Promo expiry and fraud limits enforced at debit time with SELECT FOR UPDATE on the promo account row.
    • If promo service is down, trip proceeds with full card charge. Background reconciliation retroactively credits the promo when the service recovers.
  • Challenges: Reconciliation skew can irritate users if they see full charge for several minutes. Surface "promo pending" on the receipt.

5) How do we prevent negative balances on the rider wallet? ​

If a rider's wallet balance drops below zero, we've given away money.

Good Solution — Check balance before debit ​

  • Approach: Read balance, compare, debit.
  • Challenges: TOCTOU — two concurrent debits can both pass the check.

Great Solution — Conditional debit with SELECT FOR UPDATE or atomic decrement ​

  • Approach:
    • UPDATE account SET balance = balance - :amount WHERE accountId = :id AND balance >= :amount.
    • Returns affected-row count; if 0, insufficient funds. Single atomic DB statement; no reads.
  • Challenges: Contention on high-activity wallets. Combine with Redis-based optimistic reservation for hot wallets (e.g., corporate accounts) and periodic reconciliation with the DB.

5.5) How do we handle dual-write hazards between DB and Kafka? ​

After writing ledger entries to the DB, we publish to Kafka for projections. If the DB commit succeeds but Kafka publish fails, we have drift.

Good Solution — Fire-and-forget Kafka publish ​

  • Approach: Commit DB, then publish to Kafka. Log errors.
  • Challenges: Silent drift under Kafka outages; projections fall behind.

Great Solution — Transactional outbox ​

  • Approach:
    • Write ledger entries + outbox entry in the same DB transaction.
    • A separate OutboxRelay service polls the outbox and publishes to Kafka with at-least-once semantics; marks outbox entries as dispatched.
    • Consumers deduplicate by entryId.
  • Challenges: Small DB overhead (outbox table churn). Offset by simplicity and correctness. Purge dispatched entries nightly.

6) How do we reconcile with PSPs nightly? ​

Money doesn't just disappear because a microservice lost a row. Reconciliation is a hard SOX requirement.

Great Solution — Daily settlement file match + discrepancy workflow ​

  • Approach:
    • Each PSP drops a settlement file (typically CSV) nightly. A batch job joins it against our ledger by PSP transaction reference and our txnId.
    • Discrepancies (missing on our side, missing on theirs, amount mismatch) trigger a Cadence discrepancy workflow that pages a finance operator and suspends new charges on the affected account until resolved.
  • Challenges: Time zone mismatches and late-settling transactions — allow a 48h reconciliation window before alerting.

Rapid-Fire Q&A Anticipations ​

  • "What if the DB commits but Cadence doesn't get the activity result?" Idempotency on the activity: retrying re-reads the same DB row, sees the work is already done, returns success. Cadence workflow continues.
  • "Is ledger balance strongly consistent?" Per txn: yes, atomic DB transaction. Aggregate balance projections (Redis): eventually consistent, fine for hot-path reads; authoritative balance always from the ledger.
  • "How do you sign off on a new feature touching payments?" Code review + security review + SRE on-call signoff + explicit test plan covering money flows. No exception.
  • "What's the blast radius of a single corrupt payment?" Bounded to one trip; workflow compensation rewinds side effects. Nightly reconciliation catches anything missed.
  • "How do you test chaos scenarios for PSP failures?" Fault injection in staging; synthetic load with random 5% 5xx from PSP stub.

Alternatives Considered ​

  • 2PC / XA vs Saga: 2PC is simpler on paper but collapses under scale. Saga + Cadence is the Uber answer.
  • Event sourcing vs ledger: Event sourcing is tempting for payments. Uber uses an append-only ledger (functionally similar) but avoids the full event-sourcing pattern because projections get unwieldy.
  • Single PSP vs multi-PSP: Multi-PSP reduces risk of single-vendor outage and enables regional/local PSPs (UPI in India, WeChat Pay in China).
  • In-memory ledger cache vs DB-backed: Balance projections live in Redis but are rebuilt from the ledger, not authoritative.
  • Cadence vs Temporal vs in-house: Uber historically built Cadence, contributed to Temporal fork. Either works as long as you explain the durable-workflow pattern.

Frequently Asked Follow-ups ​

  • "What if the rider cancels mid-ride?" — RefundActivity in workflow releases auth hold or captures a cancellation fee and refunds the rest.
  • "How do you handle chargebacks?" — Separate dispute workflow consumes PSP chargeback notifications; ledger reverses entries via a compensating refund.
  • "How do you detect double charges from a single trip?" — Per-trip unique constraint on (tripId, txnType) in transactions table; ledger reconciliation job flags anomalies.
  • "What about driver payouts?" — Daily scheduled Cadence workflow aggregates driver credits and triggers payout to bank account via ACH/SEPA/UPI.
  • "Cross-currency?" — Every account has a currency; cross-currency trips involve an FX ledger entry with exchange rate recorded at capture time. Rate comes from a treasury service.

Visual Aids to Draw ​

  • T-account double-entry diagram with rider/driver/platform accounts and entries for auth, capture, promo.
  • Cadence workflow state machine for PaymentWorkflow: AUTH → PENDING_CAPTURE → CAPTURED → REFUND? → COMPLETE.
  • Idempotency check flow: Redis SET NX → DB unique constraint → PSP idempotency key.
  • Reconciliation pipeline joining PSP settlement CSV with ledger.
  • Multi-PSP failover showing primary + secondary with traffic split.

What's Expected at Each Level ​

Mid-level (L4) ​

  • Double-entry ledger basics, idempotency keys, separate auth/capture.
  • Understands why direct PSP calls need retries.

Senior (L5 / L5A) ​

  • Saga vs 2PC discussion with clear tradeoff articulation.
  • Refund correctness with row locks or CAS.
  • Explicit invariants (balance never negative for rider wallet; every ledger entry has a matching counter-entry).

Staff+ (L6) ​

  • Multi-currency handling and FX boundaries.
  • Reconciliation with PSPs (nightly settlement file match) and regulatory partitioning per country.
  • Availability vs integrity tradeoffs during PSP outages — when do we queue and when do we fail the trip?
  • Tenant/region isolation for compliance (India data must stay in India).

Common Pitfalls ​

  • Using 2PC. Shows lack of scale awareness. Saga + durable workflow is the expected answer.
  • Missing idempotency at the PSP layer. It's not enough to dedup at your API; the PSP call must carry an idempotency key derived from your txnId.
  • No separate auth/capture. Taxi rides are variable-fare; you must auth at start and capture at end.
  • Ignoring refunds. Be ready for partial refunds with concurrent callers.
  • Forgetting the ledger invariant. Sum(debits) = sum(credits) per txn. Interviewers love to probe this.

Walkthrough: Interview Dialogue Example ​

Interviewer: "Trip happens, rider pays $20, $3 is promo credit, platform takes 25%. Walk through ledger entries."

You should answer:

Final ledger entries at capture (all in one DB transaction):

  1. Debit rider:123 card $17; credit clearing $17.
  2. Debit rider:123 promo_account $3; credit clearing $3.
  3. Debit clearing $15 ($20 - 25%); credit driver:456 earnings $15.
  4. Debit clearing $5; credit platform_revenue $5.

Sum of debits per txn = $40. Sum of credits per txn = $40. Clearing account nets to zero for this txn. Rider net out $20 (card + promo), driver net in $15, platform net in $5.

Cadence PaymentWorkflow orchestrated:

  • AuthorizeActivity at trip start (Stripe $25 hold, idempotency key = txnId).
  • CaptureActivity at trip end (Stripe capture $20 of the $25, idempotency key = txnId-capture).
  • WriteLedgerActivity single DB transaction with the 4 entries above.

If CaptureActivity fails mid-flight, Cadence retries with the same idempotency key — Stripe deduplicates, and the workflow proceeds. If WriteLedgerActivity fails, we've captured money but not written the ledger; Cadence retries until success or triggers a compensation workflow (RefundActivity) and alerts finance.

What If They Pivot Mid-Interview? ​

  • "Design driver payouts." — Daily Cadence workflow aggregates trip credits into a payout; hands off to a bank-transfer adapter (ACH / SEPA / UPI). Idempotency via payout period key.
  • "Support split fares (multiple riders)." — Each rider has their own auth and capture; one trip has multiple txnIds that share a tripId. Ledger entries sum to the total fare.
  • "What about prepaid wallets (corporate accounts)?" — Wallet is just another account type; debits use SELECT FOR UPDATE + conditional update pattern from deep dive 5.
  • "How do you handle GDPR right-to-erasure?" — Cannot delete ledger entries (SOX requires retention). Instead, tokenize PII columns; delete the user's PII-lookup record on request. Ledger remains intact with anonymized identifiers.

Reliability and Observability ​

  • SLO: 99.99% auth availability; 99.999% ledger durability. p99 auth < 500ms; capture within 30s of trip end.
  • Failure modes:
    • PSP outage → circuit breaker per PSP; retry via secondary PSP. Workflow continues; user experience: slight latency bump.
    • Ledger DB failover → Cadence resumes when DB is back; no lost state because each txn is a single DB transaction.
    • Redis idempotency down → DB unique constraint still enforces correctness, at the cost of an extra lookup per request.
  • Deployment: Gate any ledger schema change behind a migration Cadence workflow that dual-writes for a week before cutover.
  • Monitoring: auth_success_rate, auth_latency_p99, capture_drift_seconds, unreconciled_txn_count per PSP and per region. Alert when unreconciled exceeds 100 in a day.
  • Runbook: On PSP-wide outage, switch to secondary PSP in the region; if all PSPs fail, queue charges in a staging table for later replay (explicitly notify users that capture is delayed).

Uber-Specific Notes ​

  • Uber uses Cadence heavily for payments orchestration — the multi-step, long-running nature maps perfectly.
  • Schemaless is the system of record; ledger lives there.
  • PSP diversity: Stripe, Adyen, Braintree, plus country-specific (Paytm in India, UPI, WeChat/Alipay in China). Mention one or two.
  • M3 tracks auth_success_rate, capture_latency_p99, refund_error_rate per PSP. Jaeger traces every capture for audit drilldowns.
  • For a staff+ stretch, mention regulatory isolation: ledgers in India must stay on Indian infrastructure (RBI data localization); Uber runs region-specific Cadence clusters and Schemaless shards.
  • When the interviewer probes double-entry principles, be ready to draw the T-account: every entry has a debit and a credit; the sum of debits equals the sum of credits for any transaction.
  • Close with: "Our SLO is 99.99% on auth, five 9s on ledger durability. On PSP outage we degrade gracefully to queue-and-retry for non-critical charges; for hot-path auth we fail fast and the trip falls back to a secondary PSP."

Scaling Milestones ​

  • Seed market: One Stripe account, Postgres ledger table, simple auth/capture, no promos.
  • Early growth: Idempotency added; refunds supported; migration to double-entry ledger for auditability.
  • Scale-up: Cadence for workflow orchestration; Kafka event stream; multi-PSP with failover; promo as separate account.
  • Global: Regional Cadence clusters; country-specific PSPs; data-residency compliance; multi-currency; GDPR support.

Summary Checklist ​

  • [ ] Double-entry ledger in Schemaless.
  • [ ] Cadence workflow for multi-step payment lifecycle.
  • [ ] Idempotency at API, Redis, DB unique constraint, and PSP layer.
  • [ ] Saga pattern for cross-service consistency.
  • [ ] Refund correctness with row locks / CAS.
  • [ ] Promo as separate account with graceful degradation.
  • [ ] PSP reconciliation job nightly.
  • [ ] Multi-PSP failover for regional resilience.
  • [ ] Compliance: PCI-DSS, SOX, GDPR, data localization.

Key Numbers to Memorize ​

MetricValue
Trips/day (peak)25M
Auth QPS (peak)1K
Capture latencywithin 30s of trip end
Ledger entries/day100M
Ledger storage/year18 TB
Auth p99< 500ms
Ledger durabilityfive 9s
Auth availability99.99%
Idempotency TTL24h (Redis), permanent (DB)

One-Liner You Should Remember ​

"Double-entry ledger on Schemaless; Cadence orchestrates the payment lifecycle with durable retries and compensations; idempotency is enforced at every tier; PSP diversity mitigates outages; nightly reconciliation catches anomalies. 1K auth/sec peak, 100M ledger entries/day, zero tolerance for double-charges."

Frontend interview preparation reference.