HLD: Uber Surge Pricing

Frequently Asked at Uber — Uber's signature system design problem; reported in nearly every L5A loop.

Understanding the Problem

What is Surge Pricing?

Surge raises the price multiplier in a geographic area when demand exceeds supply. Its two goals are (1) dampen excess demand by pricing out less urgent riders, and (2) incentivize drivers to relocate into hot areas. The system must compute multipliers in near real time (sub-30s update latency), remain consistent for any quoted price window, and broadcast a heatmap back to drivers. This is Uber's canonical streaming-systems problem and frequently comes after (or paired with) the ride matching question.

Functional Requirements

Core (above the line):

Compute a surge multiplier per (geo-cell, product-type) every 10–30 seconds.
Serve the multiplier to the pricing service at < 50ms p99 lookup.
Show a surge heatmap to drivers so they can reposition.
Lock a quoted multiplier for the duration of a rider's price guarantee window (typically 90–120s).

Below the line (out of scope):

Base fare calculation — separate pricing service.
Promotional discounts — promo service.
Long-term driver incentives — separate earnings pipeline.
Fraud-surfacing of fake demand — separate abuse-detection service.

Non-Functional Requirements

Core:

Scale — ~1M active H3 res-7 cells globally; 5–10 product types → 5–10M (cell, product) pairs.
Update latency — demand-to-price < 30s p99.
Read QPS — every ride request and every driver heatmap refresh → ~50K reads/sec globally.
Consistency — a quoted multiplier must not silently increase during a rider's price-lock window. Eventual consistency is fine for heatmap refreshes.
Availability — 99.99%. If surge is unavailable, fall back to 1.0x (never block a trip because surge is down).

Below the line:

Long-term trend forecasting — offline ML pipeline.
Cross-market arbitrage detection — offline fraud work.

Capacity Estimation

Write these on the board before drawing anything:

Stream throughput: ~500K events/sec (requests, dispatches, cancellations, app opens) × 300 B = 150 MB/s Kafka ingress.
Hot state: 10M (cell, product) entries × 128 B = 1.3 GB — fits comfortably in a Redis cluster with headroom.
Audit history: 1M updates/min × 60 × 24 = 1.4B rows/day. Store 30 days hot (42B rows) in Cassandra for reconciliation. With replication factor 3 and compression, this is ~30 TB raw → ~10 TB on disk.
Read capacity: 50K reads/sec (every quote + every heatmap refresh). Redis Cluster with 3 primaries handles this at sub-ms p99.
Flink state: 10M keyed windows × ~256 B = 2.5 GB; backed by RocksDB on local SSD for spill-over and S3 checkpoints.

The Set Up

Core Entities

Entity	Description
SurgeSignal	`eventType` (REQUEST/DISPATCH/CANCEL/SEARCH), `cellId`, `productType`, `ts`
CellState	`cellId`, `productType`, `demandCount`, `supplyCount`, `multiplier`, `updatedAt`
QuotedSurge	`quoteId`, `cellId`, `multiplier`, `expiresAt`, `riderId`

API Design

Pricing service reads the current multiplier synchronously, then locks a quote for the rider's session. Drivers pull a heatmap tile set for their viewport.

TypeScript (REST)Java (gRPC)C++ (Flink UDF sketch)

typescript

// Internal: pricing service fetches current multiplier
GET /v1/surge?cellId=8928308280fffff&product=UberX
Response: { multiplier: 1.4, validUntil: "2026-04-19T12:00:30Z" }

// Internal: lock a quote for a rider request
POST /v1/surge/quote
Body: { cellId; product; riderId }
Response: { quoteId; multiplier; expiresAt; signedToken }

// Driver-facing: heatmap tiles
GET /v1/surge/heatmap?bbox=...&res=8
Response: { cells: [{ cellId; multiplier; color }, ...] }

java

service Surge {
  rpc GetMultiplier(GetMultiplierReq) returns (Multiplier);
  rpc LockQuote(LockQuoteReq) returns (Quote);
  rpc GetHeatmap(HeatmapReq) returns (HeatmapResp);
}

message LockQuoteReq {
  string cell_id = 1;
  string product = 2;
  string rider_id = 3;
}

message Quote {
  string quote_id = 1;
  double multiplier = 2;
  google.protobuf.Timestamp expires_at = 3;
  bytes signed_token = 4; // HMAC over (cell, mult, expiry)
}

cpp

// Per-(cell, product) window aggregator
struct CellAggregator {
  int demand = 0;
  int supply = 0;

  void onEvent(const SurgeSignal& s) {
    if (s.eventType == REQUEST) demand++;
    if (s.eventType == DRIVER_IDLE) supply++;
  }

  double finalize() const {
    if (supply == 0) return 5.0; // cap
    double ratio = static_cast<double>(demand) / supply;
    return std::clamp(std::sqrt(ratio), 1.0, 5.0);
  }
};

High-Level Design

Driver/Rider events
       |
       v
+--------------+    +----------+    +----------------+
|  Event API   |--->|  Kafka   |--->| Flink/Samza    |
| (ingress)    |    | (500K/s) |    | aggregation    |
+--------------+    +----------+    +----------------+
                                            |
                                            v
                                    +----------------+
                                    | Surge Compute  |
                                    | (demand/supply)|
                                    | -> multiplier  |
                                    +----------------+
                                            |
                        +-------------------+------------------+
                        v                                      v
                +----------------+                   +------------------+
                | Redis Cluster  |<------------------| Write-through    |
                | (hot state,    |    Quoted         | publisher        |
                |  cell -> mult) |    multipliers    +------------------+
                +----------------+                            |
                        ^                                     v
                        |                            +------------------+
                +------------------+                 | Cassandra        |
                | Pricing Service  |                 | (audit/history)  |
                +------------------+                 +------------------+

End-to-End Flow

Signal ingest. A rider opens the app or requests a ride; a driver's app emits idle/on-trip events. The client SDK sends these to a thin Event API keyed by cellId. Event API authenticates, applies per-device rate limits, and produces to Kafka.
Kafka topology. surge-signals topic has 192 partitions, RF=3, partitioned by cellId so all events for a cell land on the same Flink task. Retention 24h for replay.
Flink aggregation. Flink job consumes with parallelism 96 (two tasks per partition for headroom). It runs a 10-second tumbling window per (cellId, product) and keeps rolling demand (requests) and supply (idle drivers) counts. State backend: RocksDB on local SSD, checkpointed to S3 every 60s.
Multiplier compute. The Surge Compute operator runs multiplier = f(demand, supply, trend). Simple form: max(1.0, sqrt(demand / supply)) capped at 5.0. Advanced form layers in trend features and caps on rate-of-change.
Write-through. New multipliers are written through to Redis (hot-path reads) via a pipelined producer. Same updates appended to Cassandra surge_audit for reconciliation (partition (cellId, dayHour), clustering ts).
Hot-path reads. Pricing Service reads from Redis on every rider quote (sub-ms p99). LockQuote issues a signed HMAC token with a 120s TTL. Token includes (cellId, product, multiplier, expiresAt); HMAC secret rotated daily and served via a secrets service.
Driver heatmap. Drivers see a downsampled view at H3 res 7 or 8 served from a CDN-cached snapshot refreshed every 30s. For hot cells (multiplier > 2.0 or rate-of-change > 0.3), deltas push via the existing driver WebSocket.

Why This Shape

Streaming (Flink/Kafka) is the only way to keep 30s freshness at 500K events/sec.
Redis holds just 1.3 GB of hot state, so a small cluster suffices.
Cassandra gives us cheap appends for audit and dispute resolution.
Signed tokens move trust to the client so Pricing Service doesn't need a global lock to enforce price guarantees.

Data Model Detail

CellState (Redis): Key surge:{cellId}:{product}. Hash with fields demand, supply, multiplier, updatedAt, version. TTL 300s — if no events for 5 minutes, cell is considered quiet and multiplier defaults to 1.0.
QuotedSurge (Redis): Key quote:{quoteId}. Hash with cellId, multiplier, expiresAt, riderId. TTL equal to the lock window (120s).
Audit (Cassandra): Partition key (cellId, dayHour), clustering key ts. Stores every multiplier update for reconciliation with trip ledger.

Capacity Walkthrough

500K events/sec input × 150 B each = ~75 MB/sec Kafka ingress. With RF=3, storage is 225 MB/sec or ~20 TB/day replicated. Retention 24h hot.
Flink: 20 TaskManagers × 4 slots each × 5K events/sec/slot = 400K events/sec per region at comfortable margins.
Redis: 10M cell keys × 128 B = 1.3 GB. A 3-node Redis Cluster handles the read load (50K reads/sec).
Cassandra audit: 10M updates/day × 256 B = 2.5 GB/day × 30 days = 75 GB hot; fits on a small cluster.

Potential Deep Dives

1) How do we handle backpressure on demand spikes (NYE, concerts, storms)?

Demand spikes are exactly when you cannot afford to fall behind on surge — both riders and drivers need accurate signals fast.

Bad Solution — Unbounded Flink consumer

Approach: Let Flink consume from Kafka with default parallelism.
Challenges: Consumer lag grows; surge updates fall behind 5–10 minutes precisely when surge matters most. Riders pay old low multipliers; drivers chase phantom demand.

Good Solution — Auto-scale on lag

Approach: Scale Flink parallelism on consumer lag via a control loop.
Challenges: Scale-up takes 60–120s in Kubernetes, too slow for a NYE spike where demand doubles in 5 minutes.

Great Solution — Pre-provisioned headroom + adaptive sampling

Approach:
- Keep Flink provisioned at 2–3× baseline during predictable peaks (Friday 6pm, NYE, large events).
- For unpredicted spikes, apply watermark-driven downsampling: when a cell's event rate exceeds 10K/s (always-surge territory anyway), sample 1:10 on ingest. Direction matters more than exact counts for surge.
- Fall back to a 30s window (from 10s) when consumer lag exceeds 5s — trading freshness for throughput.
Numbers: Held p99 update latency at < 45s during a real 5×-demand spike in a retrospective. A 2x headroom buys you ~5 minutes of leeway before emergency sampling kicks in.
Challenges: Sampling biases counts during weird events (e.g., parade blocking supply without matching demand). Instrument the sampler so you can disable it per-cell from the control plane.

2) How do we guarantee price-lock integrity across billing?

A rider sees 1.4x at request time and then pays 2.1x at trip end — that is a PR disaster and a CX nightmare.

Bad Solution — Re-read surge at trip completion

Approach: Pricing service re-reads Redis at billing time.
Challenges: Rider sees 1.4x at request, pays 2.1x because surge climbed mid-trip. Violates the price guarantee.

Good Solution — Persist multiplier on the trip record

Approach: Pricing writes the quoted multiplier into the trip record at REQUESTED time. Billing uses the trip record, ignoring live surge.
Challenges: Rider can game the system by locking a 1.0x quote, waiting 5 minutes, then requesting — the lock persists even though supply/demand shifted.

Great Solution — Signed quote tokens with short TTL + immutable ledger

Approach:
- LockQuote returns an HMAC-signed token {cellId, multiplier, expiresAt} with a 120s TTL.
- Trip request must include the token. Pricing verifies signature and TTL before accepting the quote.
- Token expiry forces a re-quote — you cannot sit on a 1.0x quote for 10 minutes.
- Ledger entry at trip creation has an immutable surge_multiplier_applied column. A nightly reconciliation job checks Cassandra surge log vs trip ledger for anomalies.
Challenges: Clock skew on the verifier can leak a few extra seconds of validity — bound TTL conservatively (115s server-side for a 120s client-advertised window).

2.5) How do we handle cross-region consistency for riders traveling between markets?

A rider who uses Uber in NYC and LA shouldn't see phantom surge because cell definitions differ or regions miscoordinate.

Good Solution — Region-local surge only

Approach: Each region computes surge independently; the rider sees whatever region they're in.
Challenges: For intercity rides (e.g., NYC to NJ), surge changes at the regional boundary, which can look weird.

Great Solution — Region-local with trip-sticky quote

Approach:
- Each region is independent; rider's quote locks to the region where the trip began.
- For trips that cross regions, the quote remains fixed for the duration.
- Regional handoff happens automatically at the Pricing Service via a region-id encoded in the quote token.
Challenges: Regional H3 boundaries need to be pre-registered; new markets require a provisioning step.

3) How do we avoid gaming the signal?

Bad actors inflate demand signals to trigger surge in empty zones, luring drivers there.

Bad Solution — Count any app open as demand

Approach: Trust every client event as a demand signal.
Challenges: Over-aggressive marketing push notifications inflate demand; bad actors scripting app opens on emulators spike surge; drivers flock to empty zones.

Good Solution — Dwell-time filter

Approach: Only count requests that made it past a minimum dwell time (e.g., fare estimate shown for > 3s).
Challenges: Filters casual opens but scripting to mimic dwell is easy.

Great Solution — Weighted multi-signal model with caps

Approach:
- Features: requests, fare-estimate views, search-without-book, cancellation rate, recent completion rate, device reputation.
- Logistic regression or GBDT scores "true demand propensity" per event.
- Rate-limit per-device contribution: a single device can contribute at most 1 demand signal per 60s window per cell.
- Cap instantaneous multiplier change rate to ±0.2 per 30s window to prevent oscillation and to make exploits less profitable.
- Flag cells exceeding anomaly thresholds for offline review.
Challenges: Requires shared feature store with device reputation; adds deployment complexity. Reviewer bandwidth is limited — tune thresholds to avoid alert fatigue.

4) How do we propagate updates to drivers' heatmaps without smashing the backend?

Drivers watch a heatmap tile set that refreshes every few seconds.

Bad Solution — Direct queries per driver

Approach: Each driver's app queries the backend for tiles at 0.5 Hz.
Challenges: 6M drivers × 0.5 Hz = 3M QPS on the heatmap endpoint. Not scalable.

Good Solution — CDN-cached static tiles

Approach: Generate tiles at H3 res 7–8 every 30s and serve from a CDN with 30s cache.
Challenges: Works, but cache invalidation lag means drivers see slightly stale data.

Great Solution — Pre-generated tiles + WebSocket delta push for hot cells

Approach:
- Snapshot tiles generated every 30s per H3 res 7 region and pushed to CDN.
- For hot cells (multiplier changing > 0.3), push deltas over the existing driver WebSocket. Driver app merges deltas into the cached snapshot.
- Cold cells keep the CDN 30s cycle.
Challenges: Delta channel can be noisy in surge storms; throttle to at most 1 delta per cell per 10s. Also requires a client-side merge engine.

4.5) How do we handle cold starts when a new market launches?

A freshly launched city has little data; the formula can't compute reliably.

Good Solution — Default 1.0x for new cells

Approach: Until a cell has 100+ events, multiplier stays at 1.0x.
Challenges: Misses real demand spikes in genuinely busy new markets.

Great Solution — Tiered thresholds with bootstrap from similar markets

Approach:
- Bootstrap new city's supply/demand curve from a similar market (by population density + market type).
- Gradually shift weight from bootstrap to live data as sample count grows.
- Apply a confidence interval to the multiplier; only activate surge when confidence is high.
Challenges: Bootstrap choice is an ML problem; have a fallback of simple 1.0x.

5) How do we keep the control loop from oscillating?

High surge → drivers reposition → supply rises → surge drops → drivers leave → surge rises again. Classic feedback-loop instability.

Good Solution — EMA smoothing

Approach: Smooth the multiplier with an exponentially weighted moving average over the last 3 windows.
Challenges: Lags real changes by ~60s, which may be acceptable for stability.

Great Solution — Rate-limited updates with hysteresis

Approach: Apply asymmetric step limits — surge can rise 0.2 per 30s, but fall only 0.3 per 30s. Hysteresis prevents drivers from thrashing. Combine with a supply-in-flight term that counts drivers en route to the hot cell as future supply.
Challenges: Tuning is market-specific; instrument with per-city dashboards and allow per-market overrides.

Rapid-Fire Q&A Anticipations

"What's the minimum data point count before you compute a multiplier?" 100 signals in a 10-minute window; below that, default to 1.0x.
"What if demand is 0 but a lot of app opens?" App-open alone shouldn't drive surge; only filtered, dwell-time-qualified events count.
"What's the feedback time from a multiplier change to observed supply response?" Drivers reposition in ~5–15 min; the system should expect this lag and avoid over-reacting in the interim.
"How do you test a new surge formula offline?" Historical Kafka replay — run the new job against last week's stream; compare multiplier trajectories and implied trip conversions.
"How do you prevent surge from causing negative rider experience?" Cap at 5x, show clear disclosure in the app, offer "notify me when surge drops" feature.

Alternatives Considered

Flink vs Spark Streaming: Flink has true streaming semantics with low-latency windows; Spark is micro-batch. Surge wants seconds-level freshness → Flink.
Flink vs Samza: Both exist at Uber. Samza integrates tightly with Kafka but has a smaller ecosystem. Flink is the net-new choice.
Pricing live reads from Flink state vs Redis: You can serve queries directly from Flink queryable state, but Redis gives simpler SRE story and better isolation from the streaming job.
Per-cell actor (Akka, Orleans) vs Flink aggregation: Actor-per-cell is an alternative architecture; Uber chose streaming aggregation for easier reasoning about windows and backpressure.
No caps (pure market-based) vs capped multiplier: Pure market-based gives tighter supply-demand matching but terrible optics and adversarial tales. Capped at 5x is the pragmatic choice.

Frequently Asked Follow-ups

"What if Flink is down?" — Graceful degradation: Pricing Service serves last-known multiplier from Redis with a stale=true flag. If staleness exceeds 5 minutes, default to 1.0x and page on-call.
"How does the driver app know surge changed?" — Heatmap snapshot every 30s; hot-cell deltas pushed via WebSocket. Drivers never poll hard.
"What if two regions disagree?" — Each region is independent; the rider is served locally. No cross-region coordination on the hot path.
"Does surge apply to Pool / shared rides?" — Separate multiplier calculation because demand/supply dynamics differ. Same system, different partition.
"How do you A/B-test a new surge formula?" — Shadow-compute new formula on the same stream; compare with the live one. Gate rollout by region on a feature flag.

Visual Aids to Draw

When sketching on the whiteboard, lean into these diagrams:

Hexagonal grid with current multipliers color-coded (green 1.0x, yellow 1.5x, red 3x+). Shows H3 tiling.
Pipeline diagram with boxes for Event API → Kafka → Flink → Redis/Cassandra. Label Kafka partitions and Flink parallelism.
Sequence diagram for LockQuote showing token issuance and validation.
Feedback loop diagram illustrating the oscillation problem: supply↑ → surge↓ → supply leaves → surge↑.
Timeline showing lag during a spike: demand jump at t0, first multiplier update at t+10s, full reaction at t+30s.

What's Expected at Each Level

Mid-level (L4)

Streaming pipeline with Kafka + a windowed aggregator.
Redis for hot state. Understands eventual consistency.
Misses the price-lock guarantee and the anti-gaming angle.

Senior (L5 / L5A)

Explicit price-lock window with quote tokens or trip-record persistence.
Backpressure strategy with quantified tradeoffs.
Discusses the feedback loop (high surge → driver repositioning → supply up → surge down) and oscillation mitigation.
Back-of-envelope on Kafka throughput and Redis sizing.

Staff+ (L6)

Multi-region independence per market, with region-local Flink clusters and no cross-region coordination on the hot path.
Anti-gaming measures with quantitative tradeoffs and feedback loops.
Cross-functional callouts to pricing, earnings comms, and legal (surge transparency regulations in some jurisdictions).
Degradation strategy: if Surge Compute is down, fall back to 1.0x; if Redis is partitioned, serve last-known multiplier from a local cache with staleness_ms in the response.

Common Pitfalls

Conflating surge with dynamic pricing in general. Surge is the multiplier on top of a base fare. Don't redesign the whole pricing engine.
Ignoring the price-lock window. This is the most common Uber follow-up. Have an answer ready.
Unbounded multiplier. Surge caps at ~5.0 in production. Know why (rider perception + adversarial control).
Hot-cell blowup. Not all H3 cells are equal; airports blow up. Plan for cell-splitting or pre-aggregation.
Forgetting the feedback loop. High surge → drivers relocate → supply rises → surge falls. If you don't mention this, expect a probe about oscillation.

Walkthrough: Interview Dialogue Example

Interviewer: "Walk me through one rider's journey from request to capture, showing surge integration."

You should answer:

Rider opens app. App sends a GET /v1/surge?cellId=... and shows the rider "1.4x surge in your area".
Rider taps Request. App calls POST /v1/surge/quote to lock a quote. Server returns a 120s signed token with multiplier 1.4.
App includes the token in POST /v1/trips. Pricing Service verifies signature + TTL.
Trip proceeds. Trip record persists surge_multiplier_applied = 1.4.
Trip completes. Payments Capture reads multiplier from trip record (not live surge) and posts final ledger entries.
Nightly reconciliation job verifies Cassandra surge-audit entries against trip ledger.

This guarantees no silent multiplier drift from quote to capture.

What If They Pivot Mid-Interview?

"Apply surge to Uber Eats." — Same infra: restaurants replace drivers as "supply", orders replace trip requests. Tune per-vertical because dynamics differ.
"Show surge as a prediction (not reactive)." — Add a prediction model (e.g., Prophet or LSTM) that forecasts demand 10–30 min ahead. Drivers see "expected surge" as a lead indicator.
"Dynamic pricing for non-ride verticals (e.g., freight)." — Same streaming pipeline with different aggregation window and formula. Longer windows (minutes) match freight dynamics.
"Is surge legal everywhere?" — No. Some jurisdictions cap it during emergencies (storms, natural disasters). Have a regulatory_cap field per region that overrides compute.

Reliability and Observability

SLO: 99.99% read availability (multiplier lookups), 30s p99 demand-to-price.
Failure modes:
- Flink TaskManager loss → job manager reassigns the failed subtask with at-least-once semantics; duplicate events rarely skew surge because the aggregator is idempotent over a window.
- Kafka broker loss → replication factor 3 covers it; acks=1 on hot path risks up to one partition falling behind for seconds.
- Redis primary loss → replica promotion; during the blip, Pricing Service serves last-known multiplier with an explicit staleness_ms field.
Deployment: Canary new multiplier formulas at 1% traffic; evaluate accuracy against shadow stream before full rollout.
Monitoring: Per-cell lag_seconds, multiplier_current, multiplier_rate_of_change. Alert when any cell exceeds a 2x change in 60s (oscillation signal).
Runbook: If Flink lag > 60s on a city, switch that region to degraded mode (serve 1.0x) and page surge on-call.

Uber-Specific Notes

Uber's surge engine has run on both Samza (older) and Flink (modern). Mention either; emphasize Flink for net-new designs.
H3 resolution 7 (~5 km² avg) is roughly Uber's city-grid granularity; res 8–9 for denser zones.
Cadence orchestrates long-running earnings/incentive workflows (e.g., "driver who completes 20 trips in a hot zone gets a $50 bonus"), not the hot surge path.
Memorize one real anchor: NYE in NYC can hit 5× normal demand within minutes; surge caps at 5.0x to avoid adversarial perception.
M3 tracks surge cell-level time series; Jaeger samples quote paths for debugging mismatches between quoted and captured multipliers.
For a stretch Staff+ answer, mention that a small ML model (gradient-boosted tree) augments the ratio-based formula with features like time of day, weather, and event calendar — trained offline, refreshed weekly, feature-served via a low-latency feature store.

Scaling Milestones

1 city, manual multiplier: Ops team manually sets multipliers via an admin UI; daily review. Hardcoded 1.0x most of the time.
10 cities, batch compute: Nightly MapReduce computes yesterday's supply/demand trends; multiplier updated every hour.
100 cities, near real-time: Kafka + Flink with 1-minute windows; Redis for reads; price-lock via stored trip field.
Global: Signed quote tokens; anti-gaming ML; adaptive backpressure; multi-region independence; oscillation caps.

The journey shows why streaming and signed tokens are essential at scale, not premature optimizations.

Summary Checklist

[ ] Scope + out-of-scope called out (pricing, promo are separate).
[ ] Sub-30s freshness SLO.
[ ] Kafka → Flink → Redis pipeline with partition keys.
[ ] H3 granularity (res 7–9 per zone).
[ ] Price-lock via signed tokens with short TTL.
[ ] Anti-gaming multi-signal model.
[ ] Oscillation mitigation (rate limits, hysteresis).
[ ] Failure story: Flink lag, Redis failover, degraded mode.
[ ] Multi-region independence.

Key Numbers to Memorize

Metric	Value
Signal events/sec	500K
Hot-state cells (global)	10M
Hot-state footprint	1.3 GB
Read QPS (multiplier lookups)	50K
Update latency p99	< 30s
Read p99	< 50ms
Price-lock window	120s
Surge cap	5.0x
Rate-of-change cap	±0.2 per 30s

One-Liner You Should Remember

"Kafka → Flink with 10s tumbling windows per H3 cell, Redis hot state for 50K reads/sec, signed quote tokens for price-lock integrity, 5x cap to avoid optical chaos. 500K signal events/sec, 30s demand-to-price p99, no cross-region coordination."

HLD: Uber Surge Pricing ​

Understanding the Problem ​

What is Surge Pricing? ​

Functional Requirements ​

Non-Functional Requirements ​

Capacity Estimation ​

The Set Up ​

Core Entities ​

API Design ​

High-Level Design ​

End-to-End Flow ​

Why This Shape ​

Data Model Detail ​

Capacity Walkthrough ​

Potential Deep Dives ​

1) How do we handle backpressure on demand spikes (NYE, concerts, storms)? ​

Bad Solution — Unbounded Flink consumer ​

Good Solution — Auto-scale on lag ​

Great Solution — Pre-provisioned headroom + adaptive sampling ​

2) How do we guarantee price-lock integrity across billing? ​

Bad Solution — Re-read surge at trip completion ​

Good Solution — Persist multiplier on the trip record ​

Great Solution — Signed quote tokens with short TTL + immutable ledger ​

2.5) How do we handle cross-region consistency for riders traveling between markets? ​

Good Solution — Region-local surge only ​

Great Solution — Region-local with trip-sticky quote ​

3) How do we avoid gaming the signal? ​

Bad Solution — Count any app open as demand ​

Good Solution — Dwell-time filter ​

Great Solution — Weighted multi-signal model with caps ​

4) How do we propagate updates to drivers' heatmaps without smashing the backend? ​

Bad Solution — Direct queries per driver ​

Good Solution — CDN-cached static tiles ​

Great Solution — Pre-generated tiles + WebSocket delta push for hot cells ​

4.5) How do we handle cold starts when a new market launches? ​

Good Solution — Default 1.0x for new cells ​

Great Solution — Tiered thresholds with bootstrap from similar markets ​

5) How do we keep the control loop from oscillating? ​

Good Solution — EMA smoothing ​

Great Solution — Rate-limited updates with hysteresis ​

Rapid-Fire Q&A Anticipations ​

Alternatives Considered ​

Frequently Asked Follow-ups ​

Visual Aids to Draw ​

What's Expected at Each Level ​

Mid-level (L4) ​

Senior (L5 / L5A) ​

Staff+ (L6) ​

Common Pitfalls ​

Walkthrough: Interview Dialogue Example ​

What If They Pivot Mid-Interview? ​

Reliability and Observability ​

Uber-Specific Notes ​

Scaling Milestones ​

Summary Checklist ​

Key Numbers to Memorize ​

One-Liner You Should Remember ​