Skip to content

08 - HLD & System Design (Food Delivery) ​

Cross-Reference ​

For foundational system design concepts (CAP theorem, SQL vs NoSQL, indexing, sharding, caching, message queues, load balancing, ACID), see paytm-prep/notes/04-hld-system-design.md. This file focuses on food-delivery-specific system designs relevant to Temple (ex-Zomato founding team).


Quick Reference (scan in 5 min) ​

SystemKey ComponentsKey PatternsScale Challenges
Notification SystemKafka, Orchestrator, Channel Workers (push/SMS/email), Retry Queue, DLQFan-out per channel, Exponential backoff, Rate limiting per userMillions of concurrent sends during promos, priority ordering, delivery guarantees
Real-Time UpdatesWebSocket Gateway, Redis Pub/Sub, Location Service, GPS ingestion pipelineGeohash bucketing, Sticky sessions, Heartbeat + reconnect + polling fallbackHigh-frequency GPS writes, fan-out to watchers, horizontal WebSocket scaling
Food Delivery (Full)User/Restaurant/Order/Delivery/Payment/Search/Notification services, API GatewayDatabase-per-service, Saga for orders, CQRS for search, Queue-based load levelingPeak hour auto-scaling, geo-partitioned orders, ETA estimation accuracy

Design 1: Notification System at Scale ​

Requirements ​

  • Multi-channel: push notifications, SMS, and email
  • Volume: millions of users; promotional blasts during peak hours (lunch/dinner)
  • Priority handling: order updates (high) vs promotional (low)
  • Reliability: retry on transient failure, dead-letter for permanent failure
  • User respect: rate limiting to prevent notification fatigue

Architecture ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Order Serviceβ”‚   β”‚ Promo Serviceβ”‚   β”‚Delivery Svc  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                  β”‚                   β”‚
       β–Ό                  β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Kafka Cluster                      β”‚
β”‚  (partitioned by user_id for per-user ordering)      β”‚
β”‚                                                      β”‚
β”‚  Topics:  notification.high  β”‚  notification.low     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Notification Orchestrator                  β”‚
β”‚                                                      β”‚
β”‚  1. Read event from Kafka                            β”‚
β”‚  2. Resolve user preferences (opt-ins, channels)     β”‚
β”‚  3. Check rate limit (Redis counter per user)        β”‚
β”‚  4. Determine channels + priority                    β”‚
β”‚  5. Dispatch to channel-specific queues              β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚              β”‚              β”‚
    β–Ό              β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Push  β”‚   β”‚   SMS   β”‚   β”‚  Email   β”‚
β”‚Workers β”‚   β”‚ Workers β”‚   β”‚ Workers  β”‚
β”‚(FCM/APNs)  β”‚(Twilio) β”‚   β”‚(SES/SG)  β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
    β”‚            β”‚               β”‚
    β”‚    on failure               β”‚
    β–Ό            β–Ό               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Retry Queue (Kafka)                     β”‚
β”‚         Exponential backoff: 1s β†’ 2s β†’ 4s β†’ 8s      β”‚
β”‚         Max retries: 3 (push), 2 (SMS), 3 (email)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   β”‚ after max retries
                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          Dead Letter Queue (DLQ)                     β”‚
β”‚   Permanent failures logged for manual review        β”‚
β”‚   Alert on DLQ depth > threshold                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Notification Log (Audit Database) ​

Every notification attempt is logged for audit, debugging, and analytics.

Table: notification_log
─────────────────────────────────────────────────
id            UUID  PK
user_id       UUID  FK β†’ users       (indexed)
event_type    VARCHAR(50)            -- 'order_update', 'promo', 'delivery_status'
channel       VARCHAR(20)            -- 'push', 'sms', 'email'
priority      VARCHAR(10)            -- 'high', 'low'
status        VARCHAR(20)            -- 'sent', 'failed', 'dlq'
payload       JSONB                  -- full notification content
attempt_count INT
created_at    TIMESTAMP              (indexed)
updated_at    TIMESTAMP

Key Design Decisions ​

Why Kafka (not RabbitMQ)?

  • Partitioning by user_id guarantees per-user message ordering. A user always sees "order confirmed" before "order picked up."
  • High throughput for promotional blasts (millions of messages during dinner push).
  • Consumer groups allow independent scaling of the orchestrator.
  • Replay capability: if a bug in the orchestrator misprocesses events, rewind the offset and reprocess.

Why separate worker pools per channel?

  • Different latency SLAs: push is expected in < 1 second, email can tolerate 30 seconds.
  • Different failure modes: FCM may rate-limit you, Twilio may have regional outages, SES has sending quotas.
  • Independent scaling: during a promo blast, email workers scale 10x while push workers stay steady.
  • Isolating failures: an SMS provider outage does not back-pressure push delivery.

Why separate Kafka topics for priority?

  • High-priority consumers (order updates) get dedicated resources and are never starved by a promo flood.
  • Low-priority consumers can be throttled or paused during peak order load.

Rate Limiting per User ​

ts
// Redis-based sliding window rate limiter per user per channel
async function canSendNotification(
  userId: string,
  channel: "push" | "sms" | "email"
): Promise<boolean> {
  const key = `ratelimit:notif:${channel}:${userId}`;
  const now = Date.now();
  const windowMs = 3600_000; // 1-hour window

  const limits: Record<string, number> = {
    push: 10,  // max 10 push notifications per hour
    sms: 3,    // max 3 SMS per hour (cost + annoyance)
    email: 5,  // max 5 emails per hour
  };

  // Redis sorted set: score = timestamp, member = unique event id
  const pipeline = redis.pipeline();
  pipeline.zremrangebyscore(key, 0, now - windowMs); // prune old entries
  pipeline.zcard(key);                                // count in window
  const results = await pipeline.exec();

  const count = results[1][1] as number;
  return count < limits[channel];
}

Design 2: Real-Time Updates System ​

Use Case ​

Live order tracking on the customer app: the map shows the delivery driver's location updating every few seconds, just like Zomato live tracking. Also used for:

  • "Your order is being prepared" status updates
  • Estimated time of arrival countdown
  • Driver en-route path visualization

Architecture ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Driver App  β”‚          β”‚     Customer App         β”‚
β”‚ (GPS every  β”‚          β”‚  (shows live map)        β”‚
β”‚  3-5 sec)   β”‚          β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                            β”‚
       β”‚  HTTP POST /location       β”‚  WebSocket (wss://)
       β”‚                            β”‚
       β–Ό                            β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API Gateway  β”‚          β”‚  WebSocket Gateway      β”‚
β”‚ (auth, rate  β”‚          β”‚  (sticky sessions via   β”‚
β”‚  limit)      β”‚          β”‚   IP hash or conn ID)   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                            β”‚
       β–Ό                            β”‚  subscribe to
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Location    │──write──▢│     Redis               β”‚
β”‚  Service     β”‚          β”‚                          β”‚
β”‚              │──publish─▢│  Pub/Sub channels:      β”‚
β”‚              β”‚          β”‚  location:{order_id}     β”‚
β”‚              β”‚          β”‚                          β”‚
β”‚              β”‚          β”‚  Key-Value store:         β”‚
β”‚              β”‚          β”‚  driver:{driver_id} β†’     β”‚
β”‚              β”‚          β”‚    {lat, lng, ts, heading}β”‚
β”‚              β”‚          β”‚    TTL: 60s               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fan-out flow:
1. Driver app POSTs GPS coordinates every 3-5 seconds
2. Location Service writes to Redis (key: driver:{id}, TTL 60s)
3. Location Service publishes to Redis Pub/Sub channel location:{order_id}
4. WebSocket Gateway instances subscribe to relevant channels
5. Gateway pushes update to connected customer via WebSocket

Scaling WebSockets Across Multiple Instances ​

The challenge: a customer's WebSocket connects to Instance A, but the location update arrives at Instance B.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  WS Gateway  β”‚    β”‚  WS Gateway  β”‚    β”‚  WS Gateway  β”‚
β”‚  Instance A  β”‚    β”‚  Instance B  β”‚    β”‚  Instance C  β”‚
β”‚  (1000 conns)β”‚    β”‚  (1200 conns)β”‚    β”‚  (950 conns) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                   β”‚                   β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Redis       β”‚
                    β”‚  Pub/Sub     β”‚
                    β”‚              β”‚
                    β”‚ All instancesβ”‚
                    β”‚ subscribe to β”‚
                    β”‚ all active   β”‚
                    β”‚ order channelsβ”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Solution: Every WS Gateway instance subscribes to Redis Pub/Sub
for all orders whose customers are connected to that instance.
When a location update is published, ALL subscribing instances
receive it, but only the one holding the customer's connection
actually sends the WebSocket frame.

Connection Management ​

ts
// Server-side WebSocket connection lifecycle
interface TrackingConnection {
  orderId: string;
  userId: string;
  ws: WebSocket;
  lastHeartbeat: number;
}

const connections = new Map<string, TrackingConnection>();

function handleConnection(ws: WebSocket, orderId: string, userId: string) {
  const conn: TrackingConnection = {
    orderId,
    userId,
    ws,
    lastHeartbeat: Date.now(),
  };
  connections.set(userId, conn);

  // Subscribe this instance to the order's location channel
  redisSubscriber.subscribe(`location:${orderId}`);

  // Heartbeat: client must send ping every 30s
  ws.on("message", (msg) => {
    if (msg === "ping") {
      conn.lastHeartbeat = Date.now();
      ws.send("pong");
    }
  });

  ws.on("close", () => {
    connections.delete(userId);
    // Unsubscribe if no other connections care about this order
    if (!hasOtherSubscribers(orderId)) {
      redisSubscriber.unsubscribe(`location:${orderId}`);
    }
  });
}

// Stale connection reaper β€” runs every 60s
setInterval(() => {
  const now = Date.now();
  for (const [userId, conn] of connections) {
    if (now - conn.lastHeartbeat > 90_000) {
      // No heartbeat in 90s β†’ consider dead
      conn.ws.terminate();
      connections.delete(userId);
    }
  }
}, 60_000);

Client-Side Reconnection with Fallback ​

ts
class OrderTracker {
  private ws: WebSocket | null = null;
  private reconnectAttempts = 0;
  private maxReconnectAttempts = 5;
  private pollingInterval: ReturnType<typeof setInterval> | null = null;

  connect(orderId: string) {
    const url = `wss://tracking.temple.app/ws/orders/${orderId}`;
    this.ws = new WebSocket(url);

    this.ws.onopen = () => {
      this.reconnectAttempts = 0;
      this.stopPolling();
      this.startHeartbeat();
    };

    this.ws.onmessage = (event) => {
      const update = JSON.parse(event.data);
      this.onLocationUpdate(update); // update map marker
    };

    this.ws.onclose = () => {
      if (this.reconnectAttempts < this.maxReconnectAttempts) {
        // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        const delay = Math.pow(2, this.reconnectAttempts) * 1000;
        this.reconnectAttempts++;
        setTimeout(() => this.connect(orderId), delay);
      } else {
        // Fallback to HTTP polling every 5s
        this.startPolling(orderId);
      }
    };
  }

  private startPolling(orderId: string) {
    this.pollingInterval = setInterval(async () => {
      const res = await fetch(`/api/orders/${orderId}/location`);
      const update = await res.json();
      this.onLocationUpdate(update);
    }, 5_000);
  }

  private stopPolling() {
    if (this.pollingInterval) {
      clearInterval(this.pollingInterval);
      this.pollingInterval = null;
    }
  }

  private startHeartbeat() {
    setInterval(() => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send("ping");
      }
    }, 30_000);
  }

  private onLocationUpdate(update: {
    lat: number;
    lng: number;
    heading: number;
    eta: number;
  }) {
    // Render on map β€” implementation depends on map library
  }
}

Key Metrics ​

MetricTargetWhy It Matters
GPS update frequencyEvery 3-5 secondsSmooth map animation without excessive bandwidth
WebSocket message latency< 200ms end-to-endUser perceives real-time movement
Fan-out ratio1 driver update β†’ 1-3 watchersLow for delivery (usually 1 customer + maybe 1 support agent)
Connection density per instance~10,000 concurrent WSMemory-bound; each connection holds minimal state
Reconnection success rate> 99% within 3 attemptsUsers should rarely fall back to polling
Redis Pub/Sub message throughput~100K messages/secHandles all active deliveries in a city during peak

Design 3: Food Delivery System (Full) ​

This is the comprehensive end-to-end design. In an interview, you would not draw all of this β€” you would focus on 2-3 services and their interactions. But knowing the full picture lets you zoom into any part confidently.

Microservices Architecture ​

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚   Customer App   β”‚
                            β”‚   (React Native) β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚   API Gateway    β”‚
                            β”‚  (Kong / Nginx)  β”‚
                            β”‚                  β”‚
                            β”‚  β€’ Auth (JWT)    β”‚
                            β”‚  β€’ Rate limiting β”‚
                            β”‚  β€’ Routing       β”‚
                            β”‚  β€’ Request ID    β”‚
                            β””β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”˜
                               β”‚  β”‚  β”‚  β”‚  β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β”‚  └─────────────────────┐
         β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  └─────────┐              β”‚
         β–Ό              β–Ό            β–Ό            β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User Service β”‚ β”‚ Restaurant β”‚ β”‚  Order   β”‚ β”‚ Delivery β”‚ β”‚   Payment    β”‚
β”‚              β”‚ β”‚  Service   β”‚ β”‚ Service  β”‚ β”‚ Service  β”‚ β”‚   Service    β”‚
β”‚ β€’ Register  β”‚ β”‚            β”‚ β”‚          β”‚ β”‚          β”‚ β”‚              β”‚
β”‚ β€’ Login     β”‚ β”‚ β€’ Menu CRUDβ”‚ β”‚ β€’ Place  β”‚ β”‚ β€’ Assign β”‚ β”‚ β€’ Charge     β”‚
β”‚ β€’ Profile   β”‚ β”‚ β€’ Hours    β”‚ β”‚ β€’ Status β”‚ β”‚ β€’ Track  β”‚ β”‚ β€’ Refund     β”‚
β”‚ β€’ Addresses β”‚ β”‚ β€’ Ratings  β”‚ β”‚ β€’ Cancel β”‚ β”‚ β€’ ETA    β”‚ β”‚ β€’ Wallet     β”‚
β”‚              β”‚ β”‚ β€’ Availabilityβ”‚ β€’ Historyβ”‚ β”‚ β€’ Route  β”‚ β”‚ β€’ Idempotencyβ”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                β”‚            β”‚            β”‚              β”‚
       β”‚   PostgreSQL   β”‚ PostgreSQL β”‚ PostgreSQL β”‚    Redis     β”‚ PostgreSQL
       β”‚   (users)      β”‚ (menus,    β”‚ (orders)   β”‚  (driver     β”‚ (transactions,
       β”‚                β”‚  restaurants)β”‚           β”‚   locations) β”‚  ledger)
       β”‚                β”‚            β”‚            β”‚              β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚            β”‚            β”‚
                        β–Ό            β–Ό            β–Ό
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚             Kafka Event Bus                  β”‚
               β”‚                                              β”‚
               β”‚  Topics:                                     β”‚
               β”‚  β€’ order.created    β€’ payment.completed      β”‚
               β”‚  β€’ order.confirmed  β€’ delivery.assigned      β”‚
               β”‚  β€’ order.cancelled  β€’ delivery.picked_up     β”‚
               β”‚  β€’ order.delivered  β€’ driver.location        β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό             β–Ό             β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ Search     β”‚ β”‚Notificationβ”‚ β”‚  Analytics   β”‚
            β”‚ Service    β”‚ β”‚ Service    β”‚ β”‚  Service     β”‚
            β”‚            β”‚ β”‚            β”‚ β”‚              β”‚
            β”‚ Elastic-   β”‚ β”‚ Push/SMS/  β”‚ β”‚ Clickhouse / β”‚
            β”‚ search     β”‚ β”‚ Email      β”‚ β”‚ Data Lake    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Database Choices ​

ServiceDatabaseWhy
User ServicePostgreSQLACID for profile/address data, relational joins for preferences
Restaurant ServicePostgreSQL + ElasticsearchPostgreSQL for source-of-truth menu/restaurant data; Elasticsearch synced via Kafka for full-text search + geo queries
Order ServicePostgreSQL (partitioned by city)ACID for order state machine, partitioning isolates city-level failures
Delivery ServiceRedis + PostgreSQLRedis for real-time driver locations (key-value with TTL); PostgreSQL for assignment history and driver profiles
Payment ServicePostgreSQLACID is non-negotiable for money; append-only ledger pattern
Search ServiceElasticsearchGeo-distance queries, fuzzy text matching, faceted filters (cuisine, rating, price)
Notification ServiceKafka + PostgreSQLKafka for reliable delivery pipeline; PostgreSQL for notification log/audit
AnalyticsClickHouse or BigQueryColumnar storage for fast aggregations across millions of orders

Key Flow 1: Order Placement ​

Customer App                    Backend Services
─────────────                   ────────────────

1. Search for restaurants
   GET /search?q=biryani&lat=..&lng=..
                                β†’ Search Service (Elasticsearch geo query)
                                ← Restaurant list with menus

2. Add items to cart (client-side state)

3. Place order
   POST /orders
   {restaurantId, items[], addressId, paymentMethod}
                                β†’ Order Service
                                  β€’ Validate items + prices with Restaurant Service
                                  β€’ Calculate total (subtotal + tax + delivery fee)
                                  β€’ Create order record (status: PENDING_PAYMENT)
                                  β€’ Publish: order.created β†’ Kafka

4. Process payment
                                β†’ Payment Service (triggered by order.created)
                                  β€’ Idempotency check (idempotency_key = order_id)
                                  β€’ Charge via payment gateway (Razorpay/Stripe)
                                  β€’ On success: publish payment.completed β†’ Kafka
                                  β€’ On failure: publish payment.failed β†’ Kafka

5. Confirm order
                                β†’ Order Service (triggered by payment.completed)
                                  β€’ Update order status: CONFIRMED
                                  β€’ Publish: order.confirmed β†’ Kafka

6. Notify restaurant
                                β†’ Notification Service (triggered by order.confirmed)
                                  β€’ Push notification to restaurant tablet app
                                  β€’ Restaurant accepts β†’ status: PREPARING

7. Assign delivery driver
                                β†’ Delivery Service (triggered by order.confirmed)
                                  β€’ Run driver assignment algorithm
                                  β€’ Notify driver via push
                                  β€’ Driver accepts β†’ publish: delivery.assigned
                                  β€’ Update order status: DRIVER_ASSIGNED

8. Pickup & Delivery
                                β†’ Delivery Service
                                  β€’ Driver reaches restaurant β†’ PICKED_UP
                                  β€’ GPS tracking starts (see Design 2)
                                  β€’ Driver reaches customer β†’ DELIVERED
                                  β€’ Publish: order.delivered β†’ Kafka

9. Post-delivery
                                β†’ Notification Service
                                  β€’ Send "Rate your order" push to customer
                                β†’ Analytics Service
                                  β€’ Log delivery time, distance, rating

Key Flow 2: Driver Assignment Algorithm ​

The goal is to find the best available driver when an order is confirmed. "Best" balances proximity, current load, and fairness.

ts
interface Driver {
  id: string;
  lat: number;
  lng: number;
  activeOrders: number;    // currently carrying 0, 1, or 2 orders
  maxConcurrentOrders: number; // typically 2
  rating: number;          // 1-5 average
  lastAssignedAt: number;  // timestamp β€” for fairness
}

interface Restaurant {
  id: string;
  lat: number;
  lng: number;
}

interface ScoredDriver {
  driver: Driver;
  score: number;
}

function assignDriver(
  restaurant: Restaurant,
  candidateDrivers: Driver[]
): Driver | null {
  // Step 1: Filter β€” only drivers with capacity
  const available = candidateDrivers.filter(
    (d) => d.activeOrders < d.maxConcurrentOrders
  );

  if (available.length === 0) return null;

  // Step 2: Score each driver
  const scored: ScoredDriver[] = available.map((driver) => {
    const distance = haversineDistance(
      driver.lat, driver.lng,
      restaurant.lat, restaurant.lng
    );

    // Weights (tuned based on business metrics)
    const distanceScore = Math.max(0, 1 - distance / 10);   // 0-1, closer is better, 10km max
    const loadScore = 1 - driver.activeOrders / driver.maxConcurrentOrders; // prefer less loaded
    const ratingScore = driver.rating / 5;                   // prefer higher rated
    const fairnessScore = 1 / (1 + (Date.now() - driver.lastAssignedAt) / 60_000); // prefer recently idle (inverted β€” lower = longer wait = higher priority)
    const waitScore = 1 - fairnessScore; // flip: longer wait β†’ higher score

    const score =
      0.4 * distanceScore +   // proximity matters most
      0.25 * loadScore +      // don't overload drivers
      0.15 * ratingScore +    // quality of delivery
      0.2 * waitScore;        // fairness to idle drivers

    return { driver, score };
  });

  // Step 3: Pick highest score
  scored.sort((a, b) => b.score - a.score);
  return scored[0].driver;
}

function haversineDistance(
  lat1: number, lng1: number,
  lat2: number, lng2: number
): number {
  const R = 6371; // Earth's radius in km
  const dLat = toRad(lat2 - lat1);
  const dLng = toRad(lng2 - lng1);
  const a =
    Math.sin(dLat / 2) ** 2 +
    Math.cos(toRad(lat1)) * Math.cos(toRad(lat2)) *
    Math.sin(dLng / 2) ** 2;
  return R * 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
}

function toRad(deg: number): number {
  return (deg * Math.PI) / 180;
}

Finding nearby drivers efficiently:

Rather than scoring every driver in the city, use Redis geospatial queries to find candidates within a radius.

ts
// Store driver locations in Redis using GEOADD
await redis.geoadd("drivers:active", driverLng, driverLat, driverId);

// Find drivers within 5km of the restaurant
const nearbyDriverIds = await redis.georadius(
  "drivers:active",
  restaurantLng,
  restaurantLat,
  5,       // radius
  "km",    // unit
  "ASC",   // sort by distance ascending
  "COUNT", 20 // limit to 20 candidates
);

// Fetch full driver objects, then run scoring algorithm

Scaling Considerations ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    CDC / Kafka     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PostgreSQL   β”‚ ──────────────────▢│  Elasticsearch    β”‚
β”‚ (source of   β”‚                    β”‚                   β”‚
β”‚  truth)      β”‚                    β”‚  β€’ Geo-distance   β”‚
β”‚              β”‚                    β”‚    queries         β”‚
β”‚              β”‚                    β”‚  β€’ Full-text       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚    search          β”‚
                                    β”‚  β€’ Faceted filters β”‚
                                    β”‚    (cuisine, price,β”‚
                                    β”‚     rating, veg)   β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
ts
// Elasticsearch query: "biryani" within 5km, rating >= 4, currently open
const query = {
  bool: {
    must: [
      { match: { menu_items: "biryani" } },
      { range: { rating: { gte: 4.0 } } },
      { term: { is_open: true } },
    ],
    filter: {
      geo_distance: {
        distance: "5km",
        location: { lat: 12.9716, lon: 77.5946 }, // user's location
      },
    },
  },
};

// Sort by a blend of relevance, distance, and rating
const sort = [
  "_score",                          // text relevance
  {
    _geo_distance: {
      location: { lat: 12.9716, lon: 77.5946 },
      order: "asc",
      unit: "km",
    },
  },
  { rating: { order: "desc" } },
];

Order Service: Partitioning by City/Region ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Order Service                         β”‚
β”‚                                                         β”‚
β”‚  Routing logic: order.city_id β†’ shard                   β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ PG Shard 1  β”‚ β”‚ PG Shard 2  β”‚ β”‚ PG Shard 3  β”‚      β”‚
β”‚  β”‚ Delhi NCR   β”‚ β”‚ Mumbai +    β”‚ β”‚ Bangalore + β”‚      β”‚
β”‚  β”‚             β”‚ β”‚ Pune        β”‚ β”‚ Hyderabad   β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚                                                         β”‚
β”‚  Benefits:                                              β”‚
β”‚  β€’ City-level failure isolation                         β”‚
β”‚  β€’ Independent scaling (Mumbai shard gets more replicas)β”‚
β”‚  β€’ Cross-shard queries rare (users order in one city)   β”‚
β”‚  β€’ Compliance: data stays in region if needed           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why city-based and not hash-based?

  • Users order from restaurants in their city. Cross-city queries are near-zero.
  • City-level isolation means a Mumbai database outage does not affect Bangalore orders.
  • Operational: you can scale up the Mumbai shard independently during IPL cricket season (order spike in Mumbai).

Peak Hour Handling ​

Normal load:  ~1,000 orders/min
Peak (lunch): ~10,000 orders/min (10x spike)
Flash sale:   ~50,000 orders/min (50x spike, temporary)

Strategy:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Queue-Based Load Leveling               β”‚
β”‚                                                      β”‚
β”‚  Spike      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     Steady      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  traffic ──▢│ Kafka  │──── drain ─────▢│  Order   β”‚β”‚
β”‚  (10K/min)  β”‚ Queue  β”‚    (controlled)  β”‚ Workers  β”‚β”‚
β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                      β”‚
β”‚  β€’ Kafka absorbs the burst                           β”‚
β”‚  β€’ Workers process at a sustainable rate             β”‚
β”‚  β€’ Auto-scaling adds more workers within 2-3 min     β”‚
β”‚  β€’ Backpressure: if queue depth > threshold,         β”‚
β”‚    show "high demand" message on app                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Auto-Scaling Triggers:
─────────────────────
β€’ Kafka consumer lag > 5,000 messages β†’ scale up order workers
β€’ CPU > 70% on order service pods β†’ scale up pods
β€’ Active WebSocket connections > 8,000/instance β†’ scale up WS gateway
β€’ Scale down after 10 min of low utilization (avoid flapping)

ETA Estimation ​

ETA is one of the most visible features. If the app says 30 minutes and food arrives in 50, trust is broken.

ETA = Restaurant Prep Time + Driver Pickup Time + Delivery Travel Time + Buffer

Where:
─────
Restaurant Prep Time:
  β€’ Base: restaurant's average prep time (tracked per restaurant)
  β€’ Adjusted by: current order queue depth at that restaurant
  β€’ Example: base 20 min, 8 orders in queue β†’ 20 + (8 * 2) = 36 min

Driver Pickup Time:
  β€’ Distance from driver to restaurant (Google Maps / OSRM)
  β€’ Adjusted by: current traffic conditions (Google Traffic API)
  β€’ Example: 3km, light traffic β†’ 8 min

Delivery Travel Time:
  β€’ Distance from restaurant to customer
  β€’ Adjusted by: traffic, time of day, historical delivery times on this route
  β€’ Example: 5km, moderate traffic β†’ 15 min

Buffer:
  β€’ Static buffer: +3 min (accounts for parking, stairs, finding address)
  β€’ Dynamic: increased during rain or peak hours

Total ETA: 36 + 8 + 15 + 3 = 62 min
ts
interface ETAComponents {
  restaurantPrepMin: number;
  driverToRestaurantMin: number;
  restaurantToCustomerMin: number;
  bufferMin: number;
}

async function estimateETA(
  restaurantId: string,
  driverLocation: { lat: number; lng: number },
  customerLocation: { lat: number; lng: number }
): Promise<{ totalMin: number; breakdown: ETAComponents }> {
  // 1. Restaurant prep time
  const restaurant = await restaurantService.get(restaurantId);
  const queueDepth = await orderService.getActiveOrderCount(restaurantId);
  const restaurantPrepMin =
    restaurant.avgPrepTimeMin + queueDepth * 2; // 2 min per queued order

  // 2. Driver to restaurant
  const driverToRestaurant = await mapsService.getETA(
    driverLocation,
    { lat: restaurant.lat, lng: restaurant.lng }
  );
  const driverToRestaurantMin = driverToRestaurant.durationMin;

  // 3. Restaurant to customer
  const restaurantToCustomer = await mapsService.getETA(
    { lat: restaurant.lat, lng: restaurant.lng },
    customerLocation
  );
  const restaurantToCustomerMin = restaurantToCustomer.durationMin;

  // 4. Buffer β€” increases during rain or peak
  const isRaining = await weatherService.isRaining(customerLocation);
  const isPeakHour = isPeak(new Date());
  let bufferMin = 3;
  if (isRaining) bufferMin += 5;
  if (isPeakHour) bufferMin += 3;

  const totalMin =
    restaurantPrepMin +
    driverToRestaurantMin +
    restaurantToCustomerMin +
    bufferMin;

  return {
    totalMin: Math.ceil(totalMin),
    breakdown: {
      restaurantPrepMin,
      driverToRestaurantMin,
      restaurantToCustomerMin,
      bufferMin,
    },
  };
}

function isPeak(now: Date): boolean {
  const hour = now.getHours();
  return (hour >= 12 && hour <= 14) || (hour >= 19 && hour <= 22);
}

Improving ETA accuracy over time:

  • Track actual vs predicted ETA for every order.
  • Feed into an ML model (features: restaurant, time of day, weather, traffic, order size).
  • Use the ML model's output as the ETA instead of the formula, once accuracy exceeds the heuristic.

Interview Tips ​

  1. Start with requirements, not architecture. Ask clarifying questions: "How many users? What channels? What's the latency SLA?" This shows maturity.
  2. Draw the happy path first, then failure handling. Interviewers want to see you think about retries, DLQs, idempotency, and circuit breakers.
  3. Justify every database choice. Do not say "I'd use MongoDB because it's fast." Say "I'd use PostgreSQL for orders because state transitions require ACID guarantees."
  4. Know your numbers. Kafka throughput (~1M messages/sec per broker), Redis latency (~1ms), WebSocket connection limits (~10K per instance), Elasticsearch query latency (~10-50ms).
  5. Mention observability. Distributed tracing (Jaeger), metrics (Prometheus + Grafana), centralized logging (ELK). This signals production experience.
  6. For Temple specifically: Their engineering blog and Zomato's tech blog cover exactly these topics. Name-dropping specifics like "Zomato uses Kafka for order events and Redis for driver locations" shows you did your homework.

Frontend interview preparation reference.