08 - HLD & System Design (Food Delivery)

Cross-Reference

For foundational system design concepts (CAP theorem, SQL vs NoSQL, indexing, sharding, caching, message queues, load balancing, ACID), see paytm-prep/notes/04-hld-system-design.md. This file focuses on food-delivery-specific system designs relevant to Temple (ex-Zomato founding team).

Quick Reference (scan in 5 min)

System	Key Components	Key Patterns	Scale Challenges
Notification System	Kafka, Orchestrator, Channel Workers (push/SMS/email), Retry Queue, DLQ	Fan-out per channel, Exponential backoff, Rate limiting per user	Millions of concurrent sends during promos, priority ordering, delivery guarantees
Real-Time Updates	WebSocket Gateway, Redis Pub/Sub, Location Service, GPS ingestion pipeline	Geohash bucketing, Sticky sessions, Heartbeat + reconnect + polling fallback	High-frequency GPS writes, fan-out to watchers, horizontal WebSocket scaling
Food Delivery (Full)	User/Restaurant/Order/Delivery/Payment/Search/Notification services, API Gateway	Database-per-service, Saga for orders, CQRS for search, Queue-based load leveling	Peak hour auto-scaling, geo-partitioned orders, ETA estimation accuracy

Design 1: Notification System at Scale

Requirements

Multi-channel: push notifications, SMS, and email
Volume: millions of users; promotional blasts during peak hours (lunch/dinner)
Priority handling: order updates (high) vs promotional (low)
Reliability: retry on transient failure, dead-letter for permanent failure
User respect: rate limiting to prevent notification fatigue

Architecture

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ Order Service│   │ Promo Service│   │Delivery Svc  │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                   │
       ▼                  ▼                   ▼
┌─────────────────────────────────────────────────────┐
│                   Kafka Cluster                      │
│  (partitioned by user_id for per-user ordering)      │
│                                                      │
│  Topics:  notification.high  │  notification.low     │
└──────────────────┬───────────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────────┐
│           Notification Orchestrator                  │
│                                                      │
│  1. Read event from Kafka                            │
│  2. Resolve user preferences (opt-ins, channels)     │
│  3. Check rate limit (Redis counter per user)        │
│  4. Determine channels + priority                    │
│  5. Dispatch to channel-specific queues              │
└───┬──────────────┬──────────────┬────────────────────┘
    │              │              │
    ▼              ▼              ▼
┌────────┐   ┌─────────┐   ┌──────────┐
│  Push  │   │   SMS   │   │  Email   │
│Workers │   │ Workers │   │ Workers  │
│(FCM/APNs)  │(Twilio) │   │(SES/SG)  │
└───┬────┘   └───┬─────┘   └────┬─────┘
    │            │               │
    │    on failure               │
    ▼            ▼               ▼
┌─────────────────────────────────────────────────────┐
│              Retry Queue (Kafka)                     │
│         Exponential backoff: 1s → 2s → 4s → 8s      │
│         Max retries: 3 (push), 2 (SMS), 3 (email)   │
└──────────────────┬──────────────────────────────────┘
                   │ after max retries
                   ▼
┌─────────────────────────────────────────────────────┐
│          Dead Letter Queue (DLQ)                     │
│   Permanent failures logged for manual review        │
│   Alert on DLQ depth > threshold                     │
└─────────────────────────────────────────────────────┘

Notification Log (Audit Database)

Every notification attempt is logged for audit, debugging, and analytics.

Table: notification_log
─────────────────────────────────────────────────
id            UUID  PK
user_id       UUID  FK → users       (indexed)
event_type    VARCHAR(50)            -- 'order_update', 'promo', 'delivery_status'
channel       VARCHAR(20)            -- 'push', 'sms', 'email'
priority      VARCHAR(10)            -- 'high', 'low'
status        VARCHAR(20)            -- 'sent', 'failed', 'dlq'
payload       JSONB                  -- full notification content
attempt_count INT
created_at    TIMESTAMP              (indexed)
updated_at    TIMESTAMP

Key Design Decisions

Why Kafka (not RabbitMQ)?

Partitioning by user_id guarantees per-user message ordering. A user always sees "order confirmed" before "order picked up."
High throughput for promotional blasts (millions of messages during dinner push).
Consumer groups allow independent scaling of the orchestrator.
Replay capability: if a bug in the orchestrator misprocesses events, rewind the offset and reprocess.

Why separate worker pools per channel?

Different latency SLAs: push is expected in < 1 second, email can tolerate 30 seconds.
Different failure modes: FCM may rate-limit you, Twilio may have regional outages, SES has sending quotas.
Independent scaling: during a promo blast, email workers scale 10x while push workers stay steady.
Isolating failures: an SMS provider outage does not back-pressure push delivery.

Why separate Kafka topics for priority?

High-priority consumers (order updates) get dedicated resources and are never starved by a promo flood.
Low-priority consumers can be throttled or paused during peak order load.

Rate Limiting per User

// Redis-based sliding window rate limiter per user per channel
async function canSendNotification(
  userId: string,
  channel: "push" | "sms" | "email"
): Promise<boolean> {
  const key = `ratelimit:notif:${channel}:${userId}`;
  const now = Date.now();
  const windowMs = 3600_000; // 1-hour window

  const limits: Record<string, number> = {
    push: 10,  // max 10 push notifications per hour
    sms: 3,    // max 3 SMS per hour (cost + annoyance)
    email: 5,  // max 5 emails per hour
  };

  // Redis sorted set: score = timestamp, member = unique event id
  const pipeline = redis.pipeline();
  pipeline.zremrangebyscore(key, 0, now - windowMs); // prune old entries
  pipeline.zcard(key);                                // count in window
  const results = await pipeline.exec();

  const count = results[1][1] as number;
  return count < limits[channel];
}

Design 2: Real-Time Updates System

Use Case

Live order tracking on the customer app: the map shows the delivery driver's location updating every few seconds, just like Zomato live tracking. Also used for:

"Your order is being prepared" status updates
Estimated time of arrival countdown
Driver en-route path visualization

Architecture

┌─────────────┐          ┌─────────────────────────┐
│ Driver App  │          │     Customer App         │
│ (GPS every  │          │  (shows live map)        │
│  3-5 sec)   │          │                          │
└──────┬──────┘          └──────────▲───────────────┘
       │                            │
       │  HTTP POST /location       │  WebSocket (wss://)
       │                            │
       ▼                            │
┌──────────────┐          ┌─────────┴──────────────┐
│ API Gateway  │          │  WebSocket Gateway      │
│ (auth, rate  │          │  (sticky sessions via   │
│  limit)      │          │   IP hash or conn ID)   │
└──────┬───────┘          └─────────▲──────────────┘
       │                            │
       ▼                            │  subscribe to
┌──────────────┐          ┌─────────┴──────────────┐
│  Location    │──write──▶│     Redis               │
│  Service     │          │                          │
│              │──publish─▶│  Pub/Sub channels:      │
│              │          │  location:{order_id}     │
│              │          │                          │
│              │          │  Key-Value store:         │
│              │          │  driver:{driver_id} →     │
│              │          │    {lat, lng, ts, heading}│
│              │          │    TTL: 60s               │
└──────────────┘          └────────────────────────┘

Fan-out flow:
1. Driver app POSTs GPS coordinates every 3-5 seconds
2. Location Service writes to Redis (key: driver:{id}, TTL 60s)
3. Location Service publishes to Redis Pub/Sub channel location:{order_id}
4. WebSocket Gateway instances subscribe to relevant channels
5. Gateway pushes update to connected customer via WebSocket

Scaling WebSockets Across Multiple Instances

The challenge: a customer's WebSocket connects to Instance A, but the location update arrives at Instance B.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  WS Gateway  │    │  WS Gateway  │    │  WS Gateway  │
│  Instance A  │    │  Instance B  │    │  Instance C  │
│  (1000 conns)│    │  (1200 conns)│    │  (950 conns) │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌──────▼───────┐
                    │  Redis       │
                    │  Pub/Sub     │
                    │              │
                    │ All instances│
                    │ subscribe to │
                    │ all active   │
                    │ order channels│
                    └──────────────┘

Solution: Every WS Gateway instance subscribes to Redis Pub/Sub
for all orders whose customers are connected to that instance.
When a location update is published, ALL subscribing instances
receive it, but only the one holding the customer's connection
actually sends the WebSocket frame.

Connection Management

// Server-side WebSocket connection lifecycle
interface TrackingConnection {
  orderId: string;
  userId: string;
  ws: WebSocket;
  lastHeartbeat: number;
}

const connections = new Map<string, TrackingConnection>();

function handleConnection(ws: WebSocket, orderId: string, userId: string) {
  const conn: TrackingConnection = {
    orderId,
    userId,
    ws,
    lastHeartbeat: Date.now(),
  };
  connections.set(userId, conn);

  // Subscribe this instance to the order's location channel
  redisSubscriber.subscribe(`location:${orderId}`);

  // Heartbeat: client must send ping every 30s
  ws.on("message", (msg) => {
    if (msg === "ping") {
      conn.lastHeartbeat = Date.now();
      ws.send("pong");
    }
  });

  ws.on("close", () => {
    connections.delete(userId);
    // Unsubscribe if no other connections care about this order
    if (!hasOtherSubscribers(orderId)) {
      redisSubscriber.unsubscribe(`location:${orderId}`);
    }
  });
}

// Stale connection reaper — runs every 60s
setInterval(() => {
  const now = Date.now();
  for (const [userId, conn] of connections) {
    if (now - conn.lastHeartbeat > 90_000) {
      // No heartbeat in 90s → consider dead
      conn.ws.terminate();
      connections.delete(userId);
    }
  }
}, 60_000);

Client-Side Reconnection with Fallback

class OrderTracker {
  private ws: WebSocket | null = null;
  private reconnectAttempts = 0;
  private maxReconnectAttempts = 5;
  private pollingInterval: ReturnType<typeof setInterval> | null = null;

  connect(orderId: string) {
    const url = `wss://tracking.temple.app/ws/orders/${orderId}`;
    this.ws = new WebSocket(url);

    this.ws.onopen = () => {
      this.reconnectAttempts = 0;
      this.stopPolling();
      this.startHeartbeat();
    };

    this.ws.onmessage = (event) => {
      const update = JSON.parse(event.data);
      this.onLocationUpdate(update); // update map marker
    };

    this.ws.onclose = () => {
      if (this.reconnectAttempts < this.maxReconnectAttempts) {
        // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        const delay = Math.pow(2, this.reconnectAttempts) * 1000;
        this.reconnectAttempts++;
        setTimeout(() => this.connect(orderId), delay);
      } else {
        // Fallback to HTTP polling every 5s
        this.startPolling(orderId);
      }
    };
  }

  private startPolling(orderId: string) {
    this.pollingInterval = setInterval(async () => {
      const res = await fetch(`/api/orders/${orderId}/location`);
      const update = await res.json();
      this.onLocationUpdate(update);
    }, 5_000);
  }

  private stopPolling() {
    if (this.pollingInterval) {
      clearInterval(this.pollingInterval);
      this.pollingInterval = null;
    }
  }

  private startHeartbeat() {
    setInterval(() => {
      if (this.ws?.readyState === WebSocket.OPEN) {
        this.ws.send("ping");
      }
    }, 30_000);
  }

  private onLocationUpdate(update: {
    lat: number;
    lng: number;
    heading: number;
    eta: number;
  }) {
    // Render on map — implementation depends on map library
  }
}

Key Metrics

Metric	Target	Why It Matters
GPS update frequency	Every 3-5 seconds	Smooth map animation without excessive bandwidth
WebSocket message latency	< 200ms end-to-end	User perceives real-time movement
Fan-out ratio	1 driver update → 1-3 watchers	Low for delivery (usually 1 customer + maybe 1 support agent)
Connection density per instance	~10,000 concurrent WS	Memory-bound; each connection holds minimal state
Reconnection success rate	> 99% within 3 attempts	Users should rarely fall back to polling
Redis Pub/Sub message throughput	~100K messages/sec	Handles all active deliveries in a city during peak

Design 3: Food Delivery System (Full)

This is the comprehensive end-to-end design. In an interview, you would not draw all of this — you would focus on 2-3 services and their interactions. But knowing the full picture lets you zoom into any part confidently.

Microservices Architecture

                            ┌──────────────────┐
                            │   Customer App   │
                            │   (React Native) │
                            └────────┬─────────┘
                                     │
                            ┌────────▼─────────┐
                            │   API Gateway    │
                            │  (Kong / Nginx)  │
                            │                  │
                            │  • Auth (JWT)    │
                            │  • Rate limiting │
                            │  • Routing       │
                            │  • Request ID    │
                            └──┬──┬──┬──┬──┬───┘
                               │  │  │  │  │
         ┌─────────────────────┘  │  │  │  └─────────────────────┐
         │              ┌─────────┘  │  └─────────┐              │
         ▼              ▼            ▼            ▼              ▼
┌──────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ User Service │ │ Restaurant │ │  Order   │ │ Delivery │ │   Payment    │
│              │ │  Service   │ │ Service  │ │ Service  │ │   Service    │
│ • Register  │ │            │ │          │ │          │ │              │
│ • Login     │ │ • Menu CRUD│ │ • Place  │ │ • Assign │ │ • Charge     │
│ • Profile   │ │ • Hours    │ │ • Status │ │ • Track  │ │ • Refund     │
│ • Addresses │ │ • Ratings  │ │ • Cancel │ │ • ETA    │ │ • Wallet     │
│              │ │ • Availability│ • History│ │ • Route  │ │ • Idempotency│
└──────┬───────┘ └──────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘
       │                │            │            │              │
       │   PostgreSQL   │ PostgreSQL │ PostgreSQL │    Redis     │ PostgreSQL
       │   (users)      │ (menus,    │ (orders)   │  (driver     │ (transactions,
       │                │  restaurants)│           │   locations) │  ledger)
       │                │            │            │              │
       └────────────────┼────────────┼────────────┼──────────────┘
                        │            │            │
                        ▼            ▼            ▼
               ┌─────────────────────────────────────────────┐
               │             Kafka Event Bus                  │
               │                                              │
               │  Topics:                                     │
               │  • order.created    • payment.completed      │
               │  • order.confirmed  • delivery.assigned      │
               │  • order.cancelled  • delivery.picked_up     │
               │  • order.delivered  • driver.location        │
               └──────────────────┬──────────────────────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
            ┌────────────┐ ┌───────────┐ ┌──────────────┐
            │ Search     │ │Notification│ │  Analytics   │
            │ Service    │ │ Service    │ │  Service     │
            │            │ │            │ │              │
            │ Elastic-   │ │ Push/SMS/  │ │ Clickhouse / │
            │ search     │ │ Email      │ │ Data Lake    │
            └────────────┘ └───────────┘ └──────────────┘

Database Choices

Service	Database	Why
User Service	PostgreSQL	ACID for profile/address data, relational joins for preferences
Restaurant Service	PostgreSQL + Elasticsearch	PostgreSQL for source-of-truth menu/restaurant data; Elasticsearch synced via Kafka for full-text search + geo queries
Order Service	PostgreSQL (partitioned by city)	ACID for order state machine, partitioning isolates city-level failures
Delivery Service	Redis + PostgreSQL	Redis for real-time driver locations (key-value with TTL); PostgreSQL for assignment history and driver profiles
Payment Service	PostgreSQL	ACID is non-negotiable for money; append-only ledger pattern
Search Service	Elasticsearch	Geo-distance queries, fuzzy text matching, faceted filters (cuisine, rating, price)
Notification Service	Kafka + PostgreSQL	Kafka for reliable delivery pipeline; PostgreSQL for notification log/audit
Analytics	ClickHouse or BigQuery	Columnar storage for fast aggregations across millions of orders

Key Flow 1: Order Placement

Customer App                    Backend Services
─────────────                   ────────────────

1. Search for restaurants
   GET /search?q=biryani&lat=..&lng=..
                                → Search Service (Elasticsearch geo query)
                                ← Restaurant list with menus

2. Add items to cart (client-side state)

3. Place order
   POST /orders
   {restaurantId, items[], addressId, paymentMethod}
                                → Order Service
                                  • Validate items + prices with Restaurant Service
                                  • Calculate total (subtotal + tax + delivery fee)
                                  • Create order record (status: PENDING_PAYMENT)
                                  • Publish: order.created → Kafka

4. Process payment
                                → Payment Service (triggered by order.created)
                                  • Idempotency check (idempotency_key = order_id)
                                  • Charge via payment gateway (Razorpay/Stripe)
                                  • On success: publish payment.completed → Kafka
                                  • On failure: publish payment.failed → Kafka

5. Confirm order
                                → Order Service (triggered by payment.completed)
                                  • Update order status: CONFIRMED
                                  • Publish: order.confirmed → Kafka

6. Notify restaurant
                                → Notification Service (triggered by order.confirmed)
                                  • Push notification to restaurant tablet app
                                  • Restaurant accepts → status: PREPARING

7. Assign delivery driver
                                → Delivery Service (triggered by order.confirmed)
                                  • Run driver assignment algorithm
                                  • Notify driver via push
                                  • Driver accepts → publish: delivery.assigned
                                  • Update order status: DRIVER_ASSIGNED

8. Pickup & Delivery
                                → Delivery Service
                                  • Driver reaches restaurant → PICKED_UP
                                  • GPS tracking starts (see Design 2)
                                  • Driver reaches customer → DELIVERED
                                  • Publish: order.delivered → Kafka

9. Post-delivery
                                → Notification Service
                                  • Send "Rate your order" push to customer
                                → Analytics Service
                                  • Log delivery time, distance, rating

Key Flow 2: Driver Assignment Algorithm

The goal is to find the best available driver when an order is confirmed. "Best" balances proximity, current load, and fairness.

interface Driver {
  id: string;
  lat: number;
  lng: number;
  activeOrders: number;    // currently carrying 0, 1, or 2 orders
  maxConcurrentOrders: number; // typically 2
  rating: number;          // 1-5 average
  lastAssignedAt: number;  // timestamp — for fairness
}

interface Restaurant {
  id: string;
  lat: number;
  lng: number;
}

interface ScoredDriver {
  driver: Driver;
  score: number;
}

function assignDriver(
  restaurant: Restaurant,
  candidateDrivers: Driver[]
): Driver | null {
  // Step 1: Filter — only drivers with capacity
  const available = candidateDrivers.filter(
    (d) => d.activeOrders < d.maxConcurrentOrders
  );

  if (available.length === 0) return null;

  // Step 2: Score each driver
  const scored: ScoredDriver[] = available.map((driver) => {
    const distance = haversineDistance(
      driver.lat, driver.lng,
      restaurant.lat, restaurant.lng
    );

    // Weights (tuned based on business metrics)
    const distanceScore = Math.max(0, 1 - distance / 10);   // 0-1, closer is better, 10km max
    const loadScore = 1 - driver.activeOrders / driver.maxConcurrentOrders; // prefer less loaded
    const ratingScore = driver.rating / 5;                   // prefer higher rated
    const fairnessScore = 1 / (1 + (Date.now() - driver.lastAssignedAt) / 60_000); // prefer recently idle (inverted — lower = longer wait = higher priority)
    const waitScore = 1 - fairnessScore; // flip: longer wait → higher score

    const score =
      0.4 * distanceScore +   // proximity matters most
      0.25 * loadScore +      // don't overload drivers
      0.15 * ratingScore +    // quality of delivery
      0.2 * waitScore;        // fairness to idle drivers

    return { driver, score };
  });

  // Step 3: Pick highest score
  scored.sort((a, b) => b.score - a.score);
  return scored[0].driver;
}

function haversineDistance(
  lat1: number, lng1: number,
  lat2: number, lng2: number
): number {
  const R = 6371; // Earth's radius in km
  const dLat = toRad(lat2 - lat1);
  const dLng = toRad(lng2 - lng1);
  const a =
    Math.sin(dLat / 2) ** 2 +
    Math.cos(toRad(lat1)) * Math.cos(toRad(lat2)) *
    Math.sin(dLng / 2) ** 2;
  return R * 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a));
}

function toRad(deg: number): number {
  return (deg * Math.PI) / 180;
}

Finding nearby drivers efficiently:

Rather than scoring every driver in the city, use Redis geospatial queries to find candidates within a radius.

// Store driver locations in Redis using GEOADD
await redis.geoadd("drivers:active", driverLng, driverLat, driverId);

// Find drivers within 5km of the restaurant
const nearbyDriverIds = await redis.georadius(
  "drivers:active",
  restaurantLng,
  restaurantLat,
  5,       // radius
  "km",    // unit
  "ASC",   // sort by distance ascending
  "COUNT", 20 // limit to 20 candidates
);

// Fetch full driver objects, then run scoring algorithm

Scaling Considerations

Restaurant Search

┌──────────────┐    CDC / Kafka     ┌──────────────────┐
│ PostgreSQL   │ ──────────────────▶│  Elasticsearch    │
│ (source of   │                    │                   │
│  truth)      │                    │  • Geo-distance   │
│              │                    │    queries         │
│              │                    │  • Full-text       │
└──────────────┘                    │    search          │
                                    │  • Faceted filters │
                                    │    (cuisine, price,│
                                    │     rating, veg)   │
                                    └──────────────────┘

// Elasticsearch query: "biryani" within 5km, rating >= 4, currently open
const query = {
  bool: {
    must: [
      { match: { menu_items: "biryani" } },
      { range: { rating: { gte: 4.0 } } },
      { term: { is_open: true } },
    ],
    filter: {
      geo_distance: {
        distance: "5km",
        location: { lat: 12.9716, lon: 77.5946 }, // user's location
      },
    },
  },
};

// Sort by a blend of relevance, distance, and rating
const sort = [
  "_score",                          // text relevance
  {
    _geo_distance: {
      location: { lat: 12.9716, lon: 77.5946 },
      order: "asc",
      unit: "km",
    },
  },
  { rating: { order: "desc" } },
];

Order Service: Partitioning by City/Region

┌────────────────────────────────────────────────────────┐
│                   Order Service                         │
│                                                         │
│  Routing logic: order.city_id → shard                   │
│                                                         │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐      │
│  │ PG Shard 1  │ │ PG Shard 2  │ │ PG Shard 3  │      │
│  │ Delhi NCR   │ │ Mumbai +    │ │ Bangalore + │      │
│  │             │ │ Pune        │ │ Hyderabad   │      │
│  └─────────────┘ └─────────────┘ └─────────────┘      │
│                                                         │
│  Benefits:                                              │
│  • City-level failure isolation                         │
│  • Independent scaling (Mumbai shard gets more replicas)│
│  • Cross-shard queries rare (users order in one city)   │
│  • Compliance: data stays in region if needed           │
└────────────────────────────────────────────────────────┘

Why city-based and not hash-based?

Users order from restaurants in their city. Cross-city queries are near-zero.
City-level isolation means a Mumbai database outage does not affect Bangalore orders.
Operational: you can scale up the Mumbai shard independently during IPL cricket season (order spike in Mumbai).

Peak Hour Handling

Normal load:  ~1,000 orders/min
Peak (lunch): ~10,000 orders/min (10x spike)
Flash sale:   ~50,000 orders/min (50x spike, temporary)

Strategy:
┌─────────────────────────────────────────────────────┐
│              Queue-Based Load Leveling               │
│                                                      │
│  Spike      ┌────────┐     Steady      ┌──────────┐│
│  traffic ──▶│ Kafka  │──── drain ─────▶│  Order   ││
│  (10K/min)  │ Queue  │    (controlled)  │ Workers  ││
│             └────────┘                  └──────────┘│
│                                                      │
│  • Kafka absorbs the burst                           │
│  • Workers process at a sustainable rate             │
│  • Auto-scaling adds more workers within 2-3 min     │
│  • Backpressure: if queue depth > threshold,         │
│    show "high demand" message on app                 │
└─────────────────────────────────────────────────────┘

Auto-Scaling Triggers:
─────────────────────
• Kafka consumer lag > 5,000 messages → scale up order workers
• CPU > 70% on order service pods → scale up pods
• Active WebSocket connections > 8,000/instance → scale up WS gateway
• Scale down after 10 min of low utilization (avoid flapping)

ETA Estimation

ETA is one of the most visible features. If the app says 30 minutes and food arrives in 50, trust is broken.

ETA = Restaurant Prep Time + Driver Pickup Time + Delivery Travel Time + Buffer

Where:
─────
Restaurant Prep Time:
  • Base: restaurant's average prep time (tracked per restaurant)
  • Adjusted by: current order queue depth at that restaurant
  • Example: base 20 min, 8 orders in queue → 20 + (8 * 2) = 36 min

Driver Pickup Time:
  • Distance from driver to restaurant (Google Maps / OSRM)
  • Adjusted by: current traffic conditions (Google Traffic API)
  • Example: 3km, light traffic → 8 min

Delivery Travel Time:
  • Distance from restaurant to customer
  • Adjusted by: traffic, time of day, historical delivery times on this route
  • Example: 5km, moderate traffic → 15 min

Buffer:
  • Static buffer: +3 min (accounts for parking, stairs, finding address)
  • Dynamic: increased during rain or peak hours

Total ETA: 36 + 8 + 15 + 3 = 62 min

interface ETAComponents {
  restaurantPrepMin: number;
  driverToRestaurantMin: number;
  restaurantToCustomerMin: number;
  bufferMin: number;
}

async function estimateETA(
  restaurantId: string,
  driverLocation: { lat: number; lng: number },
  customerLocation: { lat: number; lng: number }
): Promise<{ totalMin: number; breakdown: ETAComponents }> {
  // 1. Restaurant prep time
  const restaurant = await restaurantService.get(restaurantId);
  const queueDepth = await orderService.getActiveOrderCount(restaurantId);
  const restaurantPrepMin =
    restaurant.avgPrepTimeMin + queueDepth * 2; // 2 min per queued order

  // 2. Driver to restaurant
  const driverToRestaurant = await mapsService.getETA(
    driverLocation,
    { lat: restaurant.lat, lng: restaurant.lng }
  );
  const driverToRestaurantMin = driverToRestaurant.durationMin;

  // 3. Restaurant to customer
  const restaurantToCustomer = await mapsService.getETA(
    { lat: restaurant.lat, lng: restaurant.lng },
    customerLocation
  );
  const restaurantToCustomerMin = restaurantToCustomer.durationMin;

  // 4. Buffer — increases during rain or peak
  const isRaining = await weatherService.isRaining(customerLocation);
  const isPeakHour = isPeak(new Date());
  let bufferMin = 3;
  if (isRaining) bufferMin += 5;
  if (isPeakHour) bufferMin += 3;

  const totalMin =
    restaurantPrepMin +
    driverToRestaurantMin +
    restaurantToCustomerMin +
    bufferMin;

  return {
    totalMin: Math.ceil(totalMin),
    breakdown: {
      restaurantPrepMin,
      driverToRestaurantMin,
      restaurantToCustomerMin,
      bufferMin,
    },
  };
}

function isPeak(now: Date): boolean {
  const hour = now.getHours();
  return (hour >= 12 && hour <= 14) || (hour >= 19 && hour <= 22);
}

Improving ETA accuracy over time:

Track actual vs predicted ETA for every order.
Feed into an ML model (features: restaurant, time of day, weather, traffic, order size).
Use the ML model's output as the ETA instead of the formula, once accuracy exceeds the heuristic.

Interview Tips

Start with requirements, not architecture. Ask clarifying questions: "How many users? What channels? What's the latency SLA?" This shows maturity.
Draw the happy path first, then failure handling. Interviewers want to see you think about retries, DLQs, idempotency, and circuit breakers.
Justify every database choice. Do not say "I'd use MongoDB because it's fast." Say "I'd use PostgreSQL for orders because state transitions require ACID guarantees."
Know your numbers. Kafka throughput (~1M messages/sec per broker), Redis latency (~1ms), WebSocket connection limits (~10K per instance), Elasticsearch query latency (~10-50ms).
Mention observability. Distributed tracing (Jaeger), metrics (Prometheus + Grafana), centralized logging (ELK). This signals production experience.
For Temple specifically: Their engineering blog and Zomato's tech blog cover exactly these topics. Name-dropping specifics like "Zomato uses Kafka for order events and Redis for driver locations" shows you did your homework.

08 - HLD & System Design (Food Delivery) ​

Cross-Reference ​

Quick Reference (scan in 5 min) ​

Design 1: Notification System at Scale ​

Requirements ​

Architecture ​

Notification Log (Audit Database) ​

Key Design Decisions ​

Rate Limiting per User ​

Design 2: Real-Time Updates System ​

Use Case ​

Architecture ​

Scaling WebSockets Across Multiple Instances ​

Connection Management ​

Client-Side Reconnection with Fallback ​

Key Metrics ​

Design 3: Food Delivery System (Full) ​

Microservices Architecture ​

Database Choices ​

Key Flow 1: Order Placement ​

Key Flow 2: Driver Assignment Algorithm ​

Scaling Considerations ​

Restaurant Search ​

Order Service: Partitioning by City/Region ​

Peak Hour Handling ​

ETA Estimation ​

Interview Tips ​

08 - HLD & System Design (Food Delivery)

Cross-Reference

Quick Reference (scan in 5 min)

Design 1: Notification System at Scale

Requirements

Architecture

Notification Log (Audit Database)

Key Design Decisions

Rate Limiting per User

Design 2: Real-Time Updates System

Use Case

Architecture

Scaling WebSockets Across Multiple Instances

Connection Management

Client-Side Reconnection with Fallback

Key Metrics

Design 3: Food Delivery System (Full)

Microservices Architecture

Database Choices

Key Flow 1: Order Placement

Key Flow 2: Driver Assignment Algorithm

Scaling Considerations

Restaurant Search

Order Service: Partitioning by City/Region

Peak Hour Handling

ETA Estimation

Interview Tips