Skip to content

HLD: Vending Machine Leasing Company Platform

Understanding the Problem

What is a Vending Machine Leasing Platform?

A platform that manages a fleet of vending machines deployed at client locations (offices, hospitals, airports, gyms). The company owns the machines, leases them to businesses, and handles inventory management, restocking, maintenance, and payment processing. The engineering challenges are unique: we are dealing with thousands of IoT devices that need real-time monitoring, inventory tracking across distributed physical machines, and route optimization for restocking crews -- all while processing payments at each machine.

Functional Requirements

Core (above the line):

  1. Machine management -- register machines, assign to client locations, track machine status (active/maintenance/offline)
  2. Inventory tracking -- real-time inventory levels per machine per product slot, low-stock alerts
  3. Lease management -- create/manage lease contracts with clients (terms, pricing, revenue share)
  4. Payment processing -- handle purchases at machines (card, mobile pay), track revenue per machine
  5. Monitoring dashboard -- real-time view of fleet health, inventory levels, revenue, and alerts

Below the line (mention but don't design):

  • Customer-facing mobile app (find nearby machines, see product availability)
  • Dynamic pricing based on demand
  • Product recommendation engine
  • Maintenance technician dispatch
  • Accounting and invoicing for lease payments

Non-Functional Requirements

  1. Reliability -- machines must process payments even with intermittent internet connectivity (offline-first)
  2. Real-time monitoring -- machine health and inventory updates within 30 seconds
  3. Scale -- 50,000 machines, each with 30 product slots, reporting telemetry every 30 seconds = ~1.7M telemetry events/minute
  4. Data durability -- zero lost transactions; every purchase must be recorded for revenue tracking

The Set Up

Core Entities

EntityDescription
Machineid, serialNumber, model, locationId, status (active/maintenance/offline), lastHeartbeat, firmwareVersion
Locationid, clientId, address, geoCoordinates, locationType (office/hospital/airport)
Leaseid, clientId, machineId, startDate, endDate, monthlyRate, revenueSharePercent, status (active/expired/terminated)
ProductSlotmachineId, slotNumber, productId, currentQuantity, maxCapacity, price
Transactionid, machineId, productId, amount, paymentMethod, timestamp, synced (boolean)
TelemetryEventmachineId, timestamp, type (heartbeat/sale/error/restock), payload (JSON)

API Design

Register a new machine:

POST /api/machines
Authorization: Bearer <admin-token>

Request:
{
  "serialNumber": "VM-2024-A1234",
  "model": "SnackMaster 3000",
  "locationId": "loc_airport_jfk_t4",
  "slots": [
    { "slotNumber": 1, "productId": "prod_cola", "maxCapacity": 15, "price": 2.50 },
    { "slotNumber": 2, "productId": "prod_water", "maxCapacity": 20, "price": 1.75 }
  ]
}

Response: 201 Created
{
  "machineId": "mach_xyz789",
  "status": "active",
  "registeredAt": "2024-01-15T10:00:00Z"
}

Get machine status and inventory:

GET /api/machines/{machineId}
Authorization: Bearer <token>

Response: 200 OK
{
  "machineId": "mach_xyz789",
  "status": "active",
  "lastHeartbeat": "2024-01-15T10:29:30Z",
  "temperature": 3.2,
  "location": { "name": "JFK Airport Terminal 4", "address": "..." },
  "inventory": [
    { "slotNumber": 1, "product": "Cola", "quantity": 3, "maxCapacity": 15, "price": 2.50 },
    { "slotNumber": 2, "product": "Water", "quantity": 18, "maxCapacity": 20, "price": 1.75 }
  ],
  "alerts": [
    { "type": "LOW_STOCK", "slotNumber": 1, "quantity": 3, "threshold": 5 }
  ]
}

Create a lease contract:

POST /api/leases
Authorization: Bearer <admin-token>

Request:
{
  "clientId": "client_jfk_airport",
  "machineId": "mach_xyz789",
  "startDate": "2024-02-01",
  "endDate": "2025-02-01",
  "monthlyRate": 500.00,
  "revenueSharePercent": 15,
  "terms": "Standard 12-month lease with auto-renewal"
}

Response: 201 Created
{
  "leaseId": "lease_abc123",
  "status": "active"
}

Get fleet dashboard:

GET /api/dashboard/fleet?region=northeast
Authorization: Bearer <admin-token>

Response: 200 OK
{
  "totalMachines": 5000,
  "activeCount": 4850,
  "offlineCount": 50,
  "maintenanceCount": 100,
  "alertCount": 230,
  "todayRevenue": 145000.00,
  "lowStockMachines": 180,
  "topPerforming": [
    { "machineId": "mach_xyz789", "location": "JFK T4", "todayRevenue": 890.00 }
  ]
}

Machine telemetry endpoint (called by the machine itself):

POST /api/telemetry
X-Machine-Id: mach_xyz789
X-Machine-Token: <hardware-token>

Request:
{
  "events": [
    {
      "type": "heartbeat",
      "timestamp": "2024-01-15T10:30:00Z",
      "payload": {
        "temperature": 3.2,
        "doorStatus": "closed",
        "paymentModuleStatus": "online",
        "networkSignal": -65
      }
    },
    {
      "type": "sale",
      "timestamp": "2024-01-15T10:28:15Z",
      "payload": {
        "slotNumber": 1,
        "productId": "prod_cola",
        "amount": 2.50,
        "paymentMethod": "contactless",
        "transactionId": "txn_local_12345"
      }
    }
  ]
}

Response: 200 OK
{
  "ack": true,
  "commands": [
    { "type": "UPDATE_PRICE", "slotNumber": 1, "newPrice": 2.75 }
  ]
}

High-Level Design

Architecture Overview

[Vending Machines (IoT)] --MQTT/HTTPS--> [IoT Gateway (AWS IoT Core)]
                                               |
                                         [Message Broker (Kafka)]
                                               |
                          ┌────────────────────┼────────────────────┐
                          |                    |                    |
                    [Telemetry              [Transaction         [Alert
                     Processor]              Processor]           Engine]
                          |                    |                    |
                    [TimescaleDB]        [PostgreSQL]          [Redis + SNS]
                          |                    |                    |
                          └────────────────────┼────────────────────┘
                                               |
                                         [API Service]
                                               |
                                         [Dashboard (React)]

Flow 1: Machine Telemetry Ingestion

Step-by-step:

  1. Each vending machine runs embedded software that collects telemetry data every 30 seconds
  2. Machine publishes telemetry via MQTT to AWS IoT Core (lightweight pub/sub protocol designed for IoT devices with limited bandwidth)
    • MQTT topic: machines/{machineId}/telemetry
    • If internet is down, machine buffers events locally (up to 24 hours of events in local flash storage)
  3. AWS IoT Core has a rule that forwards all telemetry messages to a Kafka topic (machine-telemetry)
  4. Telemetry Processor (Kafka consumer group, 20 consumers) processes events:
    • Heartbeats: update lastHeartbeat in Redis (for real-time dashboard) and batch-write to TimescaleDB (time-series database for historical data)
    • Sales: route to the Transaction Processor (separate Kafka topic machine-transactions)
    • Errors: route to the Alert Engine
  5. TimescaleDB stores telemetry with automatic partitioning by time (hourly chunks) and retention policies (raw data for 90 days, 1-hour aggregates for 2 years)

Back-of-envelope:

  • 50K machines * 1 heartbeat/30s = 1,667 events/sec
  • Add sales (~5 sales/machine/hour average = 70 events/sec) and other events
  • Total: ~2,000 events/sec. Kafka handles this trivially (single partition could do it, but we use 20 partitions for parallelism).
  • Storage: each event ~500 bytes. 2,000/sec * 86,400 sec/day * 500 bytes = 86 GB/day in TimescaleDB. With compression (10x for time-series), ~8.6 GB/day. 90-day retention = ~780 GB. Manageable.

Flow 2: Purchase Transaction Processing

  1. Customer taps their card on the vending machine
  2. Machine's embedded payment module processes the payment locally via the card network (Visa/Mastercard)
  3. Machine records the transaction locally with a unique localTransactionId
  4. Machine publishes a sale telemetry event to IoT Core
  5. Transaction Processor consumes from the machine-transactions Kafka topic: a. Deduplication check: look up localTransactionId in a Redis set (prevents double-counting if the machine retransmits) b. Write to PostgreSQL transactions table c. Update inventory: UPDATE product_slots SET current_quantity = current_quantity - 1 WHERE machine_id = ? AND slot_number = ? d. Update inventory in Redis cache: DECRBY inventory:mach_xyz789:slot_1 1 e. Check if quantity fell below threshold: if so, publish LOW_STOCK alert
  6. Revenue is attributed to the machine and the lease (for revenue share calculation)

Offline resilience: The critical insight is that the payment happens locally on the machine via the card terminal. The machine does NOT need internet to process a sale. It just needs internet to report the sale back to our platform. If the machine is offline for 4 hours, it accumulates sales locally and batch-syncs when connectivity is restored.

Flow 3: Lease Management

  1. Admin creates a lease via POST /api/leases
  2. Lease Service validates:
    • Machine exists and is not currently under another active lease
    • Client exists and is in good standing
    • Dates are valid (start < end, minimum 3-month term)
  3. Lease is written to PostgreSQL
  4. A recurring billing schedule is created in the Billing Service:
    • Monthly invoice generated on the 1st of each month
    • Invoice amount = monthlyRate + (machine revenue * revenueSharePercent / 100)
    • Sent via email and available in the admin dashboard
  5. Lease lifecycle events are tracked:
    • LEASE_CREATED, LEASE_ACTIVATED, LEASE_RENEWED, LEASE_EXPIRED, LEASE_TERMINATED
  6. 30 days before expiry, an automated reminder is sent to the client for renewal

Flow 4: Monitoring Dashboard (Real-Time)

  1. Admin opens the fleet dashboard in the browser
  2. React app calls GET /api/dashboard/fleet for initial data load
  3. App opens an SSE connection to GET /api/dashboard/stream
  4. Dashboard Service subscribes to Redis Pub/Sub channels:
    • machine:status -- machine online/offline changes
    • machine:alerts -- new alerts (low stock, errors, temperature warnings)
    • machine:transactions -- real-time revenue ticker
  5. When a telemetry event updates a machine's status, the Telemetry Processor publishes to the Redis channel
  6. Dashboard Service pushes the update to all connected admin clients via SSE
  7. React app updates the dashboard in real-time (machine status dots change color, revenue ticker increments, alerts appear)

Potential Deep Dives

Deep Dive 1: Inventory Management Across Distributed Machines

The Problem: 50K machines, each with 30 slots. That is 1.5M product slots to track. Inventory changes on every sale, and we need to know when to restock.

Bad Solution -- Poll every machine for inventory: Query each machine's inventory on demand. With 50K machines, this is slow and unreliable (machines might be offline). Also, the machine does not have a query API -- it pushes data to us.

Good Solution -- Event-driven inventory tracking: Inventory is updated on two events:

  1. Sale: decrement quantity (from telemetry event processing)
  2. Restock: increment quantity (from restocking crew scanning a barcode)

Maintain current inventory in Redis for fast reads:

HSET inventory:mach_xyz789 slot_1 3 slot_2 18 slot_3 0 ...

Dashboard reads from Redis. PostgreSQL is the source of truth (updated async via Kafka).

Great Solution -- Predictive inventory with demand forecasting:

Beyond tracking current levels, predict when each machine will run out of each product:

python
def predict_stockout(machine_id, slot_number):
    # Get sales history for this slot (last 30 days)
    daily_sales = timescaledb.query("""
        SELECT date_trunc('day', timestamp) as day, COUNT(*) as sales
        FROM transactions
        WHERE machine_id = ? AND slot_number = ?
          AND timestamp > NOW() - INTERVAL '30 days'
        GROUP BY day
    """, machine_id, slot_number)

    avg_daily_sales = sum(d.sales for d in daily_sales) / len(daily_sales)
    current_qty = redis.hget(f"inventory:{machine_id}", f"slot_{slot_number}")

    days_until_stockout = current_qty / avg_daily_sales if avg_daily_sales > 0 else float('inf')
    return days_until_stockout

Factor in day-of-week patterns (Monday sales != Saturday sales at an office location), seasonal trends, and location-specific demand.

Use this prediction to generate proactive restocking orders before items run out, rather than reacting to low-stock alerts.

Deep Dive 2: Real-Time Monitoring (IoT Telemetry)

The Problem: 50K machines sending heartbeats every 30 seconds. We need to detect machine failures within 2 minutes (missed 4 consecutive heartbeats).

Bad Solution -- Cron job checking last heartbeat:

sql
SELECT machine_id FROM machines
WHERE last_heartbeat < NOW() - INTERVAL '2 minutes' AND status = 'active';

Running this every 30 seconds on 50K rows is workable but creates a polling loop. Detection latency varies (0-30 seconds depending on when the cron runs).

Good Solution -- Redis TTL-based health monitoring: On every heartbeat, set a Redis key with a TTL:

SET machine:heartbeat:mach_xyz789 "alive" EX 120  -- 2-minute TTL

When the key expires (no heartbeat received within 2 minutes), Redis triggers a keyspace notification:

SUBSCRIBE __keyevent@0__:expired

The Alert Engine subscribes to these notifications and marks the machine as offline.

Trade-off: Redis keyspace notifications are best-effort (can be lost if Redis is under heavy load). For critical monitoring, combine with a periodic sweep.

Great Solution -- Streaming anomaly detection:

Beyond simple heartbeat monitoring, analyze telemetry for anomalies:

  1. Temperature drift: If temperature rises above 8°C (for refrigerated machines), alert before products spoil
  2. Payment module errors: If a machine reports 5 failed card reads in an hour, it likely needs maintenance
  3. Sales pattern anomaly: If a machine that normally sells 50 items/day sells 0, something is wrong (door might be stuck)

Use Apache Flink or Kafka Streams for real-time stream processing:

java
// Kafka Streams pseudo-code
telemetryStream
    .filter(event -> event.type == "heartbeat")
    .groupByKey()  // by machineId
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> new HealthMetrics(),
        (key, event, metrics) -> metrics.update(event),
        Materialized.as("machine-health-store")
    )
    .filter((key, metrics) -> metrics.temperature > 8.0 || metrics.errorCount > 5)
    .to("machine-alerts");

Deep Dive 3: Lease Contract Management

The Problem: Leases have complex lifecycle: creation, activation, renewal, amendment, termination, and expiration. Revenue share percentages mean we need accurate revenue tracking per machine per billing period.

Lease state machine:

DRAFT -> ACTIVE -> [RENEWED | EXPIRED | TERMINATED]
                      |
                      v
                   ACTIVE (new term)

Revenue share calculation:

python
def calculate_monthly_invoice(lease, billing_period):
    # Fixed monthly rate
    base_rate = lease.monthly_rate

    # Revenue share: get total machine revenue for the billing period
    machine_revenue = db.query("""
        SELECT SUM(amount) FROM transactions
        WHERE machine_id = ? AND timestamp BETWEEN ? AND ?
    """, lease.machine_id, billing_period.start, billing_period.end)

    revenue_share = machine_revenue * (lease.revenue_share_percent / 100)

    total_invoice = base_rate + revenue_share

    return {
        "leaseId": lease.id,
        "period": billing_period,
        "baseRate": base_rate,
        "machineRevenue": machine_revenue,
        "revenueSharePercent": lease.revenue_share_percent,
        "revenueShareAmount": revenue_share,
        "totalDue": total_invoice
    }

Auto-renewal logic:

python
def check_lease_renewals():
    # Run daily
    expiring_leases = db.query("""
        SELECT * FROM leases
        WHERE end_date BETWEEN NOW() AND NOW() + INTERVAL '30 days'
          AND status = 'active'
          AND auto_renew = true
    """)

    for lease in expiring_leases:
        if lease.end_date - now() == timedelta(days=30):
            send_renewal_notice(lease)  # 30-day notice to client

        if lease.end_date == today():
            new_lease = renew_lease(lease)  # Create new lease with same terms
            publish_event("LEASE_RENEWED", new_lease)

Deep Dive 4: Restocking Optimization

The Problem: We have 200 restocking crews covering 50K machines across a metropolitan area. Each crew can visit ~20 machines per shift. How do we decide which machines to restock and in what order?

Bad Solution -- Restock on low-stock alert: Wait until a machine hits the low-stock threshold, then dispatch a crew. This is reactive -- products may already be out of stock before the crew arrives. Also, crews end up criss-crossing the city visiting scattered machines.

Good Solution -- Daily route planning: Every evening, generate the next day's routes:

  1. Identify all machines that will need restocking within 48 hours (using demand prediction from Deep Dive 1)
  2. Group machines geographically using k-means clustering
  3. For each cluster, assign a crew and optimize the visit order using a Traveling Salesman Problem (TSP) approximation (nearest-neighbor heuristic or Google OR-Tools)
  4. Generate a route sheet for each crew with estimated times and product quantities to load
python
def plan_daily_routes(date, region):
    # Step 1: Identify machines needing restock
    machines_to_restock = []
    for machine in get_machines(region):
        for slot in machine.slots:
            days_until_empty = predict_stockout(machine.id, slot.number)
            if days_until_empty <= 2:
                machines_to_restock.append(machine)
                break

    # Step 2: Cluster geographically
    clusters = kmeans_cluster(
        points=[(m.location.lat, m.location.lng) for m in machines_to_restock],
        n_clusters=len(available_crews)
    )

    # Step 3: Optimize route per cluster
    routes = []
    for cluster in clusters:
        route = solve_tsp(cluster.machines, start=warehouse_location)
        routes.append(route)

    return routes

Great Solution -- Dynamic routing with real-time adjustments: Start with the planned route but adjust throughout the day:

  • If a machine sells out unexpectedly (event from Alert Engine), re-prioritize it
  • If a crew finishes early, assign them nearby machines from another crew's list
  • If traffic makes a route suboptimal, recalculate using real-time traffic data (Google Maps API)

Push route updates to the crew's mobile app in real-time.

Deep Dive 5: Payment Processing and Offline Resilience

The Problem: Vending machines are at remote locations. Internet goes down. The machine must still accept payments.

Architecture:

Each machine has:

  1. A payment terminal (card reader with EMV chip) that communicates directly with the card network (Visa/Mastercard) via a cellular modem
  2. An embedded controller (Linux-based) that manages inventory, telemetry, and local transaction logging
  3. Local storage (32GB flash) for buffering transactions and telemetry

Online flow:

  1. Customer taps card
  2. Payment terminal sends authorization request to card network (via cellular modem)
  3. Card network approves (or declines)
  4. Machine dispenses product
  5. Transaction recorded locally AND sent to our platform via IoT Core
  6. Our platform processes the transaction (update inventory, revenue tracking)

Offline flow (internet down):

  1. Customer taps card
  2. Payment terminal uses offline authorization (Store and Forward): accepts the transaction locally with floor limits ($25 max per transaction)
  3. Machine dispenses product
  4. Transaction stored in local flash storage with a synced = false flag
  5. Machine retries sync every 5 minutes
  6. When internet is restored, all buffered transactions are batch-sent to IoT Core
  7. Our platform processes them with timestamps from the original transaction time (not the sync time)

Risk mitigation for offline transactions:

  • Floor limit of $25 prevents large fraudulent charges
  • Maximum 50 offline transactions before the machine stops accepting new ones
  • When back online, any declined offline authorizations are flagged for review
  • Revenue reconciliation runs daily to match machine-reported transactions with payment processor settlements

Deep Dive 6: Machine Health Monitoring and Predictive Maintenance

The Problem: Machine breakdowns cause lost revenue and poor customer experience. We want to predict failures before they happen.

Telemetry signals that predict failures:

SignalNormal RangeWarningCritical
Temperature2-5°C5-8°C> 8°C
Door open events/day01-3> 3 (seal issue)
Card reader error rate< 1%1-5%> 5%
Motor current draw0.5-1.2A1.2-1.8A (friction)> 1.8A (imminent failure)
Network signal (dBm)> -75-75 to -90< -90

Predictive maintenance pipeline:

  1. Feature extraction: For each machine, compute rolling averages (1-hour, 24-hour, 7-day) for all telemetry signals
  2. Anomaly detection: Use a simple z-score model: if a signal is > 3 standard deviations from its 30-day mean, flag it
  3. Failure prediction model: Train on historical data (features from 7 days before each known failure). Predict probability of failure in the next 7 days.
  4. Maintenance scheduling: Machines with > 70% failure probability are added to the next maintenance schedule
python
def check_machine_health(machine_id):
    # Get recent telemetry
    recent = timescaledb.query("""
        SELECT AVG(payload->>'temperature') as avg_temp,
               AVG(payload->>'motor_current') as avg_current,
               COUNT(*) FILTER (WHERE payload->>'card_error' = 'true') as card_errors,
               COUNT(*) as total_events
        FROM telemetry_events
        WHERE machine_id = ? AND timestamp > NOW() - INTERVAL '24 hours'
          AND type = 'heartbeat'
    """, machine_id)

    health_score = 100
    alerts = []

    if recent.avg_temp > 8:
        health_score -= 30
        alerts.append({"type": "HIGH_TEMP", "value": recent.avg_temp})

    if recent.avg_current > 1.5:
        health_score -= 25
        alerts.append({"type": "HIGH_MOTOR_CURRENT", "value": recent.avg_current})

    card_error_rate = recent.card_errors / max(recent.total_events, 1)
    if card_error_rate > 0.05:
        health_score -= 20
        alerts.append({"type": "CARD_READER_ERRORS", "rate": card_error_rate})

    return {"machineId": machine_id, "healthScore": health_score, "alerts": alerts}

Outcome: Proactive maintenance reduces machine downtime from an average of 8 hours (reactive) to 2 hours (scheduled maintenance before failure). This translates to ~$15 per machine per month in recovered revenue.


What is Expected at Each Level

Mid-Level

  • Design basic machine registration and inventory tracking with a relational database
  • Understand the need for telemetry ingestion from IoT devices
  • Basic lease CRUD operations
  • Know that payments happen at the machine and need to be reported back

Senior

  • Design the IoT telemetry pipeline (MQTT + Kafka + TimescaleDB)
  • Offline-first payment architecture with store-and-forward
  • Real-time dashboard with SSE and Redis Pub/Sub
  • Inventory tracking with low-stock alerts (event-driven)
  • Lease lifecycle management with revenue share calculations
  • Back-of-envelope for telemetry throughput and storage

Staff+

  • Predictive inventory management with demand forecasting
  • Restocking route optimization (TSP, clustering, dynamic re-routing)
  • Machine health prediction using telemetry anomaly detection
  • Streaming analytics with Flink/Kafka Streams for real-time anomaly detection
  • Multi-region fleet management (machines across different time zones and regulations)
  • Edge computing considerations: what processing should happen on the machine vs in the cloud?
  • Cost optimization: cellular data costs for 50K machines, choosing between 4G and NB-IoT
  • Security: machine authentication (hardware tokens), preventing telemetry spoofing, PCI compliance for payment data

Frontend interview preparation reference.