HLD: Vending Machine Leasing Company Platform

Understanding the Problem

What is a Vending Machine Leasing Platform?

A platform that manages a fleet of vending machines deployed at client locations (offices, hospitals, airports, gyms). The company owns the machines, leases them to businesses, and handles inventory management, restocking, maintenance, and payment processing. The engineering challenges are unique: we are dealing with thousands of IoT devices that need real-time monitoring, inventory tracking across distributed physical machines, and route optimization for restocking crews -- all while processing payments at each machine.

Functional Requirements

Core (above the line):

Machine management -- register machines, assign to client locations, track machine status (active/maintenance/offline)
Inventory tracking -- real-time inventory levels per machine per product slot, low-stock alerts
Lease management -- create/manage lease contracts with clients (terms, pricing, revenue share)
Payment processing -- handle purchases at machines (card, mobile pay), track revenue per machine
Monitoring dashboard -- real-time view of fleet health, inventory levels, revenue, and alerts

Below the line (mention but don't design):

Customer-facing mobile app (find nearby machines, see product availability)
Dynamic pricing based on demand
Product recommendation engine
Maintenance technician dispatch
Accounting and invoicing for lease payments

Non-Functional Requirements

Reliability -- machines must process payments even with intermittent internet connectivity (offline-first)
Real-time monitoring -- machine health and inventory updates within 30 seconds
Scale -- 50,000 machines, each with 30 product slots, reporting telemetry every 30 seconds = ~1.7M telemetry events/minute
Data durability -- zero lost transactions; every purchase must be recorded for revenue tracking

The Set Up

Core Entities

Entity	Description
Machine	`id`, `serialNumber`, `model`, `locationId`, `status` (active/maintenance/offline), `lastHeartbeat`, `firmwareVersion`
Location	`id`, `clientId`, `address`, `geoCoordinates`, `locationType` (office/hospital/airport)
Lease	`id`, `clientId`, `machineId`, `startDate`, `endDate`, `monthlyRate`, `revenueSharePercent`, `status` (active/expired/terminated)
ProductSlot	`machineId`, `slotNumber`, `productId`, `currentQuantity`, `maxCapacity`, `price`
Transaction	`id`, `machineId`, `productId`, `amount`, `paymentMethod`, `timestamp`, `synced` (boolean)
TelemetryEvent	`machineId`, `timestamp`, `type` (heartbeat/sale/error/restock), `payload` (JSON)

API Design

Register a new machine:

POST /api/machines
Authorization: Bearer <admin-token>

Request:
{
  "serialNumber": "VM-2024-A1234",
  "model": "SnackMaster 3000",
  "locationId": "loc_airport_jfk_t4",
  "slots": [
    { "slotNumber": 1, "productId": "prod_cola", "maxCapacity": 15, "price": 2.50 },
    { "slotNumber": 2, "productId": "prod_water", "maxCapacity": 20, "price": 1.75 }
  ]
}

Response: 201 Created
{
  "machineId": "mach_xyz789",
  "status": "active",
  "registeredAt": "2024-01-15T10:00:00Z"
}

Get machine status and inventory:

GET /api/machines/{machineId}
Authorization: Bearer <token>

Response: 200 OK
{
  "machineId": "mach_xyz789",
  "status": "active",
  "lastHeartbeat": "2024-01-15T10:29:30Z",
  "temperature": 3.2,
  "location": { "name": "JFK Airport Terminal 4", "address": "..." },
  "inventory": [
    { "slotNumber": 1, "product": "Cola", "quantity": 3, "maxCapacity": 15, "price": 2.50 },
    { "slotNumber": 2, "product": "Water", "quantity": 18, "maxCapacity": 20, "price": 1.75 }
  ],
  "alerts": [
    { "type": "LOW_STOCK", "slotNumber": 1, "quantity": 3, "threshold": 5 }
  ]
}

Create a lease contract:

POST /api/leases
Authorization: Bearer <admin-token>

Request:
{
  "clientId": "client_jfk_airport",
  "machineId": "mach_xyz789",
  "startDate": "2024-02-01",
  "endDate": "2025-02-01",
  "monthlyRate": 500.00,
  "revenueSharePercent": 15,
  "terms": "Standard 12-month lease with auto-renewal"
}

Response: 201 Created
{
  "leaseId": "lease_abc123",
  "status": "active"
}

Get fleet dashboard:

GET /api/dashboard/fleet?region=northeast
Authorization: Bearer <admin-token>

Response: 200 OK
{
  "totalMachines": 5000,
  "activeCount": 4850,
  "offlineCount": 50,
  "maintenanceCount": 100,
  "alertCount": 230,
  "todayRevenue": 145000.00,
  "lowStockMachines": 180,
  "topPerforming": [
    { "machineId": "mach_xyz789", "location": "JFK T4", "todayRevenue": 890.00 }
  ]
}

Machine telemetry endpoint (called by the machine itself):

POST /api/telemetry
X-Machine-Id: mach_xyz789
X-Machine-Token: <hardware-token>

Request:
{
  "events": [
    {
      "type": "heartbeat",
      "timestamp": "2024-01-15T10:30:00Z",
      "payload": {
        "temperature": 3.2,
        "doorStatus": "closed",
        "paymentModuleStatus": "online",
        "networkSignal": -65
      }
    },
    {
      "type": "sale",
      "timestamp": "2024-01-15T10:28:15Z",
      "payload": {
        "slotNumber": 1,
        "productId": "prod_cola",
        "amount": 2.50,
        "paymentMethod": "contactless",
        "transactionId": "txn_local_12345"
      }
    }
  ]
}

Response: 200 OK
{
  "ack": true,
  "commands": [
    { "type": "UPDATE_PRICE", "slotNumber": 1, "newPrice": 2.75 }
  ]
}

High-Level Design

Architecture Overview

[Vending Machines (IoT)] --MQTT/HTTPS--> [IoT Gateway (AWS IoT Core)]
                                               |
                                         [Message Broker (Kafka)]
                                               |
                          ┌────────────────────┼────────────────────┐
                          |                    |                    |
                    [Telemetry              [Transaction         [Alert
                     Processor]              Processor]           Engine]
                          |                    |                    |
                    [TimescaleDB]        [PostgreSQL]          [Redis + SNS]
                          |                    |                    |
                          └────────────────────┼────────────────────┘
                                               |
                                         [API Service]
                                               |
                                         [Dashboard (React)]

Flow 1: Machine Telemetry Ingestion

Step-by-step:

Each vending machine runs embedded software that collects telemetry data every 30 seconds
Machine publishes telemetry via MQTT to AWS IoT Core (lightweight pub/sub protocol designed for IoT devices with limited bandwidth)
- MQTT topic: machines/{machineId}/telemetry
- If internet is down, machine buffers events locally (up to 24 hours of events in local flash storage)
AWS IoT Core has a rule that forwards all telemetry messages to a Kafka topic (machine-telemetry)
Telemetry Processor (Kafka consumer group, 20 consumers) processes events:
- Heartbeats: update lastHeartbeat in Redis (for real-time dashboard) and batch-write to TimescaleDB (time-series database for historical data)
- Sales: route to the Transaction Processor (separate Kafka topic machine-transactions)
- Errors: route to the Alert Engine
TimescaleDB stores telemetry with automatic partitioning by time (hourly chunks) and retention policies (raw data for 90 days, 1-hour aggregates for 2 years)

Back-of-envelope:

50K machines * 1 heartbeat/30s = 1,667 events/sec
Add sales (~5 sales/machine/hour average = 70 events/sec) and other events
Total: ~2,000 events/sec. Kafka handles this trivially (single partition could do it, but we use 20 partitions for parallelism).
Storage: each event ~500 bytes. 2,000/sec * 86,400 sec/day * 500 bytes = 86 GB/day in TimescaleDB. With compression (10x for time-series), ~8.6 GB/day. 90-day retention = ~780 GB. Manageable.

Flow 2: Purchase Transaction Processing

Customer taps their card on the vending machine
Machine's embedded payment module processes the payment locally via the card network (Visa/Mastercard)
Machine records the transaction locally with a unique localTransactionId
Machine publishes a sale telemetry event to IoT Core
Transaction Processor consumes from the machine-transactions Kafka topic: a. Deduplication check: look up localTransactionId in a Redis set (prevents double-counting if the machine retransmits) b. Write to PostgreSQL transactions table c. Update inventory: UPDATE product_slots SET current_quantity = current_quantity - 1 WHERE machine_id = ? AND slot_number = ? d. Update inventory in Redis cache: DECRBY inventory:mach_xyz789:slot_1 1 e. Check if quantity fell below threshold: if so, publish LOW_STOCK alert
Revenue is attributed to the machine and the lease (for revenue share calculation)

Offline resilience: The critical insight is that the payment happens locally on the machine via the card terminal. The machine does NOT need internet to process a sale. It just needs internet to report the sale back to our platform. If the machine is offline for 4 hours, it accumulates sales locally and batch-syncs when connectivity is restored.

Flow 3: Lease Management

Admin creates a lease via POST /api/leases
Lease Service validates:
- Machine exists and is not currently under another active lease
- Client exists and is in good standing
- Dates are valid (start < end, minimum 3-month term)
Lease is written to PostgreSQL
A recurring billing schedule is created in the Billing Service:
- Monthly invoice generated on the 1st of each month
- Invoice amount = monthlyRate + (machine revenue * revenueSharePercent / 100)
- Sent via email and available in the admin dashboard
Lease lifecycle events are tracked:
- LEASE_CREATED, LEASE_ACTIVATED, LEASE_RENEWED, LEASE_EXPIRED, LEASE_TERMINATED
30 days before expiry, an automated reminder is sent to the client for renewal

Flow 4: Monitoring Dashboard (Real-Time)

Admin opens the fleet dashboard in the browser
React app calls GET /api/dashboard/fleet for initial data load
App opens an SSE connection to GET /api/dashboard/stream
Dashboard Service subscribes to Redis Pub/Sub channels:
- machine:status -- machine online/offline changes
- machine:alerts -- new alerts (low stock, errors, temperature warnings)
- machine:transactions -- real-time revenue ticker
When a telemetry event updates a machine's status, the Telemetry Processor publishes to the Redis channel
Dashboard Service pushes the update to all connected admin clients via SSE
React app updates the dashboard in real-time (machine status dots change color, revenue ticker increments, alerts appear)

Potential Deep Dives

Deep Dive 1: Inventory Management Across Distributed Machines

The Problem: 50K machines, each with 30 slots. That is 1.5M product slots to track. Inventory changes on every sale, and we need to know when to restock.

Bad Solution -- Poll every machine for inventory: Query each machine's inventory on demand. With 50K machines, this is slow and unreliable (machines might be offline). Also, the machine does not have a query API -- it pushes data to us.

Good Solution -- Event-driven inventory tracking: Inventory is updated on two events:

Sale: decrement quantity (from telemetry event processing)
Restock: increment quantity (from restocking crew scanning a barcode)

Maintain current inventory in Redis for fast reads:

HSET inventory:mach_xyz789 slot_1 3 slot_2 18 slot_3 0 ...

Dashboard reads from Redis. PostgreSQL is the source of truth (updated async via Kafka).

Great Solution -- Predictive inventory with demand forecasting:

Beyond tracking current levels, predict when each machine will run out of each product:

python

def predict_stockout(machine_id, slot_number):
    # Get sales history for this slot (last 30 days)
    daily_sales = timescaledb.query("""
        SELECT date_trunc('day', timestamp) as day, COUNT(*) as sales
        FROM transactions
        WHERE machine_id = ? AND slot_number = ?
          AND timestamp > NOW() - INTERVAL '30 days'
        GROUP BY day
    """, machine_id, slot_number)

    avg_daily_sales = sum(d.sales for d in daily_sales) / len(daily_sales)
    current_qty = redis.hget(f"inventory:{machine_id}", f"slot_{slot_number}")

    days_until_stockout = current_qty / avg_daily_sales if avg_daily_sales > 0 else float('inf')
    return days_until_stockout

Factor in day-of-week patterns (Monday sales != Saturday sales at an office location), seasonal trends, and location-specific demand.

Use this prediction to generate proactive restocking orders before items run out, rather than reacting to low-stock alerts.

Deep Dive 2: Real-Time Monitoring (IoT Telemetry)

The Problem: 50K machines sending heartbeats every 30 seconds. We need to detect machine failures within 2 minutes (missed 4 consecutive heartbeats).

Bad Solution -- Cron job checking last heartbeat:

sql

SELECT machine_id FROM machines
WHERE last_heartbeat < NOW() - INTERVAL '2 minutes' AND status = 'active';

Running this every 30 seconds on 50K rows is workable but creates a polling loop. Detection latency varies (0-30 seconds depending on when the cron runs).

Good Solution -- Redis TTL-based health monitoring: On every heartbeat, set a Redis key with a TTL:

SET machine:heartbeat:mach_xyz789 "alive" EX 120  -- 2-minute TTL

When the key expires (no heartbeat received within 2 minutes), Redis triggers a keyspace notification:

SUBSCRIBE __keyevent@0__:expired

The Alert Engine subscribes to these notifications and marks the machine as offline.

Trade-off: Redis keyspace notifications are best-effort (can be lost if Redis is under heavy load). For critical monitoring, combine with a periodic sweep.

Great Solution -- Streaming anomaly detection:

Beyond simple heartbeat monitoring, analyze telemetry for anomalies:

Temperature drift: If temperature rises above 8°C (for refrigerated machines), alert before products spoil
Payment module errors: If a machine reports 5 failed card reads in an hour, it likely needs maintenance
Sales pattern anomaly: If a machine that normally sells 50 items/day sells 0, something is wrong (door might be stuck)

Use Apache Flink or Kafka Streams for real-time stream processing:

java

// Kafka Streams pseudo-code
telemetryStream
    .filter(event -> event.type == "heartbeat")
    .groupByKey()  // by machineId
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> new HealthMetrics(),
        (key, event, metrics) -> metrics.update(event),
        Materialized.as("machine-health-store")
    )
    .filter((key, metrics) -> metrics.temperature > 8.0 || metrics.errorCount > 5)
    .to("machine-alerts");

Deep Dive 3: Lease Contract Management

The Problem: Leases have complex lifecycle: creation, activation, renewal, amendment, termination, and expiration. Revenue share percentages mean we need accurate revenue tracking per machine per billing period.

Lease state machine:

DRAFT -> ACTIVE -> [RENEWED | EXPIRED | TERMINATED]
                      |
                      v
                   ACTIVE (new term)

Revenue share calculation:

python

def calculate_monthly_invoice(lease, billing_period):
    # Fixed monthly rate
    base_rate = lease.monthly_rate

    # Revenue share: get total machine revenue for the billing period
    machine_revenue = db.query("""
        SELECT SUM(amount) FROM transactions
        WHERE machine_id = ? AND timestamp BETWEEN ? AND ?
    """, lease.machine_id, billing_period.start, billing_period.end)

    revenue_share = machine_revenue * (lease.revenue_share_percent / 100)

    total_invoice = base_rate + revenue_share

    return {
        "leaseId": lease.id,
        "period": billing_period,
        "baseRate": base_rate,
        "machineRevenue": machine_revenue,
        "revenueSharePercent": lease.revenue_share_percent,
        "revenueShareAmount": revenue_share,
        "totalDue": total_invoice
    }

Auto-renewal logic:

python

def check_lease_renewals():
    # Run daily
    expiring_leases = db.query("""
        SELECT * FROM leases
        WHERE end_date BETWEEN NOW() AND NOW() + INTERVAL '30 days'
          AND status = 'active'
          AND auto_renew = true
    """)

    for lease in expiring_leases:
        if lease.end_date - now() == timedelta(days=30):
            send_renewal_notice(lease)  # 30-day notice to client

        if lease.end_date == today():
            new_lease = renew_lease(lease)  # Create new lease with same terms
            publish_event("LEASE_RENEWED", new_lease)

Deep Dive 4: Restocking Optimization

The Problem: We have 200 restocking crews covering 50K machines across a metropolitan area. Each crew can visit ~20 machines per shift. How do we decide which machines to restock and in what order?

Bad Solution -- Restock on low-stock alert: Wait until a machine hits the low-stock threshold, then dispatch a crew. This is reactive -- products may already be out of stock before the crew arrives. Also, crews end up criss-crossing the city visiting scattered machines.

Good Solution -- Daily route planning: Every evening, generate the next day's routes:

Identify all machines that will need restocking within 48 hours (using demand prediction from Deep Dive 1)
Group machines geographically using k-means clustering
For each cluster, assign a crew and optimize the visit order using a Traveling Salesman Problem (TSP) approximation (nearest-neighbor heuristic or Google OR-Tools)
Generate a route sheet for each crew with estimated times and product quantities to load

python

def plan_daily_routes(date, region):
    # Step 1: Identify machines needing restock
    machines_to_restock = []
    for machine in get_machines(region):
        for slot in machine.slots:
            days_until_empty = predict_stockout(machine.id, slot.number)
            if days_until_empty <= 2:
                machines_to_restock.append(machine)
                break

    # Step 2: Cluster geographically
    clusters = kmeans_cluster(
        points=[(m.location.lat, m.location.lng) for m in machines_to_restock],
        n_clusters=len(available_crews)
    )

    # Step 3: Optimize route per cluster
    routes = []
    for cluster in clusters:
        route = solve_tsp(cluster.machines, start=warehouse_location)
        routes.append(route)

    return routes

Great Solution -- Dynamic routing with real-time adjustments: Start with the planned route but adjust throughout the day:

If a machine sells out unexpectedly (event from Alert Engine), re-prioritize it
If a crew finishes early, assign them nearby machines from another crew's list
If traffic makes a route suboptimal, recalculate using real-time traffic data (Google Maps API)

Push route updates to the crew's mobile app in real-time.

Deep Dive 5: Payment Processing and Offline Resilience

The Problem: Vending machines are at remote locations. Internet goes down. The machine must still accept payments.

Architecture:

Each machine has:

A payment terminal (card reader with EMV chip) that communicates directly with the card network (Visa/Mastercard) via a cellular modem
An embedded controller (Linux-based) that manages inventory, telemetry, and local transaction logging
Local storage (32GB flash) for buffering transactions and telemetry

Online flow:

Customer taps card
Payment terminal sends authorization request to card network (via cellular modem)
Card network approves (or declines)
Machine dispenses product
Transaction recorded locally AND sent to our platform via IoT Core
Our platform processes the transaction (update inventory, revenue tracking)

Offline flow (internet down):

Customer taps card
Payment terminal uses offline authorization (Store and Forward): accepts the transaction locally with floor limits ($25 max per transaction)
Machine dispenses product
Transaction stored in local flash storage with a synced = false flag
Machine retries sync every 5 minutes
When internet is restored, all buffered transactions are batch-sent to IoT Core
Our platform processes them with timestamps from the original transaction time (not the sync time)

Risk mitigation for offline transactions:

Floor limit of $25 prevents large fraudulent charges
Maximum 50 offline transactions before the machine stops accepting new ones
When back online, any declined offline authorizations are flagged for review
Revenue reconciliation runs daily to match machine-reported transactions with payment processor settlements

Deep Dive 6: Machine Health Monitoring and Predictive Maintenance

The Problem: Machine breakdowns cause lost revenue and poor customer experience. We want to predict failures before they happen.

Telemetry signals that predict failures:

Signal	Normal Range	Warning	Critical
Temperature	2-5°C	5-8°C	> 8°C
Door open events/day	0	1-3	> 3 (seal issue)
Card reader error rate	< 1%	1-5%	> 5%
Motor current draw	0.5-1.2A	1.2-1.8A (friction)	> 1.8A (imminent failure)
Network signal (dBm)	> -75	-75 to -90	< -90

Predictive maintenance pipeline:

Feature extraction: For each machine, compute rolling averages (1-hour, 24-hour, 7-day) for all telemetry signals
Anomaly detection: Use a simple z-score model: if a signal is > 3 standard deviations from its 30-day mean, flag it
Failure prediction model: Train on historical data (features from 7 days before each known failure). Predict probability of failure in the next 7 days.
Maintenance scheduling: Machines with > 70% failure probability are added to the next maintenance schedule

python

def check_machine_health(machine_id):
    # Get recent telemetry
    recent = timescaledb.query("""
        SELECT AVG(payload->>'temperature') as avg_temp,
               AVG(payload->>'motor_current') as avg_current,
               COUNT(*) FILTER (WHERE payload->>'card_error' = 'true') as card_errors,
               COUNT(*) as total_events
        FROM telemetry_events
        WHERE machine_id = ? AND timestamp > NOW() - INTERVAL '24 hours'
          AND type = 'heartbeat'
    """, machine_id)

    health_score = 100
    alerts = []

    if recent.avg_temp > 8:
        health_score -= 30
        alerts.append({"type": "HIGH_TEMP", "value": recent.avg_temp})

    if recent.avg_current > 1.5:
        health_score -= 25
        alerts.append({"type": "HIGH_MOTOR_CURRENT", "value": recent.avg_current})

    card_error_rate = recent.card_errors / max(recent.total_events, 1)
    if card_error_rate > 0.05:
        health_score -= 20
        alerts.append({"type": "CARD_READER_ERRORS", "rate": card_error_rate})

    return {"machineId": machine_id, "healthScore": health_score, "alerts": alerts}

Outcome: Proactive maintenance reduces machine downtime from an average of 8 hours (reactive) to 2 hours (scheduled maintenance before failure). This translates to ~$15 per machine per month in recovered revenue.

What is Expected at Each Level

Mid-Level

Design basic machine registration and inventory tracking with a relational database
Understand the need for telemetry ingestion from IoT devices
Basic lease CRUD operations
Know that payments happen at the machine and need to be reported back

Senior

Design the IoT telemetry pipeline (MQTT + Kafka + TimescaleDB)
Offline-first payment architecture with store-and-forward
Real-time dashboard with SSE and Redis Pub/Sub
Inventory tracking with low-stock alerts (event-driven)
Lease lifecycle management with revenue share calculations
Back-of-envelope for telemetry throughput and storage

Staff+

Predictive inventory management with demand forecasting
Restocking route optimization (TSP, clustering, dynamic re-routing)
Machine health prediction using telemetry anomaly detection
Streaming analytics with Flink/Kafka Streams for real-time anomaly detection
Multi-region fleet management (machines across different time zones and regulations)
Edge computing considerations: what processing should happen on the machine vs in the cloud?
Cost optimization: cellular data costs for 50K machines, choosing between 4G and NB-IoT
Security: machine authentication (hardware tokens), preventing telemetry spoofing, PCI compliance for payment data

HLD: Vending Machine Leasing Company Platform ​

Understanding the Problem ​

What is a Vending Machine Leasing Platform? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

API Design ​

High-Level Design ​

Architecture Overview ​

Flow 1: Machine Telemetry Ingestion ​

Flow 2: Purchase Transaction Processing ​

Flow 3: Lease Management ​

Flow 4: Monitoring Dashboard (Real-Time) ​

Potential Deep Dives ​

Deep Dive 1: Inventory Management Across Distributed Machines ​

Deep Dive 2: Real-Time Monitoring (IoT Telemetry) ​

Deep Dive 3: Lease Contract Management ​

Deep Dive 4: Restocking Optimization ​

Deep Dive 5: Payment Processing and Offline Resilience ​

Deep Dive 6: Machine Health Monitoring and Predictive Maintenance ​

What is Expected at Each Level ​

Mid-Level ​

Senior ​

Staff+ ​

HLD: Vending Machine Leasing Company Platform

Understanding the Problem

What is a Vending Machine Leasing Platform?

Functional Requirements

Non-Functional Requirements

The Set Up

Core Entities

API Design

High-Level Design

Architecture Overview

Flow 1: Machine Telemetry Ingestion

Flow 2: Purchase Transaction Processing

Flow 3: Lease Management

Flow 4: Monitoring Dashboard (Real-Time)

Potential Deep Dives

Deep Dive 1: Inventory Management Across Distributed Machines

Deep Dive 2: Real-Time Monitoring (IoT Telemetry)

Deep Dive 3: Lease Contract Management

Deep Dive 4: Restocking Optimization

Deep Dive 5: Payment Processing and Offline Resilience

Deep Dive 6: Machine Health Monitoring and Predictive Maintenance

What is Expected at Each Level

Mid-Level

Senior

Staff+