Skip to content

HLD: URL Shortener (Bit.ly style) ​

L4 scoping note. This is a classic "backend service for thousands to millions of users" problem. The interviewer does NOT want a blueprint for planet-scale anycast CDN plus multi-region Spanner. They want a service that reliably shortens URLs, redirects them quickly, and captures basic analytics. Keep the scope honest -- don't start by sharding across 50 regions.


Understanding the Problem ​

What is a URL Shortener? ​

A URL shortener takes a long URL like https://www.example.com/articles/2024/deep-dive-on-distributed-systems?ref=newsletter and produces a short alias like https://sho.rt/aB3xY7. When a user visits the short URL, the service redirects them to the original. Bit.ly, TinyURL, and goo.gl (RIP) are examples. The interview value is high because it exposes your thinking around ID generation, caching, read-heavy workloads, and analytics -- all in one bounded problem.

Functional Requirements ​

Core (above the line):

  1. Shorten a URL -- given a long URL, produce a short code. Optionally accept a custom alias.
  2. Redirect -- given a short code, redirect the user to the original long URL.
  3. Expiration -- URLs can be created with an optional expiration timestamp; after expiry, the redirect returns 404.
  4. Analytics -- track click count and a basic geo breakdown (country-level) per short code.

Below the line (out of scope):

  • User accounts, auth, multi-tenant dashboards (this is a backend service, not a product surface)
  • Preview pages / safety scanning for malicious URLs (real Bit.ly does this; skip it)
  • A/B testing and deep-link routing to mobile apps
  • Editing / re-pointing existing short codes (adds consistency complexity for marginal value)

Non-Functional Requirements ​

Core:

  1. Low-latency redirects -- p99 < 100 ms for the redirect path. This is the hot path and what users feel.
  2. High availability -- 99.95% on redirects. Shortening can tolerate more downtime; redirects cannot.
  3. Scale -- 100M short URLs total, 10M new URLs/month, 1B redirects/month (~400 QPS average, ~2000 QPS peak). Read-to-write ratio roughly 100:1.
  4. Uniqueness + Collision Safety -- no two long URLs should accidentally map to the same code, and a code must never point to the wrong URL.

Below the line:

  • Global sub-20ms latency from every continent (out of scope unless problem says so)
  • Strong consistency on analytics (eventual is fine; a click count lagging by 30s is acceptable)

L4 sanity check: 2000 QPS peak is moderate traffic. A single well-tuned PostgreSQL instance with proper indexes plus a Redis cache in front can absorb this. Resist the urge to reach for Spanner or Bigtable from the start.


The Set Up ​

Core Entities ​

EntityDescription
ShortURLThe mapping itself: shortCode, longUrl, createdAt, expiresAt, ownerId (optional)
ClickEventA single redirect event: shortCode, timestamp, country, userAgent, referrer
AnalyticsAggregatePre-computed per-code rollups: shortCode, totalClicks, clicksByCountry

The API ​

Create a short URL:

POST /api/urls
Content-Type: application/json

Request:
{
  "longUrl": "https://example.com/some/very/long/path",
  "customAlias": "my-link",          // optional
  "expiresAt": "2026-12-31T00:00:00Z" // optional
}

Response: 201 Created
{
  "shortCode": "aB3xY7",
  "shortUrl": "https://sho.rt/aB3xY7",
  "longUrl": "https://example.com/some/very/long/path",
  "expiresAt": "2026-12-31T00:00:00Z"
}

POST because creating a resource. Return 201, not 200. If the custom alias is taken, return 409 Conflict.

Redirect:

GET /{shortCode}

Response: 302 Found
Location: https://example.com/some/very/long/path
Cache-Control: private, max-age=0

302 (not 301) -- explained below. 404 if the code is unknown or expired.

Get analytics:

GET /api/urls/{shortCode}/analytics

Response: 200 OK
{
  "shortCode": "aB3xY7",
  "totalClicks": 14203,
  "clicksByCountry": { "US": 8012, "IN": 3200, "DE": 900, "...": 2091 },
  "createdAt": "2026-01-15T10:00:00Z"
}

High-Level Design ​

Flow 1: Creating a Short URL ​

[Client] -> [API Gateway] -> [URL Service] -> [ID Generator]
                                  |
                                  v
                             [PostgreSQL: urls table]
                                  |
                                  v
                             [Redis cache: SET shortCode -> longUrl]
  1. Client POSTs the long URL to /api/urls.
  2. URL Service validates the input (is it a real URL, is the scheme allowed, is the custom alias legal, is it taken).
  3. If no custom alias, URL Service asks the ID Generator for a new short code. (See deep dive on Base62 vs counter.)
  4. URL Service writes (shortCode, longUrl, createdAt, expiresAt) to PostgreSQL. The shortCode column has a UNIQUE constraint -- this is our collision guard.
  5. URL Service writes through to Redis: SET url:aB3xY7 {longUrl, expiresAt} with a TTL matching expiresAt (or 30 days default).
  6. Return the short URL to the client.

Flow 2: Redirect (the hot path) ​

[Browser] --GET /aB3xY7--> [CDN edge] --miss--> [API Gateway] -> [Redirect Service]
                                                                        |
                                                                        v
                                                                   [Redis]
                                                                        |
                                                            miss ------ hit
                                                              |          |
                                                              v          v
                                                        [PostgreSQL]  respond 302
                                                              |
                                                              v
                                                        respond 302 + async click log -> [Kafka]
  1. Browser issues GET /aB3xY7.
  2. Optional CDN layer. For popular links, the CDN serves the 302 directly. For L4 you can mention CDN as a possible optimization; you are not required to design it.
  3. Redirect Service looks up aB3xY7 in Redis. Cache hit: ~90%+ for active links.
  4. On cache miss, fall back to PostgreSQL SELECT long_url, expires_at FROM urls WHERE short_code = ?. Populate Redis.
  5. If expired or not found: return 404.
  6. Otherwise return 302 with the long URL in Location.
  7. Asynchronously publish a click event to Kafka. This must not block the redirect response.

Flow 3: Analytics ​

  1. Kafka topic click_events receives every click.
  2. A small consumer (call it AnalyticsAggregator) reads the stream and updates per-code counters. Two options:
    • Write individual events to a columnar store (ClickHouse / BigQuery) and compute aggregates on query.
    • Maintain rolling aggregates in Redis (HINCRBY count:aB3xY7 US 1) and flush to PostgreSQL every minute.
  3. For L4 traffic (1B clicks/month ~= 400/sec avg), Redis + periodic flush is completely sufficient.

Flow 4: Expiration Cleanup ​

  1. A daily cron job DELETE FROM urls WHERE expires_at < NOW() - INTERVAL '30 days'.
  2. Redis entries naturally expire via TTL.
  3. Don't over-think this; expired redirects return 404 from the cache miss path even before cleanup runs.

Potential Deep Dives ​

1) How do we generate unique short codes? ​

This is THE deep dive for this problem. You should be able to compare three approaches by memory.

Bad Solution: Random generation with collision retry ​

  • Approach: Generate 6 random Base62 characters. Check DB. If it exists, retry.
  • Challenges: As the table fills up, collision rate climbs. At 100M URLs out of 62^6 = 56.8B slots, collision rate is ~0.2% -- tolerable. But every create needs a round-trip to check. Under concurrency, two writers can both check, both see "free," and both write. You need DB-level uniqueness constraints and retry-on-conflict logic. Works, but wasteful.

Good Solution: Hash the long URL ​

  • Approach: shortCode = base62(md5(longUrl)[:6]). Deterministic -- the same URL always produces the same code, which is a nice property ("submit the same link twice, get the same short URL").
  • Challenges: Collisions between different long URLs hashing to the same prefix. You still need the UNIQUE constraint and a fallback (e.g., append a salt and re-hash). The "same URL, same code" property fights against custom aliases and per-user scoping.

Great Solution: Counter-based with Base62 encoding ​

  • Approach: A monotonically increasing 64-bit counter. Encode as Base62 -> 6-7 character codes. No collisions by construction.
    • Single DB autoincrement doesn't scale for writes. Use a batch allocation pattern: each URL Service instance reserves a block of 10,000 IDs from a central counter table, then issues them locally without contention. When the block is exhausted, reserve another.
    • Alternative: Snowflake-style IDs (timestamp + machine + sequence) then Base62-encode.
java
// Base62 encoding
private static final String ALPHABET =
  "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

public static String encode(long id) {
    StringBuilder sb = new StringBuilder();
    while (id > 0) {
        sb.append(ALPHABET.charAt((int)(id % 62)));
        id /= 62;
    }
    return sb.reverse().toString();
}

// Block allocator
class IdAllocator {
    private final AtomicLong next = new AtomicLong();
    private long blockEnd = 0;

    synchronized long nextId() {
        if (next.get() >= blockEnd) {
            long start = reserveBlockFromDb(10_000);
            next.set(start);
            blockEnd = start + 10_000;
        }
        return next.getAndIncrement();
    }
}
cpp
static const std::string ALPHABET =
  "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

std::string encode(uint64_t id) {
    std::string out;
    while (id > 0) {
        out.push_back(ALPHABET[id % 62]);
        id /= 62;
    }
    std::reverse(out.begin(), out.end());
    return out;
}
typescript
const ALPHABET =
  "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

export function encode(id: bigint): string {
  let s = "";
  while (id > 0n) {
    s = ALPHABET[Number(id % 62n)] + s;
    id /= 62n;
  }
  return s;
}
  • Challenges: Sequential codes leak volume (competitor can guess your growth). If that matters, XOR the counter with a fixed secret before encoding, or use a Feistel network to pseudo-randomize while preserving 1:1. For L4, a note that "we can obfuscate if needed" is enough.

Trade-off triangle: predictability vs. simplicity vs. throughput. Counter = simplest and fastest but predictable. Random = unpredictable but collision-prone. Hash = deterministic but has collision edge cases.

2) How do we handle the read-heavy workload? ​

Bad Solution: Hit PostgreSQL on every redirect ​

At 2000 QPS peak, a single Postgres instance with a B-tree index on short_code can handle it -- each lookup is a few ms. It works until you hit a traffic spike or a hot link gets embedded in a viral tweet. One URL doing 50K QPS will melt a connection pool.

Good Solution: Redis cache in front of PostgreSQL ​

  • LRU eviction, 5-10 GB of Redis memory holds millions of hot mappings (each entry ~200 bytes).
  • 90%+ hit rate on typical URL traffic (skewed power-law distribution).
  • Redis sub-millisecond, Postgres fallback single-digit ms.
  • Write-through: on create, populate both. On miss, read-through from Postgres and populate.

Back-of-envelope: 100M URLs * 200 bytes = 20 GB. We don't need all of them in cache. The top 10% by traffic covers 90%+ of reads -- that's 2 GB. Comfortably fits on one Redis node.

Great Solution: Tiered caching with CDN for hot URLs ​

  • CDN edge cache for extremely popular links (the kind that trend on Reddit). Set Cache-Control: public, max-age=300 for non-expiring URLs. The 302 + Location header is tiny (< 1 KB) and cacheable.
  • Redis for the warm tier.
  • PostgreSQL for the cold tier (anything not touched in the last 30 days).
  • Invalidation concern: if a URL expires or is deleted, CDN entries linger. Mitigate with short max-age (5 min) and accept the staleness window.

L4 note: For most interviews, Redis + Postgres is plenty. Mention CDN as an optional optimization for hot links; don't design the CDN invalidation protocol unless asked.

3) Database choice: PostgreSQL vs KV store? ​

Good Solution: PostgreSQL ​

  • Schema: urls(short_code PK, long_url TEXT, created_at, expires_at, owner_id) + click_events + analytics_aggregates.
  • Indexes: primary key on short_code, optional secondary on owner_id for "my links" queries.
  • Why Postgres: ACID, simple SQL, mature tooling, easy to operate. At 100M rows with a B-tree index on short_code, lookups are O(log N) ~= 27 hops, all in buffer cache. Fast.
  • Sharding (if needed at higher scale): shard by hash(short_code) mod N. Not required for L4 scale.

Good Alternative: DynamoDB / Cassandra / Bigtable ​

  • Pure KV access pattern maps perfectly to a KV store. Partition key = short_code.
  • Pros: trivial horizontal scaling, high availability, no schema migrations.
  • Cons: aggregation queries for analytics are awkward (you'd push to a separate analytics store anyway). More ops overhead if you're not already on AWS/GCP.

Recommendation for L4: PostgreSQL is the safer choice. It's "boring technology that works." Only pick a KV store if the interviewer explicitly steers you that way or the scale jumps to tens of billions of URLs.

4) 301 vs 302 redirects ​

A surprisingly important detail interviewers love.

  • 301 Permanent: browser caches the redirect aggressively -- future hits go straight from browser cache to the long URL, skipping your server. Great for load. Terrible for analytics -- you never see the click.
  • 302 Found: browser asks your server every time. You capture every click. Slightly higher load.

Recommendation: 302. We want the click analytics. The extra load is bounded and we already have caching layers. Explicitly call this out to the interviewer -- they will be listening for it.

5) Analytics write path without blocking redirects ​

Bad Solution: Synchronous DB write on every click ​

Every redirect incurs a DB write. You've turned a read-heavy service into a write-heavy one. At 2000 QPS peak, you're doing 2000 writes/sec for analytics. Wasteful and it adds latency to the hot path.

Good Solution: In-memory counter + periodic flush ​

Each Redirect Service instance keeps an in-memory Map<shortCode, count>. Every 10 seconds, flush to Redis with HINCRBY. Every minute, Redis aggregator flushes to Postgres.

  • Pro: near-zero latency impact on redirect.
  • Con: if an instance crashes between flushes, you lose ~10 seconds of counts. Acceptable for analytics.

Great Solution: Kafka + streaming aggregator ​

  • Redirect Service publishes a click event to Kafka (topic: clicks, key: shortCode) in the background (fire-and-forget with bounded in-memory buffer, drop on overflow).
  • A Kafka Streams / Flink job consumes, aggregates per-code per-minute, and writes to ClickHouse for rich analytics and to Postgres/Redis for the "total clicks" counter.
  • Geo enrichment happens in the stream (IP -> country via MaxMind GeoIP).

L4 scoping: mention Kafka. You don't need to pick Flink vs Kafka Streams vs Beam; just note "a stream processor consumes and aggregates." At 400 clicks/sec average, Kafka is overkill from a throughput standpoint, but it gives clean decoupling -- a worthy architectural choice.

lua
-- Redis Lua: atomic increment with geo breakdown
local key = KEYS[1]
local country = ARGV[1]
redis.call('HINCRBY', key, 'total', 1)
redis.call('HINCRBY', key, 'country:' .. country, 1)
redis.call('EXPIRE', key, 86400 * 90)  -- retain 90 days
return 1

6) Reliability and failure modes ​

  • Postgres down: Redirect Service serves from Redis. Shortening returns 503 (write path requires the DB). Acceptable.
  • Redis down: Redirect Service falls through to Postgres directly. Latency climbs but service stays up. Put a circuit breaker so we don't hammer Redis while it's recovering.
  • Kafka down: Click events are dropped (or buffered briefly). Redirects still work. Analytics lags. This is the right trade-off -- analytics is eventually consistent.
  • Regional outage: If you're in one region, users from other continents just see higher latency, not outages. Multi-region is explicit scope expansion; call it out but don't design it unless asked.

7) Custom aliases ​

  • Need a uniqueness check on write -- Postgres UNIQUE constraint does it.
  • Need to guard against malicious aliases (reserved words, profanity, impersonation of popular brands). Maintain a denylist.
  • Custom aliases skip the ID generator entirely. Just INSERT INTO urls (short_code, long_url) VALUES (?, ?) and let the UNIQUE constraint reject duplicates.

What is Expected at Each Level ​

L3 / Mid-level ​

  • Come up with the create/redirect API. Propose random or hash-based short codes. Identify that we need a DB and probably a cache. Basic Redis-in-front-of-Postgres architecture. Might miss the 301-vs-302 nuance or the async analytics path.

L4 ​

  • Everything above, plus:
    • Counter-based ID generation with block allocation -- and can explain why it's better than random.
    • Explicit 302 choice with justification.
    • Asynchronous analytics via Kafka (or at least "decouple it from the hot path").
    • Back-of-envelope: cache size, QPS, row count, index performance.
    • Clear reasoning about Postgres being sufficient, with awareness of when you'd move to a KV store.
    • Reliability behavior when Redis / Kafka fails.

L5 / Senior ​

  • Discuss multi-region design, cross-region replication lag for the write path, eventual consistency on custom alias uniqueness.
  • Design the analytics pipeline end-to-end including schema evolution (Avro/Proto in Kafka).
  • Anticipate hot-key issues at the cache layer and discuss local-cache fallback.
  • Operational concerns: p99 monitoring, SLO burn rate alerts, capacity planning, backup strategy for the metadata DB.
  • Cost thinking: CDN bandwidth vs cache node cost vs DB read QPS.

Frontend interview preparation reference.