Skip to content

HLD: URL Shortener (Bit.ly) ​

Understanding the Problem ​

What is a URL Shortener? ​

A URL shortener takes a long URL (like https://example.com/products/electronics/headphones?sort=price&filter=wireless&page=3) and creates a short alias (like https://short.ly/a3Xk9) that redirects to the original. It seems simple on the surface, but at scale the challenges are interesting: generating globally unique short codes without collisions, handling massive redirect traffic with minimal latency, and tracking analytics on every click.

Functional Requirements ​

Core (above the line):

  1. Shorten a URL -- given a long URL, generate a unique short URL
  2. Redirect -- when a user visits the short URL, redirect them to the original long URL
  3. Custom aliases -- users can optionally specify a custom short code (e.g., short.ly/my-brand)
  4. Analytics -- track click count, referrer, geo, device for each short URL

Below the line (mention but don't design):

  • User accounts and URL management dashboard
  • URL expiration and deletion
  • QR code generation
  • Link previews
  • Bulk URL shortening API

Non-Functional Requirements ​

  1. Low redirect latency -- redirects must complete in < 50ms (P99), since every millisecond adds friction
  2. High availability -- 99.99% uptime for redirects (write path can tolerate slightly lower availability)
  3. Scale -- 100M new URLs/month, 10B redirects/month (~3,800 redirects/sec average, 20K/sec peak)
  4. Durability -- once a short URL is created, the mapping must never be lost

The Set Up ​

Core Entities ​

EntityDescription
URLshortCode, longUrl, userId (optional), createdAt, expiresAt (optional), clickCount
ClickshortCode, timestamp, referrer, userAgent, ipAddress, country, device
Userid, email, apiKey, tier (free/pro)

API Design ​

Shorten a URL:

POST /api/urls
Authorization: Bearer <token> (optional for anonymous)

Request:
{
  "longUrl": "https://example.com/very/long/path?with=params",
  "customAlias": "my-link",   // optional
  "expiresAt": "2025-01-01T00:00:00Z"  // optional
}

Response: 201 Created
{
  "shortCode": "my-link",
  "shortUrl": "https://short.ly/my-link",
  "longUrl": "https://example.com/very/long/path?with=params",
  "createdAt": "2024-01-15T10:30:00Z",
  "expiresAt": "2025-01-01T00:00:00Z"
}

Redirect (the hot path):

GET /{shortCode}
(e.g., GET /a3Xk9)

Response: 301 Moved Permanently (or 302 Found)
Location: https://example.com/very/long/path?with=params

Get URL analytics:

GET /api/urls/{shortCode}/analytics?period=7d

Response: 200 OK
{
  "shortCode": "a3Xk9",
  "totalClicks": 14523,
  "clicksByDay": [
    { "date": "2024-01-15", "clicks": 2301 },
    { "date": "2024-01-14", "clicks": 1987 }
  ],
  "topCountries": [
    { "country": "US", "clicks": 8200 },
    { "country": "UK", "clicks": 2100 }
  ],
  "topReferrers": [
    { "referrer": "twitter.com", "clicks": 5400 }
  ]
}

High-Level Design ​

Flow 1: Creating a Short URL ​

[Client] --> [API Gateway] --> [URL Service] --> [Short Code Generator]
                                    |                    |
                                    v                    v
                              [PostgreSQL]         [Redis Counter / ZooKeeper]
                                    |
                                    v
                              [Redis Cache: shortCode -> longUrl]

Step-by-step:

  1. Client submits a long URL via POST /api/urls
  2. API Gateway authenticates (if token provided), applies rate limiting (100 URLs/hour for free tier)
  3. URL Service validates the long URL (well-formed, not on a blocklist, reachable via HEAD request)
  4. If a custom alias is provided: a. Check Redis and PostgreSQL for collision b. If available, use it as the short code c. If taken, return 409 Conflict
  5. If no custom alias, Short Code Generator generates a unique short code (see Deep Dive 1)
  6. Write the mapping to PostgreSQL: INSERT INTO urls (short_code, long_url, user_id, created_at)
  7. Write to Redis cache: SET url:a3Xk9 "https://example.com/..." EX 86400 (24-hour TTL, refreshed on access)
  8. Return the short URL to the client

Flow 2: Redirecting (The Hot Path) ​

[Client] --> [CDN/Edge] --> [Redirect Service] --> [Redis Cache]
                                                        |
                                                   (cache miss)
                                                        v
                                                  [PostgreSQL]
                                                        |
                                                  [Kafka: click_events]
                                                        v
                                                  [Click Analytics Pipeline]

Step-by-step:

  1. User clicks https://short.ly/a3Xk9. DNS resolves to our CDN (CloudFront)
  2. CDN checks its edge cache. If the URL was recently accessed and we used a 301, the CDN has the redirect cached. Direct redirect from edge -- no origin hit.
  3. On CDN cache miss: request goes to Redirect Service
  4. Redirect Service looks up url:a3Xk9 in Redis
  5. If Redis hit: return 301/302 redirect with the Location header
  6. If Redis miss: query PostgreSQL, populate Redis, return redirect
  7. Asynchronously (non-blocking): publish a click event to Kafka with metadata (timestamp, IP, user-agent, referrer)
  8. Click Analytics Pipeline (Kafka consumer) processes click events:
    • Geo-lookup from IP (MaxMind database)
    • Device/browser parsing from user-agent
    • Write to ClickHouse (columnar analytics DB) for fast aggregation queries
    • Increment clickCount in Redis: INCR clicks:a3Xk9

Flow 3: Analytics Query ​

  1. Client calls GET /api/urls/{shortCode}/analytics?period=7d
  2. Analytics Service queries ClickHouse:
    sql
    SELECT toDate(timestamp) as date, count() as clicks
    FROM clicks
    WHERE short_code = 'a3Xk9' AND timestamp > now() - INTERVAL 7 DAY
    GROUP BY date ORDER BY date
  3. For top-level metrics (total clicks), read from Redis counter for speed
  4. Return aggregated analytics

Potential Deep Dives ​

Deep Dive 1: URL Encoding / Short Code Generation Strategy ​

The Problem: We need to generate short, unique codes for 100M URLs/month. Over 5 years, that is 6 billion URLs. The code should be short (7 characters) and never collide.

Capacity check: Base62 (a-z, A-Z, 0-9) with 7 characters = 62^7 = 3.5 trillion possible codes. 6 billion URLs is only 0.17% of the space. Plenty of room.


Bad Solution -- MD5 hash truncation:

shortCode = base62(MD5(longUrl))[0:7]

MD5 gives 128 bits, we take the first 42 bits (7 Base62 chars). Birthday problem: with 6B URLs and a 3.5T code space, collision probability is approximately n^2 / (2 * space) = (6*10^9)^2 / (2 * 3.5*10^12) = ~5.1. That means collisions are guaranteed -- not just probable. We would need a retry loop to check for collisions, adding latency and complexity.

Also: same long URL always generates the same short code, which is a feature (dedup) but also a limitation (different users can't have different short codes for the same URL).


Good Solution -- Counter-based with Base62 encoding: Use a global auto-incrementing counter. Each new URL gets the next number, encoded in Base62.

counter = 1000000  ->  base62("1000000") = "4c92"
counter = 1000001  ->  base62("1000001") = "4c93"

No collisions ever. But:

  • Sequential codes are predictable (users can enumerate: short.ly/4c92, short.ly/4c93)
  • Single counter is a bottleneck in a distributed system

Distributed counter approach: Use ZooKeeper or a counter service to pre-allocate ranges:

  • Server 1 gets range [1M, 2M)
  • Server 2 gets range [2M, 3M)
  • Each server increments locally, no coordination needed
  • When a server exhausts its range, it requests a new one

Great Solution -- Snowflake-style ID with Base62: Generate a 64-bit unique ID:

| timestamp (41 bits) | datacenter (5 bits) | server (5 bits) | sequence (12 bits) |

Then encode in Base62. The ID is unique, non-sequential (hard to guess), and generated locally.

7 Base62 characters = 42 bits. We can use the lower 42 bits of the Snowflake ID, which includes server ID and sequence -- unique within a millisecond per server.

To prevent predictability, we can also apply a bijective shuffle:

python
def encode(id):
    # XOR with a secret to shuffle the output
    shuffled = id ^ SECRET_MASK
    return base62_encode(shuffled)

Our choice: Counter-based with range allocation (simplest, no collisions, well-understood). Use the bijective shuffle for non-predictability.

Deep Dive 2: 301 vs 302 Redirects ​

301 Moved Permanently:

  • Browser caches the redirect. Next time the user clicks the short URL, the browser goes directly to the long URL without hitting our server.
  • Pros: Faster for repeat visits, reduces load on our servers
  • Cons: We lose analytics for repeat visits. If a user clicks the same short URL 10 times, we only see the first click.
  • CDN also caches 301s aggressively.

302 Found (Temporary Redirect):

  • Browser does NOT cache the redirect. Every click hits our server.
  • Pros: Full analytics on every click. We can also change the destination URL later.
  • Cons: Higher server load, slightly slower for users (extra hop every time)

Our decision:

  • Default: 302 for URLs where the creator has analytics enabled (most URLs). Analytics is a core feature.
  • Option: 301 for high-volume URLs where the creator does not care about analytics (saves server cost).
  • At the CDN level: use a short cache TTL (e.g., 5 minutes) even for 302s. This reduces origin load while still capturing most unique clicks.

Back-of-envelope impact:

  • 10B redirects/month with 302 = 3,800 QPS on our Redirect Service
  • With 301 + CDN caching (assume 80% cache hit rate): only 760 QPS hit origin. 5x reduction.
  • For a cost-sensitive service, 301 with CDN caching makes sense if analytics is secondary.

Deep Dive 3: Analytics Tracking at Scale ​

The Problem: 10B clicks/month = ~3,800 events/sec. Each event has ~500 bytes of metadata. We need fast writes and fast aggregation queries.

Bad Solution -- Write to PostgreSQL synchronously: Insert into a clicks table on every redirect. This adds 5-10ms to the redirect latency (unacceptable for a 50ms budget) and PostgreSQL struggles with 3,800 inserts/sec with indexes.

Good Solution -- Async writes via Kafka:

  1. Redirect Service publishes a click event to Kafka (fire-and-forget, < 1ms)
  2. Kafka consumer batch-writes to the analytics store
  3. Zero impact on redirect latency

Great Solution -- Kafka + ClickHouse pipeline:

Why ClickHouse? It is a columnar database optimized for:

  • High-speed inserts (millions of rows/sec in batches)
  • Fast aggregation queries (GROUP BY country, GROUP BY day)
  • Efficient compression for time-series data

Pipeline:

Redirect Service --> Kafka (click_events topic, 20 partitions)
                        |
                  Kafka Consumer Group (10 consumers)
                        |
                  Batch insert to ClickHouse (every 5 seconds or 10K events)

ClickHouse schema:

sql
CREATE TABLE clicks (
    short_code String,
    timestamp DateTime,
    country LowCardinality(String),
    referrer String,
    device LowCardinality(String),
    browser LowCardinality(String),
    os LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (short_code, timestamp);

Storage calculation: 10B clicks/month * 500 bytes * 12 months = 60 TB/year uncompressed. ClickHouse typically achieves 10x compression on this type of data = 6 TB/year. Very manageable.

For real-time counters (total clicks displayed on the dashboard): use Redis INCR on redirect. ClickHouse is for detailed analytics queries (top countries, click trends over time).

Deep Dive 4: Custom Aliases and Collision Handling ​

The Problem: Users want custom aliases like short.ly/my-brand. How do we handle:

  1. Checking availability without race conditions
  2. Preventing abuse (squatting on popular names)
  3. Conflicts with auto-generated codes

Solution:

Namespace separation: Reserve auto-generated codes to a specific format (e.g., always 7 characters from a restricted alphabet). Custom aliases can be 3-30 characters, alphanumeric plus hyphens. The two namespaces never overlap because auto-generated codes use a deterministic length and character set.

Availability check with locking:

python
def create_custom_alias(alias, long_url, user_id):
    # Acquire distributed lock
    lock = redis.lock(f"alias_lock:{alias}", timeout=5)
    if not lock.acquire(blocking=True, blocking_timeout=2):
        raise ConflictError("Try again")

    try:
        # Check if alias exists
        existing = db.query("SELECT 1 FROM urls WHERE short_code = ?", alias)
        if existing:
            raise ConflictError("Alias already taken")

        # Reserve it
        db.insert("INSERT INTO urls (short_code, long_url, user_id) VALUES (?, ?, ?)",
                   alias, long_url, user_id)
        redis.set(f"url:{alias}", long_url)
    finally:
        lock.release()

Abuse prevention:

  • Free tier: max 5 custom aliases per account
  • Profanity filter on aliases
  • Block trademarked terms (optional, legal requirement in some jurisdictions)
  • If an alias is unused (zero clicks for 12 months), allow reclaiming

Deep Dive 5: URL Expiration ​

The Problem: Some URLs should expire after a set time. How do we handle cleanup without scanning the entire table?

Bad Solution -- Cron job scanning all URLs:

sql
DELETE FROM urls WHERE expires_at < NOW() AND expires_at IS NOT NULL;

With billions of rows, this full-table scan is extremely expensive and blocks other operations.

Good Solution -- Lazy expiration (check on read): When a redirect request comes in, check if the URL has expired:

python
def redirect(short_code):
    url = cache.get(short_code) or db.get(short_code)
    if url.expires_at and url.expires_at < now():
        return 410 Gone  # URL has expired
    return redirect_to(url.long_url)

This handles the user-facing behavior, but expired URLs still consume storage.

Great Solution -- Lazy expiration + background cleanup:

  1. Lazy expiration on read (as above) for immediate user experience
  2. Redis TTL on cached entries: set the Redis TTL to match the expiration time. Redis automatically evicts expired entries.
  3. Background cleanup: Partition the URLs table by created_at month. Run a monthly job that drops old partitions where all URLs have expired. This is an O(1) partition drop, not a row-by-row delete.

For the cleanup job:

sql
-- Partition-level check
SELECT partition_name, max(expires_at)
FROM url_partitions
WHERE max_expires_at < NOW()
-- Drop entire partitions where everything has expired
ALTER TABLE urls DROP PARTITION p202301;

Deep Dive 6: High Availability for Redirects ​

The Problem: Redirects are the hot path. If the redirect service goes down, millions of users see errors. How do we achieve 99.99% availability?

Multi-layer redundancy:

  1. CDN layer (CloudFront): Cache redirects at edge locations globally. Even if our origin is down, cached redirects continue to work. For popular URLs (80% of traffic), CDN handles it.

  2. DNS failover: Use Route 53 with health checks. If the primary region is down, DNS fails over to a secondary region within 60 seconds.

  3. Multi-region deployment:

    • Primary region: us-east-1 (handles writes and reads)
    • Secondary region: eu-west-1 (read replica of PostgreSQL, Redis replica)
    • Both regions can serve redirects independently
    • Writes (URL creation) go to primary only -- this is acceptable since writes are 100x less frequent than reads
  4. Redis cluster with replication:

    • 3 Redis shards, each with 2 replicas
    • If a primary shard fails, a replica is promoted automatically
    • Total capacity: 3 shards * 200K ops/sec = 600K ops/sec (we need ~20K at peak)
  5. Graceful degradation: If Redis is down, fall back to PostgreSQL read replicas. Latency increases from 2ms to 15ms, but the service stays up.

Back-of-envelope for Redis sizing:

  • 6B total URLs over 5 years
  • Average entry: 7-byte key + 200-byte URL + overhead = ~300 bytes
  • Only cache active URLs (accessed in last 30 days). Assume 20% are active = 1.2B entries
  • 1.2B * 300 bytes = 360 GB. Spread across 3 shards = 120 GB per shard. Fits in a large Redis instance (r6g.4xlarge with 128GB RAM).

What is Expected at Each Level ​

Mid-Level ​

  • Design the basic shorten and redirect flow with a database
  • Understand Base62 encoding for short code generation
  • Identify that redirect latency is critical and propose caching
  • Know the difference between 301 and 302 redirects

Senior ​

  • Design the counter-based short code generation with range allocation
  • Back-of-envelope for code space capacity (Base62, 7 chars, collision probability)
  • Redis caching layer for redirects with cache-aside pattern
  • Async analytics via Kafka
  • Handle custom aliases with collision detection
  • Discuss trade-offs of 301 vs 302 for analytics vs performance
  • URL expiration with lazy evaluation

Staff+ ​

  • Multi-region deployment for redirect high availability
  • CDN integration for edge caching of redirects
  • ClickHouse (or similar OLAP) for analytics at scale with storage calculations
  • Hot-spot detection for viral URLs (what if one short URL gets 1M clicks/minute?)
  • Abuse prevention: URL spam, phishing link detection, rate limiting per IP
  • Data retention policy and partition-based cleanup
  • Cost optimization: estimating infrastructure costs for 10B redirects/month

Frontend interview preparation reference.