HLD: URL Shortener (Bit.ly)

Understanding the Problem

What is a URL Shortener?

A URL shortener takes a long URL (like https://example.com/products/electronics/headphones?sort=price&filter=wireless&page=3) and creates a short alias (like https://short.ly/a3Xk9) that redirects to the original. It seems simple on the surface, but at scale the challenges are interesting: generating globally unique short codes without collisions, handling massive redirect traffic with minimal latency, and tracking analytics on every click.

Functional Requirements

Core (above the line):

Shorten a URL -- given a long URL, generate a unique short URL
Redirect -- when a user visits the short URL, redirect them to the original long URL
Custom aliases -- users can optionally specify a custom short code (e.g., short.ly/my-brand)
Analytics -- track click count, referrer, geo, device for each short URL

Below the line (mention but don't design):

User accounts and URL management dashboard
URL expiration and deletion
QR code generation
Link previews
Bulk URL shortening API

Non-Functional Requirements

Low redirect latency -- redirects must complete in < 50ms (P99), since every millisecond adds friction
High availability -- 99.99% uptime for redirects (write path can tolerate slightly lower availability)
Scale -- 100M new URLs/month, 10B redirects/month (~3,800 redirects/sec average, 20K/sec peak)
Durability -- once a short URL is created, the mapping must never be lost

The Set Up

Core Entities

Entity	Description
URL	`shortCode`, `longUrl`, `userId` (optional), `createdAt`, `expiresAt` (optional), `clickCount`
Click	`shortCode`, `timestamp`, `referrer`, `userAgent`, `ipAddress`, `country`, `device`
User	`id`, `email`, `apiKey`, `tier` (free/pro)

API Design

Shorten a URL:

POST /api/urls
Authorization: Bearer <token> (optional for anonymous)

Request:
{
  "longUrl": "https://example.com/very/long/path?with=params",
  "customAlias": "my-link",   // optional
  "expiresAt": "2025-01-01T00:00:00Z"  // optional
}

Response: 201 Created
{
  "shortCode": "my-link",
  "shortUrl": "https://short.ly/my-link",
  "longUrl": "https://example.com/very/long/path?with=params",
  "createdAt": "2024-01-15T10:30:00Z",
  "expiresAt": "2025-01-01T00:00:00Z"
}

Redirect (the hot path):

GET /{shortCode}
(e.g., GET /a3Xk9)

Response: 301 Moved Permanently (or 302 Found)
Location: https://example.com/very/long/path?with=params

Get URL analytics:

GET /api/urls/{shortCode}/analytics?period=7d

Response: 200 OK
{
  "shortCode": "a3Xk9",
  "totalClicks": 14523,
  "clicksByDay": [
    { "date": "2024-01-15", "clicks": 2301 },
    { "date": "2024-01-14", "clicks": 1987 }
  ],
  "topCountries": [
    { "country": "US", "clicks": 8200 },
    { "country": "UK", "clicks": 2100 }
  ],
  "topReferrers": [
    { "referrer": "twitter.com", "clicks": 5400 }
  ]
}

High-Level Design

Flow 1: Creating a Short URL

[Client] --> [API Gateway] --> [URL Service] --> [Short Code Generator]
                                    |                    |
                                    v                    v
                              [PostgreSQL]         [Redis Counter / ZooKeeper]
                                    |
                                    v
                              [Redis Cache: shortCode -> longUrl]

Step-by-step:

Client submits a long URL via POST /api/urls
API Gateway authenticates (if token provided), applies rate limiting (100 URLs/hour for free tier)
URL Service validates the long URL (well-formed, not on a blocklist, reachable via HEAD request)
If a custom alias is provided: a. Check Redis and PostgreSQL for collision b. If available, use it as the short code c. If taken, return 409 Conflict
If no custom alias, Short Code Generator generates a unique short code (see Deep Dive 1)
Write the mapping to PostgreSQL: INSERT INTO urls (short_code, long_url, user_id, created_at)
Write to Redis cache: SET url:a3Xk9 "https://example.com/..." EX 86400 (24-hour TTL, refreshed on access)
Return the short URL to the client

Flow 2: Redirecting (The Hot Path)

[Client] --> [CDN/Edge] --> [Redirect Service] --> [Redis Cache]
                                                        |
                                                   (cache miss)
                                                        v
                                                  [PostgreSQL]
                                                        |
                                                  [Kafka: click_events]
                                                        v
                                                  [Click Analytics Pipeline]

Step-by-step:

User clicks https://short.ly/a3Xk9. DNS resolves to our CDN (CloudFront)
CDN checks its edge cache. If the URL was recently accessed and we used a 301, the CDN has the redirect cached. Direct redirect from edge -- no origin hit.
On CDN cache miss: request goes to Redirect Service
Redirect Service looks up url:a3Xk9 in Redis
If Redis hit: return 301/302 redirect with the Location header
If Redis miss: query PostgreSQL, populate Redis, return redirect
Asynchronously (non-blocking): publish a click event to Kafka with metadata (timestamp, IP, user-agent, referrer)
Click Analytics Pipeline (Kafka consumer) processes click events:
- Geo-lookup from IP (MaxMind database)
- Device/browser parsing from user-agent
- Write to ClickHouse (columnar analytics DB) for fast aggregation queries
- Increment clickCount in Redis: INCR clicks:a3Xk9

Flow 3: Analytics Query

Client calls GET /api/urls/{shortCode}/analytics?period=7d

Analytics Service queries ClickHouse:

sql

SELECT toDate(timestamp) as date, count() as clicks
FROM clicks
WHERE short_code = 'a3Xk9' AND timestamp > now() - INTERVAL 7 DAY
GROUP BY date ORDER BY date

For top-level metrics (total clicks), read from Redis counter for speed
Return aggregated analytics

Potential Deep Dives

Deep Dive 1: URL Encoding / Short Code Generation Strategy

The Problem: We need to generate short, unique codes for 100M URLs/month. Over 5 years, that is 6 billion URLs. The code should be short (7 characters) and never collide.

Capacity check: Base62 (a-z, A-Z, 0-9) with 7 characters = 62^7 = 3.5 trillion possible codes. 6 billion URLs is only 0.17% of the space. Plenty of room.

Bad Solution -- MD5 hash truncation:

shortCode = base62(MD5(longUrl))[0:7]

MD5 gives 128 bits, we take the first 42 bits (7 Base62 chars). Birthday problem: with 6B URLs and a 3.5T code space, collision probability is approximately n^2 / (2 * space) = (6*10^9)^2 / (2 * 3.5*10^12) = ~5.1. That means collisions are guaranteed -- not just probable. We would need a retry loop to check for collisions, adding latency and complexity.

Also: same long URL always generates the same short code, which is a feature (dedup) but also a limitation (different users can't have different short codes for the same URL).

Good Solution -- Counter-based with Base62 encoding: Use a global auto-incrementing counter. Each new URL gets the next number, encoded in Base62.

counter = 1000000  ->  base62("1000000") = "4c92"
counter = 1000001  ->  base62("1000001") = "4c93"

No collisions ever. But:

Sequential codes are predictable (users can enumerate: short.ly/4c92, short.ly/4c93)
Single counter is a bottleneck in a distributed system

Distributed counter approach: Use ZooKeeper or a counter service to pre-allocate ranges:

Server 1 gets range [1M, 2M)
Server 2 gets range [2M, 3M)
Each server increments locally, no coordination needed
When a server exhausts its range, it requests a new one

Great Solution -- Snowflake-style ID with Base62: Generate a 64-bit unique ID:

| timestamp (41 bits) | datacenter (5 bits) | server (5 bits) | sequence (12 bits) |

Then encode in Base62. The ID is unique, non-sequential (hard to guess), and generated locally.

7 Base62 characters = 42 bits. We can use the lower 42 bits of the Snowflake ID, which includes server ID and sequence -- unique within a millisecond per server.

To prevent predictability, we can also apply a bijective shuffle:

python

def encode(id):
    # XOR with a secret to shuffle the output
    shuffled = id ^ SECRET_MASK
    return base62_encode(shuffled)

Our choice: Counter-based with range allocation (simplest, no collisions, well-understood). Use the bijective shuffle for non-predictability.

Deep Dive 2: 301 vs 302 Redirects

301 Moved Permanently:

Browser caches the redirect. Next time the user clicks the short URL, the browser goes directly to the long URL without hitting our server.
Pros: Faster for repeat visits, reduces load on our servers
Cons: We lose analytics for repeat visits. If a user clicks the same short URL 10 times, we only see the first click.
CDN also caches 301s aggressively.

302 Found (Temporary Redirect):

Browser does NOT cache the redirect. Every click hits our server.
Pros: Full analytics on every click. We can also change the destination URL later.
Cons: Higher server load, slightly slower for users (extra hop every time)

Our decision:

Default: 302 for URLs where the creator has analytics enabled (most URLs). Analytics is a core feature.
Option: 301 for high-volume URLs where the creator does not care about analytics (saves server cost).
At the CDN level: use a short cache TTL (e.g., 5 minutes) even for 302s. This reduces origin load while still capturing most unique clicks.

Back-of-envelope impact:

10B redirects/month with 302 = 3,800 QPS on our Redirect Service
With 301 + CDN caching (assume 80% cache hit rate): only 760 QPS hit origin. 5x reduction.
For a cost-sensitive service, 301 with CDN caching makes sense if analytics is secondary.

Deep Dive 3: Analytics Tracking at Scale

The Problem: 10B clicks/month = ~3,800 events/sec. Each event has ~500 bytes of metadata. We need fast writes and fast aggregation queries.

Bad Solution -- Write to PostgreSQL synchronously: Insert into a clicks table on every redirect. This adds 5-10ms to the redirect latency (unacceptable for a 50ms budget) and PostgreSQL struggles with 3,800 inserts/sec with indexes.

Good Solution -- Async writes via Kafka:

Redirect Service publishes a click event to Kafka (fire-and-forget, < 1ms)
Kafka consumer batch-writes to the analytics store
Zero impact on redirect latency

Great Solution -- Kafka + ClickHouse pipeline:

Why ClickHouse? It is a columnar database optimized for:

High-speed inserts (millions of rows/sec in batches)
Fast aggregation queries (GROUP BY country, GROUP BY day)
Efficient compression for time-series data

Pipeline:

Redirect Service --> Kafka (click_events topic, 20 partitions)
                        |
                  Kafka Consumer Group (10 consumers)
                        |
                  Batch insert to ClickHouse (every 5 seconds or 10K events)

ClickHouse schema:

sql

CREATE TABLE clicks (
    short_code String,
    timestamp DateTime,
    country LowCardinality(String),
    referrer String,
    device LowCardinality(String),
    browser LowCardinality(String),
    os LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (short_code, timestamp);

Storage calculation: 10B clicks/month * 500 bytes * 12 months = 60 TB/year uncompressed. ClickHouse typically achieves 10x compression on this type of data = 6 TB/year. Very manageable.

For real-time counters (total clicks displayed on the dashboard): use Redis INCR on redirect. ClickHouse is for detailed analytics queries (top countries, click trends over time).

Deep Dive 4: Custom Aliases and Collision Handling

The Problem: Users want custom aliases like short.ly/my-brand. How do we handle:

Checking availability without race conditions
Preventing abuse (squatting on popular names)
Conflicts with auto-generated codes

Solution:

Namespace separation: Reserve auto-generated codes to a specific format (e.g., always 7 characters from a restricted alphabet). Custom aliases can be 3-30 characters, alphanumeric plus hyphens. The two namespaces never overlap because auto-generated codes use a deterministic length and character set.

Availability check with locking:

python

def create_custom_alias(alias, long_url, user_id):
    # Acquire distributed lock
    lock = redis.lock(f"alias_lock:{alias}", timeout=5)
    if not lock.acquire(blocking=True, blocking_timeout=2):
        raise ConflictError("Try again")

    try:
        # Check if alias exists
        existing = db.query("SELECT 1 FROM urls WHERE short_code = ?", alias)
        if existing:
            raise ConflictError("Alias already taken")

        # Reserve it
        db.insert("INSERT INTO urls (short_code, long_url, user_id) VALUES (?, ?, ?)",
                   alias, long_url, user_id)
        redis.set(f"url:{alias}", long_url)
    finally:
        lock.release()

Abuse prevention:

Free tier: max 5 custom aliases per account
Profanity filter on aliases
Block trademarked terms (optional, legal requirement in some jurisdictions)
If an alias is unused (zero clicks for 12 months), allow reclaiming

Deep Dive 5: URL Expiration

The Problem: Some URLs should expire after a set time. How do we handle cleanup without scanning the entire table?

Bad Solution -- Cron job scanning all URLs:

sql

DELETE FROM urls WHERE expires_at < NOW() AND expires_at IS NOT NULL;

With billions of rows, this full-table scan is extremely expensive and blocks other operations.

Good Solution -- Lazy expiration (check on read): When a redirect request comes in, check if the URL has expired:

python

def redirect(short_code):
    url = cache.get(short_code) or db.get(short_code)
    if url.expires_at and url.expires_at < now():
        return 410 Gone  # URL has expired
    return redirect_to(url.long_url)

This handles the user-facing behavior, but expired URLs still consume storage.

Great Solution -- Lazy expiration + background cleanup:

Lazy expiration on read (as above) for immediate user experience
Redis TTL on cached entries: set the Redis TTL to match the expiration time. Redis automatically evicts expired entries.
Background cleanup: Partition the URLs table by created_at month. Run a monthly job that drops old partitions where all URLs have expired. This is an O(1) partition drop, not a row-by-row delete.

For the cleanup job:

sql

-- Partition-level check
SELECT partition_name, max(expires_at)
FROM url_partitions
WHERE max_expires_at < NOW()
-- Drop entire partitions where everything has expired
ALTER TABLE urls DROP PARTITION p202301;

Deep Dive 6: High Availability for Redirects

The Problem: Redirects are the hot path. If the redirect service goes down, millions of users see errors. How do we achieve 99.99% availability?

Multi-layer redundancy:

CDN layer (CloudFront): Cache redirects at edge locations globally. Even if our origin is down, cached redirects continue to work. For popular URLs (80% of traffic), CDN handles it.
DNS failover: Use Route 53 with health checks. If the primary region is down, DNS fails over to a secondary region within 60 seconds.
Multi-region deployment:
- Primary region: us-east-1 (handles writes and reads)
- Secondary region: eu-west-1 (read replica of PostgreSQL, Redis replica)
- Both regions can serve redirects independently
- Writes (URL creation) go to primary only -- this is acceptable since writes are 100x less frequent than reads
Redis cluster with replication:
- 3 Redis shards, each with 2 replicas
- If a primary shard fails, a replica is promoted automatically
- Total capacity: 3 shards * 200K ops/sec = 600K ops/sec (we need ~20K at peak)
Graceful degradation: If Redis is down, fall back to PostgreSQL read replicas. Latency increases from 2ms to 15ms, but the service stays up.

Back-of-envelope for Redis sizing:

6B total URLs over 5 years
Average entry: 7-byte key + 200-byte URL + overhead = ~300 bytes
Only cache active URLs (accessed in last 30 days). Assume 20% are active = 1.2B entries
1.2B * 300 bytes = 360 GB. Spread across 3 shards = 120 GB per shard. Fits in a large Redis instance (r6g.4xlarge with 128GB RAM).

What is Expected at Each Level

Mid-Level

Design the basic shorten and redirect flow with a database
Understand Base62 encoding for short code generation
Identify that redirect latency is critical and propose caching
Know the difference between 301 and 302 redirects

Senior

Design the counter-based short code generation with range allocation
Back-of-envelope for code space capacity (Base62, 7 chars, collision probability)
Redis caching layer for redirects with cache-aside pattern
Async analytics via Kafka
Handle custom aliases with collision detection
Discuss trade-offs of 301 vs 302 for analytics vs performance
URL expiration with lazy evaluation

Staff+

Multi-region deployment for redirect high availability
CDN integration for edge caching of redirects
ClickHouse (or similar OLAP) for analytics at scale with storage calculations
Hot-spot detection for viral URLs (what if one short URL gets 1M clicks/minute?)
Abuse prevention: URL spam, phishing link detection, rate limiting per IP
Data retention policy and partition-based cleanup
Cost optimization: estimating infrastructure costs for 10B redirects/month

HLD: URL Shortener (Bit.ly) ​

Understanding the Problem ​

What is a URL Shortener? ​

Functional Requirements ​

Non-Functional Requirements ​

The Set Up ​

Core Entities ​

API Design ​

High-Level Design ​

Flow 1: Creating a Short URL ​

Flow 2: Redirecting (The Hot Path) ​

Flow 3: Analytics Query ​

Potential Deep Dives ​

Deep Dive 1: URL Encoding / Short Code Generation Strategy ​

Deep Dive 2: 301 vs 302 Redirects ​

Deep Dive 3: Analytics Tracking at Scale ​

Deep Dive 4: Custom Aliases and Collision Handling ​

Deep Dive 5: URL Expiration ​

Deep Dive 6: High Availability for Redirects ​

What is Expected at Each Level ​

Mid-Level ​

Senior ​

Staff+ ​

HLD: URL Shortener (Bit.ly)

Understanding the Problem

What is a URL Shortener?

Functional Requirements

Non-Functional Requirements

The Set Up

Core Entities

API Design

High-Level Design

Flow 1: Creating a Short URL

Flow 2: Redirecting (The Hot Path)

Flow 3: Analytics Query

Potential Deep Dives

Deep Dive 1: URL Encoding / Short Code Generation Strategy

Deep Dive 2: 301 vs 302 Redirects

Deep Dive 3: Analytics Tracking at Scale

Deep Dive 4: Custom Aliases and Collision Handling

Deep Dive 5: URL Expiration

Deep Dive 6: High Availability for Redirects

What is Expected at Each Level

Mid-Level

Senior

Staff+