HLD: URL Shortener (Bit.ly) ​
Understanding the Problem ​
What is a URL Shortener? ​
A URL shortener takes a long URL (like https://example.com/products/electronics/headphones?sort=price&filter=wireless&page=3) and creates a short alias (like https://short.ly/a3Xk9) that redirects to the original. It seems simple on the surface, but at scale the challenges are interesting: generating globally unique short codes without collisions, handling massive redirect traffic with minimal latency, and tracking analytics on every click.
Functional Requirements ​
Core (above the line):
- Shorten a URL -- given a long URL, generate a unique short URL
- Redirect -- when a user visits the short URL, redirect them to the original long URL
- Custom aliases -- users can optionally specify a custom short code (e.g.,
short.ly/my-brand) - Analytics -- track click count, referrer, geo, device for each short URL
Below the line (mention but don't design):
- User accounts and URL management dashboard
- URL expiration and deletion
- QR code generation
- Link previews
- Bulk URL shortening API
Non-Functional Requirements ​
- Low redirect latency -- redirects must complete in < 50ms (P99), since every millisecond adds friction
- High availability -- 99.99% uptime for redirects (write path can tolerate slightly lower availability)
- Scale -- 100M new URLs/month, 10B redirects/month (~3,800 redirects/sec average, 20K/sec peak)
- Durability -- once a short URL is created, the mapping must never be lost
The Set Up ​
Core Entities ​
| Entity | Description |
|---|---|
| URL | shortCode, longUrl, userId (optional), createdAt, expiresAt (optional), clickCount |
| Click | shortCode, timestamp, referrer, userAgent, ipAddress, country, device |
| User | id, email, apiKey, tier (free/pro) |
API Design ​
Shorten a URL:
POST /api/urls
Authorization: Bearer <token> (optional for anonymous)
Request:
{
"longUrl": "https://example.com/very/long/path?with=params",
"customAlias": "my-link", // optional
"expiresAt": "2025-01-01T00:00:00Z" // optional
}
Response: 201 Created
{
"shortCode": "my-link",
"shortUrl": "https://short.ly/my-link",
"longUrl": "https://example.com/very/long/path?with=params",
"createdAt": "2024-01-15T10:30:00Z",
"expiresAt": "2025-01-01T00:00:00Z"
}Redirect (the hot path):
GET /{shortCode}
(e.g., GET /a3Xk9)
Response: 301 Moved Permanently (or 302 Found)
Location: https://example.com/very/long/path?with=paramsGet URL analytics:
GET /api/urls/{shortCode}/analytics?period=7d
Response: 200 OK
{
"shortCode": "a3Xk9",
"totalClicks": 14523,
"clicksByDay": [
{ "date": "2024-01-15", "clicks": 2301 },
{ "date": "2024-01-14", "clicks": 1987 }
],
"topCountries": [
{ "country": "US", "clicks": 8200 },
{ "country": "UK", "clicks": 2100 }
],
"topReferrers": [
{ "referrer": "twitter.com", "clicks": 5400 }
]
}High-Level Design ​
Flow 1: Creating a Short URL ​
[Client] --> [API Gateway] --> [URL Service] --> [Short Code Generator]
| |
v v
[PostgreSQL] [Redis Counter / ZooKeeper]
|
v
[Redis Cache: shortCode -> longUrl]Step-by-step:
- Client submits a long URL via
POST /api/urls - API Gateway authenticates (if token provided), applies rate limiting (100 URLs/hour for free tier)
- URL Service validates the long URL (well-formed, not on a blocklist, reachable via HEAD request)
- If a custom alias is provided: a. Check Redis and PostgreSQL for collision b. If available, use it as the short code c. If taken, return 409 Conflict
- If no custom alias, Short Code Generator generates a unique short code (see Deep Dive 1)
- Write the mapping to PostgreSQL:
INSERT INTO urls (short_code, long_url, user_id, created_at) - Write to Redis cache:
SET url:a3Xk9 "https://example.com/..." EX 86400(24-hour TTL, refreshed on access) - Return the short URL to the client
Flow 2: Redirecting (The Hot Path) ​
[Client] --> [CDN/Edge] --> [Redirect Service] --> [Redis Cache]
|
(cache miss)
v
[PostgreSQL]
|
[Kafka: click_events]
v
[Click Analytics Pipeline]Step-by-step:
- User clicks
https://short.ly/a3Xk9. DNS resolves to our CDN (CloudFront) - CDN checks its edge cache. If the URL was recently accessed and we used a 301, the CDN has the redirect cached. Direct redirect from edge -- no origin hit.
- On CDN cache miss: request goes to Redirect Service
- Redirect Service looks up
url:a3Xk9in Redis - If Redis hit: return 301/302 redirect with the
Locationheader - If Redis miss: query PostgreSQL, populate Redis, return redirect
- Asynchronously (non-blocking): publish a click event to Kafka with metadata (timestamp, IP, user-agent, referrer)
- Click Analytics Pipeline (Kafka consumer) processes click events:
- Geo-lookup from IP (MaxMind database)
- Device/browser parsing from user-agent
- Write to ClickHouse (columnar analytics DB) for fast aggregation queries
- Increment
clickCountin Redis:INCR clicks:a3Xk9
Flow 3: Analytics Query ​
- Client calls
GET /api/urls/{shortCode}/analytics?period=7d - Analytics Service queries ClickHouse:sql
SELECT toDate(timestamp) as date, count() as clicks FROM clicks WHERE short_code = 'a3Xk9' AND timestamp > now() - INTERVAL 7 DAY GROUP BY date ORDER BY date - For top-level metrics (total clicks), read from Redis counter for speed
- Return aggregated analytics
Potential Deep Dives ​
Deep Dive 1: URL Encoding / Short Code Generation Strategy ​
The Problem: We need to generate short, unique codes for 100M URLs/month. Over 5 years, that is 6 billion URLs. The code should be short (7 characters) and never collide.
Capacity check: Base62 (a-z, A-Z, 0-9) with 7 characters = 62^7 = 3.5 trillion possible codes. 6 billion URLs is only 0.17% of the space. Plenty of room.
Bad Solution -- MD5 hash truncation:
shortCode = base62(MD5(longUrl))[0:7]MD5 gives 128 bits, we take the first 42 bits (7 Base62 chars). Birthday problem: with 6B URLs and a 3.5T code space, collision probability is approximately n^2 / (2 * space) = (6*10^9)^2 / (2 * 3.5*10^12) = ~5.1. That means collisions are guaranteed -- not just probable. We would need a retry loop to check for collisions, adding latency and complexity.
Also: same long URL always generates the same short code, which is a feature (dedup) but also a limitation (different users can't have different short codes for the same URL).
Good Solution -- Counter-based with Base62 encoding: Use a global auto-incrementing counter. Each new URL gets the next number, encoded in Base62.
counter = 1000000 -> base62("1000000") = "4c92"
counter = 1000001 -> base62("1000001") = "4c93"No collisions ever. But:
- Sequential codes are predictable (users can enumerate:
short.ly/4c92,short.ly/4c93) - Single counter is a bottleneck in a distributed system
Distributed counter approach: Use ZooKeeper or a counter service to pre-allocate ranges:
- Server 1 gets range [1M, 2M)
- Server 2 gets range [2M, 3M)
- Each server increments locally, no coordination needed
- When a server exhausts its range, it requests a new one
Great Solution -- Snowflake-style ID with Base62: Generate a 64-bit unique ID:
| timestamp (41 bits) | datacenter (5 bits) | server (5 bits) | sequence (12 bits) |Then encode in Base62. The ID is unique, non-sequential (hard to guess), and generated locally.
7 Base62 characters = 42 bits. We can use the lower 42 bits of the Snowflake ID, which includes server ID and sequence -- unique within a millisecond per server.
To prevent predictability, we can also apply a bijective shuffle:
def encode(id):
# XOR with a secret to shuffle the output
shuffled = id ^ SECRET_MASK
return base62_encode(shuffled)Our choice: Counter-based with range allocation (simplest, no collisions, well-understood). Use the bijective shuffle for non-predictability.
Deep Dive 2: 301 vs 302 Redirects ​
301 Moved Permanently:
- Browser caches the redirect. Next time the user clicks the short URL, the browser goes directly to the long URL without hitting our server.
- Pros: Faster for repeat visits, reduces load on our servers
- Cons: We lose analytics for repeat visits. If a user clicks the same short URL 10 times, we only see the first click.
- CDN also caches 301s aggressively.
302 Found (Temporary Redirect):
- Browser does NOT cache the redirect. Every click hits our server.
- Pros: Full analytics on every click. We can also change the destination URL later.
- Cons: Higher server load, slightly slower for users (extra hop every time)
Our decision:
- Default: 302 for URLs where the creator has analytics enabled (most URLs). Analytics is a core feature.
- Option: 301 for high-volume URLs where the creator does not care about analytics (saves server cost).
- At the CDN level: use a short cache TTL (e.g., 5 minutes) even for 302s. This reduces origin load while still capturing most unique clicks.
Back-of-envelope impact:
- 10B redirects/month with 302 = 3,800 QPS on our Redirect Service
- With 301 + CDN caching (assume 80% cache hit rate): only 760 QPS hit origin. 5x reduction.
- For a cost-sensitive service, 301 with CDN caching makes sense if analytics is secondary.
Deep Dive 3: Analytics Tracking at Scale ​
The Problem: 10B clicks/month = ~3,800 events/sec. Each event has ~500 bytes of metadata. We need fast writes and fast aggregation queries.
Bad Solution -- Write to PostgreSQL synchronously: Insert into a clicks table on every redirect. This adds 5-10ms to the redirect latency (unacceptable for a 50ms budget) and PostgreSQL struggles with 3,800 inserts/sec with indexes.
Good Solution -- Async writes via Kafka:
- Redirect Service publishes a click event to Kafka (fire-and-forget, < 1ms)
- Kafka consumer batch-writes to the analytics store
- Zero impact on redirect latency
Great Solution -- Kafka + ClickHouse pipeline:
Why ClickHouse? It is a columnar database optimized for:
- High-speed inserts (millions of rows/sec in batches)
- Fast aggregation queries (GROUP BY country, GROUP BY day)
- Efficient compression for time-series data
Pipeline:
Redirect Service --> Kafka (click_events topic, 20 partitions)
|
Kafka Consumer Group (10 consumers)
|
Batch insert to ClickHouse (every 5 seconds or 10K events)ClickHouse schema:
CREATE TABLE clicks (
short_code String,
timestamp DateTime,
country LowCardinality(String),
referrer String,
device LowCardinality(String),
browser LowCardinality(String),
os LowCardinality(String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (short_code, timestamp);Storage calculation: 10B clicks/month * 500 bytes * 12 months = 60 TB/year uncompressed. ClickHouse typically achieves 10x compression on this type of data = 6 TB/year. Very manageable.
For real-time counters (total clicks displayed on the dashboard): use Redis INCR on redirect. ClickHouse is for detailed analytics queries (top countries, click trends over time).
Deep Dive 4: Custom Aliases and Collision Handling ​
The Problem: Users want custom aliases like short.ly/my-brand. How do we handle:
- Checking availability without race conditions
- Preventing abuse (squatting on popular names)
- Conflicts with auto-generated codes
Solution:
Namespace separation: Reserve auto-generated codes to a specific format (e.g., always 7 characters from a restricted alphabet). Custom aliases can be 3-30 characters, alphanumeric plus hyphens. The two namespaces never overlap because auto-generated codes use a deterministic length and character set.
Availability check with locking:
def create_custom_alias(alias, long_url, user_id):
# Acquire distributed lock
lock = redis.lock(f"alias_lock:{alias}", timeout=5)
if not lock.acquire(blocking=True, blocking_timeout=2):
raise ConflictError("Try again")
try:
# Check if alias exists
existing = db.query("SELECT 1 FROM urls WHERE short_code = ?", alias)
if existing:
raise ConflictError("Alias already taken")
# Reserve it
db.insert("INSERT INTO urls (short_code, long_url, user_id) VALUES (?, ?, ?)",
alias, long_url, user_id)
redis.set(f"url:{alias}", long_url)
finally:
lock.release()Abuse prevention:
- Free tier: max 5 custom aliases per account
- Profanity filter on aliases
- Block trademarked terms (optional, legal requirement in some jurisdictions)
- If an alias is unused (zero clicks for 12 months), allow reclaiming
Deep Dive 5: URL Expiration ​
The Problem: Some URLs should expire after a set time. How do we handle cleanup without scanning the entire table?
Bad Solution -- Cron job scanning all URLs:
DELETE FROM urls WHERE expires_at < NOW() AND expires_at IS NOT NULL;With billions of rows, this full-table scan is extremely expensive and blocks other operations.
Good Solution -- Lazy expiration (check on read): When a redirect request comes in, check if the URL has expired:
def redirect(short_code):
url = cache.get(short_code) or db.get(short_code)
if url.expires_at and url.expires_at < now():
return 410 Gone # URL has expired
return redirect_to(url.long_url)This handles the user-facing behavior, but expired URLs still consume storage.
Great Solution -- Lazy expiration + background cleanup:
- Lazy expiration on read (as above) for immediate user experience
- Redis TTL on cached entries: set the Redis TTL to match the expiration time. Redis automatically evicts expired entries.
- Background cleanup: Partition the URLs table by
created_atmonth. Run a monthly job that drops old partitions where all URLs have expired. This is an O(1) partition drop, not a row-by-row delete.
For the cleanup job:
-- Partition-level check
SELECT partition_name, max(expires_at)
FROM url_partitions
WHERE max_expires_at < NOW()
-- Drop entire partitions where everything has expired
ALTER TABLE urls DROP PARTITION p202301;Deep Dive 6: High Availability for Redirects ​
The Problem: Redirects are the hot path. If the redirect service goes down, millions of users see errors. How do we achieve 99.99% availability?
Multi-layer redundancy:
CDN layer (CloudFront): Cache redirects at edge locations globally. Even if our origin is down, cached redirects continue to work. For popular URLs (80% of traffic), CDN handles it.
DNS failover: Use Route 53 with health checks. If the primary region is down, DNS fails over to a secondary region within 60 seconds.
Multi-region deployment:
- Primary region: us-east-1 (handles writes and reads)
- Secondary region: eu-west-1 (read replica of PostgreSQL, Redis replica)
- Both regions can serve redirects independently
- Writes (URL creation) go to primary only -- this is acceptable since writes are 100x less frequent than reads
Redis cluster with replication:
- 3 Redis shards, each with 2 replicas
- If a primary shard fails, a replica is promoted automatically
- Total capacity: 3 shards * 200K ops/sec = 600K ops/sec (we need ~20K at peak)
Graceful degradation: If Redis is down, fall back to PostgreSQL read replicas. Latency increases from 2ms to 15ms, but the service stays up.
Back-of-envelope for Redis sizing:
- 6B total URLs over 5 years
- Average entry: 7-byte key + 200-byte URL + overhead = ~300 bytes
- Only cache active URLs (accessed in last 30 days). Assume 20% are active = 1.2B entries
- 1.2B * 300 bytes = 360 GB. Spread across 3 shards = 120 GB per shard. Fits in a large Redis instance (r6g.4xlarge with 128GB RAM).
What is Expected at Each Level ​
Mid-Level ​
- Design the basic shorten and redirect flow with a database
- Understand Base62 encoding for short code generation
- Identify that redirect latency is critical and propose caching
- Know the difference between 301 and 302 redirects
Senior ​
- Design the counter-based short code generation with range allocation
- Back-of-envelope for code space capacity (Base62, 7 chars, collision probability)
- Redis caching layer for redirects with cache-aside pattern
- Async analytics via Kafka
- Handle custom aliases with collision detection
- Discuss trade-offs of 301 vs 302 for analytics vs performance
- URL expiration with lazy evaluation
Staff+ ​
- Multi-region deployment for redirect high availability
- CDN integration for edge caching of redirects
- ClickHouse (or similar OLAP) for analytics at scale with storage calculations
- Hot-spot detection for viral URLs (what if one short URL gets 1M clicks/minute?)
- Abuse prevention: URL spam, phishing link detection, rate limiting per IP
- Data retention policy and partition-based cleanup
- Cost optimization: estimating infrastructure costs for 10B redirects/month