HLD: Change Data Capture (CDC) Pipeline β
Understanding the Problem β
What is a CDC Pipeline? β
A Change Data Capture pipeline streams every database change β inserts, updates, deletes β from the OLTP system out to downstream consumers. Analytics wants the data in Snowflake, search wants it in Elasticsearch, mobile clients want sync events, and webhooks want to notify customer integrations. The trick is doing all this without changing application code, preserving ordering per entity, and staying durable through failures. At Salesforce, this is exactly how Change Data Capture and Platform Events work at the platform level β if you know CDC, you know how to integrate with Salesforce.
Functional Requirements β
Core (above the line):
- Capture row-level changes from Postgres (and optionally MySQL) β inserts, updates, deletes, schema snapshots.
- Publish to Kafka topics keyed and partitioned for per-tenant, per-entity ordering.
- Multiple consumer families β search indexer, analytics loader (Snowflake), webhook dispatcher, mobile sync. Each has independent offsets and lag.
- Schema evolution without breaking consumers β producers add nullable fields; consumers continue reading older records.
- Replay from an offset or a timestamp β for backfills and incident recovery.
Below the line (out of scope):
- Non-DB sources (file systems, SaaS APIs) β a separate connector framework handles those.
- Two-way sync β we are one-way only: DB β consumers.
- Exactly-once delivery to downstream β we settle for at-least-once + idempotent consumers. Exactly-once across heterogeneous sinks is a research problem.
Non-Functional Requirements β
Core:
- Scale: 100k writes/s steady across all orgs, peak 300k/s. 10 consumer families.
- Latency: end-to-end median < 2 s; P99 < 10 s.
- Durability: zero data loss. Every committed DB change is published at least once.
- Multi-tenancy: ordering preserved per
(orgId, entity). No tenant can starve another.
Below the line:
- Cross-region replication consistency guarantees (we do async replication; treat each region independently).
- Full-text indexing of blob columns (push to a separate indexer).
Capacity Estimation β
- 100k events/s Γ average 1 KB per event (key, columns, metadata) = 100 MB/s. Peak 300k/s = 300 MB/s.
- Daily volume: 100 MB/s Γ 86,400 s = ~8.6 TB/day raw.
- Kafka retention 7 days Γ 3x replication = ~180 TB hot on brokers.
- Partitions: 1024 partitions per high-volume topic gives each partition ~100 events/s average β well within broker throughput per partition.
- Consumer lag budget: 10 s worst case Γ 300k/s = 3M events worth of queue buffer.
The Set Up β
Core Entities β
- SourceTable β configured to capture. Has
tableName,schema,captureMode(all columns vs selected). - ChangeEvent β the unit of output:
lsn(log sequence number from Postgres WAL) β monotonic per-shard.orgIdβ partition key's primary component.entityIdβ the primary key of the row (or composite).opβ one ofCREATE | UPDATE | DELETE | SNAPSHOT.beforeβ previous row state (for UPDATE / DELETE).afterβ new row state (for CREATE / UPDATE).schemaVersionβ Avro schema version.timestampβ commit time from Postgres.
- Consumer β
consumerId, subscribed topics, current offset per partition, lag metrics.
The API β
Most of the API is the Kafka client protocol β consumers use standard Kafka clients. The pipeline exposes operator APIs for configuration and observability.
Operator / control plane:
POST /v1/cdc/connectors
{ "sourceUrl": "...", "database": "crm_shard_12", "tables": [...] }
GET /v1/cdc/connectors/{id}/status
POST /v1/cdc/connectors/{id}/pause
POST /v1/cdc/connectors/{id}/resume
POST /v1/cdc/replays
{ "topic": "cdc.account", "orgId": "005...", "from": "2026-04-01T00:00Z" }
GET /v1/cdc/replays/{id}
GET /v1/cdc/consumers/{id}/lag
GET /v1/cdc/schemas/{topic}/versionsConsumer data plane: standard Kafka consumer protocol (kafka-clients, librdkafka, etc.) subscribing to topics like cdc.account, cdc.opportunity.
High-Level Design β
Architecture β
ββββββββββββββ WAL βββββββββββββββ ββββββββββββββ ββββββββββββββββ
β Postgres βββββββββΆβ Debezium ββββΆβ Kafka ββββΆβ Consumers: β
β (per-shard)β β Connector β β cluster β β - ES indexer β
ββββββββββββββ β (1 per DB) β β β β - Snowflake β
ββββββββ¬βββββββ β topic per β β - Webhooks β
β β entity β β - Mobile syncβ
βΌ ββββββββββββββ ββββββββββββββββ
ββββββββββββββ
β Schema β
β Registry β
β (Avro) β
ββββββββββββββEnd-to-end flow: a record is updated β
- Application performs an
UPDATE records SET ...on shardcrm_shard_12. Postgres writes to its WAL (write-ahead log) and acks the client. - Debezium connector for shard 12 is running logical replication β a replication slot on Postgres streams WAL changes as they commit.
- Debezium decodes the WAL entry into a structured event:
{op: "UPDATE", before: {...}, after: {...}, lsn: ..., source: {db: "crm_shard_12", table: "records"}}. - Debezium consults the Schema Registry (Confluent Schema Registry or similar) to serialize the event with the current Avro schema for this entity. If schema changed, it registers a new version.
- Debezium produces to Kafka topic
cdc.recordswith partition keyhash(orgId, entityId). This is the crucial multi-tenant partitioning decision. - Kafka replicates to 3 brokers (acks=all) for durability. Debezium commits the LSN only after Kafka ack.
- Consumer group 1 (ES indexer) reads, transforms the event into an ES bulk upsert, and writes. On failure, retries with backoff.
- Consumer group 2 (Snowflake loader) batches events, writes to S3 Parquet, then triggers a Snowpipe copy.
- Consumer group 3 (webhook dispatcher) looks up customer webhook URLs and POSTs the event payload with HMAC signature.
- Consumer group 4 (mobile sync) routes events to Redis per-user queues for mobile pull.
Data model β
Control-plane tables:
cdc_connectorsβ connector configs, status, last LSN.cdc_offsetsβ tracks each Debezium connector's LSN per shard for restart recovery.cdc_dead_letterβ stores unparseable events with reason + original bytes for diagnosis.cdc_replay_jobsβ tracks manual reprocessing runs with state (queued / running / done).cdc_schemasβ mirrors Schema Registry for offline inspection.
Each downstream consumer maintains its own offset in Kafka's offset store.
Multi-Tenancy Strategy β
Isolation level: the source OLTP is L1 (shared DB, shared schema) for most shards, with mega-tenants on L3-style dedicated shards. The CDC layer inherits this: one Debezium connector per DB shard, which means mega-tenants naturally get their own connector + their own slot of Kafka partitions.
Tenant context flow:
orgIdis stamped on every event β in the event envelope, as part of the Kafka message key, and as a Kafka header.- Partition key is
hash(orgId, entityId). This preserves per-entity ordering (critical) while distributing load. - Consumers receive the full event including
orgId; they key all downstream state by(orgId, entityId).
Noisy-neighbor mitigations:
- Sticky dedicated partitions for mega-tenants. A Fortune-500 tenant with a 10M-row backfill should not trail other orgs' events behind it. Detection is via lag-per-tenant metrics; promoted tenants get a reserved partition range.
- Per-org topic quotas at the Kafka broker level (bytes/sec, messages/sec). Enforced via Kafka's built-in quota mechanism keyed by
orgIdclient ID. - Consumer-side fair scheduling. Consumers pull in round-robin per-org, not strict FIFO per partition. Prevents a burst from one org backing up others in the same partition.
- Hot-key sub-partition splitting. If a single
(orgId, entityId)pair dominates (e.g., a shared calendar record with 100k updates/min), rehash onto a dedicated partition to avoid saturating a single consumer. - Backfill throttling. Schema changes trigger a snapshot/backfill for some consumers. Backfill runs through a separate
cdc.backfilltopic with its own consumer group at throttled rate.
Per-tenant observability:
- Kafka metrics tagged with
org_idvia client IDs. - Per-org lag dashboards β if Org X's events are 5 minutes behind while everyone else is 2 s behind, alert.
- Per-org replay rate caps so an operator replaying Org X's history cannot flood shared consumers.
Potential Deep Dives β
1) What capture mechanism do we use? β
Bad Solution: Polling updated_at.
- Approach: Every N seconds,
SELECT * FROM records WHERE updated_at > :last_poll_time. - Challenges:
- Misses deletes β rows removed won't appear in future polls.
- Clock skew: if server clocks drift, events are missed or double-counted.
- Adds read load to OLTP proportional to polling frequency.
- Can't detect "field X changed from Y to Z" β only current state.
Good Solution: Application-level outbox.
- Approach: Every write does DB row + insert into an
outboxtable in the same transaction. A poller reads from outbox and publishes to Kafka; deletes outbox rows after ack. - Challenges: Requires modifying every write path in the app β invasive. Doubles write cost. Poller lag adds to end-to-end latency.
Great Solution: Logical replication via Debezium.
- Approach:
- Debezium runs as a Kafka Connect worker or standalone. It uses Postgres logical replication (via
pgoutputplugin) to stream WAL changes with zero app changes. - Captures inserts, updates, and deletes uniformly (deletes write a tombstone to the WAL).
- Events are ordered by LSN, which is monotonic per shard.
- Each connector checkpoints its LSN in Kafka so restart resumes exactly where it left off.
- Mention Maxwell for MySQL as the equivalent β binlog-based CDC.
- Debezium runs as a Kafka Connect worker or standalone. It uses Postgres logical replication (via
- Challenges:
- Logical replication slots are a Postgres resource β max ~10 concurrent on a standard instance. Plan accordingly.
- Slot lag can block Postgres WAL cleanup β a stuck connector can run Postgres out of disk. Monitor aggressively.
- Schema changes require care β Debezium emits schema change events; consumers must handle them.
2) How do we partition Kafka topics? β
Bad Solution: One partition per topic.
- Approach:
cdc.recordshas 1 partition. - Challenges: Single-threaded consumer; throughput capped at ~1 MB/s per consumer. No horizontal scaling.
Good Solution: hash(orgId) partitioning.
- Approach: 1024 partitions, keyed by
orgId. Per-org ordering preserved; load spreads across brokers. - Challenges: Mega-tenant's traffic all goes to one partition; that partition becomes a hot spot. Also, a small org's events can queue behind a mega-tenant's burst on the same partition (partition sharing).
Great Solution: hash(orgId, entityId) + sub-partition splitting for hot tenants + sticky assignment.
- Approach:
- Primary partition key is
hash(orgId, entityId). This preserves ordering per entity (which is what consumers actually need β "did I see the CREATE before the UPDATE for record X?"), while spreading one tenant's events across many partitions. - Hot-key detector monitors per-
(orgId, entityId)event rate. When a single entity crosses a threshold (e.g., > 1000 events/min), rehash that specific key onto a dedicated partition range. Consumers use a lookup to handle hot keys correctly. - Sticky partition assignment keeps the same consumer handling the same partition across rebalances β preserves consumer-side state caches.
- Document the ordering contract explicitly: "Events for a given
(orgId, entityId)are delivered in commit order. No ordering guarantee across different entities."
- Primary partition key is
- Challenges: Hot-key splitting requires cooperation from consumers β they must know to look in the "hot shards" namespace. Sticky assignment conflicts with rebalancing during scaling events. Ordering contract is weaker than "per-org total order" β consumers that need cross-entity ordering (rare) must implement buffering and reorder.
3) How do we evolve schemas without breaking consumers? β
Bad Solution: Change column type and hope for the best.
- Approach: Do the DDL; consumers that can't parse the new schema crash.
- Challenges: All downstream breaks simultaneously. Ugly ops incident.
Good Solution: Avro Schema Registry with backward compatibility.
- Approach:
- Every schema change goes through Schema Registry with a compatibility check (
BACKWARD,FORWARD,FULL). - Backward compatibility: new schemas can only add nullable fields or add fields with defaults. Never remove, never re-type.
- Producer registers new version; consumers keep reading with older schemas (Avro's resolver fills missing fields with defaults).
- Every schema change goes through Schema Registry with a compatibility check (
- Challenges:
- Breaking changes (rename, re-type) can't happen with backward compat. Forces workarounds like "add new column, deprecate old."
- Consumers that use
dynamicparsing (e.g., ES reindexing) still need updates when the schema meaning changes.
Great Solution: Dual-write transitions for breaking changes.
- Approach:
- For truly breaking changes (e.g., renaming a column), emit both old and new schemas to two topics or two versions for a transition period (N days).
- Consumers migrate one-by-one: deploy new code that reads the new schema, verify, then producer drops the old schema.
- Every event carries
schemaVersionin its envelope so consumers can branch on it. - Automated compatibility checks run in CI before DDL merges β prevent accidental breakages.
- Deprecation process has a wall-clock deadline; consumers not migrated by the deadline are blocked from releasing.
- Challenges: Dual-write doubles Kafka throughput during transition. Requires coordination across many consumer teams β a political problem as much as a technical one. Some consumers (customer webhooks) can't be coordinated; for those, maintain old schemas indefinitely.
4) How do we handle replay and back-pressure? β
Bad Solution: Replay to the main topic.
- Approach: Operator replays events by re-producing them to the live topic.
- Challenges: Live consumers re-process events they've already handled. ES gets flooded with updates; Snowflake loader blows its daily quota; webhook customers get duplicate callbacks. Chaos.
Good Solution: Separate replay topic + idempotent consumers.
- Approach:
- Replay events go to
cdc.replay.{originalTopic}with throttled consumers. - Each consumer decides whether to process replays β some do (Snowflake for audit reconstruction), some don't (webhooks, usually).
- Consumers implement idempotent writes:
PUTby(orgId, entityId, lsn)so duplicate events from replay are safe.
- Replay events go to
- Challenges: Doubles infrastructure (replay consumers in parallel). Throughput of replay is capped by the slowest idempotent write path.
Great Solution: Tier-controlled replay with per-tenant rate caps + back-pressure signals.
- Approach:
- Dedicated replay consumer groups throttle replay rate per-tenant (e.g., Org X's replay at 1000 events/s max, regardless of source rate).
- Downstream back-pressure: if ES consumer lag exceeds N seconds, it emits a back-pressure signal via Kafka header or a separate control topic. Replay producer throttles.
- Cursor-based replay API lets operators replay in chunks: "replay 1 hour, verify, continue."
- Replays emit progress metrics per org; operator dashboard shows replay running without needing to tail logs.
- Idempotency is still the backbone β every consumer uses
(orgId, entityId, lsn)as a dedup key.
- Challenges: Back-pressure signal requires consumer cooperation; not all consumers implement it. Per-tenant replay caps can make replays slow for mega-tenants (hours or days). Mitigate by offering "dedicated replay lanes" for enterprise customers.
5) How do we handle consumer slowness and lag? β
Bad Solution: Let lag grow unboundedly.
- Approach: Kafka retains events; consumers catch up eventually.
- Challenges: If ES indexer is down for 4 hours, 4 Γ 300k Γ 3600 = ~4B events pile up. When it recovers, it can't catch up at its normal rate; lag keeps growing; downstream staleness is observable for hours.
Good Solution: Alerts + manual intervention.
- Approach: Alert on consumer lag > threshold. Operator scales consumer group or investigates.
- Challenges: Slow human response; blast radius large by the time anyone notices.
Great Solution: Auto-scaling consumers + partition rebalancing + per-tenant circuit breakers.
- Approach:
- Consumer groups auto-scale based on lag (lag > threshold for N seconds β add pod; lag < lower threshold for M seconds β remove pod).
- Kafka partition assignment rebalances smoothly via incremental cooperative rebalance (KIP-429).
- Per-tenant circuit breaker: if processing events for a specific
orgIdis failing repeatedly (downstream ES tenant index is broken), the consumer skips those events temporarily and routes them to a tenant-specific DLQ. Other tenants' events keep flowing. - DLQ events are retried by a separate consumer at a throttled rate once the underlying problem is fixed.
- Per-tenant lag metrics allow us to say "Org X is lagging because of their own config issue" β distinguishes tenant-caused from system-caused lag.
- Challenges: Auto-scaling has cost; budget it. Per-tenant DLQ routing complicates consumer logic. Requires mature observability to distinguish tenant vs system issues.
What is Expected at Each Level? β
Mid-level (SMTS-junior) β
WAL-based capture via Debezium. Kafka fan-out to multiple consumers. At-least-once delivery. Basic awareness of ordering guarantees.
Senior (SMTS / LMTS) β
Schema Registry with backward compatibility. Partition strategy for per-entity ordering. Replay tooling with idempotent consumers. Concrete capacity numbers (events/s, MB/s, partitions).
Staff+ (PMTS) β
Hot-key detection and sub-partition splitting. Cross-region replication with conflict resolution. Dual-write transitions for breaking schema changes. Capacity planning with headroom math. Consumer SLOs with per-tenant granularity. Per-tenant DLQ + replay isolation.
Salesforce-Specific Considerations β
- Direct analog: Salesforce Change Data Capture (CDC) feature and Platform Events. The Pub/Sub API is the customer-facing streaming surface.
- Per-object replication filters: Salesforce lets customers opt in / out of CDC per object (e.g., "stream Account changes, skip Task changes"). Our connector config matches β
cdc_connectors.tablesis a whitelist. - Governor analog: CDC events per day are metered per org β Essentials gets 100k/day, Enterprise gets unlimited. Our design meters similarly, per-org quotas enforced at the Kafka broker level via client quotas.
- Big Object + Field History: the "archive to cold storage" pattern matches Salesforce's Field History Archive. Our long-term retention for audit use cases follows the same tiered model.
- Platform Events: Salesforce's internal transport for low-latency pub/sub between Apex, Flow, and external systems. Our Kafka layer plays the same role β a loosely-coupled broker between producers and consumers.
- External Services integration: customer-facing webhooks map to "External Services" registered in Salesforce. Same HMAC signing pattern.
- Hyperforce regional isolation: each region has its own CDC cluster; cross-region replication is opt-in per org.
Example snippet β idempotent consumer write β
public class IdempotentEsIndexer {
public void onEvent(ChangeEvent e) {
// Dedup key combines the partition key and the strictly monotonic LSN.
String docId = e.orgId() + ":" + e.entityId();
long version = e.lsn(); // Elasticsearch external version
esClient.index(i -> i
.index("records-" + e.orgId().substring(0, 3))
.id(docId)
.version(version)
.versionType(VersionType.External)
.document(e.after()));
// If two retries race, ES rejects the older-version write; safe.
}
}void EsIndexer::OnEvent(const ChangeEvent& e) {
const std::string doc_id = e.org_id() + ":" + e.entity_id();
IndexRequest req;
req.index = "records-" + e.org_id().substr(0, 3);
req.id = doc_id;
req.version = e.lsn(); // external version monotonic per entity
req.versionType = VersionType::kExternal;
req.body = e.after();
es_client_.Index(req); // OLDER_VERSION_CONFLICT treated as success
}async function onEvent(e: ChangeEvent) {
const docId = `${e.orgId}:${e.entityId}`;
try {
await es.index({
index: `records-${e.orgId.slice(0, 3)}`,
id: docId,
version: e.lsn,
version_type: "external",
document: e.after,
});
} catch (err) {
if (err.meta?.body?.error?.type !== "version_conflict_engine_exception") {
throw err;
}
// older event arrived after a newer one; safe to drop
}
}