HLD: Audit Log System
Understanding the Problem
What is an Audit Log System?
An audit log system records every user action in the platform — logins, record views, permission changes, admin operations — with full context, and stores them immutably for compliance. Regulated customers (SOX, HIPAA, GDPR, FedRAMP) need these logs retained for 7+ years, and auditors need to query them by user, resource, tenant, and time range. The design tests how you balance ingest throughput, long-term storage cost, tamper-evidence, and hot-vs-cold query latency. Salesforce's Shield Event Monitoring and Field Audit Trail are the direct analogs — Shield retains for up to 10 years.
Functional Requirements
Core (above the line):
- Log every action — capture
who, what, when, where, tenantId, sourceIp, result, beforeAfter. Uniform schema across producers. - Immutable append-only — records cannot be modified or deleted (except via the compliance pathway below).
- Tamper-evident — detectable if any record is modified. Use hash chaining + periodic notarization.
- Query by
(orgId, userId, resourceId, timeRange)with millisecond-range filtering. - Export — customers can export their audit trail for SOX reviews, GDPR requests, or self-hosted SIEM ingest.
- Hot / warm / cold tiering — recent logs fast-queryable; old logs cheap to store, slow to retrieve.
Below the line (out of scope):
- Real-time anomaly detection — that's a SIEM, built on top of this.
- Full-text search inside arbitrary payload fields — expensive; optional premium tier.
- ML-based user-behavior analytics — separate system.
Non-Functional Requirements
Core:
- Scale: 100k events/s steady, 1M events/s peak. 2.5B events/day. ~1 PB/year raw.
- Latency: ingest p99 < 100 ms from the producer's point of view. Recent queries (< 7 days) p95 < 2 s. Historical queries (7+ years) p95 < 30 s (Glacier retrieval).
- Durability: 11 nines of durability (S3 level). Never lose an event. Lost audit events are a compliance violation.
- Integrity: tamper-evident. Detect any modification to historic records.
- Multi-tenancy: per-org data isolation; compliance-grade export per org.
Below the line:
- Real-time streaming to downstream dashboards (separate system).
- Full-text search on cold tier (premium tier only).
Capacity Estimation
- 2.5B events/day × 2 KB avg = 5 TB/day raw ingest. With 3x replication, 15 TB/day on disk.
- 7 years retention × 1 PB/year = 7 PB total. Cold storage at $0.004/GB-month = ~$28k/month for 7 PB. Affordable.
- Hot tier (7 days) at 35 TB, plus ~30% index overhead = ~46 TB in Elasticsearch.
- Warm tier (1 year) at 1 PB Parquet on S3. Queried via Trino/Athena.
- Cold tier (1-7 years) at 6 PB in S3 Glacier Deep Archive. Retrieval: 12-hour SLA; customers know in advance.
The Set Up
Core Entities
- AuditEvent — the main record:
eventId(UUID).orgId,actorUserId,actorType(user / service / admin).action(e.g.,record.update,login.success,permission.grant).resourceType,resourceId— what was acted on.timestamp(server-authoritative).sourceIp,userAgent,sessionId.result(success / failure, plus error code).payload— before/after diff for mutations, parameters for reads.payloadHash— sha256 of canonical payload.prevHash— sha256 of previous event in this tenant's chain.
- TenantRetentionPolicy —
(orgId, hotDays, warmDays, coldYears, exportPrefs). Override default retention per org. - ChainCheckpoint — periodic
(orgId, endEventId, chainHead, notarizedAt)published to an independent ledger.
The API
Ingest (internal producers only):
POST /v1/audit/events
Headers: X-Producer-Id, X-Idempotency-Key
Body: { "orgId", "actorUserId", "action", ... }Query (per-org):
GET /v1/orgs/{orgId}/audit/events
?user={userId}&resource={resourceId}&action={...}
&from=2026-01-01T00:00:00Z&to=2026-04-01T00:00:00Z
&cursor={opaque}&limit=100Export:
POST /v1/orgs/{orgId}/audit/exports
{ "from": "...", "to": "...", "format": "jsonl | parquet", "destination": "s3://..." }
→ { exportJobId, status: "queued" }
GET /v1/orgs/{orgId}/audit/exports/{jobId}
→ { status: "queued | running | ready", downloadUrl? }Verify integrity:
POST /v1/orgs/{orgId}/audit/verify
{ "from": "...", "to": "..." }
→ { ok: true, eventsVerified: 125000, chainHead: "..." }High-Level Design
Architecture
Producers (Apex, services, gateway, admin tools)
│
▼
┌──────────────┐ ┌────────────┐
│ Audit Edge │──▶ │ Kafka │──┬──▶ Hash-chain Writer ──▶ S3 WORM (cold)
│ (validate, │ │ audit.raw │ │
│ enrich, sign)│ └────────────┘ ├──▶ ES Cluster (hot, 7d)
└──────────────┘ │
└──▶ Parquet Lake (warm, 1y)
Query API ──▶ Tier Router
├──▶ ES (recent)
├──▶ Athena / Trino on Parquet (warm)
└──▶ S3 restore job (cold)End-to-end flow: a user updates a record (producer side)
- Application's record-update handler completes its DB transaction.
- Handler constructs an
AuditEventwith full context: actor, target, before/after, IP, session. - Handler writes the event to a local NVMe write-ahead buffer via the Audit Edge SDK. Return to caller in < 1 ms.
- Audit Edge process batches writes and ships to the Audit Edge service over gRPC (async from the producer's point of view).
- Audit Edge validates the envelope, enriches with server-side fields (geolocation from IP, org tier tag), signs with a service-level HMAC, and produces to Kafka topic
audit.rawpartitioned byorgId. - Return 202 to SDK; SDK drops the event from its local buffer.
End-to-end flow: downstream persistence
- Hash-chain Writer consumes
audit.raw. For each event:- Fetches the previous chain head for
orgIdfrom a fast store (Redis cached + DynamoDB durable). - Computes
hash = sha256(payload || prevHash). - Sets
prevHashandpayloadHashon the event. - Updates chain head atomically (conditional update: "only if prevHash matches the expected head").
- Fetches the previous chain head for
- Writer writes the final event to:
- Elasticsearch cluster (hot tier, 7 days rolling).
- Parquet files on S3 (warm tier, partitioned by
orgId/yyyy/mm/dd/). - S3 Object Lock bucket (cold tier, WORM mode, 7-year retention).
- Every 1M events or every hour (whichever comes first), writer publishes a chain checkpoint:
(orgId, endEventId, chainHead, notarizedAt)to an independent tamper-evident ledger (could be a public blockchain like Hyperledger, a different cloud provider's WORM bucket, or a notarization service).
End-to-end flow: a query
- Query API receives
GET /audit/events?from=2025-01-01&to=2026-04-01&user=.... - Tier Router inspects the time range:
- Last 7 days → Elasticsearch.
- 7 days to 1 year → Trino on Parquet.
1 year → S3 restore job; return a 202 with
jobId.
- For a mixed-range query, router splits into per-tier subqueries, executes in parallel, merges results with a cursor-based paginator.
- Results include each event's
eventId+payloadHashso the caller can verify integrity.
Data model
- Hot tier (0-7 days): Elasticsearch. One index per day per org group (shuffle-sharded). Mapping: keyword fields for
orgId,userId,resourceId,action;@timestampas date. Rollover daily. - Warm tier (7d-1y): Parquet on S3, partitioned as
s3://audit-warm/{orgId}/{yyyy}/{mm}/{dd}/. Queryable via Trino with per-org access policies. - Cold tier (1-7y): S3 Glacier Deep Archive. Retrieval SLA 12 hours; customer is told in advance.
- Chain checkpoint store: DynamoDB for low-latency head reads, plus independent external notarization.
Hash chain formula: hash_n = sha256(event_n_canonical || hash_{n-1}). hash_0 is a per-org genesis hash seeded at org creation.
Multi-Tenancy Strategy
Isolation level: L1 for the ingest path and hot tier indexes (shared infrastructure, orgId partitioning), with per-org WORM buckets for cold storage to give compliance officers a clean story ("your audit data is in a bucket only you can access"). Warm Parquet uses per-org S3 prefixes with per-org IAM policies.
Tenant context flow:
orgIdis stamped on every event by the producer and re-verified by the Audit Edge service against the producer's service identity.- Every ES index, S3 prefix, and Parquet partition is prefixed by
orgId. Shuffle-sharded ES index groups prevent any single noisy org from degrading the whole cluster. - Chain checkpoints are per-org: each org has its own chain. No cross-tenant hash references.
Noisy-neighbor mitigations:
- Per-org ingest rate caps at the Audit Edge service. Enterprise orgs with high throughput negotiate explicit quotas.
- Shuffle sharding of ES indexes: 100 index groups, each org maps to
k=4ofn=100. Chatty tenant affects 4% of ES capacity. - Per-org retention policies override defaults. Financial-services orgs might keep 10 years; SaaS customer might keep 1 year. Policy applied at compaction time.
- Per-org export quotas. Large exports run async with throttled I/O so one org's 1-TB export doesn't saturate egress bandwidth.
- Meta-audit: operator / admin access to any tenant's audit data is itself audited into a separate, operator-visible log.
Per-tenant observability:
- Per-org ingest rate, error rate, chain-verification latency.
- Per-org dashboards for compliance teams showing retention policy, last export, chain status (green / lag detected / integrity fault).
Potential Deep Dives
1) How do we guarantee ingest durability?
Bad Solution: Synchronous write to ES.
- Approach: Producer calls ES directly on the hot path; waits for ack.
- Challenges: ES outage means every producer blocks or starts dropping events. Lost audit events = compliance violation. Also, ES is not designed for 1M events/s sustained — it's a search engine, not an ingest pipeline.
Good Solution: Fire-and-forget to Kafka.
- Approach: Producer produces to Kafka; async consumer writes to ES. At-least-once delivery via Kafka acks=all.
- Challenges: What happens during a Kafka outage? Producer has to buffer somewhere; if it buffers in memory and process crashes, events are lost.
Great Solution: Outbox on producer + local WAL buffer + Kafka + operator-visible back-pressure.
- Approach:
- Producer outbox: for mutations that already have a DB transaction, write the audit event to an outbox table in the same transaction. A separate shipper drains the outbox to Kafka. Never lose events even if the app crashes after the DB write.
- Local NVMe WAL buffer at the Audit Edge gateway as a secondary durability layer. Events are
fsync'd to disk before the gateway returns 2xx to the producer. A crashed gateway recovers events from disk on restart. - If the buffer fills (gateway overwhelmed by Kafka outage), gateway returns 503 with
Retry-After. Producers back off. Observable: operators see disk-full alerts before data is lost. - Kafka acks=all, min ISR = 2. No produce is considered complete until two brokers have fsync'd.
- Challenges: Outbox pattern adds complexity to every producer. WAL buffer means gateway nodes need NVMe — more expensive pods. Producers must handle 503s gracefully, which not every caller does in legacy code.
2) How do we query across 7 years cost-effectively?
Bad Solution: Keep everything in Elasticsearch.
- Approach: 7 PB in ES. Index everything.
- Challenges: At $0.10/GB-month for managed ES, that's ~$700k/month just for storage, plus compute. Completely infeasible.
Good Solution: Tiered storage with a static router.
- Approach: ES for hot (7d), Parquet for warm (1y), Glacier for cold (7y). Tier Router picks a single tier based on time range.
- Challenges: Range queries spanning tiers (e.g., "last 2 years") require either querying the widest tier (slow) or the client has to split manually.
Great Solution: Smart query planner with parallel tier execution.
- Approach:
- Parse time range; split into per-tier subqueries. E.g., "last 2 years" =
[hot: last 7d, warm: 7d-1y, cold: 1y-2y]. - Execute subqueries in parallel.
- Merge results with a cursor-based paginator; sort by timestamp descending across tiers.
- Detect all-cold queries upfront and return a
202 Acceptedwith ajobId— cold-tier retrieval takes hours and we don't want the HTTP connection to time out. - Export flow writes Parquet output directly to the customer's S3 bucket (cross-account) when possible, avoiding egress costs.
- Parse time range; split into per-tier subqueries. E.g., "last 2 years" =
- Challenges: Cold-tier restore costs real money ($0.02/GB retrieval) — rate-limit and bill. Trino on Parquet needs per-org predicate pushdown to avoid scanning all orgs' data. Mitigate via Parquet partitioning on
orgId.
3) How do we provide tamper-evidence?
Bad Solution: Trust the filesystem.
- Approach: Rely on S3 to not lose data.
- Challenges: S3 won't lose data, but an insider with write access could modify records. Compliance requires detection even of privileged modifications.
Good Solution: Per-event sha256.
- Approach: Each event stores a hash of its payload. Tampering is detected by recomputing.
- Challenges: An attacker who modifies the event can also recompute and store the new hash. No chain of trust.
Great Solution: Hash chain + periodic independent notarization + S3 Object Lock.
- Approach:
- Hash chain: each event's
hash_n = sha256(event_n_canonical || hash_{n-1}). A single modification anywhere in the chain invalidates every subsequent hash. - Chain head is updated atomically per event via DynamoDB conditional write.
- Periodic chain-head notarization: every hour (or every 1M events), publish
(orgId, chainHead, timestamp)to an independent ledger — could be:- A public blockchain (Bitcoin OP_RETURN, Ethereum, Hyperledger).
- A different cloud provider's WORM bucket.
- A dedicated notarization service (Qualys, etc.).
- S3 Object Lock in compliance mode on the cold-tier bucket. Once an object is written, it cannot be deleted or overwritten — not even by root. Retention is set to 7 years.
- Verification job periodically rehashes a range of events and compares to stored chain hashes. Alerts on mismatch.
- Hash chain: each event's
- Challenges: Chain-head updates are a serialization point — contention if throughput is high for one org. Mitigate by sharding the chain per-org and per-day. Notarization to a public blockchain incurs transaction fees; batch via Merkle trees. Recovery after an integrity fault is hard — you know data was tampered with, but not what. Forensics.
4) How do we handle GDPR "right to be forgotten" vs immutability?
Bad Solution: Delete rows on GDPR request.
- Approach: Literally delete the events containing PII.
- Challenges: Breaks hash chain. SOX says "thou shalt not delete audit records." GDPR says "thou shalt delete PII on request." Irreconcilable if we take both literally.
Good Solution: Tombstone + redaction.
- Approach: Replace PII fields with
REDACTED, keep the envelope. Recompute a parallel chain. Original stored under an encryption key that can be destroyed. - Challenges: Parallel chain is messy. PII can leak into unstructured
payloadfields that weren't marked.
Great Solution: Crypto-shredding with per-user keys.
- Approach:
- Every event's PII is encrypted at ingest with a per-user DEK (data encryption key). Non-PII stays plaintext.
- The per-user DEK is encrypted with a per-user KEK (key encryption key) stored in a KMS.
- "Right to be forgotten" = destroy the user's KEK. Every DEK is then unrecoverable, and so is the PII.
- Event envelope (timestamp, action, resource type,
orgId) stays intact. Chain is unbroken. PII becomes cryptographically inaccessible. - Document the GDPR / SOX interaction with legal: "the event is still present for SOX, but the PII content is unrecoverable per GDPR."
- Challenges: Key management is hard. KMS quotas on key operations. Auditors may ask "can you show me the data" and the answer is "no, because we destroyed the key per user request" — that's the correct answer but requires legal backing. Some jurisdictions don't accept crypto-shredding as "deletion" — check with counsel.
5) How do we scale the hot tier for high-cardinality queries?
Bad Solution: One giant ES index.
- Approach: All orgs' last 7 days in a single index.
- Challenges: Large indexes slow down queries on small orgs. Tenant isolation is weak. A noisy org's query load can DoS everyone.
Good Solution: One index per org per day.
- Approach:
audit-{orgId}-{date}indexes. - Challenges: 100k orgs × 7 days = 700k indexes. ES cluster-state blows up. Each index has overhead. Many orgs have almost no events, wasting capacity.
Great Solution: Shuffle-sharded index groups + per-org aliases.
- Approach:
- 100 index groups (shards) —
audit-group-{0..99}-{date}. Each group holds events from many orgs. - Each org is mapped to
k=4ofn=100groups via shuffle sharding. Writes go to a random one of those 4 per event. - Queries use per-org aliases that filter by
orgIdand fan out to the 4 groups — 4x query latency but bounded. - Group-level rollover (daily): keep 7 days of groups hot, roll older to warm.
- Mega-tenants (top 50 by volume) get dedicated indexes per day. They bypass the shuffle-shard scheme because their volume would dominate otherwise.
- 100 index groups (shards) —
- Challenges: 4x fan-out on queries increases p95 but keeps it manageable. Shuffle-sharding re-balance (adding index groups) requires careful migration. Dedicated mega-tenant indexes need separate operational attention.
What is Expected at Each Level?
Mid-level (SMTS-junior)
Append-only table per org. Kafka pipeline for ingest. Basic tenant scoping. Can be prompted for tiering and hash chain.
Senior (SMTS / LMTS)
Hot/warm/cold tiering with router. Hash chain for tamper evidence. Query planner. Export flow. Per-org retention policies.
Staff+ (PMTS)
Crypto-shredding for GDPR compliance. WORM storage with Object Lock. Independent notarization. Cost model per tier with concrete numbers. Meta-audit of admin access. Shuffle-sharded ES groups with dedicated mega-tenant indexes. Verification job with integrity-fault alerting.
Salesforce-Specific Considerations
- Direct analog: Salesforce Shield Event Monitoring + Field Audit Trail. Field Audit Trail retains history for up to 10 years per customer.
- Platform Events as ingest channel: internal producers publish to a Platform Event topic; the audit service subscribes. Keeps coupling loose between producer and audit service.
- Per-org retention policies map directly to Shield customer configurations. Financial-services customers often ask for 10-year retention with monthly export.
- Shield Encryption: Salesforce Shield supports per-field encryption with customer-managed keys. The crypto-shredding pattern extends this: each customer holds their own KEK via BYOK (Bring Your Own Key), and destruction is a customer-initiated operation.
- Hyperforce data residency: audit data for EU orgs lives in EU regions; cross-region replication is forbidden unless the customer explicitly enables it.
- Governor-style: per-org event limits — Essentials gets 10k/day audit events, Enterprise gets unlimited. Over-limit events are rate-shed at the Audit Edge with SEV-2 alerting for the customer to investigate.
- SOX / FedRAMP / HIPAA: the specific retention, immutability, and access-control guarantees above are what these standards require. Be ready to enumerate them: "SOX wants 7-year retention, HIPAA wants encryption at rest and access logs, FedRAMP wants FIPS 140-2 crypto modules."
Example snippet — hash-chain writer
public class HashChainWriter {
public void onEvent(AuditEvent e) {
String head = ddb.getHead(e.orgId()); // latest chain head for this org
String payloadHash = sha256(canonicalize(e));
String newHash = sha256(payloadHash + head);
e.setPayloadHash(payloadHash);
e.setPrevHash(head);
e.setChainHash(newHash);
// Conditional update: only commit if head hasn't moved under us.
boolean ok = ddb.compareAndSwapHead(e.orgId(), head, newHash, e.eventId());
if (!ok) throw new RetryException("concurrent chain update");
esSink.write(e);
parquetSink.write(e);
wormSink.write(e);
}
}void HashChainWriter::OnEvent(AuditEvent& e) {
const std::string head = ddb_.GetHead(e.org_id());
const std::string payloadHash = Sha256(Canonicalize(e));
const std::string newHash = Sha256(payloadHash + head);
e.set_payload_hash(payloadHash);
e.set_prev_hash(head);
e.set_chain_hash(newHash);
if (!ddb_.Cas(e.org_id(), head, newHash, e.event_id())) {
throw RetryException("concurrent chain update");
}
es_sink_.Write(e);
parquet_sink_.Write(e);
worm_sink_.Write(e);
}async function writeChained(e: AuditEvent) {
const head = await ddb.getHead(e.orgId);
const payloadHash = sha256(canonicalize(e));
const newHash = sha256(payloadHash + head);
e.payloadHash = payloadHash;
e.prevHash = head;
e.chainHash = newHash;
const ok = await ddb.cas(e.orgId, head, newHash, e.eventId);
if (!ok) throw new RetryError("concurrent chain update");
await Promise.all([
esSink.write(e),
parquetSink.write(e),
wormSink.write(e),
]);
}