Backend Fundamentals ā Salesforce SMTS ā
Broad-surface backend recall targeted at Salesforce SMTS Backend interviews (~3 YoE). The fundamentals show up in R3/R4/R5 follow-ups: they probe multi-tenancy, consistency, and concurrency. Answer the core question first, then volunteer one Salesforce-angle hook (tenant isolation, governor limits, bulkification) to signal you understand the platform.
Quick reference cheat sheet ā
Rapid one-liners. If the interviewer throws these as a lightning round, you should produce the one-liner in under 10 seconds.
| Concept | One-line recall |
|---|---|
| ACID | Atomicity, Consistency, Isolation, Durability ā all-or-nothing, valid state, no interference, survives crash |
| Read Committed | Prevents dirty reads; Postgres default; non-repeatable reads still possible |
| Repeatable Read | Same row reads are stable; phantoms possible in SQL standard (MySQL InnoDB blocks phantoms via gap locks) |
| Serializable | Full isolation; behaves as if transactions ran one at a time |
| B-tree index | Ordered tree, O(log n) lookup, supports range and equality |
| Hash index | Equality only, O(1), no range scans |
| Covering index | Index contains all queried columns, avoids heap lookup |
| Selectivity | Fraction of unique values; low selectivity = index is useless |
| Sharding | Partition data across nodes by key (range, hash, directory) |
| Consistent hashing | Keys and nodes on a ring, minimal rebalance on node churn |
| Replication lag | Time between write on primary and visibility on replica |
| 2PC | Prepare + commit across participants, blocks on coordinator failure |
| Saga | Long-running tx as sequence of local tx + compensations |
| Outbox | Write event to same DB tx, separate process publishes |
| CDC | Stream DB changes (logical WAL) to downstream consumers |
| Cache-aside | App reads cache, on miss reads DB and backfills |
| Write-through | Writes go through cache to DB synchronously |
| LRU | Evict least recently used |
| Cache stampede | Many requests hit DB when hot key expires |
| Mutex | Exclusive lock |
| Semaphore | N permits, gate concurrency |
| CAS | Compare-and-swap, lock-free primitive |
| Deadlock | 4 conditions: mutex, hold-and-wait, no preemption, circular wait |
| Kafka partition | Unit of parallelism and ordering |
| At-least-once | Retries may produce duplicates; needs idempotency |
| Exactly-once (Kafka) | Transactional producer + idempotent consumer + read-committed |
| DLQ | Dead letter queue for poisoned messages |
| HTTP/2 | Multiplexed streams over one TCP, HPACK headers |
| gRPC | HTTP/2 + protobuf, supports bidi streaming |
| CAP | Under partition, pick Consistency or Availability |
| PACELC | Else (no partition), pick Latency or Consistency |
| Raft | Leader + log replication + majority quorum |
| Lamport ts | Scalar logical clock, total order |
| Vector clock | Per-node counter, detects concurrent events |
| Redlock | Redis multi-node distributed lock (controversial) |
| Circuit breaker | Closed ā open on failures ā half-open to probe |
| Bulkhead | Isolate resources so one tenant can't drown others |
| Jitter | Random offset on retry to avoid herds |
| JWT | Base64(header).Base64(payload).signature |
| OIDC | OAuth 2.0 + ID token for identity |
| mTLS | Both client and server present certs |
| RBAC | Role-based access control, permissions attached to roles |
| ABAC | Attribute-based; policy over subject/resource/env attributes |
| Row-level security | DB-enforced filter on tenant_id/owner |
| CQRS | Split command and query models |
| Event sourcing | State = fold(events), append-only log |
How to deploy this in Salesforce interviews ā
At SMTS level the interviewer cares less about textbook definitions and more about how you reason under multi-tenant constraints. Use this mental model when answering:
- Default to tenant isolation. Every data structure, cache key, queue name, thread pool, and log line should carry a
tenantId(ororgIdin Salesforce lingo). If your answer doesn't mention it, the interviewer will push until you do. - Fairness beats peak throughput. Governor limits exist because one noisy tenant should never starve the other 150k orgs on the same pod. When you pick a pattern, articulate how it bounds the blast radius.
- Consistency is a product decision, not just a DB flag. Salesforce records must be consistent within an org (CP), but search indexes and reports can be eventually consistent (AP). State which side you're on.
- Bulkify by default. Don't loop a network call per record; batch. When you propose a solution, describe the batch boundary (200 records, 10MB, 30s window, whichever comes first).
- Always describe failure modes. For every happy path, mention retry behavior, idempotency key, DLQ, and how you'd detect it in observability.
Interview followup pattern you'll see: "OK, that works for 1 tenant. What breaks at 10k tenants? What breaks at 150k?" Always have the next-order answer ready.
Section 1 ā Databases ā
Databases are the heaviest fundamentals topic for Salesforce because the platform is fundamentally a database-as-a-service with a programmable layer. Expect isolation, indexing, and multi-tenant schema design to eat 20+ minutes in a 60-minute loop.
ACID ā
Definition. Atomicity (tx is all-or-nothing), Consistency (tx moves DB from valid state to valid state respecting constraints), Isolation (concurrent tx don't interfere per the chosen level), Durability (committed data survives crashes, typically via WAL fsync).
When ACID. Financial ledgers, inventory debits, anything where a half-applied change corrupts the business (Salesforce Opportunity + OpportunityLineItem must commit together ā orphan line items = bug report from a Fortune 500 customer).
When not ACID. Analytical pipelines, activity streams, audit logs that only append. Eventual consistency is fine and cheaper.
Interview followup. "What does Consistency in ACID mean vs Consistency in CAP?" They're different: ACID-C is about invariants (unique constraints, FK, check constraints), CAP-C is about replica agreement.
Salesforce angle. A Salesforce transaction inside Apex is ACID within an org ā inserts, triggers, and workflows run in a single DB transaction and roll back together.
Isolation levels ā
Ordered weakest to strongest:
| Level | Dirty read | Non-repeatable read | Phantom read | Typical engine |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | SQL Server (if set) |
| Read Committed | Blocked | Possible | Possible | Postgres default, Oracle default |
| Repeatable Read | Blocked | Blocked | Possible (standard); Blocked in InnoDB via gap locks | MySQL InnoDB default |
| Serializable | Blocked | Blocked | Blocked | Rare default; Postgres SSI |
- Dirty read ā T1 reads T2's uncommitted write.
- Non-repeatable read ā T1 reads row, T2 updates and commits, T1 reads again and sees different data.
- Phantom read ā T1 runs
SELECT ... WHERE age > 30, T2 inserts a matching row and commits, T1 reruns and sees a new row.
Postgres SSI (Serializable Snapshot Isolation) detects conflicts at commit and aborts one tx; requires retry logic.
Interview followup. "Your code does SELECT balance, UPDATE balance = balance - 100. Is Read Committed safe?" No ā classic lost update. Use SELECT ... FOR UPDATE, an atomic UPDATE balance = balance - 100 WHERE balance >= 100, or bump isolation to Repeatable Read.
Salesforce angle. Salesforce's DB layer is Oracle; most workloads run at Read Committed with row-level locks (FOR UPDATE) for things like approval processes or sequence generation.
Indexes ā
B-tree is the default. Ordered, supports equality, range (BETWEEN, <, >), prefix matches (LIKE 'abc%'), and ORDER BY. O(log n) lookup.
Hash is equality-only, O(1). No range scans. Postgres has hash indexes; most engines default to B-tree anyway.
Covering index includes all columns needed by a query, so the engine never visits the heap. Postgres INCLUDE clause, MySQL secondary indexes already carry PK so they can be covering by accident.
Composite (a, b, c) supports WHERE a = ?, WHERE a = ? AND b = ?, WHERE a = ? AND b = ? AND c = ?. Does not support WHERE b = ? alone (leftmost prefix rule).
Partial indexes cover a subset: CREATE INDEX ... WHERE deleted = false. Cheaper to build and smaller.
When indexes hurt. Write-heavy tables (every insert updates every index). Low-selectivity columns (boolean, status with 3 values ā scan is faster). Huge wide indexes on small tables.
Selectivity. unique_values / total_rows. An index on is_active with 50/50 split is useless; planner will seq scan. An index on email with near-1.0 selectivity is ideal.
Interview followup. "You have a composite index (org_id, created_at). Which queries use it?" Anything filtering on org_id alone, or org_id + created_at. Not created_at alone.
Salesforce angle. Multi-tenant tables start every index with org_id. A query without a leading org_id predicate will scan across tenants and is usually rejected at code review.
Query optimization ā
- EXPLAIN / EXPLAIN ANALYZE. Read bottom-up. Watch for
Seq Scanon large tables,Nested Loopwith high row counts (should be Hash Join), and hugerowsestimates that are off by 10x (stale statistics ā runANALYZE). - Index hints. MySQL
USE INDEX, Oracle/*+ INDEX(...) */. Last resort ā usually means stats are wrong. - Query rewriting. Turn correlated subqueries into joins. Replace
ORwithUNION ALLwhen each branch is selective. Push predicates down before joins. - N+1. Loading parent, then looping children per parent. Fix with join,
IN (...)batch, or dataloader pattern.
Interview followup. "The query was fast yesterday and slow today, same data size. What happened?" Plan flip due to stale stats, parameter sniffing, or a new index changing planner choices.
Normalization and when to denormalize ā
- 1NF ā atomic columns, no repeating groups.
- 2NF ā 1NF + no partial dependency on a composite key.
- 3NF ā 2NF + no transitive dependency.
- BCNF ā stricter 3NF.
Denormalize when. Read-heavy, reporting, dashboards, hot paths where the join cost dominates. Duplicate data with a refresh job or CDC pipeline, and document the source of truth.
Interview followup. "How do you keep a denormalized column in sync?" Triggers (fragile), CDC into materialized view, or rebuild-on-write in application code guarded by a single writer.
Salesforce angle. Reports and list views use denormalized summary fields (rollup summary, formula fields). The platform maintains them via background jobs.
SQL vs NoSQL decision tree ā
Pick SQL when:
- You need multi-row transactions.
- Schema is stable and you value constraints.
- Ad-hoc analytics with joins.
- Regulatory audit trails.
Pick NoSQL when:
- Schema is genuinely flexible (event payloads, product catalogs with vendor-specific attrs).
- Scale of writes exceeds a single primary and sharding SQL is operationally painful.
- You need a specific data model (graph for relationships, time-series for metrics).
- Latency budgets demand in-memory (Redis).
Most Salesforce backend services are SQL first; Redis for cache, Kafka for events, document stores only where justified.
NoSQL types ā
- Document (MongoDB, DocumentDB). JSON-ish documents, flexible schema, indexes on fields. Good for content-heavy data.
- Key-value (Redis, DynamoDB KV mode). Single key ā value. Fastest. Caches, session stores, rate limit counters.
- Columnar (Cassandra, HBase, ScyllaDB). Wide rows keyed by partition; great for time-series and write-heavy workloads. Tunable consistency.
- Graph (Neo4j, Neptune). Nodes and edges with traversal queries. Fraud detection, social graphs.
- Time-series (InfluxDB, TimescaleDB). Optimized for append + time-range reads with retention policies. Metrics, IoT.
Interview followup. "When would you choose Cassandra over MongoDB?" Wide-column workloads with heavy writes, multi-DC active-active, tunable consistency (QUORUM, LOCAL_QUORUM). MongoDB is better for flexible schemas and secondary indexes.
Sharding ā
Splitting a logical table across physical nodes.
- Range sharding. Shard 1: aām, Shard 2: nāz. Simple range queries, but hotspots (e.g., if key is timestamp, all writes hit latest shard).
- Hash sharding.
shard = hash(key) % N. Even distribution. Range queries must scatter. - Directory sharding. Lookup table from key ā shard. Flexible, can rebalance per-key, but the directory is a SPOF and a hot read.
- Consistent hashing. Keys and nodes hashed onto a ring; a key lives on the next clockwise node. Adding/removing a node only moves
1/Nkeys. Use virtual nodes (vnodes, typically 100ā256 per physical node) to even out distribution and make rebalancing granular.
Hotspot mitigation. Salt the key ({tenantId}:{randomBucket0-15}:{id}), add a secondary prefix, time-bucket, or route power-user tenants to dedicated shards.
Interview followup. "A tenant is 100x the next largest. What do you do?" Isolate them: move to their own shard (pod in Salesforce terms), or use a different storage tier for their large objects.
Salesforce angle. Orgs are assigned to pods; large orgs may get dedicated pods. Data within an org stays on one pod.
Replication ā
- Master-slave (primary-replica). One writer, N readers. Simple, read scaling, async lag.
- Master-master (multi-primary). Writes to any node; needs conflict resolution (last-write-wins, CRDTs, app-level merge). Operationally hard.
- Sync replication. Writer waits for replica ack. Zero data loss on failover, higher write latency. Postgres synchronous commit, Oracle Data Guard SYNC.
- Async replication. Writer returns immediately, replicas catch up. Risk of data loss on primary crash.
Read replicas serve stale data (lag from ms to minutes under load). Don't read your own writes from a replica ā pin the read to primary after a write, or use a session token to verify the replica has caught up (MySQL GTID, Postgres LSN).
Interview followup. "How do you handle a replication lag spike?" Alert on lag metric; throttle write-heavy jobs; fall back to primary for critical reads; investigate long transactions blocking apply.
Multi-tenant DB patterns ā
Three canonical choices:
- Shared DB + shared schema (pool model). Every table has
tenant_id. Every query hasWHERE tenant_id = :t. Cheapest, scales to millions of tenants. Risk: a missing predicate leaks data across tenants. Mitigate with row-level security (Postgres RLS) or a mandatory repository layer that injects tenant_id. - Shared DB + separate schema (bridge model). One DB, one schema per tenant. Schema migrations run per tenant (slow at scale). Moderate isolation; noisy neighbors still share buffer pool.
- DB per tenant (silo model). Best isolation, easy backup/restore per tenant, easy compliance. Expensive at >1k tenants; migrations become orchestration problems.
Interview followup. "Customer demands their data in EU-only, others stay in US." DB-per-tenant or at least shard-per-region. The pool model can't satisfy data-residency by itself.
Salesforce angle. Historically Salesforce uses a shared schema on Oracle with org_id on every row, plus a sophisticated metadata layer. It's the textbook pool model, proven at 150k+ orgs per pod.
Partitioning ā
- Horizontal (aka sharding when across nodes). Rows split by a key. Postgres 10+ declarative partitioning:
PARTITION BY RANGE (created_at)orLIST (region)orHASH (tenant_id). - Vertical. Split wide tables into narrow ones by column usage (hot columns vs blob columns).
Tenant-aware partitioning. Hash-partition on org_id so tenant data co-locates on one partition. Aids partition pruning on every query that filters by org_id.
Interview followup. "Why partition if you have indexes?" Partition pruning skips whole partitions (smaller index to walk), aids maintenance (drop a partition = drop a table, no VACUUM), supports tiered storage (old partitions ā slow disk).
Distributed transactions ā
- 2PC (two-phase commit). Prepare phase: each participant writes a prepare record and votes yes/no. Commit phase: coordinator broadcasts decision. Blocks if coordinator dies between prepare and commit ā participants hold locks until they recover the decision.
- 3PC. Adds a pre-commit phase to reduce blocking. Rare in practice; network assumptions don't hold.
- Sagas. Sequence of local transactions; each step has a compensating transaction. Orchestration ā central coordinator drives steps (easier to reason, the coordinator is a SPOF unless itself HA). Choreography ā services publish/subscribe events (scales, but the flow is scattered across services and hard to debug).
- Outbox pattern. Write domain change and event row to the same DB transaction. A poller or CDC process reads the outbox and publishes to Kafka. Guarantees at-least-once publish atomically with state change.
- TCC (Try-Confirm-Cancel). Participants expose try/confirm/cancel APIs. Try reserves resources, confirm commits, cancel undoes. More explicit than saga; business logic must support reservation.
Interview followup. "2PC vs Saga?" 2PC needs all participants online, supports same-transaction semantics, blocks under coordinator failure. Sagas give up atomicity for availability; you write compensations and accept intermediate visibility.
Salesforce angle. Cross-org or cross-service flows use saga-style orchestration with an idempotency key; Platform Events + Outbox publishes changes to subscribers.
CDC (Change Data Capture) ā
Stream row-level changes out of a DB.
- Debezium. Reads the WAL/binlog via logical replication slots; publishes to Kafka.
- Maxwell's Daemon. MySQL-only; simpler.
- Postgres logical replication. Built-in, subscribers can be other Postgres instances or third-party sinks.
Why CDC. Keep caches, search indexes (Elasticsearch), analytics DBs, and downstream services in sync without dual-writes. Replaces trigger-based event publishing.
Interview followup. "How do you handle schema changes under CDC?" Use a schema registry (Avro + Confluent), make additive changes, handle the consumer side with schema evolution.
Section 2 ā Caching ā
Caching is where you buy performance with complexity. Every cache pattern introduces a consistency window; the question is whether you can tolerate it.
Cache patterns ā
- Cache-aside (lazy loading). App reads cache ā miss ā app reads DB ā backfill cache. Writes invalidate or update cache. Most common. Cache only holds what's been requested.
- Read-through. Cache client transparently loads from DB on miss. App sees a single API. Requires a cache layer that can reach DB (e.g., library or proxy).
- Write-through. Write goes through cache synchronously to DB. Cache is always consistent with DB. Slightly higher write latency.
- Write-behind (write-back). Write to cache, async flush to DB. Fast writes, risk of data loss on cache node death.
- Refresh-ahead. Proactively refresh hot keys before TTL expiry. Avoids miss spikes; wasted work on keys that aren't actually requested again.
When NOT to cache. Strongly consistent financial reads, write-heavy workloads (cache churn > benefit), data that's cheap to compute (hit the DB).
Interview followup. "Cache-aside vs write-through for user profile?" Cache-aside is simpler and fits the read-heavy pattern. Write-through if profile updates must be immediately visible across services.
Eviction policies ā
- LRU (Least Recently Used). Evict the item untouched the longest. Redis approximation via sampling (maxmemory-policy
allkeys-lru). - LFU (Least Frequently Used). Evict the least accessed overall. Better for Zipfian traffic where some keys are permanently hot. Redis
allkeys-lfu. - FIFO. Evict oldest inserted. Simple, rarely ideal.
- TTL. Time-based; every key has a deadline. Combine with LRU.
- Random. Evict a random key. Surprisingly okay when you have many equal-weight keys and want cheap eviction.
Interview followup. "You see cache hit rate drop after a deploy. What do you check?" Key format change (prefix bumped), TTL too short, memory pressure triggering eviction, cold cache right after deploy (warm it).
Consistency ā
Cache-aside with TTL trades freshness for simplicity. If you need strong consistency:
- Write-through (cache+DB synchronously).
- Invalidate-on-write (delete the cache key in the same transaction ā but what if the cache delete succeeds and DB rollback? Order: write DB first, then invalidate. If invalidation fails, a short TTL bounds damage).
- Double-delete pattern: delete before write, write DB, delete again after short delay (defeats racing readers backfilling stale data).
Event-driven invalidation. CDC stream publishes to a topic; cache consumers invalidate. Decouples producers from caches.
Interview followup. "Two readers miss simultaneously ā both read DB and write cache. Which wins?" Last writer, but both values are the same so it's fine. The real risk is if a stale reader (from before a write) and a fresh reader race to backfill.
Thundering herd / cache stampede ā
Hot key expires ā 10k requests miss simultaneously ā 10k DB reads. Mitigations:
- Jittered TTL.
base + rand(0, jitter). Spreads expiration. - Request coalescing / single-flight. First miss triggers load; other requests wait on the same promise. Go's
singleflight, JavaCompletableFuturereuse, GuavaLoadingCache. - Probabilistic early expiration (XFetch). Each reader has a small chance to refresh before TTL; probability rises as TTL approaches. Avoids any cliff.
- Stale-while-revalidate. Return stale value, refresh async. Good for read-heavy, eventually consistent data.
- Bloom filter gate. Avoids cache penetration (cache misses for keys that don't exist in DB either).
Cache invalidation strategies ā
"There are only two hard things in computer science: cache invalidation and naming things." Karlton's law.
- TTL. Set-and-forget; bounded staleness.
- Explicit invalidation on write. App code deletes keys after state change.
- Tag-based invalidation. Associate keys with tags; invalidate by tag. Varnish, Rails cache tags.
- Event-driven (CDC or pub-sub). Most robust for multi-service systems; decouples writers and readers.
Distributed caches ā
- Redis Cluster. Sharded (16384 hash slots), single-threaded per node (fast, but a long command blocks the slot), supports data structures (lists, sorted sets, streams, hashes), Lua scripting for atomic multi-op, persistence (RDB, AOF), pub/sub.
- Memcached. Simpler, multi-threaded, strings only, no persistence, no replication built-in. Good for plain K/V cache at very high throughput.
When Redis. You need data structures, atomic ops, pub/sub, or persistence. When Memcached. Pure ephemeral K/V, extreme throughput, minimal ops overhead.
Tenant-aware caching ā
- Per-tenant key namespace.
cache:v1:{tenantId}:user:{userId}. Never omit tenantId; easy grep in incidents. - Per-tenant quota. Use
CLIENT LIST+ track memory via pattern, or dedicated Redis instances per shard. Avoids one tenant monopolizing cache memory. - Eviction fairness. LRU across tenants is unfair to quiet tenants with occasional important reads. Consider separate cache databases per tier (premium vs standard).
- Cache key versioning.
v1prefix lets you deploy new serialization without invalidation storms ā roll tov2in code and old keys expire naturally.
Interview followup. "A large tenant fills 90% of the cache. How do you protect other tenants?" Namespace + quota tracking, separate Redis DB per shard/tier, or evict by tenant memory fairness. Salesforce style: per-pod caches so tenants can't cross pod boundaries.
Section 3 ā Concurrency ā
Concurrency is where SMTS interviews actually stress-test you. Expect live code, expect follow-ups on the JMM, expect deadlock scenarios.
Primitives ā
- Mutex. Exclusive lock, one holder.
synchronizedblock,ReentrantLock. - Semaphore. N permits. Bound concurrency (e.g., max 10 DB connections).
java.util.concurrent.Semaphore. - Read-write lock. Many readers OR one writer. Best when reads dominate.
ReentrantReadWriteLock. - Condition variable. Wait for a predicate while holding a lock.
Object.wait/notify,Condition.await/signal. - Monitor. Lock + condition bundled per object. Java's intrinsic lock on every Object.
- Barrier / Latch.
CountDownLatch(one-shot),CyclicBarrier(reusable, all threads meet before proceeding),Phaser.
Java-specific toolbox ā
synchronizedā intrinsic, reentrant, always releases on exception.ReentrantLockā explicit, supportstryLock(timeout), fair mode, interruptible wait.ReentrantReadWriteLockā read-mostly shared state; write lock is exclusive, read lock shared.StampedLockā optimistic reads without blocking writers; validate stamp before use.ConcurrentHashMapā striped locks historically, now CAS-based buckets. UsecomputeIfAbsentfor atomic memoization.BlockingQueueā producer-consumer handoff;ArrayBlockingQueue,LinkedBlockingQueue,SynchronousQueue(zero capacity handoff).CompletableFutureā async composition,thenApply,thenCompose,allOf,anyOf, custom executor.ForkJoinPoolā work-stealing, for recursive divide-and-conquer. Backs parallel streams.ThreadLocalā per-thread slot; beware memory leaks in thread pools (alwaysremove()in finally).AtomicInteger,AtomicReference,LongAdderā CAS-based lock-free counters.LongAdderscales better under contention thanAtomicLong.VarHandle(Java 9+) ā low-level CAS, fences.- Virtual threads (Java 21) ā lightweight threads scheduled on carrier threads; blocking I/O no longer costs a platform thread. Perfect for multi-tenant request-per-thread models.
Race conditions ā
- TOCTOU (Time-Of-Check-To-Time-Of-Use). Check a condition, then act based on it; another thread changes state between. Classic
if (!map.containsKey(k)) map.put(k, v)ā useputIfAbsentorcomputeIfAbsent. - Check-then-act. Subset of TOCTOU.
- Compound actions. Increment, read-modify-write. Must be atomic or locked.
Deadlock ā
Four necessary conditions (Coffman):
- Mutual exclusion.
- Hold and wait.
- No preemption.
- Circular wait.
Break any one to prevent deadlock:
- Lock ordering. Always acquire locks in a global canonical order (e.g., by object hashcode or ID).
- Timeouts.
tryLock(timeout)ā if you can't get the lock in time, back off and retry. - Detection. Wait-for graph + periodic cycle detection; abort one tx. DBs do this natively.
- Prevention via single big lock. Simplest, worst throughput.
Interview followup. "Two services, each calls the other and they deadlock." Same principles: canonical call ordering or timeouts and compensating actions.
Livelock and starvation ā
- Livelock. Threads actively change state but no one makes progress (two polite people dodging in a hallway). Fix with random backoff or priority.
- Starvation. A thread never gets the resource (unfair lock, high-priority threads monopolizing). Fix with fair locks or priority inheritance.
Actor model ā
State is encapsulated in an actor; communication only via asynchronous messages. Actors process messages one at a time, so no shared mutable state. Great for multi-tenant: one actor per tenant/session. Frameworks: Akka (Scala/Java), Erlang/OTP, Microsoft Orleans.
When. Naturally concurrent, state-ful entities (one connection, one device, one org). When not. Transactional cross-entity updates (you'll end up doing distributed coordination, same hard problems).
Java Memory Model ā
- Happens-before. Partial ordering on actions. Writes before
volatilewrite are visible aftervolatileread of the same variable. Monitor release happens-before next monitor acquire. Thread start happens-before first action of that thread. volatile. Visibility (no caching in registers) + prevents reordering across the volatile access. Not atomicity for compound actions.synchronized. Mutual exclusion + full happens-before on entry/exit.finalfields. Safely published after constructor returns (with a small caveat: don't leakthisfrom the constructor).
Interview followup. "Is volatile int counter; counter++ safe?" No ā counter++ is read-modify-write, not atomic. Use AtomicInteger.
Structured concurrency (Java 21 preview) ā
StructuredTaskScope bounds the lifetime of spawned virtual threads to the scope. Errors propagate; cancellation fans out. Replaces ad-hoc Future juggling. Cleaner for fan-out reads (fetch user + org + permissions in parallel with a deadline).
Key Java concurrency idiom ā counter ā
// Atomic counter (lock-free, scales under contention)
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.LongAdder;
class Counter {
private final AtomicInteger count = new AtomicInteger();
public int increment() {
return count.incrementAndGet(); // CAS under the hood
}
}
// LongAdder scales better than AtomicLong under heavy contention
class HighContentionCounter {
private final LongAdder count = new LongAdder();
public void increment() { count.increment(); }
public long value() { return count.sum(); }
}
// synchronized alternative (coarse lock, simpler, slower under contention)
class SyncCounter {
private int count = 0;
public synchronized int increment() { return ++count; }
}// std::atomic is the C++ equivalent; memory order matters
#include <atomic>
class Counter {
std::atomic<int> count{0};
public:
int increment() {
return count.fetch_add(1, std::memory_order_relaxed) + 1;
}
};
// Mutex alternative
#include <mutex>
class MutexCounter {
int count = 0;
std::mutex m;
public:
int increment() {
std::lock_guard<std::mutex> lk(m);
return ++count;
}
};// Node is single-threaded for JS; no data races on primitives.
// Concurrency problems show up as logical races across async boundaries.
// Here's a "limit N concurrent" pool ā analog of Java Semaphore.
async function pMap<T, R>(
items: T[],
fn: (item: T) => Promise<R>,
concurrency = 10,
): Promise<R[]> {
const results: R[] = new Array(items.length);
let cursor = 0;
async function worker() {
while (true) {
const i = cursor++;
if (i >= items.length) return;
results[i] = await fn(items[i]);
}
}
await Promise.all(Array.from({ length: concurrency }, worker));
return results;
}
// For true shared-memory concurrency in Node: SharedArrayBuffer + Atomics
const sab = new SharedArrayBuffer(4);
const view = new Int32Array(sab);
Atomics.add(view, 0, 1); // atomic increment across worker threadsRead-write scenarios ā
import java.util.concurrent.locks.ReentrantReadWriteLock;
class Cache<K, V> {
private final Map<K, V> map = new HashMap<>();
private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
public V get(K key) {
lock.readLock().lock();
try { return map.get(key); }
finally { lock.readLock().unlock(); }
}
public void put(K key, V value) {
lock.writeLock().lock();
try { map.put(key, value); }
finally { lock.writeLock().unlock(); }
}
}
// For read-heavy, prefer ConcurrentHashMap. RWLock is only a win if writes
// are rare and reads are long.#include <shared_mutex>
#include <unordered_map>
template <typename K, typename V>
class Cache {
std::unordered_map<K, V> map;
mutable std::shared_mutex m;
public:
V get(const K& k) const {
std::shared_lock lk(m); // reader lock
return map.at(k);
}
void put(const K& k, V v) {
std::unique_lock lk(m); // writer lock
map[k] = std::move(v);
}
};// Node: no data races on a JS Map in a single event loop.
// Logical races across async boundaries still require coordination.
class AsyncCache<K, V> {
private map = new Map<K, V>();
private inflight = new Map<K, Promise<V>>();
async get(key: K, loader: (k: K) => Promise<V>): Promise<V> {
const cached = this.map.get(key);
if (cached !== undefined) return cached;
// single-flight: coalesce concurrent misses
const existing = this.inflight.get(key);
if (existing) return existing;
const p = loader(key).then((v) => {
this.map.set(key, v);
this.inflight.delete(key);
return v;
});
this.inflight.set(key, p);
return p;
}
}Section 4 ā Messaging and queues ā
SMTS interviews love async: "how do you make this not block the request?" Know Kafka and RabbitMQ well enough to defend your choice.
Kafka ā
- Topic. Logical stream. Partition. Ordered, append-only log; unit of parallelism and ordering.
- Offset. Position in partition. Consumers track their own offset.
- Consumer group. Set of consumers collaborating on a topic; each partition is owned by exactly one consumer in the group at a time.
- Retention. Time-based (e.g., 7 days) or size-based. Messages stay on disk, consumers replay by seeking.
- Log compaction. Keeps latest value per key. Useful as a database changelog.
- Exactly-once semantics (EOS). Transactional producer (writes to multiple partitions are atomic) + idempotent producer (dedupes on producer ID + sequence) +
isolation.level=read_committedon consumer + consumer commits offsets transactionally. EOS only within Kafka; side effects outside Kafka still need idempotency. - Rebalancing. When a consumer joins/leaves, partitions reshuffle. Cooperative sticky assignor minimizes churn.
Ordering. Per partition only. If you need per-entity order, hash by entity key to a partition.
Pitfalls. Too many partitions ā metadata overhead, long rebalances. Too few ā limits consumer parallelism. Partition count is hard to change (new partitions change hashing ā order breaks).
RabbitMQ ā
- Exchange types.
- Direct. Routing key equality.
- Topic. Wildcard routing (
orders.*.us). - Fanout. Broadcast to all bound queues.
- Headers. Match on headers instead of routing key.
- Queue. Ordered FIFO buffer.
- Binding. Rule connecting exchange to queue.
- Acks. Consumer acks after processing; broker redelivers on channel close without ack.
- Prefetch (QoS). Limit unacked messages per consumer to prevent one slow consumer hoarding.
Kafka vs RabbitMQ ā
| Aspect | Kafka | RabbitMQ |
|---|---|---|
| Model | Distributed log | Smart broker, dumb consumer |
| Throughput | Very high (100k+ msg/s/broker) | High (tens of k msg/s) |
| Ordering | Per partition | Per queue |
| Retention | Days/weeks; replay | Until ack |
| Consumer model | Pull, offset-based | Push (or pull), ack-based |
| Routing | Client-side (by key) | Server-side (exchanges) |
| Best for | Event streaming, log pipelines, CDC, analytics | Task queues, RPC patterns, complex routing |
Interview followup. "Why pick Kafka for an order event stream?" Replay for new consumers, retention, high throughput, partition ordering per customer.
Delivery semantics ā
- At-most-once. Fire and forget; may lose.
- At-least-once. Retry until ack; may duplicate. Default assumption; design consumers to be idempotent.
- Exactly-once. Hard in distributed systems. Kafka EOS within Kafka; for external side effects, achieve "effectively-once" via idempotency keys.
Idempotency ā
- Idempotency key. Client-generated UUID in header. Server stores processed keys with result; retry returns stored result.
- Unique constraint. DB unique index on business key; second insert errors cleanly.
- Dedup table.
(key, expires_at); background cleanup. Sized for your retry window. - Natural idempotency.
UPDATE SET state = 'shipped' WHERE id = ? AND state = 'paid'ā re-running is a no-op.
Interview followup. "How long do you keep idempotency keys?" Longer than the max retry window (e.g., 7 days). Size the table; partition by day; drop old partitions.
DLQ and retry ā
- DLQ (dead letter queue). Messages that fail after N retries go here for manual inspection.
- Exponential backoff with jitter.
delay = min(cap, base * 2^attempt) + rand(0, base). Jitter avoids retry herds. - Retry budgets. Cap retries per unit time to avoid amplifying outages.
- Poison message. Same message fails repeatedly; route to DLQ early.
- Replay tooling. DLQ consumer with manual dispatch; do not auto-replay without operator approval.
Order guarantees ā
- Kafka: per-partition only. Hash by entity key.
- RabbitMQ: per-queue only, and only with a single consumer (or
single active consumerflag). Multiple consumers of one queue parallelize but lose order.
Salesforce angle. Platform Events are Kafka-backed; per-org events get ordered within their partition assignment. Bulk API uses batch jobs (BlockingQueue analog) to serialize work per org.
Section 5 ā Networking ā
Network fundamentals come up in R3/R4 when the interviewer probes "what happens between services."
HTTP methods ā
- GET ā safe, idempotent, cacheable. Never mutates.
- POST ā create or process. Not idempotent by default.
- PUT ā replace resource at URL. Idempotent.
- PATCH ā partial update. Not necessarily idempotent (depends on body).
- DELETE ā remove. Idempotent.
- HEAD, OPTIONS ā metadata and CORS preflight.
Status codes ā
- 2xx success: 200 OK, 201 Created (Location header), 202 Accepted, 204 No Content.
- 3xx redirection: 301 permanent, 302/307 temporary, 304 Not Modified.
- 4xx client: 400 Bad Request, 401 Unauthorized (missing auth), 403 Forbidden (auth but denied), 404, 409 Conflict, 422 Unprocessable, 429 Too Many Requests.
- 5xx server: 500, 502 Bad Gateway, 503 Service Unavailable (with Retry-After), 504 Gateway Timeout.
HTTP/1.1 vs HTTP/2 vs HTTP/3 ā
- HTTP/1.1. Text-based, one request per TCP connection (pipelining broken in practice), head-of-line blocking. Keep-alive reuses connections.
- HTTP/2. Binary framing, multiplexed streams over one TCP (many parallel requests, no HOL at HTTP layer). HPACK header compression. Server push (mostly deprecated).
- HTTP/3. Over QUIC (UDP-based). Solves TCP HOL (packet loss doesn't block other streams). Faster handshake (0-RTT with session resumption). Good over lossy networks (mobile).
REST vs GraphQL vs gRPC ā
- REST. Resource-oriented, HTTP verbs, JSON. Simple, cacheable, ubiquitous. Over-fetching/under-fetching on nested resources.
- GraphQL. Client selects exactly the fields it needs. One endpoint, complex caching, n+1 risk on the server (use dataloader). Schema-first.
- gRPC. HTTP/2 + protobuf, typed contracts, unary + server/client/bidi streaming, code-gen. Best for service-to-service where you control both ends. Not browser-native without grpc-web.
When each.
- REST for public APIs and simple CRUD.
- GraphQL for aggregating many sources under a single flexible client (BFF).
- gRPC for internal microservices, low latency, streaming.
WebSocket ā
- Handshake. HTTP/1.1 request with
Upgrade: websocket+Sec-WebSocket-Key. Server responds 101 Switching Protocols. Connection becomes full-duplex. - Ping/pong. Keep-alive and liveness; close if no pong within timeout.
- Backpressure. TCP provides backpressure, but application must handle queue buildup (close or drop).
- Scaling. Sticky routing or a shared pub/sub (Redis) for fanout across instances.
gRPC ā
- Protobuf. Binary schema, forward/backward compatible if you follow rules (never change field numbers; make new fields optional).
- Unary ā request/response.
- Server streaming ā one request, stream of responses.
- Client streaming ā stream of requests, one response.
- Bidi streaming ā both sides stream.
- Interceptors ā cross-cutting (auth, tracing, retries). Analog of HTTP middleware.
- Deadlines. Every call must carry a deadline; propagate through fan-outs.
DNS ā
- Resolution. Recursive resolver ā root ā TLD ā authoritative NS ā answer.
- TTL. Caching duration. Short TTL (30ā60s) enables fast failover; long TTL (hours) reduces load.
- Geo-DNS. Returns different IPs per client geography.
- Round-robin DNS. Multiple A records, client picks. Crude load balancing; no health checks. Prefer a real LB.
TLS ā
- Handshake (TLS 1.3). ClientHello (cipher suites + key share) ā ServerHello + cert + Finished ā client Finished. One RTT. 0-RTT resumption with prior session.
- Cert validation. Chain to trusted root; check name, expiry, revocation (OCSP or CRL). SNI lets one IP host many certs.
- mTLS. Both sides present certs. Used for service-to-service trust inside a mesh; replaces or complements token auth.
Load balancers ā
- L4 (TCP/UDP). Routes by IP+port. Fast, protocol-agnostic. No HTTP awareness, no per-request routing.
- L7 (HTTP). Routes by path, host, header, cookie. Can do TLS termination, header rewrites, WAF. Modern LBs (Envoy, NGINX, HAProxy, ALB) are L7.
Algorithms.
- Round-robin. Simple; ignores load.
- Least connections. Sends to the LB-tracked least-busy backend. Good for long connections.
- Consistent hashing. Same key ā same backend. Cache affinity, sticky sessions without cookies.
- Power of two choices. Pick 2 random backends, send to less-loaded. Nearly optimal, cheap.
Section 6 ā Distributed systems primitives ā
At SMTS, the interviewer expects you to name patterns and describe tradeoffs in the same sentence. Practice that pairing.
CAP theorem ā
Under a network partition, pick Consistency or Availability ā you can't have both. Without partition, you always have both; CAP only bites during failure.
- CP. Rejects requests that would violate consistency (e.g., primary unreachable ā reads/writes fail). Examples: Spanner, Zookeeper, HBase.
- AP. Answers, possibly stale. Examples: Cassandra (tunable), Dynamo, DNS.
Salesforce angle. Writes to a Salesforce org are CP (single primary per pod). Read replicas for reports and search can be AP ā you'll see slightly stale data briefly after a write.
PACELC ā
Extends CAP: if Partition, choose A or C; Else, choose Latency or Consistency. Spanner is CP/EC (strong even during normal ops, paying latency). Dynamo is AP/EL (cheap latency, eventual normally). PACELC forces you to reason about the normal case, not just the partition.
Consistency models ā
Strongest to weakest:
- Strict/Linearizable. Operations appear to happen in real-time order. Single system image.
- Sequential. All nodes see the same order, not necessarily real-time.
- Causal. Preserves happens-before relationships; concurrent writes may reorder.
- Read-your-writes. A client sees its own writes.
- Monotonic-read. Successive reads never go backwards.
- Eventual. Converges if no new writes.
Interview followup. "Session consistency?" Combines read-your-writes + monotonic reads within a session. Achievable with a session token (LSN, GTID) forwarded to the DB.
Consensus ā
- Paxos. Classic; hard to implement. Rarely used directly; Multi-Paxos for log replication.
- Raft. Understandable alternative. Leader election via randomized timeouts and votes; log replication with majority quorum; commit when majority persists.
- Leader election. Node becomes candidate after election timeout, increments term, requests votes. Wins on majority.
- Log replication. Leader appends entries, replicates to followers, commits when majority ack. Followers overwrite on conflict.
- Safety. Only leaders with up-to-date logs can win elections.
- Used in etcd, Consul, CockroachDB, TiKV.
Logical time ā
- Lamport timestamp. Counter per node; on send
send_ts = max(local, recv) + 1. Total order by(ts, node_id). Cannot detect concurrency. - Vector clock.
[c1, c2, ..., cN], one per node. Can determine if A happens-before B, B happens-before A, or concurrent. - Hybrid Logical Clock (HLC). Physical time + logical counter. Close to real time but preserves causality. Used by CockroachDB, YugabyteDB.
When. Distributed DBs and event systems needing causal order without a central sequencer.
Leader election ā
- ZooKeeper. Ephemeral sequential znodes; smallest sequence becomes leader. Watchers on predecessor for failover.
- etcd. Lease + atomic compare-and-swap on a key.
- Redis Sentinel. Quorum-based election among sentinels; failover the Redis primary.
- Raft-based. Built-in, used in etcd, Consul.
Distributed locks ā
- Redis Redlock. Lock across N Redis nodes, require majority. Martin Kleppmann criticized the fencing assumption; safer with fencing tokens (monotonically increasing token passed to the resource; resource rejects stale tokens).
- ZooKeeper-based. Ephemeral sequential znode; smallest holds the lock. Automatic release on session expiry. Stronger than Redlock.
- DB-based.
SELECT ... FOR UPDATEon a sentinel row. Easy, bounded by DB contention.
Interview followup. "Why fencing tokens?" A lock holder can be paused (GC, stop-the-world), lock expires, another acquires, original wakes and writes. Fencing token lets the downstream reject the stale write.
Distributed rate limiting ā
- Redis + Lua. Atomic multi-key ops. Token bucket: store tokens and last refill; Lua script refills and decrements atomically.
- Token bucket. Smooth rate, allows bursts up to bucket size.
- Sliding window log. Store request timestamps; count last N seconds. Precise, memory heavy.
- Sliding window counter. Two counters (current/previous window); interpolate. Good balance.
- Leaky bucket. Constant drain rate; queued requests. Smooths bursts.
Interview followup. "Per-user or global?" Usually both: per-tenant quota + global safety. Salesforce governor limits are per-tenant and per-transaction.
Circuit breaker ā
States: Closed (normal), Open (fail fast after failure threshold), Half-open (after cooldown, allow a trial request).
Parameters: failure rate threshold, minimum requests, open duration, half-open trial count. Libraries: Resilience4j (Java), Hystrix (legacy), Polly (.NET).
When. Calling a dependency that may go down; protect yourself from slow failures blowing your thread pool.
Bulkhead pattern ā
Isolate resources so one failure doesn't drown the system. Per-tenant or per-dependency thread pools, connection pools, or queue quotas. If tenant A's provider is slow, tenant B's requests still flow.
Salesforce angle. Governor limits are effectively a bulkhead: CPU time, heap, and query counts are capped per transaction per tenant. One customer's bad loop can't starve another.
Retries ā
- Exponential backoff.
delay = base * 2^attempt, capped. - Jitter. Add randomness; full jitter (
rand(0, delay)) is often best. - Retry budget. Don't retry forever; cap as a fraction of normal traffic.
- Idempotency required. Every retry assumes idempotent targets, otherwise you double-charge.
Section 7 ā Observability ā
"If it's not measured, it's broken in production" ā expect to be asked how you'd debug a latency spike.
Logs, metrics, traces ā
- Logs. Discrete events with context. Expensive per-event; index selectively. Structured (JSON) so you can query.
- Metrics. Aggregates over time; cheap and queryable. Counters (always increase), gauges (point-in-time), histograms (distribution), summaries (client-side percentiles).
- Traces. A single request's journey across services. Invaluable for latency attribution.
Use logs for cause, metrics for trend, traces for flow.
Structured logging ā
Emit JSON with stable fields: ts, level, service, tenantId, traceId, userId, message. Makes grep trivial and feeds log analytics.
Correlation ID. A request ID propagated through all downstream calls (in HTTP headers, gRPC metadata, Kafka headers). Stitches logs across services.
Distributed tracing ā
- W3C Trace Context.
traceparent: 00-{trace-id}-{span-id}-{flags}. Standardized header; OpenTelemetry emits it by default. - Spans. One per operation; parent/child forms the tree. Attributes, events, status.
- Sampling. Head-based (decide at entry) or tail-based (sample slow/errors). Full sampling on errors, low sample on happy path.
Prometheus + Grafana ā
- Counter ā monotonic; rate with
rate(). - Gauge ā up/down; memory, queue depth.
- Histogram ā bucketed counts; compute percentiles with
histogram_quantile. - Summary ā client-side quantiles; cheaper to read but can't aggregate across instances.
Golden signals (Google SRE). Latency, traffic, errors, saturation.
SLI / SLO / SLA ā
- SLI (Indicator). What you measure ā e.g., "fraction of HTTP 2xx responses."
- SLO (Objective). Internal target ā "99.9% of requests succeed over 30 days."
- SLA (Agreement). External contract with consequences ā "99.9% or we refund."
Error budget. 1 - SLO. If SLO is 99.9%, you can be down 43.2 min/month. Spend the budget on velocity (risky deploys) when healthy; freeze when exhausted.
Section 8 ā Security ā
SaaS is security-critical. Salesforce enforces tenant isolation at every layer; expect probing questions.
AuthN vs AuthZ ā
- Authentication. Who are you? Credentials ā identity.
- Authorization. What can you do? Identity ā permissions.
Separate layers; blur them at your peril.
JWT ā
Structure: base64url(header).base64url(payload).base64url(signature). Claims: iss, sub, aud, exp, iat, nbf, jti, plus custom.
- HS256. Symmetric HMAC with shared secret. Simpler, fine for same-service.
- RS256 / ES256. Asymmetric. Public key verifies; private key signs. Required when clients should verify without holding signing key.
Revocation. JWT is stateless, which means revocation is hard. Options:
- Short-lived access tokens (5ā15 min) + refresh tokens.
jtidenylist with TTL = token lifetime.- Token introspection endpoint (stateful ā negates the statelessness win).
Interview followup. "Access vs refresh tokens?" Access: short-lived, sent on every request. Refresh: long-lived, sent only to auth server to mint a new access token. Refresh rotation (one-time-use) catches theft.
OAuth 2.0 and OIDC ā
- OAuth 2.0 ā delegated authorization.
- OIDC ā identity layer on top; adds the
id_token(JWT with user info).
Flows:
- Authorization code + PKCE. For public clients (SPAs, mobile). PKCE prevents code interception by tying the code to a verifier only the caller knows.
- Client credentials. Machine-to-machine.
- Device code. TVs, CLI.
- Avoid Implicit (deprecated) and Resource owner password (don't collect passwords).
Tokens. id_token (identity, JWT), access_token (authz, opaque or JWT), refresh_token (mint new access tokens).
mTLS ā
Both sides present X.509 certs. Used for service mesh (Istio, Linkerd, Consul Connect) ā every hop authenticated by cert. Rotate certs frequently; SPIFFE/SPIRE automates.
Secrets management ā
- HashiCorp Vault. Central secret store; dynamic secrets (short-lived DB creds), transit encryption, PKI.
- AWS KMS / Secrets Manager. Managed; tight IAM integration; automatic rotation.
- Rotation. Automated; services fetch on startup or subscribe to updates. Avoid env vars for long-lived secrets.
RBAC vs ABAC ā
- RBAC. Role ā permissions; user ā roles. Simple, scales to hundreds of roles.
- ABAC. Policy evaluates attributes: subject (role, department), resource (owner, classification), action, environment (time, IP). Expressive; harder to audit. Policies in Rego (OPA) or XACML.
Salesforce angle. Profiles and permission sets are RBAC. Record-level security (owner, role hierarchy, sharing rules) is ABAC-ish ā it's policy over row attributes.
Row-level security ā
Every read/write must carry tenantId. Enforcement options:
- App layer. Repository always injects
WHERE tenant_id = :t. Easy to bypass if someone writes raw SQL. - DB RLS. Postgres policies enforce at the DB. Cannot be bypassed by app bugs. Connection sets
SET app.current_tenant = :t; policy filters on it. - Per-tenant schemas/DBs. Physical isolation.
Interview followup. "How do you prevent a missing tenant_id predicate from leaking data?" DB-level RLS + lint rules that reject raw SQL outside the repository layer + pen tests with a cross-tenant query.
OWASP Top 10 highlights ā
- Injection (SQL/NoSQL/OS). Always parameterize. Never string-concatenate query fragments from input. ORM protects if used correctly.
- XSS. Escape output by context (HTML, attribute, JS, URL). Content Security Policy as defense-in-depth.
- CSRF. State-changing requests need anti-CSRF token or SameSite cookies. JSON APIs with no cookies are less vulnerable but check origin.
- SSRF. Server fetches a URL the attacker controls; blocks metadata endpoint (169.254.169.254) and private IP ranges. Allowlist destinations.
- Insecure deserialization. Don't deserialize untrusted data. Whitelist types.
- Secrets in logs. Never. Redact cards, tokens, passwords at the logger.
Section 9 ā Scaling patterns ā
Horizontal vs vertical ā
- Vertical (scale-up). Bigger box. Simple, limited by hardware, single failure domain.
- Horizontal (scale-out). More boxes. Requires statelessness or a coordination layer. Linear scaling if done right.
Default horizontal for stateless services; vertical for single-node DBs until you must shard.
Auto-scaling ā
- Reactive. Scale on CPU, request rate, queue depth. Lags spikes; overshoots.
- Predictive. ML on historical load; scale ahead of known patterns (9am Monday).
- Scheduled. Manually scale up for known events (Black Friday).
Cooldowns to prevent flapping; scale up fast, scale down slow.
CDN ā
Edge caching at PoPs close to users. Caches by URL + Vary headers. Purge strategies:
- Soft purge. Mark stale; revalidate on next request. Cheap.
- Hard purge. Remove now. Expensive; use for incidents.
- Tag-based purge. Surrogate-Key or Cache-Tag headers; purge all assets with a tag.
Cache key. URL + headers in Vary (e.g., Vary: Accept-Language). Beware Cookie in Vary ā essentially uncacheable.
Database read replicas ā
- Easy read scaling; offload reports and search.
- Replication lag. ms to minutes. Writes + immediate reads must go to primary.
- Read-your-writes. Session stickiness, LSN tokens, or read from primary for a bounded window after write.
Hot partition detection ā
Symptom: one partition CPU/IO near 100%, others idle. Fix:
- Re-shard ā rare and expensive.
- Salt hot keys ā add a random prefix to split one logical key into N.
- Dedicated shard ā move the big tenant to its own partition.
- Caching layer ā absorb reads before they hit the hot partition.
Detect via per-partition metrics, latency P99 by partition, query plans with partition pruning.
Section 10 ā Architectural patterns ā
Architecture questions escalate in R4/R5. Know these names and their failure modes.
CQRS (Command Query Responsibility Segregation) ā
Split write model (commands, normalized, transactional) from read model (queries, denormalized, fast).
When. Read/write workloads diverge dramatically; reporting needs shapes the transactional model can't serve cheaply.
Pitfalls. Eventual consistency between command and query sides; operational complexity (two models to maintain); stale reads visible to users (design UX for it).
Salesforce angle. List views and reports query a denormalized projection; writes hit the normalized transactional tables. CDC/materialized views bridge the two.
Event sourcing ā
State = replay of immutable events. The log is the source of truth; current state is a projection.
Benefits. Full audit, time-travel, rebuild projections, natural fit with CQRS.
Pitfalls. Schema evolution (event versioning is permanent), snapshots to avoid replaying millions of events, harder to query current state directly.
When NOT. Simple CRUD with no audit demands. The complexity tax isn't worth it.
Saga ā
Long-running workflow as a sequence of local transactions with compensating actions.
- Orchestration. A central saga coordinator invokes each step and compensates on failure. Pros: flow is explicit, easy to debug. Cons: coordinator is central (needs HA).
- Choreography. Services publish events, peers react. Pros: loose coupling. Cons: the flow is implicit and debugging is a nightmare at scale.
Always design compensating transactions carefully: they're rarely the exact inverse (you can't "uncharge" a credit card silently; you issue a refund).
Outbox pattern ā
Within the DB transaction that changes state, also insert a row into an outbox table. A separate process (polling query or CDC on the outbox table) publishes to the message bus and deletes/marks the row. Guarantees exactly the events that correspond to committed state changes get published ā no dual-write race.
When. Any time a service must publish events on state changes. Standard pattern for microservices with a transactional core.
Strangler fig ā
Gradually replace a legacy system by routing specific endpoints to the new system behind a facade. Old system shrinks over time and eventually gets strangled (removed).
When. Large legacy migrations where big-bang rewrites are too risky.
Backends for Frontends (BFF) ā
One backend per client type (web, iOS, Android). Each BFF aggregates downstream services and shapes responses for its client.
When. Clients differ in data needs, screen sizes, latency budgets. Avoids the monolith API that pleases no one.
Pitfalls. Duplicated logic across BFFs; discipline to keep business logic in domain services, not BFFs.
Section 11 ā Salesforce-specific patterns ā
These patterns are the ones interviewers use to tell platform-aware SMTSes from generic backend engineers. Even if you haven't written Apex, speak the vocabulary.
Multi-tenancy at every layer ā
tenantId (org_id) flows from request through every layer:
- Request. Auth layer extracts
org_idfrom JWT/session, attaches to request context. - Service. Every domain method takes
OrgContextor reads from aThreadLocal. Never a bareid. - DB. Every query filters on
org_id. Enforced via repository layer + DB RLS as defense-in-depth. - Cache. Every key prefixed with
org_id. No global keys except true app config. - Async jobs. Job payload includes
org_id; executors re-establishOrgContextbefore running. - Logs and metrics. Every structured log has
org_id. Metrics tagged withorg_idwhere cardinality permits (or bucket by tier).
Interview followup. "Show me the code path where org_id could leak." Missing predicate on raw JDBC, a shared cache key, a cron job that forgets to restore context. Your lint rules and RLS must assume engineers will make this mistake.
Governor limits philosophy ā
Salesforce caps per-transaction resources (CPU ms, heap, SOQL queries, DML rows, callouts). The philosophy: fairness over raw performance. A runaway transaction should be killed before it noisy-neighbors the pod.
In your own service design: budget per request (timeouts on every downstream, max rows processed per call, max memory). Surface the limit clearly (429 with Retry-After; error telling the caller what limit they hit).
Async job patterns ā
Apex patterns and Java analogs:
| Apex | Java analog | Use |
|---|---|---|
@future | CompletableFuture.runAsync / ExecutorService | Fire-and-forget async |
| Batch Apex | Chunked ExecutorService loop, or Spring Batch | Process millions of records in chunks of 200 |
| Queueable | BlockingQueue + worker pool | Enqueue work with chained follow-ups |
| Scheduled Apex | Quartz, ScheduledExecutorService | Cron-style |
| Platform Events | Kafka topic + consumer group | Pub/sub across services and triggers |
Interview followup. "Design a Salesforce-style batch job in pure Java." ExecutorService with a bounded queue, chunk iteration of 200 IDs, per-chunk transaction, progress persisted to a job_state table, idempotent on retry. Governor-limit analogs: per-chunk CPU budget, per-job row cap, per-tenant quota.
Bulkification ā
Process records in batches. The cardinal sin is a per-record network/DB call inside a loop.
- Batch size: 200 is the Salesforce default; pick yours based on payload size and downstream limits.
- Fail atomically per chunk, not per record, if you can (easier retry).
- For idempotency, include batch ID + record ID in the idempotency key.
Example. 50k records to sync to an external system. Bad: 50k HTTP calls. Good: chunks of 200, one bulk API call per chunk, DLQ for chunks that fail after retries.
Platform Events analog ā
Event bus semantics:
- At-least-once delivery.
- Per-org ordering.
- Retention (Salesforce: hours to days depending on type).
- Subscribers replay from an offset.
In your own Java service: Kafka topic per event type, key by org_id for ordering, Outbox pattern for publishing, idempotent consumers keyed by event ID.
Final checklist for the interview ā
Before each system-design or fundamentals follow-up, mentally walk this list:
- Tenant isolation. Where does
org_idgo? Where could it leak? - Consistency stance. CP or AP? Strong or eventual? Why is that acceptable?
- Concurrency model. Threads, actors, async? What's the contention hotspot?
- Failure modes. What happens when this dependency is slow, down, or partitioned? Circuit breaker? DLQ? Retry budget?
- Idempotency. What's the idempotency key? How long do you remember it?
- Observability. What do you log, meter, and trace? What does the alarm look like?
- Scale story. How does it behave at 1x, 100x, 10000x tenants?
- Security. AuthN, AuthZ, row-level security. Where's the blast radius?
- Bulkification. Am I doing one-at-a-time where I should batch?
- Governor-style limits. What protects other tenants from this tenant's worst day?
If you can hit these ten in every answer, you'll sound like an SMTS Salesforce hires.