Skip to content

12 - ML Architectures ​


1. CNNs (Convolutional Neural Networks) ​

What: Specialized architecture for processing grid-like data (images, audio spectrograms). Uses sliding filters (kernels) that detect local patterns like edges, textures, and shapes.

Input Image (28Γ—28)
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Conv Layer 1     β”‚  32 filters (3Γ—3) β†’ detect edges
β”‚ + ReLU + Pool    β”‚  Output: 14Γ—14Γ—32
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Conv Layer 2     β”‚  64 filters (3Γ—3) β†’ detect shapes
β”‚ + ReLU + Pool    β”‚  Output: 7Γ—7Γ—64
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Flatten          β”‚  7Γ—7Γ—64 = 3136
β”‚ + Fully Connectedβ”‚  β†’ 128 β†’ 10 (classes)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Convolution operation:

Input patch:     Kernel (3Γ—3):     Output:
1  0  1          1  0  1
0  1  0    Γ—     0  1  0     =    4 (sum of element-wise products)
1  0  1          1  0  1

Slide kernel across entire image β†’ feature map

Key concepts:

  • Stride: How many pixels the kernel moves per step
  • Padding: Adding zeros around the edges to preserve spatial dimensions
  • Pooling: Downsampling (max pool, average pool) to reduce size and add translation invariance
  • Feature hierarchy: Early layers detect low-level features (edges), deeper layers detect high-level features (faces, objects)
python
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (3, 224, 224) β†’ (32, 224, 224)
            nn.ReLU(),
            nn.MaxPool2d(2),                               # β†’ (32, 112, 112)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),                               # β†’ (64, 56, 56)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 56 * 56, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

Notable architectures: AlexNet (2012) β†’ VGG β†’ ResNet (residual connections) β†’ EfficientNet β†’ Vision Transformer (ViT).


2. RNNs and LSTMs ​

What: Architectures for sequential data. Process one element at a time, maintaining a hidden state that carries information from previous steps.

Basic RNN:

         hβ‚€        h₁        hβ‚‚        h₃
          ↓         ↓         ↓         ↓
xβ‚€ β†’ [RNN Cell] β†’ [RNN Cell] β†’ [RNN Cell] β†’ [RNN Cell] β†’ output
      "The"        "cat"       "sat"        "on"

h_t = tanh(W_hh Β· h_{t-1} + W_xh Β· x_t + b)

Problem: Vanishing gradients β€” after many steps, early inputs have negligible influence. Can't learn long-range dependencies.

LSTM (Long Short-Term Memory):

Solves vanishing gradients with a gating mechanism:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              LSTM Cell                β”‚
β”‚                                       β”‚
β”‚  Cell state (C) ───────────────→ C'  β”‚  ← "highway" for long-range info
β”‚        ↑          ↑          ↑       β”‚
β”‚   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β” β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β”   β”‚
β”‚   β”‚ Forget β”‚ β”‚ Input   β”‚ β”‚Outputβ”‚   β”‚
β”‚   β”‚ Gate   β”‚ β”‚ Gate    β”‚ β”‚ Gate β”‚   β”‚
β”‚   β”‚ (Οƒ)    β”‚ β”‚ (σ×tanh)β”‚ β”‚ (Οƒ)  β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜   β”‚
β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚              h_{t-1}, x_t            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
GatePurpose
ForgetWhat to remove from cell state
InputWhat new information to add
OutputWhat to output as hidden state

Why LSTMs are mostly replaced by Transformers:

  • Sequential processing (can't parallelize)
  • Still struggle with very long sequences (>1000 tokens)
  • Transformers process all positions simultaneously

Still used for: Time series, audio processing, small-scale sequential tasks.


3. Attention Mechanism Evolution ​

What: The progression from basic attention to the Transformer architecture that powers modern AI.

Timeline:
2014: Bahdanau Attention    β†’ First attention for seq2seq (encoder-decoder)
2015: Luong Attention       β†’ Simplified scoring functions
2017: Self-Attention        β†’ "Attention Is All You Need" (Transformer)
2018: BERT                  β†’ Bidirectional Transformer encoder
2018: GPT                   β†’ Autoregressive Transformer decoder
2020: GPT-3                 β†’ Scaling breakthrough (in-context learning)
2022: ChatGPT               β†’ RLHF + instruction following
2023+: GPT-4, Claude, LLaMA β†’ Multimodal, open-source, reasoning

Bahdanau (additive) attention:

score(h_i, s_j) = v^T Β· tanh(W₁h_i + Wβ‚‚s_j)

Used in encoder-decoder models. Decoder attends to encoder hidden states.

Luong (multiplicative) attention:

score(h_i, s_j) = h_i^T Β· W Β· s_j    (general)
score(h_i, s_j) = h_i^T Β· s_j         (dot product)

Self-attention (Transformer):

score(q_i, k_j) = (q_i · k_j) / √d_k

Every position attends to every other position in the same sequence.

Key innovation: Self-attention has O(1) path length between any two positions (vs O(n) for RNNs), enabling direct information flow across the entire sequence.


4. Diffusion Models ​

What: Generative models that learn to denoise. Start from pure noise and iteratively remove noise to generate data (images, audio, video).

Forward process (training): Add noise gradually
Image β†’ Slightly noisy β†’ More noisy β†’ ... β†’ Pure noise

Reverse process (generation): Remove noise gradually
Pure noise β†’ Less noisy β†’ ... β†’ Slightly noisy β†’ Clean image

β”Œβ”€β”€β”€β”€β”€β”   noise   β”Œβ”€β”€β”€β”€β”€β”   noise   β”Œβ”€β”€β”€β”€β”€β”   noise   β”Œβ”€β”€β”€β”€β”€β”
β”‚ xβ‚€  β”‚ ────────→ β”‚ x₁  β”‚ ────────→ β”‚ xβ‚‚  β”‚ ────────→ β”‚ x_T β”‚
β”‚imageβ”‚           β”‚     β”‚           β”‚     β”‚           β”‚noiseβ”‚
β””β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”˜
                     ↑                 ↑                 ↑
   denoise β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    denoise β†β”€β”€β”€β”˜    denoise β†β”€β”€β”€β”€β”˜
   (learned)              (learned)         (learned)

How training works:

  1. Take a clean image
  2. Add random noise (at random timestep t)
  3. Train a neural network to predict the noise
  4. Loss = MSE between predicted noise and actual noise

Key models:

  • DDPM: Original diffusion paper
  • Stable Diffusion: Diffusion in latent space (fast, high quality)
  • DALL-E 2/3: OpenAI's text-to-image
  • Midjourney: High-quality image generation
  • Sora: Video generation (OpenAI)

Latent diffusion (Stable Diffusion):

Instead of diffusing in pixel space (512Γ—512Γ—3 = 786K dims):
  Image β†’ VAE encoder β†’ Latent (64Γ—64Γ—4 = 16K dims) β†’ Diffuse here
  Generated latent β†’ VAE decoder β†’ Image

100x fewer dimensions β†’ much faster

5. GANs (Generative Adversarial Networks) ​

What: Two networks competing against each other: a Generator (creates fake data) and a Discriminator (distinguishes real from fake).

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Generator   │────────→│  Discriminator    β”‚
β”‚              β”‚  fake   β”‚                  β”‚
β”‚  noise β†’ img β”‚  data   β”‚  "Is this real   │──→ real / fake
β”‚              β”‚         β”‚   or fake?"       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚                  β”‚
                         β”‚  ← also sees     β”‚
              real data β†’β”‚    real data      β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Generator wants to fool Discriminator.
Discriminator wants to catch Generator.
Both improve through competition.

Training:

python
# Simplified GAN training loop
for real_batch in dataloader:
    # 1. Train Discriminator
    fake_batch = generator(random_noise)
    d_loss = -log(discriminator(real_batch)) - log(1 - discriminator(fake_batch))
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    fake_batch = generator(random_noise)
    g_loss = -log(discriminator(fake_batch))  # fool discriminator
    g_loss.backward()
    g_optimizer.step()

Notable GAN variants:

  • StyleGAN: High-quality face generation (NVIDIA)
  • Pix2Pix: Image-to-image translation
  • CycleGAN: Unpaired image translation

GANs vs Diffusion Models:

  • GANs: Faster generation (single forward pass), but harder to train (mode collapse, instability)
  • Diffusion: More stable training, higher quality, but slower generation (many denoising steps)
  • Diffusion models have largely replaced GANs for image generation

6. Training vs Inference Trade-offs ​

AspectTrainingInference
ComputeVery high (days/weeks on GPU clusters)Lower (single GPU or CPU)
MemoryHigh (gradients + optimizer states)Lower (only model weights + KV cache)
Batch sizeLarge (32-2048)Often 1 (single request)
PrecisionFP32 or mixed (FP16/BF16)INT8/INT4 quantization common
LatencyDoesn't matter (offline)Critical (user-facing)
ThroughputSamples per secondTokens per second

Inference optimization techniques:

TechniqueHowSpeedup
QuantizationReduce precision (FP32 β†’ INT8/INT4)2-4x, smaller model
KV cacheCache key/value tensors from previous tokensAvoid recomputation
BatchingProcess multiple requests togetherHigher throughput
Speculative decodingSmall model drafts, large model verifies2-3x faster generation
Flash AttentionMemory-efficient attention kernel2-4x faster attention
Model pruningRemove unimportant weightsSmaller model
DistillationTrain small model to mimic large modelMuch faster, slight quality loss
Full model:    70B params, FP16 β†’ 140 GB VRAM
Quantized:     70B params, INT4 β†’ 35 GB VRAM
Distilled:     7B params, FP16  β†’ 14 GB VRAM
Distilled+Q4:  7B params, INT4  β†’ 3.5 GB VRAM  ← runs on laptop!

Frontend interview preparation reference.