12 - ML Architectures

1. CNNs (Convolutional Neural Networks)

What: Specialized architecture for processing grid-like data (images, audio spectrograms). Uses sliding filters (kernels) that detect local patterns like edges, textures, and shapes.

Input Image (28×28)
       ↓
┌─────────────────┐
│ Conv Layer 1     │  32 filters (3×3) → detect edges
│ + ReLU + Pool    │  Output: 14×14×32
└────────┬────────┘
         ↓
┌─────────────────┐
│ Conv Layer 2     │  64 filters (3×3) → detect shapes
│ + ReLU + Pool    │  Output: 7×7×64
└────────┬────────┘
         ↓
┌─────────────────┐
│ Flatten          │  7×7×64 = 3136
│ + Fully Connected│  → 128 → 10 (classes)
└─────────────────┘

Convolution operation:

Input patch:     Kernel (3×3):     Output:
1  0  1          1  0  1
0  1  0    ×     0  1  0     =    4 (sum of element-wise products)
1  0  1          1  0  1

Slide kernel across entire image → feature map

Key concepts:

Stride: How many pixels the kernel moves per step
Padding: Adding zeros around the edges to preserve spatial dimensions
Pooling: Downsampling (max pool, average pool) to reduce size and add translation invariance
Feature hierarchy: Early layers detect low-level features (edges), deeper layers detect high-level features (faces, objects)

python

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (3, 224, 224) → (32, 224, 224)
            nn.ReLU(),
            nn.MaxPool2d(2),                               # → (32, 112, 112)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),                               # → (64, 56, 56)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 56 * 56, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

Notable architectures: AlexNet (2012) → VGG → ResNet (residual connections) → EfficientNet → Vision Transformer (ViT).

2. RNNs and LSTMs

What: Architectures for sequential data. Process one element at a time, maintaining a hidden state that carries information from previous steps.

Basic RNN:

         h₀        h₁        h₂        h₃
          ↓         ↓         ↓         ↓
x₀ → [RNN Cell] → [RNN Cell] → [RNN Cell] → [RNN Cell] → output
      "The"        "cat"       "sat"        "on"

h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b)

Problem: Vanishing gradients — after many steps, early inputs have negligible influence. Can't learn long-range dependencies.

LSTM (Long Short-Term Memory):

Solves vanishing gradients with a gating mechanism:

┌──────────────────────────────────────┐
│              LSTM Cell                │
│                                       │
│  Cell state (C) ───────────────→ C'  │  ← "highway" for long-range info
│        ↑          ↑          ↑       │
│   ┌────┴───┐ ┌───┴────┐ ┌──┴───┐   │
│   │ Forget │ │ Input   │ │Output│   │
│   │ Gate   │ │ Gate    │ │ Gate │   │
│   │ (σ)    │ │ (σ×tanh)│ │ (σ)  │   │
│   └────┬───┘ └───┬────┘ └──┬───┘   │
│        └──────────┴─────────┘       │
│              h_{t-1}, x_t            │
└──────────────────────────────────────┘

Gate	Purpose
Forget	What to remove from cell state
Input	What new information to add
Output	What to output as hidden state

Why LSTMs are mostly replaced by Transformers:

Sequential processing (can't parallelize)
Still struggle with very long sequences (>1000 tokens)
Transformers process all positions simultaneously

Still used for: Time series, audio processing, small-scale sequential tasks.

3. Attention Mechanism Evolution

What: The progression from basic attention to the Transformer architecture that powers modern AI.

Timeline:
2014: Bahdanau Attention    → First attention for seq2seq (encoder-decoder)
2015: Luong Attention       → Simplified scoring functions
2017: Self-Attention        → "Attention Is All You Need" (Transformer)
2018: BERT                  → Bidirectional Transformer encoder
2018: GPT                   → Autoregressive Transformer decoder
2020: GPT-3                 → Scaling breakthrough (in-context learning)
2022: ChatGPT               → RLHF + instruction following
2023+: GPT-4, Claude, LLaMA → Multimodal, open-source, reasoning

Bahdanau (additive) attention:

score(h_i, s_j) = v^T · tanh(W₁h_i + W₂s_j)

Used in encoder-decoder models. Decoder attends to encoder hidden states.

Luong (multiplicative) attention:

score(h_i, s_j) = h_i^T · W · s_j    (general)
score(h_i, s_j) = h_i^T · s_j         (dot product)

Self-attention (Transformer):

score(q_i, k_j) = (q_i · k_j) / √d_k

Every position attends to every other position in the same sequence.

Key innovation: Self-attention has O(1) path length between any two positions (vs O(n) for RNNs), enabling direct information flow across the entire sequence.

4. Diffusion Models

What: Generative models that learn to denoise. Start from pure noise and iteratively remove noise to generate data (images, audio, video).

Forward process (training): Add noise gradually
Image → Slightly noisy → More noisy → ... → Pure noise

Reverse process (generation): Remove noise gradually
Pure noise → Less noisy → ... → Slightly noisy → Clean image

┌─────┐   noise   ┌─────┐   noise   ┌─────┐   noise   ┌─────┐
│ x₀  │ ────────→ │ x₁  │ ────────→ │ x₂  │ ────────→ │ x_T │
│image│           │     │           │     │           │noise│
└─────┘           └─────┘           └─────┘           └─────┘
                     ↑                 ↑                 ↑
   denoise ←─────────┘    denoise ←───┘    denoise ←────┘
   (learned)              (learned)         (learned)

How training works:

Take a clean image
Add random noise (at random timestep t)
Train a neural network to predict the noise
Loss = MSE between predicted noise and actual noise

Key models:

DDPM: Original diffusion paper
Stable Diffusion: Diffusion in latent space (fast, high quality)
DALL-E 2/3: OpenAI's text-to-image
Midjourney: High-quality image generation
Sora: Video generation (OpenAI)

Latent diffusion (Stable Diffusion):

Instead of diffusing in pixel space (512×512×3 = 786K dims):
  Image → VAE encoder → Latent (64×64×4 = 16K dims) → Diffuse here
  Generated latent → VAE decoder → Image

100x fewer dimensions → much faster

5. GANs (Generative Adversarial Networks)

What: Two networks competing against each other: a Generator (creates fake data) and a Discriminator (distinguishes real from fake).

┌─────────────┐         ┌──────────────────┐
│  Generator   │────────→│  Discriminator    │
│              │  fake   │                  │
│  noise → img │  data   │  "Is this real   │──→ real / fake
│              │         │   or fake?"       │
└─────────────┘         │                  │
                         │  ← also sees     │
              real data →│    real data      │
                         └──────────────────┘

Generator wants to fool Discriminator.
Discriminator wants to catch Generator.
Both improve through competition.

Training:

python

# Simplified GAN training loop
for real_batch in dataloader:
    # 1. Train Discriminator
    fake_batch = generator(random_noise)
    d_loss = -log(discriminator(real_batch)) - log(1 - discriminator(fake_batch))
    d_loss.backward()
    d_optimizer.step()

    # 2. Train Generator
    fake_batch = generator(random_noise)
    g_loss = -log(discriminator(fake_batch))  # fool discriminator
    g_loss.backward()
    g_optimizer.step()

Notable GAN variants:

StyleGAN: High-quality face generation (NVIDIA)
Pix2Pix: Image-to-image translation
CycleGAN: Unpaired image translation

GANs vs Diffusion Models:

GANs: Faster generation (single forward pass), but harder to train (mode collapse, instability)
Diffusion: More stable training, higher quality, but slower generation (many denoising steps)
Diffusion models have largely replaced GANs for image generation

6. Training vs Inference Trade-offs

Aspect	Training	Inference
Compute	Very high (days/weeks on GPU clusters)	Lower (single GPU or CPU)
Memory	High (gradients + optimizer states)	Lower (only model weights + KV cache)
Batch size	Large (32-2048)	Often 1 (single request)
Precision	FP32 or mixed (FP16/BF16)	INT8/INT4 quantization common
Latency	Doesn't matter (offline)	Critical (user-facing)
Throughput	Samples per second	Tokens per second

Inference optimization techniques:

Technique	How	Speedup
Quantization	Reduce precision (FP32 → INT8/INT4)	2-4x, smaller model
KV cache	Cache key/value tensors from previous tokens	Avoid recomputation
Batching	Process multiple requests together	Higher throughput
Speculative decoding	Small model drafts, large model verifies	2-3x faster generation
Flash Attention	Memory-efficient attention kernel	2-4x faster attention
Model pruning	Remove unimportant weights	Smaller model
Distillation	Train small model to mimic large model	Much faster, slight quality loss

Full model:    70B params, FP16 → 140 GB VRAM
Quantized:     70B params, INT4 → 35 GB VRAM
Distilled:     7B params, FP16  → 14 GB VRAM
Distilled+Q4:  7B params, INT4  → 3.5 GB VRAM  ← runs on laptop!

12 - ML Architectures ​

1. CNNs (Convolutional Neural Networks) ​

2. RNNs and LSTMs ​

3. Attention Mechanism Evolution ​

4. Diffusion Models ​

5. GANs (Generative Adversarial Networks) ​

6. Training vs Inference Trade-offs ​