12 - ML Architectures β
1. CNNs (Convolutional Neural Networks) β
What: Specialized architecture for processing grid-like data (images, audio spectrograms). Uses sliding filters (kernels) that detect local patterns like edges, textures, and shapes.
Input Image (28Γ28)
β
βββββββββββββββββββ
β Conv Layer 1 β 32 filters (3Γ3) β detect edges
β + ReLU + Pool β Output: 14Γ14Γ32
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Conv Layer 2 β 64 filters (3Γ3) β detect shapes
β + ReLU + Pool β Output: 7Γ7Γ64
ββββββββββ¬βββββββββ
β
βββββββββββββββββββ
β Flatten β 7Γ7Γ64 = 3136
β + Fully Connectedβ β 128 β 10 (classes)
βββββββββββββββββββConvolution operation:
Input patch: Kernel (3Γ3): Output:
1 0 1 1 0 1
0 1 0 Γ 0 1 0 = 4 (sum of element-wise products)
1 0 1 1 0 1
Slide kernel across entire image β feature mapKey concepts:
- Stride: How many pixels the kernel moves per step
- Padding: Adding zeros around the edges to preserve spatial dimensions
- Pooling: Downsampling (max pool, average pool) to reduce size and add translation invariance
- Feature hierarchy: Early layers detect low-level features (edges), deeper layers detect high-level features (faces, objects)
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1), # (3, 224, 224) β (32, 224, 224)
nn.ReLU(),
nn.MaxPool2d(2), # β (32, 112, 112)
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2), # β (64, 56, 56)
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 56 * 56, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)Notable architectures: AlexNet (2012) β VGG β ResNet (residual connections) β EfficientNet β Vision Transformer (ViT).
2. RNNs and LSTMs β
What: Architectures for sequential data. Process one element at a time, maintaining a hidden state that carries information from previous steps.
Basic RNN:
hβ hβ hβ hβ
β β β β
xβ β [RNN Cell] β [RNN Cell] β [RNN Cell] β [RNN Cell] β output
"The" "cat" "sat" "on"
h_t = tanh(W_hh Β· h_{t-1} + W_xh Β· x_t + b)Problem: Vanishing gradients β after many steps, early inputs have negligible influence. Can't learn long-range dependencies.
LSTM (Long Short-Term Memory):
Solves vanishing gradients with a gating mechanism:
ββββββββββββββββββββββββββββββββββββββββ
β LSTM Cell β
β β
β Cell state (C) ββββββββββββββββ C' β β "highway" for long-range info
β β β β β
β ββββββ΄ββββ βββββ΄βββββ ββββ΄ββββ β
β β Forget β β Input β βOutputβ β
β β Gate β β Gate β β Gate β β
β β (Ο) β β (ΟΓtanh)β β (Ο) β β
β ββββββ¬ββββ βββββ¬βββββ ββββ¬ββββ β
β ββββββββββββ΄ββββββββββ β
β h_{t-1}, x_t β
ββββββββββββββββββββββββββββββββββββββββ| Gate | Purpose |
|---|---|
| Forget | What to remove from cell state |
| Input | What new information to add |
| Output | What to output as hidden state |
Why LSTMs are mostly replaced by Transformers:
- Sequential processing (can't parallelize)
- Still struggle with very long sequences (>1000 tokens)
- Transformers process all positions simultaneously
Still used for: Time series, audio processing, small-scale sequential tasks.
3. Attention Mechanism Evolution β
What: The progression from basic attention to the Transformer architecture that powers modern AI.
Timeline:
2014: Bahdanau Attention β First attention for seq2seq (encoder-decoder)
2015: Luong Attention β Simplified scoring functions
2017: Self-Attention β "Attention Is All You Need" (Transformer)
2018: BERT β Bidirectional Transformer encoder
2018: GPT β Autoregressive Transformer decoder
2020: GPT-3 β Scaling breakthrough (in-context learning)
2022: ChatGPT β RLHF + instruction following
2023+: GPT-4, Claude, LLaMA β Multimodal, open-source, reasoningBahdanau (additive) attention:
score(h_i, s_j) = v^T Β· tanh(Wβh_i + Wβs_j)Used in encoder-decoder models. Decoder attends to encoder hidden states.
Luong (multiplicative) attention:
score(h_i, s_j) = h_i^T Β· W Β· s_j (general)
score(h_i, s_j) = h_i^T Β· s_j (dot product)Self-attention (Transformer):
score(q_i, k_j) = (q_i Β· k_j) / βd_kEvery position attends to every other position in the same sequence.
Key innovation: Self-attention has O(1) path length between any two positions (vs O(n) for RNNs), enabling direct information flow across the entire sequence.
4. Diffusion Models β
What: Generative models that learn to denoise. Start from pure noise and iteratively remove noise to generate data (images, audio, video).
Forward process (training): Add noise gradually
Image β Slightly noisy β More noisy β ... β Pure noise
Reverse process (generation): Remove noise gradually
Pure noise β Less noisy β ... β Slightly noisy β Clean image
βββββββ noise βββββββ noise βββββββ noise βββββββ
β xβ β βββββββββ β xβ β βββββββββ β xβ β βββββββββ β x_T β
βimageβ β β β β βnoiseβ
βββββββ βββββββ βββββββ βββββββ
β β β
denoise βββββββββββ denoise βββββ denoise ββββββ
(learned) (learned) (learned)How training works:
- Take a clean image
- Add random noise (at random timestep t)
- Train a neural network to predict the noise
- Loss = MSE between predicted noise and actual noise
Key models:
- DDPM: Original diffusion paper
- Stable Diffusion: Diffusion in latent space (fast, high quality)
- DALL-E 2/3: OpenAI's text-to-image
- Midjourney: High-quality image generation
- Sora: Video generation (OpenAI)
Latent diffusion (Stable Diffusion):
Instead of diffusing in pixel space (512Γ512Γ3 = 786K dims):
Image β VAE encoder β Latent (64Γ64Γ4 = 16K dims) β Diffuse here
Generated latent β VAE decoder β Image
100x fewer dimensions β much faster5. GANs (Generative Adversarial Networks) β
What: Two networks competing against each other: a Generator (creates fake data) and a Discriminator (distinguishes real from fake).
βββββββββββββββ ββββββββββββββββββββ
β Generator βββββββββββ Discriminator β
β β fake β β
β noise β img β data β "Is this real ββββ real / fake
β β β or fake?" β
βββββββββββββββ β β
β β also sees β
real data ββ real data β
ββββββββββββββββββββ
Generator wants to fool Discriminator.
Discriminator wants to catch Generator.
Both improve through competition.Training:
# Simplified GAN training loop
for real_batch in dataloader:
# 1. Train Discriminator
fake_batch = generator(random_noise)
d_loss = -log(discriminator(real_batch)) - log(1 - discriminator(fake_batch))
d_loss.backward()
d_optimizer.step()
# 2. Train Generator
fake_batch = generator(random_noise)
g_loss = -log(discriminator(fake_batch)) # fool discriminator
g_loss.backward()
g_optimizer.step()Notable GAN variants:
- StyleGAN: High-quality face generation (NVIDIA)
- Pix2Pix: Image-to-image translation
- CycleGAN: Unpaired image translation
GANs vs Diffusion Models:
- GANs: Faster generation (single forward pass), but harder to train (mode collapse, instability)
- Diffusion: More stable training, higher quality, but slower generation (many denoising steps)
- Diffusion models have largely replaced GANs for image generation
6. Training vs Inference Trade-offs β
| Aspect | Training | Inference |
|---|---|---|
| Compute | Very high (days/weeks on GPU clusters) | Lower (single GPU or CPU) |
| Memory | High (gradients + optimizer states) | Lower (only model weights + KV cache) |
| Batch size | Large (32-2048) | Often 1 (single request) |
| Precision | FP32 or mixed (FP16/BF16) | INT8/INT4 quantization common |
| Latency | Doesn't matter (offline) | Critical (user-facing) |
| Throughput | Samples per second | Tokens per second |
Inference optimization techniques:
| Technique | How | Speedup |
|---|---|---|
| Quantization | Reduce precision (FP32 β INT8/INT4) | 2-4x, smaller model |
| KV cache | Cache key/value tensors from previous tokens | Avoid recomputation |
| Batching | Process multiple requests together | Higher throughput |
| Speculative decoding | Small model drafts, large model verifies | 2-3x faster generation |
| Flash Attention | Memory-efficient attention kernel | 2-4x faster attention |
| Model pruning | Remove unimportant weights | Smaller model |
| Distillation | Train small model to mimic large model | Much faster, slight quality loss |
Full model: 70B params, FP16 β 140 GB VRAM
Quantized: 70B params, INT4 β 35 GB VRAM
Distilled: 7B params, FP16 β 14 GB VRAM
Distilled+Q4: 7B params, INT4 β 3.5 GB VRAM β runs on laptop!