11 - ML Fundamentals

1. Neural Network Basics

What: A neural network is a function that maps inputs to outputs through layers of connected "neurons" (linear transformations + non-linear activations).

Input Layer      Hidden Layer 1    Hidden Layer 2    Output Layer
   (3)               (4)               (4)              (2)

  x₁ ─────┐    ┌── h₁ ──┐    ┌── h₁ ──┐    ┌── y₁
           ├──→│         ├──→│         ├──→│
  x₂ ─────┤    ├── h₂ ──┤    ├── h₂ ──┤    ├── y₂
           ├──→│         ├──→│         ├──→│
  x₃ ─────┘    ├── h₃ ──┤    ├── h₃ ──┤    └────
                └── h₄ ──┘    └── h₄ ──┘

Each neuron computes:

output = activation(W · input + b)

W = weight matrix (learned)
b = bias vector (learned)
activation = non-linear function (ReLU, sigmoid, etc.)

Common activations:

Function	Formula	Range	Used For
ReLU	max(0, x)	[0, inf)	Hidden layers (default)
Sigmoid	1/(1+e^-x)	(0, 1)	Binary classification output
Tanh	(e^x - e^-x)/(e^x + e^-x)	(-1, 1)	Hidden layers (older)
Softmax	e^xi / Σe^xj	(0, 1), sums to 1	Multi-class output
GELU	x · Φ(x)	(-0.17, inf)	Transformers

python

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),   # input → hidden
            nn.ReLU(),
            nn.Linear(256, 128),   # hidden → hidden
            nn.ReLU(),
            nn.Linear(128, 10),    # hidden → output
        )

    def forward(self, x):
        return self.layers(x)     # 10 logits (one per class)

2. Backpropagation

What: The algorithm for training neural networks. Computes how much each weight contributed to the error, then adjusts weights to reduce error.

Forward pass → compute loss → backward pass → update weights:

Forward pass:
  Input → Layer 1 → Layer 2 → ... → Output → Loss

Backward pass (chain rule):
  ∂Loss/∂W₃ = ∂Loss/∂output × ∂output/∂W₃
  ∂Loss/∂W₂ = ∂Loss/∂output × ∂output/∂h₂ × ∂h₂/∂W₂
  ∂Loss/∂W₁ = ∂Loss/∂output × ∂output/∂h₂ × ∂h₂/∂h₁ × ∂h₁/∂W₁

Update:
  W = W - learning_rate × ∂Loss/∂W

python

# PyTorch handles this automatically
model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Training step
predictions = model(input_batch)         # forward pass
loss = loss_fn(predictions, labels)      # compute loss
loss.backward()                          # backward pass (compute gradients)
optimizer.step()                         # update weights
optimizer.zero_grad()                    # reset gradients for next step

Vanishing gradient problem: In deep networks, gradients can shrink exponentially as they propagate backward through many layers, making early layers learn very slowly. Solutions: ReLU activation, residual connections, batch normalization.

3. Loss Functions

What: A function that measures how far the model's predictions are from the true labels. The goal of training is to minimize this.

Loss	Formula	Used For
MSE	mean((y - ŷ)²)	Regression
Cross-Entropy	-Σ y·log(ŷ)	Classification
Binary Cross-Entropy	-[y·log(ŷ) + (1-y)·log(1-ŷ)]	Binary classification
Huber	MSE when small, MAE when large	Robust regression
Contrastive	Pull similar pairs close, push dissimilar apart	Embedding learning

python

# Classification (most common for LLMs)
loss = nn.CrossEntropyLoss()
# Combines LogSoftmax + NLLLoss
# Input: raw logits (batch, num_classes)
# Target: class indices (batch,)

logits = model(input)                    # (32, 10) — 32 samples, 10 classes
labels = torch.tensor([3, 7, 1, ...])   # (32,) — true class for each sample
loss_value = loss(logits, labels)        # scalar

4. Optimizers

What: Algorithms that update model weights based on gradients. Different optimizers use different strategies for step size and direction.

Optimizer	Key Idea	When to Use
SGD	Basic gradient descent with optional momentum	Simple problems, well-tuned LR
Adam	Adaptive learning rate per parameter	Default choice for most tasks
AdamW	Adam + decoupled weight decay	Transformers (standard)
Adafactor	Memory-efficient Adam variant	Very large models

python

# Adam — the default choice
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)

# AdamW — standard for transformers
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

Learning rate schedule: Start with higher LR, decrease over training.

python

# Cosine annealing — smoothly decrease LR
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Warmup + decay — standard for transformer training
# Linear warmup for first N steps, then cosine decay

Learning Rate
  │
  │    /╲
  │   /  ╲
  │  /    ╲
  │ /      ╲────────────
  │/                    ╲
  └─────────────────────────→ Steps
  warmup    peak    decay

5. Overfitting and Regularization

What: Overfitting = model memorizes training data but fails on new data. Regularization = techniques to prevent this.

Loss
  │
  │  ╲         Training loss
  │   ╲──────────────────────
  │    ╲    Validation loss
  │     ──────╱╲
  │           ╱  ╲   ← overfitting starts here
  │          ╱    ╲
  └─────────────────────→ Epochs
        ↑
     optimal stopping point

Regularization techniques:

Technique	How	Effect
Dropout	Randomly zero out neurons during training	Prevents co-adaptation
Weight decay (L2)	Add
L1 regularization	Add \|W\| penalty to loss	Encourages sparsity
Early stopping	Stop training when validation loss increases	Prevents over-training
Data augmentation	Transform training data (flip, rotate, etc.)	More diverse training
Batch normalization	Normalize layer inputs	Stabilizes training

python

class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Dropout(0.3),      # 30% dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10),
        )

6. Batch Normalization

What: Normalizes the input to each layer to have zero mean and unit variance. Dramatically stabilizes and accelerates training.

Without BatchNorm:
  Layer output values can shift wildly during training
  → "internal covariate shift"
  → requires small learning rates
  → slow convergence

With BatchNorm:
  Normalize each mini-batch: x̂ = (x - μ_batch) / σ_batch
  Then scale and shift: y = γx̂ + β  (γ, β are learned)
  → Stable distributions
  → Can use larger learning rates
  → Faster convergence

python

class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),    # normalize after conv
            nn.ReLU(),
        )

Layer Normalization (used in Transformers instead of BatchNorm):

BatchNorm: Normalize across the batch dimension
LayerNorm: Normalize across the feature dimension (per sample)
LayerNorm works with variable-length sequences and doesn't depend on batch size

python

# Transformer uses LayerNorm
self.norm = nn.LayerNorm(d_model)  # normalize across hidden dimension
output = self.norm(attention_output + residual)  # residual + normalize

7. Training Pipeline Overview

1. DATA
   Raw data → Clean → Split (train/val/test) → DataLoader (batches)

2. MODEL
   Define architecture → Initialize weights → Move to GPU

3. TRAINING LOOP
   for epoch in range(num_epochs):
       for batch in train_loader:
           predictions = model(batch)      # forward
           loss = loss_fn(predictions, labels)
           loss.backward()                 # gradients
           optimizer.step()                # update weights
           optimizer.zero_grad()

       # Evaluate on validation set
       val_loss = evaluate(model, val_loader)
       scheduler.step()                    # update learning rate
       if val_loss < best_val_loss:
           save_checkpoint(model)          # save best model

4. EVALUATION
   Load best checkpoint → Evaluate on test set → Report metrics

Key hyperparameters:

Hyperparameter	Typical Range	Impact
Learning rate	1e-5 to 1e-2	Most important — too high diverges, too low stalls
Batch size	16-512	Larger = smoother gradients, more memory
Epochs	3-100+	Task dependent
Dropout	0.1-0.5	Higher = more regularization
Weight decay	1e-4 to 0.1	Higher = smaller weights

11 - ML Fundamentals ​

1. Neural Network Basics ​

2. Backpropagation ​

3. Loss Functions ​

4. Optimizers ​

5. Overfitting and Regularization ​

6. Batch Normalization ​

7. Training Pipeline Overview ​

11 - ML Fundamentals

1. Neural Network Basics

2. Backpropagation

3. Loss Functions

4. Optimizers

5. Overfitting and Regularization

6. Batch Normalization

7. Training Pipeline Overview