11 - ML Fundamentals โ
1. Neural Network Basics โ
What: A neural network is a function that maps inputs to outputs through layers of connected "neurons" (linear transformations + non-linear activations).
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
(3) (4) (4) (2)
xโ โโโโโโ โโโ hโ โโโ โโโ hโ โโโ โโโ yโ
โโโโโ โโโโโ โโโโโ
xโ โโโโโโค โโโ hโ โโโค โโโ hโ โโโค โโโ yโ
โโโโโ โโโโโ โโโโโ
xโ โโโโโโ โโโ hโ โโโค โโโ hโ โโโค โโโโโ
โโโ hโ โโโ โโโ hโ โโโEach neuron computes:
output = activation(W ยท input + b)
W = weight matrix (learned)
b = bias vector (learned)
activation = non-linear function (ReLU, sigmoid, etc.)Common activations:
| Function | Formula | Range | Used For |
|---|---|---|---|
| ReLU | max(0, x) | [0, inf) | Hidden layers (default) |
| Sigmoid | 1/(1+e^-x) | (0, 1) | Binary classification output |
| Tanh | (e^x - e^-x)/(e^x + e^-x) | (-1, 1) | Hidden layers (older) |
| Softmax | e^xi / ฮฃe^xj | (0, 1), sums to 1 | Multi-class output |
| GELU | x ยท ฮฆ(x) | (-0.17, inf) | Transformers |
import torch
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(784, 256), # input โ hidden
nn.ReLU(),
nn.Linear(256, 128), # hidden โ hidden
nn.ReLU(),
nn.Linear(128, 10), # hidden โ output
)
def forward(self, x):
return self.layers(x) # 10 logits (one per class)2. Backpropagation โ
What: The algorithm for training neural networks. Computes how much each weight contributed to the error, then adjusts weights to reduce error.
Forward pass โ compute loss โ backward pass โ update weights:
Forward pass:
Input โ Layer 1 โ Layer 2 โ ... โ Output โ Loss
Backward pass (chain rule):
โLoss/โWโ = โLoss/โoutput ร โoutput/โWโ
โLoss/โWโ = โLoss/โoutput ร โoutput/โhโ ร โhโ/โWโ
โLoss/โWโ = โLoss/โoutput ร โoutput/โhโ ร โhโ/โhโ ร โhโ/โWโ
Update:
W = W - learning_rate ร โLoss/โW# PyTorch handles this automatically
model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
# Training step
predictions = model(input_batch) # forward pass
loss = loss_fn(predictions, labels) # compute loss
loss.backward() # backward pass (compute gradients)
optimizer.step() # update weights
optimizer.zero_grad() # reset gradients for next stepVanishing gradient problem: In deep networks, gradients can shrink exponentially as they propagate backward through many layers, making early layers learn very slowly. Solutions: ReLU activation, residual connections, batch normalization.
3. Loss Functions โ
What: A function that measures how far the model's predictions are from the true labels. The goal of training is to minimize this.
| Loss | Formula | Used For |
|---|---|---|
| MSE | mean((y - ลท)ยฒ) | Regression |
| Cross-Entropy | -ฮฃ yยทlog(ลท) | Classification |
| Binary Cross-Entropy | -[yยทlog(ลท) + (1-y)ยทlog(1-ลท)] | Binary classification |
| Huber | MSE when small, MAE when large | Robust regression |
| Contrastive | Pull similar pairs close, push dissimilar apart | Embedding learning |
# Classification (most common for LLMs)
loss = nn.CrossEntropyLoss()
# Combines LogSoftmax + NLLLoss
# Input: raw logits (batch, num_classes)
# Target: class indices (batch,)
logits = model(input) # (32, 10) โ 32 samples, 10 classes
labels = torch.tensor([3, 7, 1, ...]) # (32,) โ true class for each sample
loss_value = loss(logits, labels) # scalar4. Optimizers โ
What: Algorithms that update model weights based on gradients. Different optimizers use different strategies for step size and direction.
| Optimizer | Key Idea | When to Use |
|---|---|---|
| SGD | Basic gradient descent with optional momentum | Simple problems, well-tuned LR |
| Adam | Adaptive learning rate per parameter | Default choice for most tasks |
| AdamW | Adam + decoupled weight decay | Transformers (standard) |
| Adafactor | Memory-efficient Adam variant | Very large models |
# Adam โ the default choice
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
# AdamW โ standard for transformers
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)Learning rate schedule: Start with higher LR, decrease over training.
# Cosine annealing โ smoothly decrease LR
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# Warmup + decay โ standard for transformer training
# Linear warmup for first N steps, then cosine decayLearning Rate
โ
โ /โฒ
โ / โฒ
โ / โฒ
โ / โฒโโโโโโโโโโโโ
โ/ โฒ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ Steps
warmup peak decay5. Overfitting and Regularization โ
What: Overfitting = model memorizes training data but fails on new data. Regularization = techniques to prevent this.
Loss
โ
โ โฒ Training loss
โ โฒโโโโโโโโโโโโโโโโโโโโโโ
โ โฒ Validation loss
โ โโโโโโโฑโฒ
โ โฑ โฒ โ overfitting starts here
โ โฑ โฒ
โโโโโโโโโโโโโโโโโโโโโโโ Epochs
โ
optimal stopping pointRegularization techniques:
| Technique | How | Effect |
|---|---|---|
| Dropout | Randomly zero out neurons during training | Prevents co-adaptation |
| Weight decay (L2) | Add | |
| L1 regularization | Add |W| penalty to loss | Encourages sparsity |
| Early stopping | Stop training when validation loss increases | Prevents over-training |
| Data augmentation | Transform training data (flip, rotate, etc.) | More diverse training |
| Batch normalization | Normalize layer inputs | Stabilizes training |
class RegularizedNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.3), # 30% dropout
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 10),
)6. Batch Normalization โ
What: Normalizes the input to each layer to have zero mean and unit variance. Dramatically stabilizes and accelerates training.
Without BatchNorm:
Layer output values can shift wildly during training
โ "internal covariate shift"
โ requires small learning rates
โ slow convergence
With BatchNorm:
Normalize each mini-batch: xฬ = (x - ฮผ_batch) / ฯ_batch
Then scale and shift: y = ฮณxฬ + ฮฒ (ฮณ, ฮฒ are learned)
โ Stable distributions
โ Can use larger learning rates
โ Faster convergenceclass ConvBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.block = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch), # normalize after conv
nn.ReLU(),
)Layer Normalization (used in Transformers instead of BatchNorm):
- BatchNorm: Normalize across the batch dimension
- LayerNorm: Normalize across the feature dimension (per sample)
- LayerNorm works with variable-length sequences and doesn't depend on batch size
# Transformer uses LayerNorm
self.norm = nn.LayerNorm(d_model) # normalize across hidden dimension
output = self.norm(attention_output + residual) # residual + normalize7. Training Pipeline Overview โ
1. DATA
Raw data โ Clean โ Split (train/val/test) โ DataLoader (batches)
2. MODEL
Define architecture โ Initialize weights โ Move to GPU
3. TRAINING LOOP
for epoch in range(num_epochs):
for batch in train_loader:
predictions = model(batch) # forward
loss = loss_fn(predictions, labels)
loss.backward() # gradients
optimizer.step() # update weights
optimizer.zero_grad()
# Evaluate on validation set
val_loss = evaluate(model, val_loader)
scheduler.step() # update learning rate
if val_loss < best_val_loss:
save_checkpoint(model) # save best model
4. EVALUATION
Load best checkpoint โ Evaluate on test set โ Report metricsKey hyperparameters:
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| Learning rate | 1e-5 to 1e-2 | Most important โ too high diverges, too low stalls |
| Batch size | 16-512 | Larger = smoother gradients, more memory |
| Epochs | 3-100+ | Task dependent |
| Dropout | 0.1-0.5 | Higher = more regularization |
| Weight decay | 1e-4 to 0.1 | Higher = smaller weights |