Skip to content

11 - ML Fundamentals โ€‹


1. Neural Network Basics โ€‹

What: A neural network is a function that maps inputs to outputs through layers of connected "neurons" (linear transformations + non-linear activations).

Input Layer      Hidden Layer 1    Hidden Layer 2    Output Layer
   (3)               (4)               (4)              (2)

  xโ‚ โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€ hโ‚ โ”€โ”€โ”    โ”Œโ”€โ”€ hโ‚ โ”€โ”€โ”    โ”Œโ”€โ”€ yโ‚
           โ”œโ”€โ”€โ†’โ”‚         โ”œโ”€โ”€โ†’โ”‚         โ”œโ”€โ”€โ†’โ”‚
  xโ‚‚ โ”€โ”€โ”€โ”€โ”€โ”ค    โ”œโ”€โ”€ hโ‚‚ โ”€โ”€โ”ค    โ”œโ”€โ”€ hโ‚‚ โ”€โ”€โ”ค    โ”œโ”€โ”€ yโ‚‚
           โ”œโ”€โ”€โ†’โ”‚         โ”œโ”€โ”€โ†’โ”‚         โ”œโ”€โ”€โ†’โ”‚
  xโ‚ƒ โ”€โ”€โ”€โ”€โ”€โ”˜    โ”œโ”€โ”€ hโ‚ƒ โ”€โ”€โ”ค    โ”œโ”€โ”€ hโ‚ƒ โ”€โ”€โ”ค    โ””โ”€โ”€โ”€โ”€
                โ””โ”€โ”€ hโ‚„ โ”€โ”€โ”˜    โ””โ”€โ”€ hโ‚„ โ”€โ”€โ”˜

Each neuron computes:

output = activation(W ยท input + b)

W = weight matrix (learned)
b = bias vector (learned)
activation = non-linear function (ReLU, sigmoid, etc.)

Common activations:

FunctionFormulaRangeUsed For
ReLUmax(0, x)[0, inf)Hidden layers (default)
Sigmoid1/(1+e^-x)(0, 1)Binary classification output
Tanh(e^x - e^-x)/(e^x + e^-x)(-1, 1)Hidden layers (older)
Softmaxe^xi / ฮฃe^xj(0, 1), sums to 1Multi-class output
GELUx ยท ฮฆ(x)(-0.17, inf)Transformers
python
import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),   # input โ†’ hidden
            nn.ReLU(),
            nn.Linear(256, 128),   # hidden โ†’ hidden
            nn.ReLU(),
            nn.Linear(128, 10),    # hidden โ†’ output
        )

    def forward(self, x):
        return self.layers(x)     # 10 logits (one per class)

2. Backpropagation โ€‹

What: The algorithm for training neural networks. Computes how much each weight contributed to the error, then adjusts weights to reduce error.

Forward pass โ†’ compute loss โ†’ backward pass โ†’ update weights:

Forward pass:
  Input โ†’ Layer 1 โ†’ Layer 2 โ†’ ... โ†’ Output โ†’ Loss

Backward pass (chain rule):
  โˆ‚Loss/โˆ‚Wโ‚ƒ = โˆ‚Loss/โˆ‚output ร— โˆ‚output/โˆ‚Wโ‚ƒ
  โˆ‚Loss/โˆ‚Wโ‚‚ = โˆ‚Loss/โˆ‚output ร— โˆ‚output/โˆ‚hโ‚‚ ร— โˆ‚hโ‚‚/โˆ‚Wโ‚‚
  โˆ‚Loss/โˆ‚Wโ‚ = โˆ‚Loss/โˆ‚output ร— โˆ‚output/โˆ‚hโ‚‚ ร— โˆ‚hโ‚‚/โˆ‚hโ‚ ร— โˆ‚hโ‚/โˆ‚Wโ‚

Update:
  W = W - learning_rate ร— โˆ‚Loss/โˆ‚W
python
# PyTorch handles this automatically
model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Training step
predictions = model(input_batch)         # forward pass
loss = loss_fn(predictions, labels)      # compute loss
loss.backward()                          # backward pass (compute gradients)
optimizer.step()                         # update weights
optimizer.zero_grad()                    # reset gradients for next step

Vanishing gradient problem: In deep networks, gradients can shrink exponentially as they propagate backward through many layers, making early layers learn very slowly. Solutions: ReLU activation, residual connections, batch normalization.


3. Loss Functions โ€‹

What: A function that measures how far the model's predictions are from the true labels. The goal of training is to minimize this.

LossFormulaUsed For
MSEmean((y - ลท)ยฒ)Regression
Cross-Entropy-ฮฃ yยทlog(ลท)Classification
Binary Cross-Entropy-[yยทlog(ลท) + (1-y)ยทlog(1-ลท)]Binary classification
HuberMSE when small, MAE when largeRobust regression
ContrastivePull similar pairs close, push dissimilar apartEmbedding learning
python
# Classification (most common for LLMs)
loss = nn.CrossEntropyLoss()
# Combines LogSoftmax + NLLLoss
# Input: raw logits (batch, num_classes)
# Target: class indices (batch,)

logits = model(input)                    # (32, 10) โ€” 32 samples, 10 classes
labels = torch.tensor([3, 7, 1, ...])   # (32,) โ€” true class for each sample
loss_value = loss(logits, labels)        # scalar

4. Optimizers โ€‹

What: Algorithms that update model weights based on gradients. Different optimizers use different strategies for step size and direction.

OptimizerKey IdeaWhen to Use
SGDBasic gradient descent with optional momentumSimple problems, well-tuned LR
AdamAdaptive learning rate per parameterDefault choice for most tasks
AdamWAdam + decoupled weight decayTransformers (standard)
AdafactorMemory-efficient Adam variantVery large models
python
# Adam โ€” the default choice
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)

# AdamW โ€” standard for transformers
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

Learning rate schedule: Start with higher LR, decrease over training.

python
# Cosine annealing โ€” smoothly decrease LR
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# Warmup + decay โ€” standard for transformer training
# Linear warmup for first N steps, then cosine decay
Learning Rate
  โ”‚
  โ”‚    /โ•ฒ
  โ”‚   /  โ•ฒ
  โ”‚  /    โ•ฒ
  โ”‚ /      โ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  โ”‚/                    โ•ฒ
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Steps
  warmup    peak    decay

5. Overfitting and Regularization โ€‹

What: Overfitting = model memorizes training data but fails on new data. Regularization = techniques to prevent this.

Loss
  โ”‚
  โ”‚  โ•ฒ         Training loss
  โ”‚   โ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  โ”‚    โ•ฒ    Validation loss
  โ”‚     โ”€โ”€โ”€โ”€โ”€โ”€โ•ฑโ•ฒ
  โ”‚           โ•ฑ  โ•ฒ   โ† overfitting starts here
  โ”‚          โ•ฑ    โ•ฒ
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Epochs
        โ†‘
     optimal stopping point

Regularization techniques:

TechniqueHowEffect
DropoutRandomly zero out neurons during trainingPrevents co-adaptation
Weight decay (L2)Add
L1 regularizationAdd |W| penalty to lossEncourages sparsity
Early stoppingStop training when validation loss increasesPrevents over-training
Data augmentationTransform training data (flip, rotate, etc.)More diverse training
Batch normalizationNormalize layer inputsStabilizes training
python
class RegularizedNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Dropout(0.3),      # 30% dropout
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10),
        )

6. Batch Normalization โ€‹

What: Normalizes the input to each layer to have zero mean and unit variance. Dramatically stabilizes and accelerates training.

Without BatchNorm:
  Layer output values can shift wildly during training
  โ†’ "internal covariate shift"
  โ†’ requires small learning rates
  โ†’ slow convergence

With BatchNorm:
  Normalize each mini-batch: xฬ‚ = (x - ฮผ_batch) / ฯƒ_batch
  Then scale and shift: y = ฮณxฬ‚ + ฮฒ  (ฮณ, ฮฒ are learned)
  โ†’ Stable distributions
  โ†’ Can use larger learning rates
  โ†’ Faster convergence
python
class ConvBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),    # normalize after conv
            nn.ReLU(),
        )

Layer Normalization (used in Transformers instead of BatchNorm):

  • BatchNorm: Normalize across the batch dimension
  • LayerNorm: Normalize across the feature dimension (per sample)
  • LayerNorm works with variable-length sequences and doesn't depend on batch size
python
# Transformer uses LayerNorm
self.norm = nn.LayerNorm(d_model)  # normalize across hidden dimension
output = self.norm(attention_output + residual)  # residual + normalize

7. Training Pipeline Overview โ€‹

1. DATA
   Raw data โ†’ Clean โ†’ Split (train/val/test) โ†’ DataLoader (batches)

2. MODEL
   Define architecture โ†’ Initialize weights โ†’ Move to GPU

3. TRAINING LOOP
   for epoch in range(num_epochs):
       for batch in train_loader:
           predictions = model(batch)      # forward
           loss = loss_fn(predictions, labels)
           loss.backward()                 # gradients
           optimizer.step()                # update weights
           optimizer.zero_grad()

       # Evaluate on validation set
       val_loss = evaluate(model, val_loader)
       scheduler.step()                    # update learning rate
       if val_loss < best_val_loss:
           save_checkpoint(model)          # save best model

4. EVALUATION
   Load best checkpoint โ†’ Evaluate on test set โ†’ Report metrics

Key hyperparameters:

HyperparameterTypical RangeImpact
Learning rate1e-5 to 1e-2Most important โ€” too high diverges, too low stalls
Batch size16-512Larger = smoother gradients, more memory
Epochs3-100+Task dependent
Dropout0.1-0.5Higher = more regularization
Weight decay1e-4 to 0.1Higher = smaller weights

Frontend interview preparation reference.