Replace and unfreeze head

May 18, 2026 · View on GitHub

The most complete, interview-focused ML/AI reference on GitHub.

Cracking interviews at Google DeepMind, OpenAI, Meta AI, Amazon, Microsoft.

📌 What's Inside

Section	Content
🧠 ML Fundamentals	Bias-variance, overfitting, regularization, loss functions
🔢 Math & Statistics	Linear algebra, probability, calculus for ML
🤖 Deep Learning	CNNs, RNNs, transformers, attention, training tricks
🗣️ NLP	BERT, GPT, RAG, embeddings, tokenization
👁️ Computer Vision	YOLO, ResNet, image augmentation, segmentation
🏗️ ML System Design	Recommendation systems, search, fraud detection
🐍 Python & Libraries	NumPy, Pandas, scikit-learn, PyTorch one-liners
🧩 Coding Patterns	Data preprocessing, model eval, cross-validation
💼 Behavioral	STAR answers, research discussion, project walkthrough

🧠 ML Fundamentals

Q: Explain the bias-variance tradeoff.

Bias = error from wrong assumptions (underfitting — model too simple). Variance = error from sensitivity to training data fluctuations (overfitting — model too complex).

Total Error = Bias² + Variance + Irreducible Noise

	High Bias	High Variance
Training error	High	Low
Test error	High	High
Fix	More features, complex model	Regularization, more data, dropout

Interview tip: Draw the U-shaped test error curve. Explain that the goal is to find the sweet spot.

Q: What is regularization? Compare L1 vs L2.

Regularization adds a penalty to the loss function to prevent overfitting.

	L1 (Lasso)	L2 (Ridge)
Penalty	λΣ\|wᵢ\|	λΣwᵢ²
Effect	Produces sparse weights (zeros out features)	Shrinks weights toward zero, keeps all
Use when	Feature selection needed	All features relevant
Gradient	Sign(w)	2w

from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=0.1)   # L1
ridge = Ridge(alpha=1.0)   # L2

Q: How do you handle imbalanced datasets?

Resampling:

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Oversample minority class
X_res, y_res = SMOTE().fit_resample(X, y)

# Undersample majority class
X_res, y_res = RandomUnderSampler().fit_resample(X, y)

Class weights:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced')

Metrics: Avoid accuracy. Use:

Precision, Recall, F1-score
ROC-AUC
PR-AUC (better for severe imbalance)

Q: Explain cross-validation. Why use k-fold?

K-fold CV splits data into k subsets. Train on k-1, test on 1. Repeat k times. Average the scores.

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

Why k-fold? Reduces variance in evaluation vs a single train/test split. Stratified k-fold preserves class distribution in each fold.

Q: What is gradient descent? Compare SGD, Mini-batch, Adam.

Gradient descent minimizes loss by updating parameters in the direction of steepest descent:

θ = θ - α · ∇L(θ)

	Batch GD	SGD	Mini-batch
Update per	Full dataset	1 sample	Batch (32-256)
Speed	Slow	Fast	Fast
Noise	Low	High	Medium
Memory	High	Low	Medium

Adam (Adaptive Moment Estimation) — most popular optimizer:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

Adam combines momentum + RMSProp. Adapts learning rate per parameter.

🔢 Math & Statistics

Q: What is the dot product and why does it matter in ML?

a · b = Σ aᵢbᵢ = |a||b|cos(θ)

In ML: measures similarity (cosine similarity in embeddings), is the core of every linear layer:

output = X @ W.T + b  # Matrix multiplication = stacked dot products

Q: Explain PCA intuitively and mathematically.

PCA finds directions (principal components) of maximum variance in data.

Steps:

Center data: X_centered = X - mean(X)
Compute covariance matrix: C = (X_centered.T @ X_centered) / (n-1)
Eigendecompose: C = V Λ Vᵀ
Project: X_pca = X_centered @ V[:, :k]

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")

Q: What is the difference between MLE and MAP?

MLE (Maximum Likelihood Estimation):

θ_MLE = argmax P(data | θ)

Finds parameters that make observed data most probable. No prior assumption.

MAP (Maximum A Posteriori):

θ_MAP = argmax P(θ | data) = argmax P(data | θ) · P(θ)

Incorporates prior belief about θ. MAP with Gaussian prior = L2 regularization. MAP with Laplace prior = L1 regularization.

🤖 Deep Learning

Q: How does backpropagation work?

Backprop computes gradients of the loss with respect to all parameters using the chain rule.

Forward: x → [L1] → h → [L2] → ŷ → loss
Backward: ∂loss/∂W₂ → ∂loss/∂h → ∂loss/∂W₁

loss = criterion(output, target)
loss.backward()       # Compute all gradients
optimizer.step()      # Update weights: W = W - lr * W.grad
optimizer.zero_grad() # Clear for next batch

Chain rule: ∂L/∂W₁ = (∂L/∂ŷ) · (∂ŷ/∂h) · (∂h/∂W₁)

Q: Explain the vanishing gradient problem and solutions.

In deep networks, gradients shrink exponentially as they propagate backward through many sigmoid/tanh layers → early layers learn very slowly.

Solutions:

Fix	How
ReLU activation	Gradient = 1 for positive inputs (no squashing)
Batch Normalization	Normalizes activations, keeps gradients stable
Residual connections	Gradient flows directly via skip connections
LSTM/GRU	Gating mechanisms preserve long-range gradients
Gradient clipping	`torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)`

Q: What is attention mechanism / self-attention?

Attention lets the model focus on relevant parts of the input for each output position.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Q (Query): what we're looking for
K (Key): what each position offers
V (Value): actual content to aggregate

# Scaled dot-product attention
import torch.nn.functional as F

def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return weights @ V

Self-attention: Q=K=V come from same sequence. Allows each token to attend to all others.

Q: Compare CNN vs RNN vs Transformer for sequence tasks.

	CNN	RNN/LSTM	Transformer
Parallelizable	✅ Yes	❌ Sequential	✅ Yes
Long-range deps	❌ Fixed window	⚠️ Struggles	✅ Global attention
Memory	Low	Medium	High (O(n²))
Best for	Local patterns, CV	Short sequences	NLP, long sequences
Speed	Fast	Slow	Fast (parallelized)

Q: What is batch normalization? Why does it help?

BatchNorm normalizes activations within each mini-batch to have zero mean and unit variance, then applies learnable scale/shift:

# PyTorch
self.bn = nn.BatchNorm2d(num_channels)

# What it computes:
# μ = mean(x), σ² = var(x)
# x_norm = (x - μ) / √(σ² + ε)
# output = γ * x_norm + β  (γ, β are learned)

Benefits: Reduces internal covariate shift, acts as regularization, allows higher learning rates, reduces sensitivity to weight initialization.

🗣️ NLP

Q: How does BERT work? What makes it different from GPT?

BERT (Bidirectional Encoder Representations from Transformers):

Encoder-only transformer
Bidirectional: sees both left and right context simultaneously
Pre-trained with: Masked Language Model (MLM) + Next Sentence Prediction (NSP)
Best for: classification, NER, Q&A (understanding tasks)

GPT (Generative Pre-trained Transformer):

Decoder-only transformer
Unidirectional (causal): only sees left context
Pre-trained with: Next token prediction
Best for: text generation, summarization, chat

BERT: [CLS] The [MASK] sat on the mat [SEP] → predicts "cat"
GPT:  The cat sat on → predicts "the"

Q: What is RAG (Retrieval-Augmented Generation)?

RAG combines a retriever (finds relevant documents) with a generator (LLM) to answer questions grounded in external knowledge.

Query → [Embed] → Vector DB search → Top-k docs
                                          ↓
                   Prompt: "Using these docs: {docs}\nAnswer: {query}"
                                          ↓
                                    LLM generates answer

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

Why RAG? Overcomes LLM knowledge cutoff, reduces hallucination, keeps knowledge updatable without retraining.

Q: Explain word embeddings. Word2Vec vs GloVe vs FastText.

Embeddings map words to dense vectors where similar words are close.

	Word2Vec	GloVe	FastText
Method	Neural (CBOW/Skip-gram)	Matrix factorization on co-occurrence	Word2Vec + subword n-grams
OOV handling	❌ No	❌ No	✅ Yes (via n-grams)
Morphology	❌ No	❌ No	✅ Yes
Best for	General NLP	General NLP	Morphologically rich languages

from gensim.models import Word2Vec, FastText
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv['python']  # shape: (100,)

👁️ Computer Vision

Q: How does YOLO work? What makes YOLOv8 better?

YOLO (You Only Look Once) divides image into S×S grid. Each cell predicts B bounding boxes + class probabilities in a single forward pass.

$ \text{Input} \text{image} (640 \times 640) ↓ \text{Backbone} (\text{feature} \text{extraction}) ↓ \text{Neck} (\text{FPN}/\text{PAN} — \text{multi}-\text{scale} \text{features}) ↓ \text{Head} (\text{predict} \text{boxes} + \text{classes} \text{for} 3 \text{scales}) ↓ \text{NMS} (\text{remove} \text{overlapping} \text{boxes}) ↓ \text{Final} \text{detections} $

YOLOv8 improvements over v5:

Anchor-free detection (no pre-defined anchors)
Decoupled head (separate classification and regression)
C2f module replaces C3 (better gradient flow)
New loss: Distribution Focal Loss for bounding box regression
~35% fewer parameters than YOLOv5 at same accuracy

from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('image.jpg')
results[0].boxes  # bboxes, confidences, classes

Q: What is transfer learning? When to freeze layers?

Transfer learning: use a model trained on large dataset (ImageNet) as starting point for your task.

Strategies:

Scenario	Approach
Small dataset, similar domain	Freeze backbone, train head only
Small dataset, different domain	Freeze early layers, fine-tune later layers
Large dataset, any domain	Fine-tune entire network

# PyTorch — freeze backbone
model = torchvision.models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False  # Freeze all

# Replace and unfreeze head
model.fc = nn.Linear(2048, num_classes)  # Only fc trains

🏗️ ML System Design

Q: Design a real-time fraud detection system.

Transaction → [Feature Engineering] → [ML Model] → Decision
                     ↑                              ↓
              Feature Store                    Risk Score
              (user history,              → Block / Flag / Pass
               merchant profile,
               velocity features)

Key components:

Features: amount, merchant category, location delta, time of day, velocity (5 txns in 1 min?), device fingerprint
Model: XGBoost / LightGBM (low latency), backed by deep learning for complex patterns
Threshold: p(fraud) > 0.7 → block, 0.3-0.7 → MFA challenge, < 0.3 → pass
Latency: < 100ms P99 via feature precomputation + model serving (TorchServe/TFServing)
Feedback loop: labeled outcomes → retrain weekly

Metrics: Precision (false positives = bad UX), Recall (false negatives = fraud loss), F1, AUC

Q: Design a recommendation system (YouTube/Netflix style).

Two-stage architecture:

100M items → [Retrieval (fast)] → 1000 candidates
                                      ↓
                              [Ranking (accurate)]
                                      ↓
                               Top 50 shown to user

Retrieval: Matrix factorization / two-tower neural network

# Two-tower: embed user and item separately, dot product for score
user_embed = user_tower(user_features)   # (batch, 128)
item_embed = item_tower(item_features)   # (batch, 128)
scores = (user_embed * item_embed).sum(-1)

Ranking: Wide & Deep / DIN / Transformer on (user, item, context) features

Metrics: Click-through rate, Watch time, NDCG, Coverage, Diversity

🐍 Python & Libraries

Essential NumPy one-liners

import numpy as np

# Shape manipulation
x = np.random.randn(100, 3)
x.reshape(50, 6)           # Reshape
x.T                         # Transpose
x[:, np.newaxis]            # Add dimension
np.squeeze(x)               # Remove size-1 dims

# Math
np.dot(A, B)                # Matrix multiply (2D)
A @ B                       # Same, cleaner syntax
np.linalg.norm(x)           # L2 norm
np.linalg.eig(A)            # Eigendecomposition
np.linalg.svd(A)            # SVD

# Stats
np.mean(x, axis=0)          # Column means
np.std(x, ddof=1)           # Sample std
np.percentile(x, 75)        # 75th percentile
np.corrcoef(x[:, 0], x[:, 1])  # Correlation

# Boolean ops
np.where(x > 0, x, 0)      # ReLU!
x[x > 0]                    # Boolean indexing
np.any(x > 5), np.all(x > 0)

Essential Pandas one-liners

import pandas as pd

# Load
df = pd.read_csv('data.csv')
df.info()               # Shape, dtypes, nulls
df.describe()           # Stats summary
df.head(), df.tail()

# Missing values
df.isnull().sum()                       # Null count per col
df.fillna(df.mean(), inplace=True)      # Fill numeric
df.dropna(subset=['target'])            # Drop rows with null target
df['col'].fillna(df['col'].mode()[0])   # Fill with mode

# Feature engineering
df['log_price'] = np.log1p(df['price'])
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['teen','young','mid','senior'])
pd.get_dummies(df['category'], prefix='cat', drop_first=True)  # One-hot encode

# Aggregation
df.groupby('city')['salary'].agg(['mean','median','count'])
df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')

Scikit-learn full pipeline template

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV

num_features = ['age', 'salary', 'experience']
cat_features = ['city', 'department']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

full_pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=100))
])

# Cross-validate
scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='roc_auc')

# Grid search
param_grid = {'model__n_estimators': [100, 200], 'model__max_depth': [3, 5]}
gs = GridSearchCV(full_pipeline, param_grid, cv=5, n_jobs=-1, scoring='roc_auc')
gs.fit(X_train, y_train)

PyTorch training loop template

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss, correct = 0, 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        output = model(X)
        total_loss += criterion(output, y).item()
        correct += (output.argmax(1) == y).sum().item()
    return total_loss / len(loader), correct / len(loader.dataset)

# Training loop with early stopping
best_val_loss, patience, wait = float('inf'), 5, 0
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, val_acc = evaluate(model, val_loader, criterion, device)
    scheduler.step(val_loss)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f'Early stopping at epoch {epoch}')
            break
    
    print(f'Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f} acc={val_acc:.4f}')

🧩 Coding Patterns

Custom Dataset class (PyTorch)

from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as T

class ImageDataset(Dataset):
    def __init__(self, df, img_dir, transform=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transform = transform or T.Compose([
            T.Resize((224, 224)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img = Image.open(f"{self.img_dir}/{self.df.loc[idx, 'filename']}").convert('RGB')
        label = self.df.loc[idx, 'label']
        return self.transform(img), torch.tensor(label, dtype=torch.long)

Evaluation metrics from scratch

import numpy as np

def precision_recall_f1(y_true, y_pred):
    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    precision = tp / (tp + fp + 1e-8)
    recall    = tp / (tp + fn + 1e-8)
    f1        = 2 * precision * recall / (precision + recall + 1e-8)
    return precision, recall, f1

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def iou(box1, box2):
    x1 = max(box1[0], box2[0]); y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2]); y2 = min(box1[3], box2[3])
    intersection = max(0, x2-x1) * max(0, y2-y1)
    union = (box1[2]-box1[0])*(box1[3]-box1[1]) + (box2[2]-box2[0])*(box2[3]-box2[1]) - intersection
    return intersection / (union + 1e-8)

💼 Behavioral

Project walkthrough template (STAR format)

Situation: "At [company/university], we faced [problem] — [metric showing scale]."

Task: "My role was to [responsibility] — specifically [what you owned]."

Action:

"First I [explored/analyzed] the data and found [insight]."
"I chose [model/approach] because [reason over alternatives]."
"Key challenge was [X] — I solved it by [Y]."

Result: "[Metric improvement] — e.g., reduced inference time by 40%, improved F1 from 0.72 to 0.89."

Tip: Always quantify. "Improved accuracy" < "Improved F1 from 0.72 to 0.89 on a 100K sample test set."

Questions to ask the interviewer

"What does the ML infrastructure look like — on-prem, cloud, internal tooling?"
"How do you handle model monitoring and drift detection in production?"
"What's the typical iteration cycle from idea to model in production?"
"What's the biggest unsolved ML challenge on the team right now?"
"How does the team balance research vs engineering vs product priorities?"

🗺️ Roadmap

LLM fine-tuning section (LoRA, QLoRA, RLHF)
MLOps questions (MLflow, DVC, Kubeflow, feature stores)
Reinforcement Learning fundamentals
Time series deep dive
More system design case studies

🤝 Contributing

Found an error? Have a great question/answer pair? PRs are very welcome.

git checkout -b add/new-question
# Add your Q&A in the relevant section
git commit -m 'Add: <topic> question on <concept>'
git push origin add/new-question