Replace and unfreeze head

May 18, 2026 Β· View on GitHub

Stars Forks PRs Welcome License: MIT

The most complete, interview-focused ML/AI reference on GitHub.

Cracking interviews at Google DeepMind, OpenAI, Meta AI, Amazon, Microsoft.


πŸ“Œ What's Inside

SectionContent
🧠 ML FundamentalsBias-variance, overfitting, regularization, loss functions
πŸ”’ Math & StatisticsLinear algebra, probability, calculus for ML
πŸ€– Deep LearningCNNs, RNNs, transformers, attention, training tricks
πŸ—£οΈ NLPBERT, GPT, RAG, embeddings, tokenization
πŸ‘οΈ Computer VisionYOLO, ResNet, image augmentation, segmentation
πŸ—οΈ ML System DesignRecommendation systems, search, fraud detection
🐍 Python & LibrariesNumPy, Pandas, scikit-learn, PyTorch one-liners
🧩 Coding PatternsData preprocessing, model eval, cross-validation
πŸ’Ό BehavioralSTAR answers, research discussion, project walkthrough

🧠 ML Fundamentals

Q: Explain the bias-variance tradeoff.

Bias = error from wrong assumptions (underfitting β€” model too simple). Variance = error from sensitivity to training data fluctuations (overfitting β€” model too complex).

Total Error = BiasΒ² + Variance + Irreducible Noise
High BiasHigh Variance
Training errorHighLow
Test errorHighHigh
FixMore features, complex modelRegularization, more data, dropout

Interview tip: Draw the U-shaped test error curve. Explain that the goal is to find the sweet spot.

Q: What is regularization? Compare L1 vs L2.

Regularization adds a penalty to the loss function to prevent overfitting.

L1 (Lasso)L2 (Ridge)
PenaltyλΣ|wᡒ|λΣwᡒ²
EffectProduces sparse weights (zeros out features)Shrinks weights toward zero, keeps all
Use whenFeature selection neededAll features relevant
GradientSign(w)2w
from sklearn.linear_model import Lasso, Ridge

lasso = Lasso(alpha=0.1)   # L1
ridge = Ridge(alpha=1.0)   # L2
Q: How do you handle imbalanced datasets?

Resampling:

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Oversample minority class
X_res, y_res = SMOTE().fit_resample(X, y)

# Undersample majority class
X_res, y_res = RandomUnderSampler().fit_resample(X, y)

Class weights:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced')

Metrics: Avoid accuracy. Use:

  • Precision, Recall, F1-score
  • ROC-AUC
  • PR-AUC (better for severe imbalance)
Q: Explain cross-validation. Why use k-fold?

K-fold CV splits data into k subsets. Train on k-1, test on 1. Repeat k times. Average the scores.

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
print(f"F1: {scores.mean():.3f} Β± {scores.std():.3f}")

Why k-fold? Reduces variance in evaluation vs a single train/test split. Stratified k-fold preserves class distribution in each fold.

Q: What is gradient descent? Compare SGD, Mini-batch, Adam.

Gradient descent minimizes loss by updating parameters in the direction of steepest descent:

ΞΈ = ΞΈ - Ξ± Β· βˆ‡L(ΞΈ)
Batch GDSGDMini-batch
Update perFull dataset1 sampleBatch (32-256)
SpeedSlowFastFast
NoiseLowHighMedium
MemoryHighLowMedium

Adam (Adaptive Moment Estimation) β€” most popular optimizer:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

Adam combines momentum + RMSProp. Adapts learning rate per parameter.


πŸ”’ Math & Statistics

Q: What is the dot product and why does it matter in ML?
a Β· b = Ξ£ aα΅’bα΅’ = |a||b|cos(ΞΈ)

In ML: measures similarity (cosine similarity in embeddings), is the core of every linear layer:

output = X @ W.T + b  # Matrix multiplication = stacked dot products
Q: Explain PCA intuitively and mathematically.

PCA finds directions (principal components) of maximum variance in data.

Steps:

  1. Center data: X_centered = X - mean(X)
  2. Compute covariance matrix: C = (X_centered.T @ X_centered) / (n-1)
  3. Eigendecompose: C = V Ξ› Vα΅€
  4. Project: X_pca = X_centered @ V[:, :k]
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
Q: What is the difference between MLE and MAP?

MLE (Maximum Likelihood Estimation):

ΞΈ_MLE = argmax P(data | ΞΈ)

Finds parameters that make observed data most probable. No prior assumption.

MAP (Maximum A Posteriori):

ΞΈ_MAP = argmax P(ΞΈ | data) = argmax P(data | ΞΈ) Β· P(ΞΈ)

Incorporates prior belief about ΞΈ. MAP with Gaussian prior = L2 regularization. MAP with Laplace prior = L1 regularization.


πŸ€– Deep Learning

Q: How does backpropagation work?

Backprop computes gradients of the loss with respect to all parameters using the chain rule.

Forward: x β†’ [L1] β†’ h β†’ [L2] β†’ Ε· β†’ loss
Backward: βˆ‚loss/βˆ‚Wβ‚‚ β†’ βˆ‚loss/βˆ‚h β†’ βˆ‚loss/βˆ‚W₁
loss = criterion(output, target)
loss.backward()       # Compute all gradients
optimizer.step()      # Update weights: W = W - lr * W.grad
optimizer.zero_grad() # Clear for next batch

Chain rule: βˆ‚L/βˆ‚W₁ = (βˆ‚L/βˆ‚Ε·) Β· (βˆ‚Ε·/βˆ‚h) Β· (βˆ‚h/βˆ‚W₁)

Q: Explain the vanishing gradient problem and solutions.

In deep networks, gradients shrink exponentially as they propagate backward through many sigmoid/tanh layers β†’ early layers learn very slowly.

Solutions:

FixHow
ReLU activationGradient = 1 for positive inputs (no squashing)
Batch NormalizationNormalizes activations, keeps gradients stable
Residual connectionsGradient flows directly via skip connections
LSTM/GRUGating mechanisms preserve long-range gradients
Gradient clippingtorch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
Q: What is attention mechanism / self-attention?

Attention lets the model focus on relevant parts of the input for each output position.

Attention(Q, K, V) = softmax(QKα΅€ / √dβ‚–) Β· V
  • Q (Query): what we're looking for
  • K (Key): what each position offers
  • V (Value): actual content to aggregate
# Scaled dot-product attention
import torch.nn.functional as F

def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)
    return weights @ V

Self-attention: Q=K=V come from same sequence. Allows each token to attend to all others.

Q: Compare CNN vs RNN vs Transformer for sequence tasks.
CNNRNN/LSTMTransformer
Parallelizableβœ… Yes❌ Sequentialβœ… Yes
Long-range deps❌ Fixed window⚠️ Strugglesβœ… Global attention
MemoryLowMediumHigh (O(nΒ²))
Best forLocal patterns, CVShort sequencesNLP, long sequences
SpeedFastSlowFast (parallelized)
Q: What is batch normalization? Why does it help?

BatchNorm normalizes activations within each mini-batch to have zero mean and unit variance, then applies learnable scale/shift:

# PyTorch
self.bn = nn.BatchNorm2d(num_channels)

# What it computes:
# ΞΌ = mean(x), σ² = var(x)
# x_norm = (x - ΞΌ) / √(σ² + Ξ΅)
# output = Ξ³ * x_norm + Ξ²  (Ξ³, Ξ² are learned)

Benefits: Reduces internal covariate shift, acts as regularization, allows higher learning rates, reduces sensitivity to weight initialization.


πŸ—£οΈ NLP

Q: How does BERT work? What makes it different from GPT?

BERT (Bidirectional Encoder Representations from Transformers):

  • Encoder-only transformer
  • Bidirectional: sees both left and right context simultaneously
  • Pre-trained with: Masked Language Model (MLM) + Next Sentence Prediction (NSP)
  • Best for: classification, NER, Q&A (understanding tasks)

GPT (Generative Pre-trained Transformer):

  • Decoder-only transformer
  • Unidirectional (causal): only sees left context
  • Pre-trained with: Next token prediction
  • Best for: text generation, summarization, chat
BERT: [CLS] The [MASK] sat on the mat [SEP] β†’ predicts "cat"
GPT:  The cat sat on β†’ predicts "the"
Q: What is RAG (Retrieval-Augmented Generation)?

RAG combines a retriever (finds relevant documents) with a generator (LLM) to answer questions grounded in external knowledge.

Query β†’ [Embed] β†’ Vector DB search β†’ Top-k docs
                                          ↓
                   Prompt: "Using these docs: {docs}\nAnswer: {query}"
                                          ↓
                                    LLM generates answer
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

Why RAG? Overcomes LLM knowledge cutoff, reduces hallucination, keeps knowledge updatable without retraining.

Q: Explain word embeddings. Word2Vec vs GloVe vs FastText.

Embeddings map words to dense vectors where similar words are close.

Word2VecGloVeFastText
MethodNeural (CBOW/Skip-gram)Matrix factorization on co-occurrenceWord2Vec + subword n-grams
OOV handling❌ No❌ Noβœ… Yes (via n-grams)
Morphology❌ No❌ Noβœ… Yes
Best forGeneral NLPGeneral NLPMorphologically rich languages
from gensim.models import Word2Vec, FastText
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv['python']  # shape: (100,)

πŸ‘οΈ Computer Vision

Q: How does YOLO work? What makes YOLOv8 better?

YOLO (You Only Look Once) divides image into SΓ—S grid. Each cell predicts B bounding boxes + class probabilities in a single forward pass.

$ \text{Input} \text{image} (640 \times 640) ↓ \text{Backbone} (\text{feature} \text{extraction}) ↓ \text{Neck} (\text{FPN}/\text{PAN} β€” \text{multi}-\text{scale} \text{features}) ↓ \text{Head} (\text{predict} \text{boxes} + \text{classes} \text{for} 3 \text{scales}) ↓ \text{NMS} (\text{remove} \text{overlapping} \text{boxes}) ↓ \text{Final} \text{detections} $

YOLOv8 improvements over v5:

  • Anchor-free detection (no pre-defined anchors)
  • Decoupled head (separate classification and regression)
  • C2f module replaces C3 (better gradient flow)
  • New loss: Distribution Focal Loss for bounding box regression
  • ~35% fewer parameters than YOLOv5 at same accuracy
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model('image.jpg')
results[0].boxes  # bboxes, confidences, classes
Q: What is transfer learning? When to freeze layers?

Transfer learning: use a model trained on large dataset (ImageNet) as starting point for your task.

Strategies:

ScenarioApproach
Small dataset, similar domainFreeze backbone, train head only
Small dataset, different domainFreeze early layers, fine-tune later layers
Large dataset, any domainFine-tune entire network
# PyTorch β€” freeze backbone
model = torchvision.models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False  # Freeze all

# Replace and unfreeze head
model.fc = nn.Linear(2048, num_classes)  # Only fc trains

πŸ—οΈ ML System Design

Q: Design a real-time fraud detection system.
Transaction β†’ [Feature Engineering] β†’ [ML Model] β†’ Decision
                     ↑                              ↓
              Feature Store                    Risk Score
              (user history,              β†’ Block / Flag / Pass
               merchant profile,
               velocity features)

Key components:

  1. Features: amount, merchant category, location delta, time of day, velocity (5 txns in 1 min?), device fingerprint
  2. Model: XGBoost / LightGBM (low latency), backed by deep learning for complex patterns
  3. Threshold: p(fraud) > 0.7 β†’ block, 0.3-0.7 β†’ MFA challenge, < 0.3 β†’ pass
  4. Latency: < 100ms P99 via feature precomputation + model serving (TorchServe/TFServing)
  5. Feedback loop: labeled outcomes β†’ retrain weekly

Metrics: Precision (false positives = bad UX), Recall (false negatives = fraud loss), F1, AUC

Q: Design a recommendation system (YouTube/Netflix style).

Two-stage architecture:

100M items β†’ [Retrieval (fast)] β†’ 1000 candidates
                                      ↓
                              [Ranking (accurate)]
                                      ↓
                               Top 50 shown to user

Retrieval: Matrix factorization / two-tower neural network

# Two-tower: embed user and item separately, dot product for score
user_embed = user_tower(user_features)   # (batch, 128)
item_embed = item_tower(item_features)   # (batch, 128)
scores = (user_embed * item_embed).sum(-1)

Ranking: Wide & Deep / DIN / Transformer on (user, item, context) features

Metrics: Click-through rate, Watch time, NDCG, Coverage, Diversity


🐍 Python & Libraries

Essential NumPy one-liners
import numpy as np

# Shape manipulation
x = np.random.randn(100, 3)
x.reshape(50, 6)           # Reshape
x.T                         # Transpose
x[:, np.newaxis]            # Add dimension
np.squeeze(x)               # Remove size-1 dims

# Math
np.dot(A, B)                # Matrix multiply (2D)
A @ B                       # Same, cleaner syntax
np.linalg.norm(x)           # L2 norm
np.linalg.eig(A)            # Eigendecomposition
np.linalg.svd(A)            # SVD

# Stats
np.mean(x, axis=0)          # Column means
np.std(x, ddof=1)           # Sample std
np.percentile(x, 75)        # 75th percentile
np.corrcoef(x[:, 0], x[:, 1])  # Correlation

# Boolean ops
np.where(x > 0, x, 0)      # ReLU!
x[x > 0]                    # Boolean indexing
np.any(x > 5), np.all(x > 0)
Essential Pandas one-liners
import pandas as pd

# Load
df = pd.read_csv('data.csv')
df.info()               # Shape, dtypes, nulls
df.describe()           # Stats summary
df.head(), df.tail()

# Missing values
df.isnull().sum()                       # Null count per col
df.fillna(df.mean(), inplace=True)      # Fill numeric
df.dropna(subset=['target'])            # Drop rows with null target
df['col'].fillna(df['col'].mode()[0])   # Fill with mode

# Feature engineering
df['log_price'] = np.log1p(df['price'])
df['age_group'] = pd.cut(df['age'], bins=[0,18,35,60,100], labels=['teen','young','mid','senior'])
pd.get_dummies(df['category'], prefix='cat', drop_first=True)  # One-hot encode

# Aggregation
df.groupby('city')['salary'].agg(['mean','median','count'])
df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
Scikit-learn full pipeline template
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV

num_features = ['age', 'salary', 'experience']
cat_features = ['city', 'department']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

full_pipeline = Pipeline([
    ('prep', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=100))
])

# Cross-validate
scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='roc_auc')

# Grid search
param_grid = {'model__n_estimators': [100, 200], 'model__max_depth': [3, 5]}
gs = GridSearchCV(full_pipeline, param_grid, cv=5, n_jobs=-1, scoring='roc_auc')
gs.fit(X_train, y_train)
PyTorch training loop template
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss, correct = 0, 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        output = model(X)
        total_loss += criterion(output, y).item()
        correct += (output.argmax(1) == y).sum().item()
    return total_loss / len(loader), correct / len(loader.dataset)

# Training loop with early stopping
best_val_loss, patience, wait = float('inf'), 5, 0
for epoch in range(100):
    train_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
    val_loss, val_acc = evaluate(model, val_loader, criterion, device)
    scheduler.step(val_loss)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_model.pt')
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print(f'Early stopping at epoch {epoch}')
            break
    
    print(f'Epoch {epoch}: train={train_loss:.4f} val={val_loss:.4f} acc={val_acc:.4f}')

🧩 Coding Patterns

Custom Dataset class (PyTorch)
from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as T

class ImageDataset(Dataset):
    def __init__(self, df, img_dir, transform=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transform = transform or T.Compose([
            T.Resize((224, 224)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        img = Image.open(f"{self.img_dir}/{self.df.loc[idx, 'filename']}").convert('RGB')
        label = self.df.loc[idx, 'label']
        return self.transform(img), torch.tensor(label, dtype=torch.long)
Evaluation metrics from scratch
import numpy as np

def precision_recall_f1(y_true, y_pred):
    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    precision = tp / (tp + fp + 1e-8)
    recall    = tp / (tp + fn + 1e-8)
    f1        = 2 * precision * recall / (precision + recall + 1e-8)
    return precision, recall, f1

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def iou(box1, box2):
    x1 = max(box1[0], box2[0]); y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2]); y2 = min(box1[3], box2[3])
    intersection = max(0, x2-x1) * max(0, y2-y1)
    union = (box1[2]-box1[0])*(box1[3]-box1[1]) + (box2[2]-box2[0])*(box2[3]-box2[1]) - intersection
    return intersection / (union + 1e-8)

πŸ’Ό Behavioral

Project walkthrough template (STAR format)

Situation: "At [company/university], we faced [problem] β€” [metric showing scale]."

Task: "My role was to [responsibility] β€” specifically [what you owned]."

Action:

  1. "First I [explored/analyzed] the data and found [insight]."
  2. "I chose [model/approach] because [reason over alternatives]."
  3. "Key challenge was [X] β€” I solved it by [Y]."

Result: "[Metric improvement] β€” e.g., reduced inference time by 40%, improved F1 from 0.72 to 0.89."

Tip: Always quantify. "Improved accuracy" < "Improved F1 from 0.72 to 0.89 on a 100K sample test set."

Questions to ask the interviewer
  1. "What does the ML infrastructure look like β€” on-prem, cloud, internal tooling?"
  2. "How do you handle model monitoring and drift detection in production?"
  3. "What's the typical iteration cycle from idea to model in production?"
  4. "What's the biggest unsolved ML challenge on the team right now?"
  5. "How does the team balance research vs engineering vs product priorities?"

πŸ—ΊοΈ Roadmap

  • LLM fine-tuning section (LoRA, QLoRA, RLHF)
  • MLOps questions (MLflow, DVC, Kubeflow, feature stores)
  • Reinforcement Learning fundamentals
  • Time series deep dive
  • More system design case studies

🀝 Contributing

Found an error? Have a great question/answer pair? PRs are very welcome.

git checkout -b add/new-question
# Add your Q&A in the relevant section
git commit -m 'Add: <topic> question on <concept>'
git push origin add/new-question

⭐ If this helped you land an offer β€” please star the repo!

Star

MIT Β© Mohd Aasim Ansari