Whisper Fine-tuning on Modal
November 23, 2025 · View on GitHub
Fine-tune OpenAI's Whisper speech recognition models using Modal's serverless GPU infrastructure.
Overview
This script provides Modal-based training infrastructure for five Whisper model variants (tiny, base, small, medium, large-v3-turbo). Each variant runs as an isolated Modal app with separate caching volumes. Training data is loaded from Hugging Face Hub datasets, and trained models are pushed back to Hugging Face Hub repositories.
Requirements
- Modal account with CLI configured
- Hugging Face account with write-enabled API token
- Python 3.12 (used in container image)
Initial Setup
Install and authenticate Modal CLI:
pip install modal
modal token new
Store Hugging Face token as Modal secret:
modal secret create huggingface-token HF_TOKEN=your_token_here
Dataset Format
The training script expects a Hugging Face dataset with the following structure:
Parquet Format
Required columns:
audio: Audio files (WAV format, 16kHz sample rate)textorsentence: Transcription text
Example dataset structure:
your-dataset/
├── data/
│ └── train-00000-of-00001.parquet
├── audio/
│ ├── sample1.wav
│ ├── sample2.wav
│ └── ...
└── README.md
Optional columns:
duration_seconds,sample_rate, metadata fields
Technical Notes
- Audio is automatically resampled to 16kHz if different
- Text column must be named
textorsentence - Dataset must have a
trainsplit - Evaluation split is auto-generated (10% of training data, 90/10 split)
- Minimum 10 samples required (100+ recommended for effective fine-tuning)
Configuration
Edit modal_finetune_opensource.py:
-
Set dataset name (line 18):
DATASET_NAME = "your-username/your-whisper-dataset" -
Set target repository names for each model (lines 66-90):
LARGE_V3_TURBO = ModelConfig( # ... default_repo="your-username/whisper-large-v3-turbo-finetuned", )
Deploy to Modal:
modal deploy modal_finetune_opensource.py
Usage
Run training for a specific model variant:
modal run modal_finetune_opensource.py::main_large # large-v3-turbo
modal run modal_finetune_opensource.py::main_base # base
modal run modal_finetune_opensource.py::main_tiny # tiny
modal run modal_finetune_opensource.py::main_small # small
modal run modal_finetune_opensource.py::main_medium # medium
Parameter Overrides
Override default training parameters:
from modal_finetune_opensource import train_base
train_base.remote(
repo_name="your-username/custom-repo-name",
max_steps=500,
learning_rate=5e-6,
num_train_epochs=5,
train_batch_size=16,
gradient_accumulation_steps=1
)
Parameters
| Parameter | Default | Description |
|---|---|---|
repo_name | From config | HF repo to push model to |
max_steps | 250 | Maximum training steps (None for epoch-based) |
learning_rate | 1e-5 | Optimizer learning rate |
num_train_epochs | 3 | Number of epochs (when max_steps=None) |
train_batch_size | 8 | Per-device batch size for training |
eval_batch_size | 8 | Per-device batch size for evaluation |
gradient_accumulation_steps | 2 | Steps to accumulate gradients |
Model Variants
| Model | Parameters | GPU Memory | Relative Inference Speed |
|---|---|---|---|
| tiny | 39M | ~2GB | Fastest |
| base | 74M | ~3GB | Very fast |
| small | 244M | ~6GB | Fast |
| medium | 769M | ~12GB | Moderate |
| large-v3-turbo | 809M | ~14GB | Slower |
Training Pipeline
- Dataset loaded from Hugging Face Hub
- Audio resampled to 16kHz (if necessary)
- 90/10 train/eval split created automatically
- Audio converted to mel spectrograms via feature extractor
- Seq2Seq training with evaluation every 50 steps
- Checkpoints saved every 100 steps to Modal volume
- Final model pushed to Hugging Face Hub
Monitoring
- TensorBoard logs saved during training
- Evaluation runs every 50 steps
- Checkpoints saved every 100 steps
Resource Usage
Approximate Modal costs (A100-40GB GPU):
- GPU rate: ~$1.10/hour
- Training duration (250 steps): 30-90 minutes
- Cost per run: $0.50-$2.00
Troubleshooting
Dataset not found
- Verify dataset exists on Hugging Face Hub
- Check
DATASET_NAMEmatches your dataset ID - Ensure dataset is public or token has read access
Out of memory errors
- Reduce
train_batch_size(try 4 or lower) - Increase
gradient_accumulation_stepsto compensate - Use a smaller model variant
Authentication failures
- Verify Modal secret:
modal secret list - Check HF token has write permissions
- Re-create secret if necessary
Slow training
- Expected for medium/large-v3-turbo models
- Use tiny/base for faster iteration
- Optimize dataset format (parquet preferred)
Advanced Configuration
Custom Dependencies
Modify build_image() to add packages:
def build_image() -> modal.Image:
img = (
modal.Image.debian_slim(python_version="3.12")
.apt_install("git", "ffmpeg")
.pip_install(
# ... existing packages
"your-custom-package",
)
)
return img
GPU Selection
Change GPU type in @app.function decorator:
@base_app.function(
gpu="T4", # Options: T4, A10G, A100-40GB, A100-80GB
# ...
)
Training Arguments
Modify Seq2SeqTrainingArguments in _train() for control over:
- Warmup steps
- Learning rate scheduling
- Evaluation strategy
- Logging frequency
License
MIT
References
- OpenAI Whisper: https://github.com/openai/whisper
- Modal documentation: https://modal.com/docs
- Hugging Face datasets: https://huggingface.co/docs/datasets