Whisper Fine-tuning on Modal

November 23, 2025 · View on GitHub

Fine-tune OpenAI's Whisper speech recognition models using Modal's serverless GPU infrastructure.

Overview

This script provides Modal-based training infrastructure for five Whisper model variants (tiny, base, small, medium, large-v3-turbo). Each variant runs as an isolated Modal app with separate caching volumes. Training data is loaded from Hugging Face Hub datasets, and trained models are pushed back to Hugging Face Hub repositories.

Requirements

  • Modal account with CLI configured
  • Hugging Face account with write-enabled API token
  • Python 3.12 (used in container image)

Initial Setup

Install and authenticate Modal CLI:

pip install modal
modal token new

Store Hugging Face token as Modal secret:

modal secret create huggingface-token HF_TOKEN=your_token_here

Dataset Format

The training script expects a Hugging Face dataset with the following structure:

Parquet Format

Required columns:

  • audio: Audio files (WAV format, 16kHz sample rate)
  • text or sentence: Transcription text

Example dataset structure:

your-dataset/
├── data/
│   └── train-00000-of-00001.parquet
├── audio/
│   ├── sample1.wav
│   ├── sample2.wav
│   └── ...
└── README.md

Optional columns:

  • duration_seconds, sample_rate, metadata fields

Technical Notes

  • Audio is automatically resampled to 16kHz if different
  • Text column must be named text or sentence
  • Dataset must have a train split
  • Evaluation split is auto-generated (10% of training data, 90/10 split)
  • Minimum 10 samples required (100+ recommended for effective fine-tuning)

Configuration

Edit modal_finetune_opensource.py:

  1. Set dataset name (line 18):

    DATASET_NAME = "your-username/your-whisper-dataset"
    
  2. Set target repository names for each model (lines 66-90):

    LARGE_V3_TURBO = ModelConfig(
        # ...
        default_repo="your-username/whisper-large-v3-turbo-finetuned",
    )
    

Deploy to Modal:

modal deploy modal_finetune_opensource.py

Usage

Run training for a specific model variant:

modal run modal_finetune_opensource.py::main_large      # large-v3-turbo
modal run modal_finetune_opensource.py::main_base       # base
modal run modal_finetune_opensource.py::main_tiny       # tiny
modal run modal_finetune_opensource.py::main_small      # small
modal run modal_finetune_opensource.py::main_medium     # medium

Parameter Overrides

Override default training parameters:

from modal_finetune_opensource import train_base

train_base.remote(
    repo_name="your-username/custom-repo-name",
    max_steps=500,
    learning_rate=5e-6,
    num_train_epochs=5,
    train_batch_size=16,
    gradient_accumulation_steps=1
)

Parameters

ParameterDefaultDescription
repo_nameFrom configHF repo to push model to
max_steps250Maximum training steps (None for epoch-based)
learning_rate1e-5Optimizer learning rate
num_train_epochs3Number of epochs (when max_steps=None)
train_batch_size8Per-device batch size for training
eval_batch_size8Per-device batch size for evaluation
gradient_accumulation_steps2Steps to accumulate gradients

Model Variants

ModelParametersGPU MemoryRelative Inference Speed
tiny39M~2GBFastest
base74M~3GBVery fast
small244M~6GBFast
medium769M~12GBModerate
large-v3-turbo809M~14GBSlower

Training Pipeline

  1. Dataset loaded from Hugging Face Hub
  2. Audio resampled to 16kHz (if necessary)
  3. 90/10 train/eval split created automatically
  4. Audio converted to mel spectrograms via feature extractor
  5. Seq2Seq training with evaluation every 50 steps
  6. Checkpoints saved every 100 steps to Modal volume
  7. Final model pushed to Hugging Face Hub

Monitoring

  • TensorBoard logs saved during training
  • Evaluation runs every 50 steps
  • Checkpoints saved every 100 steps

Resource Usage

Approximate Modal costs (A100-40GB GPU):

  • GPU rate: ~$1.10/hour
  • Training duration (250 steps): 30-90 minutes
  • Cost per run: $0.50-$2.00

Troubleshooting

Dataset not found

  • Verify dataset exists on Hugging Face Hub
  • Check DATASET_NAME matches your dataset ID
  • Ensure dataset is public or token has read access

Out of memory errors

  • Reduce train_batch_size (try 4 or lower)
  • Increase gradient_accumulation_steps to compensate
  • Use a smaller model variant

Authentication failures

  • Verify Modal secret: modal secret list
  • Check HF token has write permissions
  • Re-create secret if necessary

Slow training

  • Expected for medium/large-v3-turbo models
  • Use tiny/base for faster iteration
  • Optimize dataset format (parquet preferred)

Advanced Configuration

Custom Dependencies

Modify build_image() to add packages:

def build_image() -> modal.Image:
    img = (
        modal.Image.debian_slim(python_version="3.12")
        .apt_install("git", "ffmpeg")
        .pip_install(
            # ... existing packages
            "your-custom-package",
        )
    )
    return img

GPU Selection

Change GPU type in @app.function decorator:

@base_app.function(
    gpu="T4",  # Options: T4, A10G, A100-40GB, A100-80GB
    # ...
)

Training Arguments

Modify Seq2SeqTrainingArguments in _train() for control over:

  • Warmup steps
  • Learning rate scheduling
  • Evaluation strategy
  • Logging frequency

License

MIT

References