FVD (Fréchet Video Distance) Evaluation Tool
April 8, 2026 · View on GitHub
Overview
This tool computes the Fréchet Video Distance (FVD) between two sets of videos using a pre-trained I3D model (Kinetics-400, RGB stream). FVD is a distribution-level metric that measures how similar two collections of videos are — lower values indicate more similar distributions (0 = identical).
Primary Use Cases:
- Model Optimization Validation — Verify that quantized/pruned video generation models maintain output quality
- Precision Analysis — Compare BF16 vs INT8 vs INT4 generated video outputs
- Framework Comparison — Evaluate outputs across different inference backends
Key Components
| Script | Purpose |
|---|---|
compute_fvd.py | Main script — loads videos, extracts I3D features, computes FVD |
i3d_model.py | I3D (Inception-v1 Inflated 3D) model architecture and weight loading |
I3D Model Details
- Architecture: Inception-v1 inflated to 3D (Carreira & Zisserman, CVPR 2017)
- Weights:
rgb_imagenet.ptfrom pytorch-i3d (~49 MB, auto-downloaded on first run) - Feature dimension: 1024 (from the final average pooling layer)
- Input: 16-frame clips, center-cropped to 224×224, normalized to [-1, 1]
Installation
1. Create and Activate a Virtual Environment (Recommended)
python -m venv fvd_env
source fvd_env/bin/activate # Linux/macOS
# .\fvd_env\Scripts\Activate.ps1 # Windows PowerShell
2. Install Requirements
pip install -r requirements.txt
Note: For GPU acceleration, install PyTorch with CUDA support:
pip install torch --index-url https://download.pytorch.org/whl/cu129
Usage Examples
Quick Start
Compare two directories of videos:
python compute_fvd.py \
--ref-dir /path/to/reference/videos \
--gen-dir /path/to/generated/videos
The I3D weights (~49 MB) are downloaded automatically on first run and cached in ~/.cache/fvd/rgb_imagenet.pt.
Save Results to JSON
python compute_fvd.py \
--ref-dir /path/to/reference/videos \
--gen-dir /path/to/generated/videos \
--output results.json
Use a Locally Downloaded I3D Checkpoint
python compute_fvd.py \
--ref-dir /path/to/reference/videos \
--gen-dir /path/to/generated/videos \
--weights ./rgb_imagenet.pt
Increase Sample Count with Multiple Clips per Video
python compute_fvd.py \
--ref-dir /path/to/reference/videos \
--gen-dir /path/to/generated/videos \
--clips-per-video 4 \
--output results.json
Specify Device and Batch Size
python compute_fvd.py \
--ref-dir ./real --gen-dir ./fake \
--device cuda \
--batch-size 16
Explicit PCA Dimension
python compute_fvd.py \
--ref-dir ./real --gen-dir ./fake \
--pca-dim 64 \
--output results.json
Configuration Parameters
Required Parameters
| Parameter | Description |
|---|---|
--ref-dir | Directory containing reference (real) videos |
--gen-dir | Directory containing generated videos |
Optional Parameters
| Parameter | Description | Default |
|---|---|---|
--weights | Path to I3D weights file | Auto-downloaded rgb_imagenet.pt |
--device | Torch device (cuda, cpu, cuda:0) | Auto-detected |
--clip-length | Number of frames per clip | 16 |
--clips-per-video | Number of clips sampled per video | 1 |
--batch-size | Batch size for I3D inference | 8 |
--pca-dim | PCA dimension for features (0 to disable, auto-selected when clips < 1024) | Auto |
--output | Path to save JSON results | None (prints to console) |
Supported Video Formats
.mp4, .avi, .mov, .mkv, .webm, .flv, .m4v
Videos are discovered recursively under the specified directories.
Expected Output
Console Output
2025-01-15 10:30:00 | INFO | Device: cuda
2025-01-15 10:30:02 | INFO | I3D model loaded from rgb_imagenet.pt (1024-dim features)
2025-01-15 10:30:02 | INFO | Reference videos: 100
2025-01-15 10:30:02 | INFO | Generated videos: 100
Loading ref: 100%|██████████| 100/100 [00:15<00:00, 6.5video/s]
Loading gen: 100%|██████████| 100/100 [00:14<00:00, 6.8video/s]
2025-01-15 10:30:32 | INFO | Total clips — ref: 100, gen: 100
Extracting ref features: 100%|██████████| 13/13 [00:08<00:00, 1.5it/s]
Extracting gen features: 100%|██████████| 13/13 [00:07<00:00, 1.6it/s]
2025-01-15 10:30:48 | INFO | FVD = 12.3456
JSON Output
{
"fvd": 12.3456,
"ref_dir": "/path/to/reference/videos",
"gen_dir": "/path/to/generated/videos",
"num_ref_clips": 100,
"num_gen_clips": 100,
"clip_length": 16,
"clips_per_video": 1,
"feature_dim": 1024,
"pca_dim": null,
"model": "I3D (Kinetics-400, 1024-dim pool)"
}
Benchmark Results
LTX-2.3 Video Generation — PTQ vs QAD (BF16 Reference)
FVD scores comparing PTQ-quantized and QAD-quantized LTX-2.3 video generation outputs against BF16 baseline, evaluated across VBench dimensions. Lower is better.
| Category | FVD: PTQ vs BF16 ↓ | FVD: QAD vs BF16 ↓ |
|---|---|---|
| Temporal Flickering | 31.92 | 21.97 |
| Subject Dynamic Motion | 23.44 | 16.28 |
| Multiple Objects | 35.35 | 22.47 |
| Human Action | 30.08 | 21.82 |
| Object Class | 51.51 | 26.86 |
| Color | 36.52 | 25.09 |
| Spatial Relationship | 25.07 | 18.41 |
| Scene Background | 64.92 | 35.69 |
| Appearance Style | 31.08 | 20.82 |
| Temporal Style | 23.61 | 15.85 |
| Overall Consistency | 25.03 | 18.85 |
| Average | 34.41 | 22.19 |
Takeaways:
- QAD consistently outperforms PTQ across all 11 VBench dimensions, with an average FVD of 22.19 vs 34.41 (35% lower).
- The largest gap is on Scene Background (64.92 vs 35.69) and Object Class (51.51 vs 26.86), indicating PTQ degrades spatial detail fidelity more than QAD.
- Both methods perform best on Temporal Style and Subject Dynamic Motion, suggesting temporal dynamics are more robust to quantization.
Key Insights
- Lower is better: FVD = 0 means identical distributions
- Sample count matters: FVD estimates are noisy below ~256 clips; 2048+ clips recommended for publishable results. Use
--clips-per-videoto increase sample count. - PCA auto-selection: When the number of clips is less than the feature dimension (1024), PCA is automatically applied to avoid rank-deficient covariance matrices
Troubleshooting
CUDA Out of Memory
Solutions:
- Reduce batch size:
--batch-size 2 - Use CPU:
--device cpu - Close other GPU applications
No Videos Found
Ensure your video files have a supported extension (.mp4, .avi, etc.) and are located in or under the specified directory. The script searches recursively.
Noisy / Unstable FVD Values
If FVD values vary significantly between runs, you likely have too few clips. Increase the sample count:
python compute_fvd.py --ref-dir ./real --gen-dir ./fake --clips-per-video 8
References
- Unterthiner et al., "FVD: A New Metric for Video Generation", 2019
- Carreira & Zisserman, "Quo Vadis, Action Recognition?", CVPR 2017
- I3D PyTorch weights: piergiaj/pytorch-i3d