M3SVD: Multi-Modal Multi-Scene Video Dataset
March 4, 2026 Β· View on GitHub
π¦ Notes on Provided Formats (Images vs. Videos)
Baidu Netdisk Images: https://pan.baidu.com/s/1g8jixAr39n06JWPwrBE6lQ?pwd=M2VD
Baidu Netdisk Videos: https://pan.baidu.com/s/1z_kMLxYejPvt_17SNGlOTA?pwd=M2VD
GoogleDrive Videos: https://drive.google.com/file/d/1bRoNhQBzWtj0y8CMGdXQvjbQCGAVddPX/view?usp=sharing
- Images are released in the native per-frame format that is directly used by the VideoFusion project (i.e., frame sequences under each clip folder), so you can plug them into the training/testing pipeline without any extra conversion.
- Videos are additionally provided by packing each frame sequence into a single video file (e.g.,
.mp4) to reduce file count and avoid storage / hosting limitations (many platforms struggle with extremely large numbers of small image files).
π Convert Videos to Frame Sequences (Recommended)
If you download the Videos version and need per-frame image sequences (the format directly used in VideoFusion), please use:
This script converts each .mp4 clip into an ordered frame sequence and restores the dataset layout for training/testing.
π§Ύ Folder Meaning
We provide both high-quality (clean/enhanced) data and degraded data for infrared and visible modalities:
- infrared_Enhance: High-quality infrared (IR) frames (clean/enhanced version).
- visible_Enhance: High-quality visible (VI) frames (clean/enhanced version).
- infrared_noise: Degraded infrared (IR) frames with stripe noise (a typical IR sensor degradation).
- visible_Blur: Degraded visible (VI) frames with blur (e.g., motion/defocus blur).
β¨ News
-
[2026] Our paper βVideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusionβ has been accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026! [Paper] [Code]
-
[2025] M3SVD dataset is officially released.
π Introduction
M3SVD (Multi-Modal Multi-Scene Video Dataset) is a large-scale infrared-visible (IR-VI) video dataset designed for:
- π₯ Multi-modal video fusion
- π Low-light / degraded video restoration
- πΉ Spatio-temporal modeling research
π₯ Scenario Schematic
Visualization of representative scenarios in M3SVD. The dataset contains 220 temporally synchronized infrared-visible (IR-VI) video pairs with 153,797 aligned frames in total, captured at a resolution of 640Γ480 and 30 FPS.
π Data Processing Workflow
Dataset Comparison (vs. prior works)
π Video Demo
Example sequences (GIF previews):




π¦ Dataset Availability
- Current release: Test split
- Full dataset access: Please contact
linfeng0419@gmail.com
We are open to academic collaboration and research usage.
π Related Work Using M3SVD
VideoFusion (CVPR 2026)
Spatio-temporal collaborative network for multi-modal video fusion and restoration. [Paper] [Code]
π Citation
If you use M3SVD in your research, please cite:
@inproceedings{Tang2026VideoFusion,
title = {VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion and Restoration},
author = {Tang, Linfeng and Wang, Yeda and Gong, Meiqi and Li, Zizhuo and Deng, Yuxin and Yi, Xunpeng and Li, Chunyu and Zhang, Hao and Xu, Han and Ma, Jiayi},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}