M3SVD: Multi-Modal Multi-Scene Video Dataset

March 4, 2026 · View on GitHub

GoogleDrive Videos: https://drive.google.com/file/d/1bRoNhQBzWtj0y8CMGdXQvjbQCGAVddPX/view?usp=sharing

Images are released in the native per-frame format that is directly used by the VideoFusion project (i.e., frame sequences under each clip folder), so you can plug them into the training/testing pipeline without any extra conversion.
Videos are additionally provided by packing each frame sequence into a single video file (e.g., .mp4) to reduce file count and avoid storage / hosting limitations (many platforms struggle with extremely large numbers of small image files).

🔁 Convert Videos to Frame Sequences (Recommended)

If you download the Videos version and need per-frame image sequences (the format directly used in VideoFusion), please use:

video2img.py: https://github.com/Linfeng-Tang/M3SVD/blob/main/video2img.py

This script converts each .mp4 clip into an ordered frame sequence and restores the dataset layout for training/testing.

🧾 Folder Meaning

We provide both high-quality (clean/enhanced) data and degraded data for infrared and visible modalities:

infrared_Enhance: High-quality infrared (IR) frames (clean/enhanced version).
visible_Enhance: High-quality visible (VI) frames (clean/enhanced version).
infrared_noise: Degraded infrared (IR) frames with stripe noise (a typical IR sensor degradation).
visible_Blur: Degraded visible (VI) frames with blur (e.g., motion/defocus blur).

✨ News

[2026] Our paper “VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion” has been accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026! [Paper] [Code]
[2025] M3SVD dataset is officially released.

📖 Introduction

M3SVD (Multi-Modal Multi-Scene Video Dataset) is a large-scale infrared-visible (IR-VI) video dataset designed for:

🔥 Multi-modal video fusion
🌙 Low-light / degraded video restoration
📹 Spatio-temporal modeling research

🎥 Scenario Schematic

Visualization of representative scenarios in M3SVD. The dataset contains 220 temporally synchronized infrared-visible (IR-VI) video pairs with 153,797 aligned frames in total, captured at a resolution of 640×480 and 30 FPS.

🏗 Data Processing Workflow

Dataset Comparison (vs. prior works)

🎞 Video Demo

Example sequences (GIF previews):

Demo GIF

📦 Dataset Availability

Current release: Test split
Full dataset access: Please contact
linfeng0419@gmail.com

We are open to academic collaboration and research usage.

VideoFusion (CVPR 2026)

Spatio-temporal collaborative network for multi-modal video fusion and restoration. [Paper] [Code]

📝 Citation

If you use M3SVD in your research, please cite:

@inproceedings{Tang2026VideoFusion,
  title     = {VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion and Restoration},
  author    = {Tang, Linfeng and Wang, Yeda and Gong, Meiqi and Li, Zizhuo and Deng, Yuxin and Yi, Xunpeng and Li, Chunyu and Zhang, Hao and Xu, Han and Ma, Jiayi},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}