M3SVD: Multi-Modal Multi-Scene Video Dataset

March 4, 2026 Β· View on GitHub

Paper Code Baidu Netdisk Images Baidu Netdisk Videos GoogleDrive Videos

πŸ“¦ Notes on Provided Formats (Images vs. Videos)

Baidu Netdisk Images: https://pan.baidu.com/s/1g8jixAr39n06JWPwrBE6lQ?pwd=M2VD

Baidu Netdisk Videos: https://pan.baidu.com/s/1z_kMLxYejPvt_17SNGlOTA?pwd=M2VD

GoogleDrive Videos: https://drive.google.com/file/d/1bRoNhQBzWtj0y8CMGdXQvjbQCGAVddPX/view?usp=sharing

  • Images are released in the native per-frame format that is directly used by the VideoFusion project (i.e., frame sequences under each clip folder), so you can plug them into the training/testing pipeline without any extra conversion.
  • Videos are additionally provided by packing each frame sequence into a single video file (e.g., .mp4) to reduce file count and avoid storage / hosting limitations (many platforms struggle with extremely large numbers of small image files).

If you download the Videos version and need per-frame image sequences (the format directly used in VideoFusion), please use:

This script converts each .mp4 clip into an ordered frame sequence and restores the dataset layout for training/testing.

🧾 Folder Meaning

We provide both high-quality (clean/enhanced) data and degraded data for infrared and visible modalities:

  • infrared_Enhance: High-quality infrared (IR) frames (clean/enhanced version).
  • visible_Enhance: High-quality visible (VI) frames (clean/enhanced version).
  • infrared_noise: Degraded infrared (IR) frames with stripe noise (a typical IR sensor degradation).
  • visible_Blur: Degraded visible (VI) frames with blur (e.g., motion/defocus blur).

✨ News

  • [2026] Our paper β€œVideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion” has been accepted by The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026! [Paper] [Code]

  • [2025] M3SVD dataset is officially released.


πŸ“– Introduction

M3SVD (Multi-Modal Multi-Scene Video Dataset) is a large-scale infrared-visible (IR-VI) video dataset designed for:

  • πŸ”₯ Multi-modal video fusion
  • πŸŒ™ Low-light / degraded video restoration
  • πŸ“Ή Spatio-temporal modeling research

πŸŽ₯ Scenario Schematic

Visualization of representative scenarios in M3SVD. The dataset contains 220 temporally synchronized infrared-visible (IR-VI) video pairs with 153,797 aligned frames in total, captured at a resolution of 640Γ—480 and 30 FPS.


πŸ— Data Processing Workflow

Dataset Comparison (vs. prior works)


🎞 Video Demo

Example sequences (GIF previews):

Demo GIF

Demo GIF

Demo GIF

Demo GIF


πŸ“¦ Dataset Availability

We are open to academic collaboration and research usage.


πŸ”— Related Work Using M3SVD

VideoFusion (CVPR 2026)

Spatio-temporal collaborative network for multi-modal video fusion and restoration. [Paper] [Code]


πŸ“ Citation

If you use M3SVD in your research, please cite:

@inproceedings{Tang2026VideoFusion,
  title     = {VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion and Restoration},
  author    = {Tang, Linfeng and Wang, Yeda and Gong, Meiqi and Li, Zizhuo and Deng, Yuxin and Yi, Xunpeng and Li, Chunyu and Zhang, Hao and Xu, Han and Ma, Jiayi},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}