HD-VILA-100M Dataset

March 27, 2026 ยท View on GitHub

What is HD-VILA-100M?

HD-VILA-100M is a large-scale, high-resolution, and diversified video-language dataset to facilitate the multimodal representation learning.

examples for hd-vila

Examples of video clips and ASR generated transcriptions in the proposed HD-VILA-100M dataset.

Data statistics

The dataset contains 3.3 million videos in total, which are of high quality and distributed in 15 categories in balance.

statistics

The distribution of categories in HD-VILA-100M dataset.

The details of our dataset are presented in the table below.

DatasetDomain#Video clips#SentenceAvg len(sec)Sent lenDuration(h)Resolution
HD-VILA-100Mopen100M100M13.432.5371.5K720p

Download

You can download all the urls through this link and the meta data here (updated 3/27/2026). Together we also offer all the timestamps to divide the videos into clips. The format of the data is:

{   
    'video_id':'QMi8x8o55Ns',
    'url': 'https://www.youtube.com/watch?v=QMi8x8o55Ns',
    'clip': [
                {'clip_id': 'QMi8x8o55Ns.1.mp4', 'span': ['00:00:17.759', '00:00:23.279']}
                ...
                {'clip_id': 'QMi8x8o55Ns.16.mp4', 'span': ['00:04:52.140', '00:05:03.350']}
            ],
}

You can download the raw videos from YouTube and use src/cut_videos.py to cut the videos to clips.

License

The license of the collected dataset is here.

Citing HD-VILA

If you find this dataset useful for your research, please consider citing our paper. :blush:

@inproceedings{xue2022hdvila,
    title={Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions},
    author={Xue, Hongwei and Hang, Tiankai and Zeng, Yanhong and Sun, Yuchong and Liu, Bei and Yang, Huan and Fu, Jianlong and Guo, Baining},
    booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2022}
}

Contact Information

For further request about dataset or problems using the dataset, you can contact Bei Liu (bei.liu@microsoft.com).