VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

May 29, 2026 ยท View on GitHub

VidLaDA overview

arXiv Hugging Face Accepted by ICML 2026

๐Ÿ”ฅ News

  • [2026.05] ๐ŸŽ‰ Accepted by ICML 2026.
  • [2026.01] ๐Ÿ“„ We released the paper on arXiv.

๐Ÿ“ Overview

VidLaDA is a video large language model based on diffusion language modeling. It uses bidirectional attention for global spatiotemporal aggregation and parallel token decoding for efficient video understanding.

VidLaDA overview

๐Ÿ› ๏ธ Requirements and Installation

Basic dependencies:

  • Python 3.10
  • PyTorch 2.5.1
  • Transformers 4.43.4
  • Triton 3.1.0
  • CUDA-compatible GPU

Create and activate a conda environment:

conda create -n vidlada python=3.10 -y
conda activate vidlada

Clone the repository:

git clone https://github.com/ziHoHe/VidLaDA
cd VidLaDA

Install PyTorch. We recommend PyTorch 2.5.1:

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

Install VidLaDA and the evaluation package:

bash init_env.sh

๐Ÿค— Model Zoo

ModelCheckpoint
VidLaDA-8BziHoHe/VidLaDA-8B

๐Ÿ‹๏ธ Training

We provide a training entry point based on LLaVA-Video-178K.

LLaVA-Video-178K training example

Before launching exp.sh, update the local model and dataset paths for your machine:

VariableDescription
LLM_VERSIONBase checkpoint. The default is GSAI-ML/LLaDA-V.
VISION_MODEL_VERSIONVision tower model name. The default is google/siglip2-so400m-patch14-384.
VISION_TOWER_CACHE_DIROptional local root for cached vision tower weights. The loader first checks ${VISION_TOWER_CACHE_DIR}/${VISION_MODEL_VERSION} and falls back to the Hugging Face model name if the local path is unavailable.
VIDEO_FOLDERRoot directory for LLaVA-Video-178K videos. Download from lmms-lab/LLaVA-Video-178K.

Then replace all absolute json_path entries in train/scripts/data/exp.yaml with your downloaded LLaVA-Video-178K annotation paths.

cd train
bash scripts/video/train/exp.sh

The training script infers the number of visible GPUs with nvidia-smi and launches distributed training with torchrun. For multi-node training, pass the number of nodes after the script name and set the standard launch variables on each node:

MASTER_ADDR=<master-node-ip> MASTER_PORT=29199 RANK=<node-rank> \
bash scripts/video/train/exp.sh <num_nodes>

For example, use bash scripts/video/train/exp.sh 2 for a 2-node run, with RANK=0 on the master node and RANK=1 on the second node.

๐Ÿ“Š Evaluation

Before running evaluation, download the benchmark datasets under your HF_HOME directory. The task loaders use HF_HOME as the base directory for cached video files; in the YAML files listed below, update the dataset_path field to your local annotation/dataset root.

Evaluation datasets:

TaskDownloadPath to update
video_mmmulmms-lab/VideoMMMUeval/lmms-eval/lmms_eval/tasks/videommmu/_default_template_yaml
longvideobench_val_vlongvideobench/LongVideoBencheval/lmms-eval/lmms_eval/tasks/longvideobench/longvideobench_val_v.yaml
lvbenchlmms-lab/LVBencheval/lmms-eval/lmms_eval/tasks/lvbench/lvbench.yaml
egoschemalmms-lab/egoschemaeval/lmms-eval/lmms_eval/tasks/egoschema/_default_template_yaml
mvbenchOpenGVLab/MVBencheval/lmms-eval/lmms_eval/tasks/mvbench/_default_template_yaml
mlvu_devsy1998/MLVU_deveval/lmms-eval/lmms_eval/tasks/mlvu/mlvu_dev.yaml
mlvu_testsy1998/MLVU_Testeval/lmms-eval/lmms_eval/tasks/mlvu/mlvu_test.yaml
videommelmms-lab/Video-MMEeval/lmms-eval/lmms_eval/tasks/videomme/videomme.yaml

We provide evaluation code based on lmms-eval. The main VidLaDA model wrapper is:

  • llava_llada_video

The evaluation scripts are local launchers; edit cache paths and model defaults if needed.

Standard video evaluation:

cd train
bash scripts/video/eval/eval_video.sh

๐Ÿ“Œ Citation

@article{he2026vidlada,
  title={VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding},
  author={He, Zhihao and Chen, Tieyuan and Wang, Kangyu and Qin, Ziran and Shao, Yang and Gan, Chaofan and Li, Shijie and Wu, Zuxuan and Lin, Weiyao},
  journal={arXiv preprint arXiv:2601.17868},
  year={2026}
}

๐Ÿ™ Acknowledgments

This codebase builds on LLaDA-V, LLaVA-NeXT, and lmms-eval. We thank the authors and maintainers for their open-source contributions.