VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

May 29, 2026 · View on GitHub

VidLaDA overview

🔥 News

[2026.05] 🎉 Accepted by ICML 2026.
[2026.01] 📄 We released the paper on arXiv.

VidLaDA is a video large language model based on diffusion language modeling. It uses bidirectional attention for global spatiotemporal aggregation and parallel token decoding for efficient video understanding.

VidLaDA overview

🛠️ Requirements and Installation

Basic dependencies:

Python 3.10
PyTorch 2.5.1
Transformers 4.43.4
Triton 3.1.0
CUDA-compatible GPU

Create and activate a conda environment:

conda create -n vidlada python=3.10 -y
conda activate vidlada

Clone the repository:

git clone https://github.com/ziHoHe/VidLaDA
cd VidLaDA

Install PyTorch. We recommend PyTorch 2.5.1:

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

Install VidLaDA and the evaluation package:

bash init_env.sh

🤗 Model Zoo

Model	Checkpoint
VidLaDA-8B	`ziHoHe/VidLaDA-8B`

🏋️ Training

We provide a training entry point based on LLaVA-Video-178K.

LLaVA-Video-178K training example

Before launching exp.sh, update the local model and dataset paths for your machine:

Variable	Description
`LLM_VERSION`	Base checkpoint. The default is `GSAI-ML/LLaDA-V`.
`VISION_MODEL_VERSION`	Vision tower model name. The default is `google/siglip2-so400m-patch14-384`.
`VISION_TOWER_CACHE_DIR`	Optional local root for cached vision tower weights. The loader first checks `${VISION_TOWER_CACHE_DIR}/${VISION_MODEL_VERSION}` and falls back to the Hugging Face model name if the local path is unavailable.
`VIDEO_FOLDER`	Root directory for LLaVA-Video-178K videos. Download from `lmms-lab/LLaVA-Video-178K`.

Then replace all absolute json_path entries in train/scripts/data/exp.yaml with your downloaded LLaVA-Video-178K annotation paths.

cd train
bash scripts/video/train/exp.sh

The training script infers the number of visible GPUs with nvidia-smi and launches distributed training with torchrun. For multi-node training, pass the number of nodes after the script name and set the standard launch variables on each node:

MASTER_ADDR=<master-node-ip> MASTER_PORT=29199 RANK=<node-rank> \
bash scripts/video/train/exp.sh <num_nodes>

For example, use bash scripts/video/train/exp.sh 2 for a 2-node run, with RANK=0 on the master node and RANK=1 on the second node.

📊 Evaluation

Before running evaluation, download the benchmark datasets under your HF_HOME directory. The task loaders use HF_HOME as the base directory for cached video files; in the YAML files listed below, update the dataset_path field to your local annotation/dataset root.

Evaluation datasets:

Task	Download	Path to update
`video_mmmu`	`lmms-lab/VideoMMMU`	`eval/lmms-eval/lmms_eval/tasks/videommmu/_default_template_yaml`
`longvideobench_val_v`	`longvideobench/LongVideoBench`	`eval/lmms-eval/lmms_eval/tasks/longvideobench/longvideobench_val_v.yaml`
`lvbench`	`lmms-lab/LVBench`	`eval/lmms-eval/lmms_eval/tasks/lvbench/lvbench.yaml`
`egoschema`	`lmms-lab/egoschema`	`eval/lmms-eval/lmms_eval/tasks/egoschema/_default_template_yaml`
`mvbench`	`OpenGVLab/MVBench`	`eval/lmms-eval/lmms_eval/tasks/mvbench/_default_template_yaml`
`mlvu_dev`	`sy1998/MLVU_dev`	`eval/lmms-eval/lmms_eval/tasks/mlvu/mlvu_dev.yaml`
`mlvu_test`	`sy1998/MLVU_Test`	`eval/lmms-eval/lmms_eval/tasks/mlvu/mlvu_test.yaml`
`videomme`	`lmms-lab/Video-MME`	`eval/lmms-eval/lmms_eval/tasks/videomme/videomme.yaml`

We provide evaluation code based on lmms-eval. The main VidLaDA model wrapper is:

llava_llada_video

The evaluation scripts are local launchers; edit cache paths and model defaults if needed.

Standard video evaluation:

cd train
bash scripts/video/eval/eval_video.sh

📌 Citation

@article{he2026vidlada,
  title={VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding},
  author={He, Zhihao and Chen, Tieyuan and Wang, Kangyu and Qin, Ziran and Shao, Yang and Gan, Chaofan and Li, Shijie and Wu, Zuxuan and Lin, Weiyao},
  journal={arXiv preprint arXiv:2601.17868},
  year={2026}
}

🙏 Acknowledgments

This codebase builds on LLaDA-V, LLaVA-NeXT, and lmms-eval. We thank the authors and maintainers for their open-source contributions.