PixFoundation-2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
May 15, 2026 ยท View on GitHub
Official implementation of my work PixFoundation 2.0. Release of motion-centric adaptation code, soon ...
Motion-Centric Benchmark (MoCentric-Bench)

Benchmarking Pixel-level video MLLMs for visual grounding (RefVOS)

Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the referring expression without any temporal reasoning. To address this, we introduce novel motion-centric probing, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the ordering of motion. Consequently, we introduce MoCentric-Bench, a motion-centric benchmark designed to evaluate video MLLMs based on their ability to capture the interaction between motion and language, rather than relying primarily on static cues. We further establish strong single-image baselines that either match or surpass prior methods. Finally, we explore a simple motion-centric adaptation that provides state-of-the-art performance on our MoCentric-Bench.
Installation
- Clone the repository recursively to include the submodules
git clone --recursive https://github.com/MSiam/PixFoundation-2.0
- Create conda environment
conda create --name pixfoundation2 python=3.10
conda activate pixfoundation2
- Install the base requirements for evaluations
pip install -r requirements.txt
- Setup detectron2 for some utilities used (Optional)
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
- Follow installation setup for each model you are evaluating, refer to their README and if necessary create its separate conda env.
- Refer to scripts/run_X.sh for each model X script and modify the conda environment if needed or use the common pixfoundation2
Synthetic Dataset Setup
- Download MeVIS.
- Generate the multi-video-layout. Layouts l2, l3
python data/create_benchmark_mevis.py --config-file configs/mevis.yaml --dataset_root DATA_ROOT --frames_sel_file data/mevis_keyframes.csv --output_dir OUT_DIR
python data/create_benchmark_mevis.py --config-file configs/mevis.yaml --dataset_root DATA_ROOT --frames_sel_file data/mevis_keyframes.csv --output_dir OUT_DIR --left_flag
python data/create_benchmark_mevis.py --config-file configs/mevis.yaml --dataset_root DATA_ROOT --output_dir OUT_DIR --reverse_flag
python data/create_benchmark_mevis.py --config-file configs/mevis.yaml --dataset_root DATA_ROOT --output_dir OUT_DIR --reverse_flag --left_flag
-
Move the reverse expressions under data/meta_expressions_reverse_filtered.json to the valid_u subset path.
-
The final MeVIS directory is as follows:
|--- MeVIS
|--- train
|--- JPEGImages
|--- mask_dict.json
|--- meta_expressions.json
|--- valid_u
|--- JPEGImages
|--- mask_dict.json
|--- meta_expressions.json
|--- meta_expressions_reverse_filtered.json
|--- valid_u_mocentric_tile_single
|--- JPEGImages
|--- valid_u_mocentric_tile_single_left
|--- JPEGImages
|--- valid_u_mocentric_tile_reverse
|--- JPEGImages
|--- valid_u_mocentric_tile_reverse_left
|--- JPEGImages
- Visualize the synthetic dataset with the segmentation masks
python datasets_/test_loaders.py --config-file configs/mevis.yaml --dataset_root DATA_ROOT --dataset_split mevis_val_mocentric_tile_single --dataset_mask_path DATA_ROOT/valid_u/mask_dict.json --dataset_exp_path DATA_ROOT/valid_u/EXPRS_JSON --out_dir OUT_DIR --save_vis
- You can follow similar procedure to Molmo2Track.
Evaluation
- For simplicity I added the modified Sa2VA loader for my benchmarking, modify accordingly and use the following SHA commit
git checkout c94a50776e61515d72c5fe1839d3676e27082237
cp datasets_/sa2va_refVOS.py Sa2VA/projects/llava_sam2/evaluation/dataset/refVOS.py
- Run common bash script to run the benchmarking
cd scripts
bash run_all.sh
Acknowledgements
These repositories were used as part of my work and the benchmarking effort
References
@article{siam2025pixfoundation,
title={PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?},
author={Siam, Mennatullah},
journal={arXiv preprint arXiv:2509.02807},
year={2025}
}