VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM
March 10, 2026 Β· View on GitHub
π Quick Start
Installation
conda create -n vptrack python==3.10
conda activate vptrack
cd ms-swift
conda install -c conda-forge pyarrow sentencepiece
pip install -e .
pip install "sglang[all]" -U
pip install "vllm>=0.5.1" "transformers<4.55" "trl<0.21" -U
pip install "lmdeploy>=0.5" -U
pip install autoawq -U --no-deps
pip install auto_gptq optimum bitsandbytes "gradio<5.33" -U
pip install git+https://github.com/modelscope/ms-swift.git
pip install timm -U
pip install "deepspeed" -U
pip install flash-attn==2.7.4.post1 --no-build-isolation
conda install av -c conda-forge
pip install qwen_vl_utils qwen_omni_utils decord librosa icecream soundfile -U
pip install liger_kernel nvitop pre-commit math_verify py-spy -U
Data Preparation
|-- data
β βββ tnl2k
β β βββtest
β β | βββadvSamp_Baseball_game_002-Done
β β | βββ...
β β βββtrain
β β βββArrow_Video_ZZ04_done
β β βββ...
β βββ tnllt
β βββJE_Assian_ship_v01
β βββ...
Data PreParation
bash data_preparation.sh
Model Training
bash train.sh
Model Testing
bash infer.sh
π¦ Checkpoints
You can download it from HuggingFace: VPTracker
π Visualization
π Acknowledgments
This code is developed on the top of ms-swift
βοΈ Contact
Email: jcwang@stu.ecnu.edu.cn. Any kind discussions are welcomed!
π Citation
If our work is useful for your research, please consider cite:
@misc{wang2025vptrackerglobalvisionlanguagetracking,
title={VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM},
author={Jingchao Wang and Kaiwen Zhou and Zhijian Wu and Kunhua Ji and Dingjiang Huang and Yefeng Zheng},
year={2025},
eprint={2512.22799},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.22799},
}