MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking

May 28, 2026 · View on GitHub

CVPR 2026  |  Paper  |  Project Page

Highlights

  • Template-free tracking: localizes objects using only natural language — no bounding box or visual template required at initialization
  • MVLM confidence: fuses VL correlation margins, encoder head predictions, and temporal memory into a single reliable score
  • Memory-gated tracking: dynamically switches between compact ROI search (high confidence) and global re-localization (low confidence)

News

  • [2026/05] Code and models are released.
  • [2026/02] MVLM accepted to CVPR 2026.

Main Results

Tracking-by-Language (no bounding box)

MethodTNL2K PRETNL2K AUCLaSOT PRELaSOT AUCOTB99 PREOTB99 AUCMGIT PREMGIT AUC
JointNLT (CVPR'23)55.054.659.356.977.659.243.859.2
UVLTrack-B (AAAI'24)57.255.761.057.279.160.144.656.1
MambaVLT (CVPR'25)58.958.457.255.879.258.950.364.6
MVLM (Ours)60.957.865.560.784.360.755.563.5

Tracking-by-BBox + Language

MethodTNL2K PRETNL2K AUCLaSOT PRELaSOT AUCOTB99 PREOTB99 AUCMGIT PREMGIT AUC
SUTrack‑B224 (AAAI'25)67.965.080.573.293.470.8--
MambaVLT (CVPR'25)69.966.571.066.694.472.258.969.9
DUTrack-256 (CVPR'25)70.664.981.173.093.970.9--
MVLM (Ours)71.466.379.372.092.369.766.371.7

Raw tracking results are available for download: Google Drive

Installation

# Clone the repository
git clone https://github.com/inha-vllab/MVLM.git
cd MVLM

# Create conda environment
conda create -n mvlm python=3.8 -y
conda activate mvlm

# Install dependencies
pip install -r requirements.txt

# Configure local dataset and output paths
python tracking/create_default_local_file.py --workspace_dir . --data_dir <path_to_data_root> --save_dir ./output

Demo Web UI Frontend (optional)

The Demo Web UI frontend requires Node.js (v18+) to build:

cd tracking/web/frontend
npm install
npm run build
cd ../../..

Model

Download the pretrained backbone and place it at pretrained/itpn/:

BackboneFileDownload
FastITPN-Basefast_itpn_base_clipl_e1600.ptDownload

Download MVLM checkpoints:

ModelConfigBackboneDownload
MVLMmvlm_TFFastITPN-BGoogle Drive

Data Preparation

Download and organize the following datasets:

Expected directory layout:

/path/to/data/
├── lasot/
│   ├── airplane/
│   │   ├── airplane-1/
│   │   │   ├── img/
│   │   │   ├── groundtruth.txt
│   │   │   └── nlp.txt
│   │   └── ...
│   ├── ...
│   ├── training_set.txt
│   └── testing_set.txt

├── vasttrack/
│   └── train/
│       ├── Aardwolf/
│       │   ├── Aardwolf-10/
│       │   │   ├── imgs/
│       │   │   ├── Groundtruth.txt
│       │   │   └── nlp.txt
│       └── ...

├── tnl2k/
│   ├── train/
│   │   ├── Arrow_Video_ZZ04_done/
│   │   │   ├── imgs/
│   │   │   ├── groundtruth.txt
│   │   │   └── language.txt
│   │   └── ...
│   └── test/
│       ├── Assian_video_Z03_done/
│       │   ├── imgs/
│       │   ├── groundtruth.txt
│       │   └── language.txt
│       └── ...

├── otb99_lang/
│   ├── OTB_videos/
│   │   ├── Basketball/
│   │   │   ├── img/
│   │   │   └── groundtruth_rect.txt
│   │   └── ...
│   └── OTB_query_test/
│       ├── Biker.txt
│       └── ...

└── mgit/
    ├── attribute/
    │   ├── groundtruth/
    │   │   ├── 001.txt
    │   │   └── ...
    │   └── ...
    ├── data/
    │   └── test/
    │       ├── 001/
    │       │   ├── frame_001/
    │       │   │   ├── 000000.jpg
    │       │   │   └── ...
    │       └── ...
    └── mgit_nlp/
        ├── 001.xlsx
        └── ...

Training

# Single GPU
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode single

# Multi-GPU (4 GPUs, torch.distributed.launch)
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode multiple --nproc_per_node 4

# Multi-GPU (4 GPUs, torchrun)
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode multiple --nproc_per_node 4 --launcher torchrun

Config files are located in experiments/mvlm/. The text encoder (CLIP) is frozen by default during training. Checkpoints are saved without frozen CLIP weights to reduce file size.

Evaluation

Step 1: Generate Tracking Results

Three execution modes are available via --mode:

ModeLaunch commandDescription
singlepythonSequential on 1 GPU (default)
disttorchrunDistributed across GPUs via torch.distributed
mppythonParallel via multiprocessing.Pool
# Single GPU (default)
python tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --exp_id test_run1

# Multi-GPU with torchrun (dist)
torchrun --nproc_per_node <nproc_per_node> tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --num_gpus <num_gpus> --mode dist --exp_id test_run1

# Multi-GPU with multiprocessing (mp)
python tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --num_gpus <num_gpus> --threads <num_threads> --mode mp --exp_id test_run1

Supported --dataset_name values: tnl2k, lasot, otb99_lang, mgit.

Step 2: Compute Metrics (Precision / AUC)

After tracking results are generated, compute PRE, NPR, AUC:

# --exp_id must match the value used in test.py (results are stored under results_dir/{tracker_name}/{exp_id}/)
python tracking/analysis_results.py \
  --tracker_name mvlm \
  --tracker_param mvlm_TF \
  --exp_id test_run1 \
  --dataset lasot

MGIT submission: To evaluate MGIT test set performance, submit raw tracking results to VideoCube Official Platform

Demo

CLI Demo

# Template-free tracking (language only)
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection

# Stream results to web browser
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection --no_display --stream_port 8080
# Then open http://localhost:8080 in your browser

# Force CPU inference (no GPU required)
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection --device cpu

--skip-selection enables fully template-free mode. Omit it to select the initial bounding box interactively.

Web UI

The Demo Web UI provides a browser-based interface for interactive tracking — configure model, video, and text description through the GUI without any CLI parameters.

Note: The frontend must be built before first use. See Demo Web UI Frontend in the Installation section.

# Start the server (opens at http://localhost:8080)
python tracking/web/api.py

# Force CPU inference
python tracking/web/api.py --device cpu

Workflow:

  1. Model tab — Select config (mvlm_TF or mvlm_BBOX) and checkpoint, then click Load
  2. Video tab — Enter video path (or upload/URL/webcam), then click Load Video
  3. First frame preview appears — optionally drag to select initial ROI
  4. Enter target description and click Start Tracking
  5. Control tab — Pause/resume, switch target mid-tracking

Citation

@inproceedings{park2026mvlm,
  title={MVLM: Template-Free Tracking via Vision--Language Margin Confidence and Memory-Gated Tracking},
  author={Park, Dae-Hyeon and Baek, Mina and Ha, Jeong-Hun and Park, Chan-Seop and Ganiev, Jamshidjon and Bae, Seung-Hwan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month={June},
  year={2026},
  pages={35156-35165}
}

Acknowledgements

This codebase builds upon SUTrack, FastITPN, and OpenAI CLIP. We thank the authors for their excellent work.