MVLM: Template-Free Tracking via Vision–Language Margin Confidence and Memory-Gated Tracking

May 28, 2026 · View on GitHub

CVPR 2026 | Paper | Project Page

Highlights

Template-free tracking: localizes objects using only natural language — no bounding box or visual template required at initialization
MVLM confidence: fuses VL correlation margins, encoder head predictions, and temporal memory into a single reliable score
Memory-gated tracking: dynamically switches between compact ROI search (high confidence) and global re-localization (low confidence)

News

[2026/05] Code and models are released.
[2026/02] MVLM accepted to CVPR 2026.

Main Results

Tracking-by-Language (no bounding box)

Method	TNL2K PRE	TNL2K AUC	LaSOT PRE	LaSOT AUC	OTB99 PRE	OTB99 AUC	MGIT PRE	MGIT AUC
JointNLT (CVPR'23)	55.0	54.6	59.3	56.9	77.6	59.2	43.8	59.2
UVLTrack-B (AAAI'24)	57.2	55.7	61.0	57.2	79.1	60.1	44.6	56.1
MambaVLT (CVPR'25)	58.9	58.4	57.2	55.8	79.2	58.9	50.3	64.6
MVLM (Ours)	60.9	57.8	65.5	60.7	84.3	60.7	55.5	63.5

Tracking-by-BBox + Language

Method	TNL2K PRE	TNL2K AUC	LaSOT PRE	LaSOT AUC	OTB99 PRE	OTB99 AUC	MGIT PRE	MGIT AUC
SUTrack‑B224 (AAAI'25)	67.9	65.0	80.5	73.2	93.4	70.8	-	-
MambaVLT (CVPR'25)	69.9	66.5	71.0	66.6	94.4	72.2	58.9	69.9
DUTrack-256 (CVPR'25)	70.6	64.9	81.1	73.0	93.9	70.9	-	-
MVLM (Ours)	71.4	66.3	79.3	72.0	92.3	69.7	66.3	71.7

Raw tracking results are available for download: Google Drive

Installation

# Clone the repository
git clone https://github.com/inha-vllab/MVLM.git
cd MVLM

# Create conda environment
conda create -n mvlm python=3.8 -y
conda activate mvlm

# Install dependencies
pip install -r requirements.txt

# Configure local dataset and output paths
python tracking/create_default_local_file.py --workspace_dir . --data_dir <path_to_data_root> --save_dir ./output

Demo Web UI Frontend (optional)

The Demo Web UI frontend requires Node.js (v18+) to build:

cd tracking/web/frontend
npm install
npm run build
cd ../../..

Model

Download the pretrained backbone and place it at pretrained/itpn/:

Backbone	File	Download
FastITPN-Base	`fast_itpn_base_clipl_e1600.pt`	Download

Download MVLM checkpoints:

Model	Config	Backbone	Download
MVLM	`mvlm_TF`	FastITPN-B	Google Drive

Data Preparation

Download and organize the following datasets:

LaSOT — with language annotations (nlp.txt)
VastTrack
TNL2K
OTB99-Lang
MGIT

Expected directory layout:

/path/to/data/
├── lasot/
│   ├── airplane/
│   │   ├── airplane-1/
│   │   │   ├── img/
│   │   │   ├── groundtruth.txt
│   │   │   └── nlp.txt
│   │   └── ...
│   ├── ...
│   ├── training_set.txt
│   └── testing_set.txt
│
├── vasttrack/
│   └── train/
│       ├── Aardwolf/
│       │   ├── Aardwolf-10/
│       │   │   ├── imgs/
│       │   │   ├── Groundtruth.txt
│       │   │   └── nlp.txt
│       └── ...
│
├── tnl2k/
│   ├── train/
│   │   ├── Arrow_Video_ZZ04_done/
│   │   │   ├── imgs/
│   │   │   ├── groundtruth.txt
│   │   │   └── language.txt
│   │   └── ...
│   └── test/
│       ├── Assian_video_Z03_done/
│       │   ├── imgs/
│       │   ├── groundtruth.txt
│       │   └── language.txt
│       └── ...
│
├── otb99_lang/
│   ├── OTB_videos/
│   │   ├── Basketball/
│   │   │   ├── img/
│   │   │   └── groundtruth_rect.txt
│   │   └── ...
│   └── OTB_query_test/
│       ├── Biker.txt
│       └── ...
│
└── mgit/
    ├── attribute/
    │   ├── groundtruth/
    │   │   ├── 001.txt
    │   │   └── ...
    │   └── ...
    ├── data/
    │   └── test/
    │       ├── 001/
    │       │   ├── frame_001/
    │       │   │   ├── 000000.jpg
    │       │   │   └── ...
    │       └── ...
    └── mgit_nlp/
        ├── 001.xlsx
        └── ...

Training

# Single GPU
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode single

# Multi-GPU (4 GPUs, torch.distributed.launch)
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode multiple --nproc_per_node 4

# Multi-GPU (4 GPUs, torchrun)
python tracking/train.py \
  --script mvlm \
  --config mvlm_TF \
  --save_dir ./output --mode multiple --nproc_per_node 4 --launcher torchrun

Config files are located in experiments/mvlm/. The text encoder (CLIP) is frozen by default during training. Checkpoints are saved without frozen CLIP weights to reduce file size.

Evaluation

Step 1: Generate Tracking Results

Three execution modes are available via --mode:

Mode	Launch command	Description
`single`	`python`	Sequential on 1 GPU (default)
`dist`	`torchrun`	Distributed across GPUs via `torch.distributed`
`mp`	`python`	Parallel via `multiprocessing.Pool`

# Single GPU (default)
python tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --exp_id test_run1

# Multi-GPU with torchrun (dist)
torchrun --nproc_per_node <nproc_per_node> tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --num_gpus <num_gpus> --mode dist --exp_id test_run1

# Multi-GPU with multiprocessing (mp)
python tracking/test.py mvlm mvlm_TF \
  --dataset_name lasot --weight_path ./models/MVLM_TF.pth.tar --num_gpus <num_gpus> --threads <num_threads> --mode mp --exp_id test_run1

Supported --dataset_name values: tnl2k, lasot, otb99_lang, mgit.

Step 2: Compute Metrics (Precision / AUC)

After tracking results are generated, compute PRE, NPR, AUC:

# --exp_id must match the value used in test.py (results are stored under results_dir/{tracker_name}/{exp_id}/)
python tracking/analysis_results.py \
  --tracker_name mvlm \
  --tracker_param mvlm_TF \
  --exp_id test_run1 \
  --dataset lasot

MGIT submission: To evaluate MGIT test set performance, submit raw tracking results to VideoCube Official Platform

Demo

CLI Demo

# Template-free tracking (language only)
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection

# Stream results to web browser
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection --no_display --stream_port 8080
# Then open http://localhost:8080 in your browser

# Force CPU inference (no GPU required)
python tracking/demo.py \
  --config mvlm_TF \
  --checkpoint ./models/MVLM_TF.pth.tar \
  --video <path_to_video> \
  --text "the man in white shirt" \
  --skip-selection --device cpu

--skip-selection enables fully template-free mode. Omit it to select the initial bounding box interactively.

Web UI

The Demo Web UI provides a browser-based interface for interactive tracking — configure model, video, and text description through the GUI without any CLI parameters.

Note: The frontend must be built before first use. See Demo Web UI Frontend in the Installation section.

# Start the server (opens at http://localhost:8080)
python tracking/web/api.py

# Force CPU inference
python tracking/web/api.py --device cpu

Workflow:

Model tab — Select config (mvlm_TF or mvlm_BBOX) and checkpoint, then click Load
Video tab — Enter video path (or upload/URL/webcam), then click Load Video
First frame preview appears — optionally drag to select initial ROI
Enter target description and click Start Tracking
Control tab — Pause/resume, switch target mid-tracking

Citation

@inproceedings{park2026mvlm,
  title={MVLM: Template-Free Tracking via Vision--Language Margin Confidence and Memory-Gated Tracking},
  author={Park, Dae-Hyeon and Baek, Mina and Ha, Jeong-Hun and Park, Chan-Seop and Ganiev, Jamshidjon and Bae, Seung-Hwan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month={June},
  year={2026},
  pages={35156-35165}
}

Acknowledgements

This codebase builds upon SUTrack, FastITPN, and OpenAI CLIP. We thank the authors for their excellent work.