README.md

June 18, 2026 · View on GitHub

This repository contains the official implementation of RAGTrack, the first language-aware RGBT tracking framework powered by Retrieval-Augmented Generation (RAG). We construct four RGB-T-L benchmarks by introducing textual descriptions into GTOT, RGBT210, RGBT234, and LasHeR via MLLM-based annotation pipelines. We propose a novel framework consisting of a Multi-modal Transformer Encoder (MTE), Adaptive Token Fusion (ATF), and Context-aware Reasoning Module (CRM). Included are training/evaluation codes, models, and results.

🔥 Motivation

RAGTrack Motivation
Figure 1. (a) Existing RGBT trackers suffer from inadequate appearance modeling, search redundancy, and modality gap. (b) Our RAGTrack introduces linguistic reasoning, dynamic token selection, and adaptive channel exchange for robust tracking.

🏗️ Framework

RAGTrack Pipeline
Figure 2. Overall framework of RAGTrack. MTE performs unified visual-language modeling, ATF dynamically selects target-relevant tokens and enables adaptive channel exchange, and CRM retrieves relevant contexts for context-aware reasoning.

🔬 Adaptive Token Fusion (ATF)

Adaptive Token Fusion
Figure 3. Details of ATF. Dynamic token selection leverages text-guided attention scores to retain target-relevant tokens, while adaptive channel exchange bridges heterogeneous modality gaps.

⚙️ Installation

1. Clone the repository and create the conda environment:

git clone https://github.com/IdolLab/RAGTrack.git
cd RAGTrack
conda create -n RAGTrack python=3.10
conda activate RAGTrack

2. Download auxiliary models:

CLIP: Baidu Drive (pwd: tea6) / Google Drive
Qwen-VL: Baidu Drive (pwd: 61h4) / Google Drive

3. Install dependencies:

pip install -r requirements.txt

📁 Data Preparation

Download the following datasets and place them under ./data/:

RGB-T images: GTOT, RGBT210, RGBT234, LasHeR
Textual annotations: Baidu Drive (pwd: nayy) / Google Drive

The expected directory structure is as follows:

RAGTrack/
└── data/
    ├── GTOT/
    │   ├── BlackCar/
    │   │   ├── i/
    │   │   ├── v/
    │   │   ├── groundTruth_i.txt
    │   │   ├── groundTruth_v.txt
    │   │   ├── visible_description.txt
    │   │   └── class.txt
    │   └── ...
    ├── RGBT210/
    │   ├── afterrain/
    │   │   ├── infrared/
    │   │   ├── visible/
    │   │   ├── init.txt
    │   │   ├── visible_description.txt
    │   │   └── class.txt
    │   └── ...
    ├── RGBT234/
    │   └── ...  (same structure as RGBT210)
    └── LasHeR/
        ├── train/
        │   ├── 1boygo/
        │   │   ├── infrared/
        │   │   ├── visible/
        │   │   ├── init.txt
        │   │   └── visible_description.txt
        │   └── ...
        └── test/
            ├── 1blackteacher/
            │   ├── infrared/
            │   ├── visible/
            │   ├── init.txt
            │   ├── visible_description.txt
            │   └── class.txt
            └── ...

🔧 Setup & Configuration

Run the following command to initialize local paths:

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./output

Alternatively, manually edit the path configs:

./lib/train/admin/local.py      # paths for training
./lib/test/evaluation/local.py  # paths for testing

🏋️ Training

1. Download the pretrained backbone:

Download the pretrained model from Baidu Drive (pwd: 3ure) or Google Drive and place it under ./pretrained/.

2. Launch training:

bash train.sh

💡 You can switch between different model variants by modifying the arguments in train.sh.

🚀 Testing

Benchmark Evaluation

1. Download the model:

Download the model from Baidu Drive (pwd: 3ure) or Google Drive and place it under ./output/.

2. Launch testing:

Modify <DATASET_PATH> and <SAVE_PATH> in ./RGBT_workspace/test_rgbt_mgpus.py, then run:

bash test.sh

Evaluation Toolkit

For GTOT / RGBT210 / RGBT234 / LasHeR, please use the official Evaluation Toolkit.

🗂️ Benchmark Construction

We construct four RGB-T-L benchmarks by introducing high-quality textual descriptions into GTOT, RGBT210, RGBT234, and LasHeR. The annotation pipeline consists of three stages:

1. Initial Description Generation

Download Qwen2.5-VL (pwd: 8swe) and generate initial descriptions:

cd Qwen2.5-VL
python generate_description.py

2. Automatic Refinement

Download Qwen3 (pwd: nqqc) and correct the generated descriptions:

cd Qwen3
python correct_description.py

3. Human Review & Verification

We open-source an annotation review tool to facilitate efficient human verification:

🔍 AnnotationCheck

💡 The final annotations are reviewed by our annotation team to correct remaining hallucinations, grammatical errors, and mixed-language content, ensuring high-quality semantic labels.

Annotation Statistics

Dataset	Split	Sequences	Descriptions
LasHeR	Train	979	514,081
LasHeR / GTOT / RGBT210 / RGBT234	Test	739	739

Annotation Team

We sincerely thank our annotation team for their dedicated efforts:


Lixin Wang （王利鑫）	Lihong Huang （黄立宏）	Zhongtian Li （李中天）	Zijing Gong （巩子靖）	Yilin Zhang （张翼麟）
Yangyang Liu （刘杨杨）	Yuhang Zhang （张宇航）	Chaoyuan Liu （刘超远）	Jiani Qiu （邱嘉妮）	Qing'en Zhu （祝庆恩）
Junji Wang （汪俊佶）	Xi Zhang （张熙）	Ningxin Hu （胡宁馨）	Ruoshui Qu （曲若水）	Huiyu Luo （罗珲宇）
Jian Shi （史健）	Yue Xiong （熊悦）	Shuyan Tian （田书颜）	Xuanyu Zhang （张暄雨）	Enhui Wang （王恩惠）
Qiwei Yang （杨骐玮）	Kuanxin Shen （沈宽心）	Yakun Huo （霍亚坤）	Haojing Zhou （周皓靖）	Deyu Hong （洪德宇）
Zi Wang （王子）	Xiaowen Wu （吴晓文）	Longquan Shang （尚龙泉）	Tao He （何涛）	Jinxu Zhao （赵金旭）
Yongfeng Lv （吕泳锋）	Weicheng Yi （易韦丞）	Bowen Liu （刘博文）	Xingyu Huang （黄星宇）	Minghe Chen （陈明赫）
Zixin Wu （吴子鑫）	Jiaqi Long （龙佳琪）	Sijia Cui （崔思佳）	Liyong Liu （刘礼勇）

@inproceedings{li2026ragtrack,
  title={RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation},
  author={Li, Hao and Wang, Yuhao and Hao, Wenning and Zhang, Pingping and Wang, Dong and Lu, Huchuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={28179--28189},
  year={2026}
}

Star ⭐ this repo if you like our work!