[TCSVT2025] AVLTrack: Dynamic Sparse Learning for Aerial Vision-Language Tracking

December 1, 2025 · View on GitHub

This is an official pytorch implementation of the 2025 IEEE Transactions on Circuits and Systems for Video Technology paper:

AVLTrack: Dynamic Sparse Learning for Aerial Vision-Language Tracking
(accepted by IEEE Transactions on Circuits and Systems for Video Technology, DOI: 10.1109/TCSVT.2025.3549953)

image

The paper can be downloaded from IEEE Xplore

The models and raw results can be downloaded from [GitHub] and [BaiduYun].

The tracking demos are displayed on the Bilibili.

Proposed modules

Requirements

Results

AUC on WebUAV-3M

TrackersSourceInitializeAUCPre.NPre.cAUCmAccParam.(M)FPS
AVLTrackOursNL+BB55.070.058.854.155.723.94480.5@GPU
PRL-TrackIROS’24BB46.362.350.845.246.713.377174.9@GPU
TDA-TrackIROS’24BB46.462.551.845.346.85.661134.7@GPU
SmallTrackTGRS’23BB50.166.255.249.250.729.39472.5@GPU
Aba-ViTrackICCV’23BB53.568.357.052.654.27.97986.6@GPU

Pre on DTB70-NLP

TrackersSourceInitializePre. on DTB70-NLP
AVLTrackOursNL+BB86.3
MixFormerV2NeurIPS’23BB84.1
SeqTrack-B384CVPR’23BB85.9
MixFormer-CvTTPAMI’24BB82.7
SmallTrackTGRS’23BB85.8
Aba-ViTrackICCV’23BB85.9
TDA-TrackIROS’24BB80.2
DCPTICRA’24BB84.0
AVTrackICML’24BB84.3
LiteTrackICRA’24BB82.5

AUC on UAV20L-NLP

TrackersSourceInitializeAUC on UAV20L-NLP
AVLTrackOursNL+BB63.9
E.T.TrackWACV’23BB60.0
TaMOs-Swin-BWACV’24BB61.2
ACM-BANTPAMI’24BB56.0
HiT-SmallICCV’23BB63.0
Aba-ViTrackICCV’23BB63.5
PRL-TrackIROS’24BB52.0
TDA-TrackIROS’24BB50.6
SAM-DA-BaseICARM’24BB55.9
CGDenoiserIROS’24BB54.7
QRDTTIM’24BB55.8
DaDiff-GATIROS’24BB57.3

It should be noted that the above pretrained model is trained on an Ubuntu 18.04 server with multiple NVIDIA RTX A100 GPUs. For WebUAV-3M, we recommend the official evaluation toolkit.

Set project paths

Run the following command to set paths for this project

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./output

After running this command, you can also modify paths by editing these two files

lib/train/admin/local.py  # paths about training
lib/test/evaluation/local.py  # paths about testing

Data Preparation

Training datasets download

Put the tracking datasets in ./data. It should look like this:

${PROJECT_ROOT}
 -- data
     -- lasot
         |-- airplane
         |-- basketball
         |-- bear
         ...
     -- got10k
         |-- test
         |-- train
         |-- val
     -- coco
         |-- annotations
         |-- images
     -- trackingnet
         |-- TRAIN_0
         |-- TRAIN_1
         ...
         |-- TRAIN_11
         |-- TEST

Training

  • Download pre-trained [MAE ViT-Base weights] and put it to $PROJECT_ROOT$/AVLTrack/lib/models/pretrained_models.
  • Download bert-base-uncased pre-trained weights [Baidu Yun] or [Gitcode] or [Hugging Face] and put them to $PROJECT_ROOT$/AVLTrack/lib/models/pretrained_models/bert-base-uncased.

1.Training with one GPU.

cd /$PROJECT_ROOT$/AVLTrack
CUDA_VISIBLE_DEVICES=0 python tracking/train.py --script ostrack --config abavit_patch16_224_ep300 --save_dir ./output --mode single --nproc_per_node 1

2.Training with multiple GPUs.

cd /$PROJECT_ROOT$/AVLTrack
CUDA_VISIBLE_DEVICES=0,1 python tracking/train.py --script ostrack --config abavit_patch16_224_ep300 --save_dir ./output --mode multiple --nproc_per_node 2

Before training, please make sure the data path in local.py is correct.

Evaluation

Download the model AVLTrack, extraction code: 43xv. Add the model to $PROJECT_ROOT$/AVLTrack/output/checkpoints/train/.

python tracking/test.py --tracker_name ostrack --tracker_param abavit_patch16_224_ep297 --dataset webuav3m --threads 2 --num_gpus 2
python tracking/analysis_results.py

Before evaluation, please make sure the data path in local.py is correct.

Test FLOPs, and Speed

python tracking/profile_model.py --script ostrack --config levit_256_32x4_ep300

UAV vision-language tracking dataset: DTB70-NLP, UAV20L-NLP, UAVDT-NLP, and VisDrone2019-SOT-test-dev-NLP

Considering that WebUAV-3M is the only UAV vision-language tracking dataset, we additionally construct vision-language tracking datasets DTB70-NLP, UAV20L-NLP, UAVDT-NLP, and VisDrone2019-SOT-test-dev-NLP based on the vision-only aerial dataset DTB70, UAV20L, UAVDT, and VisDrone2019-SOT-test-dev. Note that DTB70-NLP and UAV20L-NLP are non-overlapping with the training set. Therefore, to better evaluate the model generalization, DTB70-NLP and UAV20L-NLP are only used for testing.

Dataset#Video#Total frame#Mean frame#Mean languageDownload
WebUAV-3M7803.3 M71014.4Language and Image
DTB70-NLP7015.8 K22512.6Language and Image
UAV20L-NLP2058.6 K293411.5Language and Image

Acknowledgement

The code based on the OSTrack, All-in-One, and A-ViT.

We would like to express our sincere thanks to the contributors.

Citation:

If you find this work useful for your research, please cite the following papers:

@ARTICLE{10220112,
  author={Yuanliang Xue,Bineng Zhong,Guodong Jin,Tao Shen,Lining Tan,Ning Li,Yaozong Zheng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology}, 
  title={AVLTrack: Dynamic Sparse Learning for Aerial Vision-Language Tracking}, 
  year={2025},
  doi={10.1109/TCSVT.2025.3549953}}

If you have any questions about this work, please contact with me via xyl_507@outlook.com