Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

October 9, 2023 ยท View on GitHub

This repository is the official implementation of Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation (ICCV 2023 Oral).

Cross-modal alignment is one key challenge for Visionand-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pretraining objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

framework

Requirements

  1. Install Matterport3D simulators: follow instructions here.
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name VLN-GELA python=3.8.5
conda activate VLN-GELA
pip install -r requirements.txt
  1. Download datasets from Baidu Netdisk, including processed annotations, features and pre-trained models of R2R and CVDN datasets. Put the data in datasets directory.

  2. Download the GEL-R2R dataset from Baidu Netdisk. Put the data in datasets/R2R/annotations/GELR2R directory.

Adaptive Pre-training

Grounded entity-landmark adaptive pre-training:

bash ada_pretrain_src/pretrain_r2r.sh

Fine-tuning & Evaluation

cd finetune_src
bash scripts/run_r2r.sh # (run_cvdn.sh)

Citation

@InProceedings{Cui_2023_ICCV,
    author    = {Cui, Yibo and Xie, Liang and Zhang, Yakun and Zhang, Meishan and Yan, Ye and Yin, Erwei},
    title     = {Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {12043-12053}
}

Acknowledgments

Our code is based on VLN-HAMT, EnvEdit and MDETR. Thanks for their great works!