README.md

April 24, 2025 · View on GitHub

A Multilevel Attention Network with Sub-Instructions for Continuous Vision-and-Language Navigation

Setup

  1. Use anaconda to create a Python 3.6 environment:
conda create -n vlnce python3.6
conda activate vlnce
  1. Install Habitat-Sim 0.1.7:
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
  1. Install Habitat-Lab 0.1.7:
git clone --branch v0.1.7 git@github.com:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all

Habitat v0.2.1 is also supported now!

  1. Clone this repository and install python requirements:
git clone https://github.com/RavenKiller/MLA.git
cd MLA
pip install -r requirements.txt
  1. Download Matterport3D sences:
    # requires running with python 2.7
    python download_mp.py --task habitat -o data/scene_datasets/mp3d/
    
    • Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.
  2. Download preprocessed episodes R2R_VLNCE_FSASub from here. Extrach it into data/datasets/.
  3. Download the depth encoder gibson-2plus-resnet50.pth from here. Extract the contents to data/ddppo-models/{model}.pth.

Train, evaluate and test

run.py is the program entrance. You can run it like:

python run.py \
  --exp-config {config} \
  --run-type {type}

{config} should be replaced by a config file path; {type} should be train, eval or inference, meaning train models, evaluate models and test models.

Our config files is stored in mlanet/config/mla:

FileMeaning
mla.yamlTrain model
mla_da.yamlTrain model with DAgger
mla_aug.yamlTrain model with EnvDrop augmentation
mla_da_aug_tune.yamlFine-tune model with DAgger
mla_ppo.yamlFine-tune model with PPO
mla_ablate.yamlAblation study
eval_single.yamlEvaluate and visualize a single path
mla_real.yamlReal-world open-loop test
mla_alkaid.yamlReal-world close-loop test on the Alkaid robot

Performance

The best model on validation sets is trained with EnvDrop augmentation and then fine-tuned with DAgger and PPO. We use the same strategy to train the model submitted to the test leaderboard, but on all available data (train, val_seen and val_unseen).

SplitTLNEOSRSRSPL
Test7.426.780.390.340.32
Val Unseen7.216.300.420.380.35
Val Seen8.105.830.500.440.42

Qualitative examples

  • Val unseen, episode 7. Go straight past the pool. Walk between the bar and chairs. Stop when you get to the corner of the bar. That's where you will wait. (success)

https://github.com/user-attachments/assets/9c458100-7276-4213-8a9f-e929e5166cb9

  • Val unseen, episode 90. Walk into the dining area and make a right when you get to the end of the table. Walk down the hall and stand in front of the door of the dining room at the end of the hall. (failure)

https://github.com/user-attachments/assets/8332e1a4-6375-49f2-8fe6-be7e934c8a37

  • Val unseen, episode 1124. Walk up the stairs and go left into the bedroom. Turn left into the bathroom. (success)

https://github.com/user-attachments/assets/dce7affc-ee1d-4437-b5cb-f1f3af0ab578

  • Val unseen, episode 1584. Go left around the wooden barrier and stop once you reach the wooden barrier on the opposite corner. (failure)

https://github.com/user-attachments/assets/2e0b80bc-505d-47b4-9ff1-07c47787b881

  • Val seen, episode 12. move forward in front of the television. turn left and exit the room. go down hallway and step into the bedroom on the left. (success)

https://github.com/user-attachments/assets/a9d51795-1952-4be9-b8f0-d608044cb16f

  • Val seen, episode 369. Leave the playroom and walk straight ahead. Walk to the balcony across from the balcony. Stop in front of the balcony. (failure)

https://github.com/user-attachments/assets/19f36430-4281-4167-9fc6-8d0871b598bb

Real-world application

alkaid_robot

Alkaid is a self-developed interactive service robot. Here are some parameters:

  • Camera: 720P resolution, 90° max FOV
  • Screen: 1080P, touch screen
  • Microphone: 4-microphone circular array, 61dB SNR
  • Speaker: 2 stereo units, 150Hz-20kHz output
  • Chassis: 2-wheel differential drive, 0.5m/s max speed, 1.2rad/s max angular speed

The model is evaluated on collected VLNCE@TJ validation set (13 examples, extraction code: evop). Demonstrations (click to watch the full video):

Watch the video

Checkpoints

[best model]

Citation

@article{he2025multilevel,
  title = {A Multilevel Attention Network with Sub-Instructions for Continuous Vision-and-Language Navigation},
  author = {He, Zongtao and Wang, Liuyi and Li, Shu and Yan, Qingqing and Liu, Chengju and Chen, Qijun},
  year = {2025},
  month = apr,
  journal = {Applied Intelligence},
  volume = {55},
  number = {7},
  pages = {657},
  issn = {1573-7497},
  doi = {10.1007/s10489-025-06544-9}
}