README.md
April 24, 2025 · View on GitHub
A Multilevel Attention Network with Sub-Instructions for Continuous Vision-and-Language Navigation
Setup
- Use anaconda to create a Python 3.6 environment:
conda create -n vlnce python3.6
conda activate vlnce
- Install Habitat-Sim 0.1.7:
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
- Install Habitat-Lab 0.1.7:
git clone --branch v0.1.7 git@github.com:facebookresearch/habitat-lab.git
cd habitat-lab
# installs both habitat and habitat_baselines
python -m pip install -r requirements.txt
python -m pip install -r habitat_baselines/rl/requirements.txt
python -m pip install -r habitat_baselines/rl/ddppo/requirements.txt
python setup.py develop --all
Habitat v0.2.1 is also supported now!
- Clone this repository and install python requirements:
git clone https://github.com/RavenKiller/MLA.git
cd MLA
pip install -r requirements.txt
- Download Matterport3D sences:
- Get the official
download_mp.pyfrom Matterport3D project webpage - Download scene data for Habitat
# requires running with python 2.7 python download_mp.py --task habitat -o data/scene_datasets/mp3d/- Extract such that it has the form
data/scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes.
- Get the official
- Download preprocessed episodes R2R_VLNCE_FSASub from here. Extrach it into
data/datasets/. - Download the depth encoder
gibson-2plus-resnet50.pthfrom here. Extract the contents todata/ddppo-models/{model}.pth.
Train, evaluate and test
run.py is the program entrance. You can run it like:
python run.py \
--exp-config {config} \
--run-type {type}
{config} should be replaced by a config file path; {type} should be train, eval or inference, meaning train models, evaluate models and test models.
Our config files is stored in mlanet/config/mla:
| File | Meaning |
|---|---|
mla.yaml | Train model |
mla_da.yaml | Train model with DAgger |
mla_aug.yaml | Train model with EnvDrop augmentation |
mla_da_aug_tune.yaml | Fine-tune model with DAgger |
mla_ppo.yaml | Fine-tune model with PPO |
mla_ablate.yaml | Ablation study |
eval_single.yaml | Evaluate and visualize a single path |
mla_real.yaml | Real-world open-loop test |
mla_alkaid.yaml | Real-world close-loop test on the Alkaid robot |
Performance
The best model on validation sets is trained with EnvDrop augmentation and then fine-tuned with DAgger and PPO. We use the same strategy to train the model submitted to the test leaderboard, but on all available data (train, val_seen and val_unseen).
| Split | TL | NE | OSR | SR | SPL |
|---|---|---|---|---|---|
| Test | 7.42 | 6.78 | 0.39 | 0.34 | 0.32 |
| Val Unseen | 7.21 | 6.30 | 0.42 | 0.38 | 0.35 |
| Val Seen | 8.10 | 5.83 | 0.50 | 0.44 | 0.42 |
Qualitative examples
- Val unseen, episode 7. Go straight past the pool. Walk between the bar and chairs. Stop when you get to the corner of the bar. That's where you will wait. (success)
https://github.com/user-attachments/assets/9c458100-7276-4213-8a9f-e929e5166cb9
- Val unseen, episode 90. Walk into the dining area and make a right when you get to the end of the table. Walk down the hall and stand in front of the door of the dining room at the end of the hall. (failure)
https://github.com/user-attachments/assets/8332e1a4-6375-49f2-8fe6-be7e934c8a37
- Val unseen, episode 1124. Walk up the stairs and go left into the bedroom. Turn left into the bathroom. (success)
https://github.com/user-attachments/assets/dce7affc-ee1d-4437-b5cb-f1f3af0ab578
- Val unseen, episode 1584. Go left around the wooden barrier and stop once you reach the wooden barrier on the opposite corner. (failure)
https://github.com/user-attachments/assets/2e0b80bc-505d-47b4-9ff1-07c47787b881
- Val seen, episode 12. move forward in front of the television. turn left and exit the room. go down hallway and step into the bedroom on the left. (success)
https://github.com/user-attachments/assets/a9d51795-1952-4be9-b8f0-d608044cb16f
- Val seen, episode 369. Leave the playroom and walk straight ahead. Walk to the balcony across from the balcony. Stop in front of the balcony. (failure)
https://github.com/user-attachments/assets/19f36430-4281-4167-9fc6-8d0871b598bb
Real-world application
Alkaid is a self-developed interactive service robot. Here are some parameters:
- Camera: 720P resolution, 90° max FOV
- Screen: 1080P, touch screen
- Microphone: 4-microphone circular array, 61dB SNR
- Speaker: 2 stereo units, 150Hz-20kHz output
- Chassis: 2-wheel differential drive, 0.5m/s max speed, 1.2rad/s max angular speed
The model is evaluated on collected VLNCE@TJ validation set (13 examples, extraction code: evop). Demonstrations (click to watch the full video):
Checkpoints
Citation
@article{he2025multilevel,
title = {A Multilevel Attention Network with Sub-Instructions for Continuous Vision-and-Language Navigation},
author = {He, Zongtao and Wang, Liuyi and Li, Shu and Yan, Qingqing and Liu, Chengju and Chen, Qijun},
year = {2025},
month = apr,
journal = {Applied Intelligence},
volume = {55},
number = {7},
pages = {657},
issn = {1573-7497},
doi = {10.1007/s10489-025-06544-9}
}