SiMO: Single-Modal-Operable Multimodal Collaborative Perception
March 18, 2026 ยท View on GitHub
Official PyTorch Implementation
This repository contains the official implementation of the paper "Single-Modal-Operable Multimodal Collaborative Perception" (ICLR 2026).
๐ Update: Pretrained checkpoints are now available! Download them from our Hugging Face repository.
Abstract
Multimodal collaborative perception promises robust 3D object detection by fusing complementary sensor data from multiple connected vehicles. However, existing methods suffer from catastrophic performance degradation when one modality becomes unavailable during deployment, a common scenario in real-world autonomous driving. SiMO addresses this critical limitation through two key innovations:
-
LAMMA (Length-Adaptive Multi-Modal Fusion): A novel fusion module that adaptively handles variable numbers of input modalities, operating like a parallel circuit rather than series fusion.
-
PAFR Training Strategy: A four-stage training paradigm (Pretrain-Align-Fuse-Random Drop) that prevents modality competition and enables seamless single-modal operation.
| Modality | AP@30 | AP@50 | AP@70 |
|---|---|---|---|
| LiDAR + Camera | 98.30 | 97.94 | 94.64 |
| LiDAR-only | 97.32 | 97.07 | 94.06 |
| Camera-only | 80.81 | 69.63 | 44.82 |
Key Result: SiMO achieves state-of-the-art performance on OPV2V-H with graceful degradation when modalities fail.
Key Features
- Single-Modal Operability: First multimodal collaborative perception framework that maintains functional performance with any subset of modalities
- Adaptive Fusion: LAMMA module dynamically adjusts to available modalities without architecture changes
- No Modality Competition: PAFR training prevents feature suppression between modalities
- Drop-in Replacement: Compatible with existing fusion frameworks like HEAL's Pyramid Fusion
- Multi-Dataset Support: Evaluated on OPV2V-H, V2XSet, and DAIR-V2X-C
Architecture Overview
LAMMA (Length-Adaptive Multi-Modal Fusion)
LAMMA is the core fusion module that enables SiMO's single-modal operability:
$ \text{Input}: \text{Camera} \text{Features} (\text{B}, \text{N}, \text{C}, \text{H}, \text{W}) + \text{LiDAR} \text{Features} (\text{B}, \text{N}, \text{C}, \text{H}, \text{W}) โ [\text{Positional} \text{Encoding}] โ [\text{Feature} \text{Projection}] โ \text{Downsampling} (2\text{x}) โ [\text{Modality}-\text{Aware} \text{Masking}] โ \text{Single}-\text{mode} \text{or} \text{Random} \text{Drop} โ [\text{Cross}-\text{Attention}] \times 2 (\text{Camera} \text{branch} + \text{LiDAR} \text{branch}) โ [\text{Parallel} \text{Fusion}] โ \text{Sum} \text{of} \text{attended} \text{features} โ [\text{Feature} \text{Recovery}] โ \text{Upsampling} (2\text{x}) โ \text{Output}: \text{Fused} \text{Features} + \text{Single}-\text{Modal} \text{Features} $
Key Design Principles:
- Parallel Processing: Unlike sequential fusion, LAMMA processes modalities in parallel and sums their contributions
- Adaptive Masking: During training, random modality dropout forces the network to learn robust single-modal representations
- Cross-Attention: Each modality attends to the concatenated features of all available modalities
Integration with Pyramid Fusion
SiMO works seamlessly with HEAL's Pyramid Fusion framework:
Stage 1: Single-Modal Encoders (PointPillar for LiDAR, Lift-Splat-Shoot for Camera)
โ
Stage 2: Single-Modal Backbones (ResNet-based BEV feature extraction)
โ
Stage 3: Modality Alignment (ConvNeXt-based feature alignment)
โ
Stage 4: LAMMA Fusion (Adaptive multimodal fusion)
โ
Stage 5: Pyramid Fusion Backbone (Multi-scale collaborative aggregation)
โ
Stage 6: Detection Head (Anchor-based 3D object detection)
PAFR Training Strategy
The PAFR (Pretrain-Align-Fuse-Random Drop) strategy consists of four stages:
Stage 1: Pretrain (P)
Goal: Train single-modal feature extractors independently
# Pretrain LiDAR branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml
# Pretrain Camera branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml
Configuration: Set freeze: true for all pretrained components in subsequent stages.
Stage 2: Align (A)
Goal: Align multi-modal features to a common representation space using ConvNeXt
Key Configuration:
aligner_args:
core_method: convnext
freeze: true
spatial_align: false
args:
num_of_blocks: 3
dim: 64
Training: Train with both modalities, allowing the aligner to learn cross-modal feature correspondence.
Stage 3: Fuse (F)
Goal: Train LAMMA fusion module with full multimodal inputs
Key Configuration:
mm_fusion_method: 'lamma3'
lamma:
freeze: false
feature_stride: 2
feat_dim: 64
dim: 128
heads: 2
single_mode: false
random_drop: false
Important: Keep random_drop: false and single_mode: false during this stage.
Stage 4: Random Drop (RD)
Goal: Fine-tune with random modality dropout to enable single-modal operation
Key Configuration:
lamma:
random_drop: true
lidar_drop_ratio: 0.5 # 50% chance to drop LiDAR when dropping
Training: With 50% probability, randomly drop one modality during training. This forces the network to maintain functional performance with either modality alone.
Installation
This project is implemented based on HEAL and adopts the same environment setup. Please refer to the HEAL repository for detailed installation instructions and troubleshooting.
Prerequisites
- Python >= 3.8
- PyTorch >= 1.12.0
- CUDA >= 11.3
- spconv >= 2.0
Step 1: Clone Repository
git clone https://github.com/dempsey-wen/SiMO.git
cd SiMO
Step 2: Install Dependencies
# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
# Install spconv (for LiDAR feature extraction)
pip install spconv-cu113
# Install other requirements
pip install -r requirements.txt
Key Dependencies:
easydict~=1.9opencv-python-headless~=4.5.1.48timmeinopsshapely==2.0.0efficientnet_pytorch==0.7.0
Step 3: Install OpenCOOD
pip install -e .
Step 4: Compile CUDA Extensions
cd opencood/pcdet_utils/pointnet2
python setup.py install
cd ../iou3d_nms
python setup.py install
cd ../../..
Data Preparation
Supported Datasets
SiMO supports the following collaborative perception datasets:
| Dataset | Scenarios | Modalities | Download |
|---|---|---|---|
| OPV2V-H | Highway, Urban | LiDAR, Camera | Link |
| V2XSet | Highway, Urban | LiDAR, Camera | Link |
| DAIR-V2X-C | Real-world | LiDAR, Camera | Link |
Directory Structure
data/
โโโ OPV2V/
โ โโโ train/
โ โโโ validate/
โ โโโ test/
โโโ V2XSet/
โ โโโ train/
โ โโโ validate/
โ โโโ test/
โโโ DAIR-V2X/
โโโ ...
Training Commands
Complete PAFR Pipeline
Step 1: Pretrain Single-Modal Branches
# LiDAR-only pretraining (20 epochs)
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml
# Camera-only pretraining (50 epochs)
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml
Output: Model checkpoints saved to saved_models/opv2v_lidar_pyramid/ and saved_models/opv2v_camera_pyramid/
Step 2: Align Multi-Modal Features
This stage trains the modality aligners to align LiDAR and camera features to a common representation space.
Key Configuration: The aligner trains independently for each modality with frozen encoders and backbones.
2.1 Train LiDAR Aligner
Modify config to set single_modality: lidar and freeze camera aligner:
model:
args:
single_modality: lidar
lidar_aligner:
freeze: false
camera_aligner:
freeze: true # Freeze camera aligner
lidar_encoder:
freeze: true # Freeze pretrained LiDAR encoder
lidar_backbone:
freeze: true # Freeze pretrained LiDAR backbone
camera_encoder:
freeze: true # Freeze pretrained camera encoder
camera_backbone:
freeze: true # Freeze pretrained camera backbone
Then run training:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml
2.2 Train Camera Aligner
Modify config to set single_modality: camera and freeze LiDAR aligner:
model:
args:
single_modality: camera
lidar_aligner:
freeze: true # Freeze LiDAR aligner
camera_aligner:
freeze: false
# Keep encoders and backbones frozen
Then run training:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml
Output: Checkpoints saved with trained aligners. Load these checkpoints for the next stage.
Step 3: Train LAMMA Fusion
This stage trains the LAMMA fusion module with both aligners frozen.
Key Configuration: Set single_modality: false to enable full multimodal fusion.
model:
args:
single_modality: false # Enable full multimodal fusion
lidar_aligner:
freeze: true # Freeze both aligners
camera_aligner:
freeze: true
lamma:
random_drop: false # Disable random drop in this stage
single_mode: false
Set model_dir to load the pretrained aligner checkpoints from the Align stage.
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
--model_dir saved_models/opv2v_lidarcamera_aligned/
Step 4: Random Drop Fine-tuning
This final stage enables random modality dropout during training to ensure robust single-modal operation.
Key Configuration: Set lamma.random_drop: true to enable random dropout.
model:
args:
single_modality: false # Still enable full multimodal fusion
lamma:
random_drop: true # Enable random modality dropout
lidar_drop_ratio: 0.5 # 50% probability to drop LiDAR when dropping
single_mode: false
Then resume training with the Fusion stage checkpoint:
python opencood/tools/train.py \
--hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
--model_dir saved_models/opv2v_lidarcamera_lamma3_fused/
Training Notes:
- During training, with 50% probability, one modality is randomly dropped
- This forces the network to maintain functional performance with either modality alone
- The final checkpoint will have robust single-modal operability
Testing Commands
Multimodal Testing (LiDAR + Camera)
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate
Single-Modal Testing
LiDAR-Only Inference
Modify the config to set single_modality: lidar:
model:
args:
single_modality: lidar
Then run inference:
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate
Camera-Only Inference
model:
args:
single_modality: camera
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate
Evaluation with Different Ranges
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate \
--range 51.2,51.2
Save Visualization
python opencood/tools/inference.py \
--model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
--fusion_method intermediate \
--save_vis_interval 10
Benchmark Results
OPV2V-H Test Set
SiMO-PF (Pyramid Fusion + LAMMA)
| Method | Modality | AP@30 | AP@50 | AP@70 | Modality Drop? |
|---|---|---|---|---|---|
| SiMO-PF | LiDAR + Camera | 98.30 | 97.94 | 94.64 | No |
| SiMO-PF | LiDAR only | 97.32 | 97.07 | 94.06 | Yes |
| SiMO-PF | Camera only | 80.81 | 69.63 | 44.82 | Yes |
Key Observations:
- SiMO maintains >97% AP@50 even when operating with LiDAR alone
- Camera-only performance is competitive for low-precision detection (AP@30 = 80.81)
- Graceful degradation pattern enables safe fallback strategies
Comparison with Baselines
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| BM2CP (Zhao et al., 2023) | 91.45 | 91.31 | 0.00 |
| BEVFusion (Liu et al., 2023) | 94.21 | 91.99 | 0.00 |
| UniBEV (Wang et al., 2024a) | 91.71 | 91.73 | 0.00 |
| AttFusion (Xu et al., 2022c) | - | 95.09 | 52.91 |
| HEAL (Lu et al., 2024) | - | 98.00 | 60.48 |
| SiMO (AttFusion w/ RD) | 94.98 | 94.02 | 49.69 |
| SiMO (Pyramid Fusion w/ RD) (Ours) | 97.94 | 97.07 | 69.63 |
V2XSet Test Set
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| SiMO-PF | 92.66 | 90.44 | 56.42 |
DAIR-V2X-C Test Set
| Method | LiDAR+Camera AP@50 | LiDAR-Only AP@50 | Camera-Only AP@50 |
|---|---|---|---|
| SiMO-PF | 51.82 | 52.33 | 2.24 |
Model Zoo
Pretrained models are available on Hugging Face.
| Model | Dataset | Config | Checkpoint |
|---|---|---|---|
| SiMO-PF | OPV2V-H | Config | ๐ค HF |
| SiMO-AttFuse | OPV2V-H | Config | ๐ค HF |
Download Models from Hugging Face
# Install huggingface-hub
pip install huggingface-hub
# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='DempseyWen/SiMO', repo_type='model')"
# Or download specific model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='DempseyWen/SiMO', filename='path/to/checkpoint.pth')"
The downloaded checkpoints will be saved to ~/.cache/huggingface/hub/. You can also manually download from Hugging Face.
Project Structure
SiMO/
โโโ opencood/
โ โโโ models/
โ โ โโโ fuse_modules/
โ โ โ โโโ lamma.py # LAMMA implementation
โ โ โ โโโ pyramid_fuse.py # Pyramid Fusion
โ โ โโโ heter_pyramid_collab.py # Main model
โ โโโ tools/
โ โ โโโ train.py # Training script
โ โ โโโ train_ddp.py # Distributed training
โ โ โโโ inference.py # Testing script
โ โโโ hypes_yaml/
โ โ โโโ opv2v/
โ โ โโโ LiDAROnly/ # Single-modal configs
โ โ โโโ CameraOnly/
โ โ โโโ MoreModality/ # Multimodal configs
โ โโโ data_utils/
โ โโโ datasets/ # Dataset loaders
โโโ requirements.txt
โโโ setup.py
โโโ README.md
Citation
If you find this work useful for your research, please cite:
@inproceedings{wen2026simo,
title={Single-Modal-Operable Multimodal Collaborative Perception},
author={Wen, Dempsey and Lu, Yifan and others},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}
If you use the OpenCOOD framework, please also cite:
@inproceedings{xu2022opencood,
title={OpenCOOD: An Open Cooperative Perception Framework for Autonomous Driving},
author={Xu, Runsheng and Lu, Yifan and others},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
year={2023}
}
License
This project is licensed under the MIT License. See LICENSE for details.
The code is based on OpenCOOD and HEAL.
Acknowledgements
We thank the authors of OpenCOOD and HEAL for their excellent open-source frameworks. This work builds upon their contributions to collaborative perception research.
Contact
For questions or issues, please open an issue on GitHub or contact the authors.