SiMO: Single-Modal-Operable Multimodal Collaborative Perception

March 18, 2026 · View on GitHub

Official PyTorch Implementation

This repository contains the official implementation of the paper "Single-Modal-Operable Multimodal Collaborative Perception" (ICLR 2026).

🎉 Update: Pretrained checkpoints are now available! Download them from our Hugging Face repository.

Abstract

Multimodal collaborative perception promises robust 3D object detection by fusing complementary sensor data from multiple connected vehicles. However, existing methods suffer from catastrophic performance degradation when one modality becomes unavailable during deployment, a common scenario in real-world autonomous driving. SiMO addresses this critical limitation through two key innovations:

LAMMA (Length-Adaptive Multi-Modal Fusion): A novel fusion module that adaptively handles variable numbers of input modalities, operating like a parallel circuit rather than series fusion.
PAFR Training Strategy: A four-stage training paradigm (Pretrain-Align-Fuse-Random Drop) that prevents modality competition and enables seamless single-modal operation.

Modality	AP@30	AP@50	AP@70
LiDAR + Camera	98.30	97.94	94.64
LiDAR-only	97.32	97.07	94.06
Camera-only	80.81	69.63	44.82

Key Result: SiMO achieves state-of-the-art performance on OPV2V-H with graceful degradation when modalities fail.

Key Features

Single-Modal Operability: First multimodal collaborative perception framework that maintains functional performance with any subset of modalities
Adaptive Fusion: LAMMA module dynamically adjusts to available modalities without architecture changes
No Modality Competition: PAFR training prevents feature suppression between modalities
Drop-in Replacement: Compatible with existing fusion frameworks like HEAL's Pyramid Fusion
Multi-Dataset Support: Evaluated on OPV2V-H, V2XSet, and DAIR-V2X-C

Architecture Overview

LAMMA is the core fusion module that enables SiMO's single-modal operability:

$ \text{Input}: \text{Camera} \text{Features} (\text{B}, \text{N}, \text{C}, \text{H}, \text{W}) + \text{LiDAR} \text{Features} (\text{B}, \text{N}, \text{C}, \text{H}, \text{W}) ↓ [\text{Positional} \text{Encoding}] ↓ [\text{Feature} \text{Projection}] → \text{Downsampling} (2\text{x}) ↓ [\text{Modality}-\text{Aware} \text{Masking}] ← \text{Single}-\text{mode} \text{or} \text{Random} \text{Drop} ↓ [\text{Cross}-\text{Attention}] \times 2 (\text{Camera} \text{branch} + \text{LiDAR} \text{branch}) ↓ [\text{Parallel} \text{Fusion}] → \text{Sum} \text{of} \text{attended} \text{features} ↓ [\text{Feature} \text{Recovery}] → \text{Upsampling} (2\text{x}) ↓ \text{Output}: \text{Fused} \text{Features} + \text{Single}-\text{Modal} \text{Features} $

Key Design Principles:

Parallel Processing: Unlike sequential fusion, LAMMA processes modalities in parallel and sums their contributions
Adaptive Masking: During training, random modality dropout forces the network to learn robust single-modal representations
Cross-Attention: Each modality attends to the concatenated features of all available modalities

Integration with Pyramid Fusion

SiMO works seamlessly with HEAL's Pyramid Fusion framework:

Stage 1: Single-Modal Encoders (PointPillar for LiDAR, Lift-Splat-Shoot for Camera)
         ↓
Stage 2: Single-Modal Backbones (ResNet-based BEV feature extraction)
         ↓
Stage 3: Modality Alignment (ConvNeXt-based feature alignment)
         ↓
Stage 4: LAMMA Fusion (Adaptive multimodal fusion)
         ↓
Stage 5: Pyramid Fusion Backbone (Multi-scale collaborative aggregation)
         ↓
Stage 6: Detection Head (Anchor-based 3D object detection)

PAFR Training Strategy

The PAFR (Pretrain-Align-Fuse-Random Drop) strategy consists of four stages:

Stage 1: Pretrain (P)

Goal: Train single-modal feature extractors independently

# Pretrain LiDAR branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml

# Pretrain Camera branch
python opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml

Configuration: Set freeze: true for all pretrained components in subsequent stages.

Stage 2: Align (A)

Goal: Align multi-modal features to a common representation space using ConvNeXt

Key Configuration:

aligner_args:
  core_method: convnext
  freeze: true
  spatial_align: false
  args:
    num_of_blocks: 3
    dim: 64

Training: Train with both modalities, allowing the aligner to learn cross-modal feature correspondence.

Stage 3: Fuse (F)

Goal: Train LAMMA fusion module with full multimodal inputs

Key Configuration:

mm_fusion_method: 'lamma3'
lamma:
  freeze: false
  feature_stride: 2
  feat_dim: 64
  dim: 128
  heads: 2
  single_mode: false
  random_drop: false

Important: Keep random_drop: false and single_mode: false during this stage.

Stage 4: Random Drop (RD)

Goal: Fine-tune with random modality dropout to enable single-modal operation

Key Configuration:

lamma:
  random_drop: true
  lidar_drop_ratio: 0.5  # 50% chance to drop LiDAR when dropping

Training: With 50% probability, randomly drop one modality during training. This forces the network to maintain functional performance with either modality alone.

Installation

This project is implemented based on HEAL and adopts the same environment setup. Please refer to the HEAL repository for detailed installation instructions and troubleshooting.

Prerequisites

Python >= 3.8
PyTorch >= 1.12.0
CUDA >= 11.3
spconv >= 2.0

Step 1: Clone Repository

git clone https://github.com/dempsey-wen/SiMO.git
cd SiMO

Step 2: Install Dependencies

# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

# Install spconv (for LiDAR feature extraction)
pip install spconv-cu113

# Install other requirements
pip install -r requirements.txt

Key Dependencies:

easydict~=1.9
opencv-python-headless~=4.5.1.48
timm
einops
shapely==2.0.0
efficientnet_pytorch==0.7.0

Step 3: Install OpenCOOD

pip install -e .

Step 4: Compile CUDA Extensions

cd opencood/pcdet_utils/pointnet2
python setup.py install
cd ../iou3d_nms
python setup.py install
cd ../../..

Data Preparation

Supported Datasets

SiMO supports the following collaborative perception datasets:

Dataset	Scenarios	Modalities	Download
OPV2V-H	Highway, Urban	LiDAR, Camera	Link
V2XSet	Highway, Urban	LiDAR, Camera	Link
DAIR-V2X-C	Real-world	LiDAR, Camera	Link

Directory Structure

data/
├── OPV2V/
│   ├── train/
│   ├── validate/
│   └── test/
├── V2XSet/
│   ├── train/
│   ├── validate/
│   └── test/
└── DAIR-V2X/
    └── ...

Training Commands

Complete PAFR Pipeline

# LiDAR-only pretraining (20 epochs)
python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/LiDAROnly/lidar_pyramid.yaml

# Camera-only pretraining (50 epochs)
python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/CameraOnly/camera_pyramid.yaml

Output: Model checkpoints saved to saved_models/opv2v_lidar_pyramid/ and saved_models/opv2v_camera_pyramid/

This stage trains the modality aligners to align LiDAR and camera features to a common representation space.

Key Configuration: The aligner trains independently for each modality with frozen encoders and backbones.

2.1 Train LiDAR Aligner

Modify config to set single_modality: lidar and freeze camera aligner:

model:
  args:
    single_modality: lidar
    lidar_aligner:
      freeze: false
    camera_aligner:
      freeze: true  # Freeze camera aligner
    lidar_encoder:
      freeze: true   # Freeze pretrained LiDAR encoder
    lidar_backbone:
      freeze: true   # Freeze pretrained LiDAR backbone
    camera_encoder:
      freeze: true   # Freeze pretrained camera encoder
    camera_backbone:
      freeze: true   # Freeze pretrained camera backbone

Then run training:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml

2.2 Train Camera Aligner

Modify config to set single_modality: camera and freeze LiDAR aligner:

model:
  args:
    single_modality: camera
    lidar_aligner:
      freeze: true   # Freeze LiDAR aligner
    camera_aligner:
      freeze: false
    # Keep encoders and backbones frozen

Then run training:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml

Output: Checkpoints saved with trained aligners. Load these checkpoints for the next stage.

Step 3: Train LAMMA Fusion

This stage trains the LAMMA fusion module with both aligners frozen.

Key Configuration: Set single_modality: false to enable full multimodal fusion.

model:
  args:
    single_modality: false   # Enable full multimodal fusion
    lidar_aligner:
      freeze: true    # Freeze both aligners
    camera_aligner:
      freeze: true
    lamma:
      random_drop: false  # Disable random drop in this stage
      single_mode: false

Set model_dir to load the pretrained aligner checkpoints from the Align stage.

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
    --model_dir saved_models/opv2v_lidarcamera_aligned/

Step 4: Random Drop Fine-tuning

This final stage enables random modality dropout during training to ensure robust single-modal operation.

Key Configuration: Set lamma.random_drop: true to enable random dropout.

model:
  args:
    single_modality: false   # Still enable full multimodal fusion
    lamma:
      random_drop: true       # Enable random modality dropout
      lidar_drop_ratio: 0.5  # 50% probability to drop LiDAR when dropping
      single_mode: false

Then resume training with the Fusion stage checkpoint:

python opencood/tools/train.py \
    --hypes_yaml opencood/hypes_yaml/opv2v/MoreModality/lidar_camera_lamma3_pyramid_fusion.yaml \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_fused/

Training Notes:

During training, with 50% probability, one modality is randomly dropped
This forces the network to maintain functional performance with either modality alone
The final checkpoint will have robust single-modal operability

Testing Commands

Multimodal Testing (LiDAR + Camera)

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

LiDAR-Only Inference

Modify the config to set single_modality: lidar:

model:
  args:
    single_modality: lidar

Then run inference:

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

Camera-Only Inference

model:
  args:
    single_modality: camera

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate

Evaluation with Different Ranges

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate \
    --range 51.2,51.2

Save Visualization

python opencood/tools/inference.py \
    --model_dir saved_models/opv2v_lidarcamera_lamma3_pyramid_fusion/ \
    --fusion_method intermediate \
    --save_vis_interval 10

Method	Modality	AP@30	AP@50	AP@70	Modality Drop?
SiMO-PF	LiDAR + Camera	98.30	97.94	94.64	No
SiMO-PF	LiDAR only	97.32	97.07	94.06	Yes
SiMO-PF	Camera only	80.81	69.63	44.82	Yes

Key Observations:

SiMO maintains >97% AP@50 even when operating with LiDAR alone
Camera-only performance is competitive for low-precision detection (AP@30 = 80.81)
Graceful degradation pattern enables safe fallback strategies

Comparison with Baselines

Method	LiDAR+Camera AP@50	LiDAR-Only AP@50	Camera-Only AP@50
BM2CP (Zhao et al., 2023)	91.45	91.31	0.00
BEVFusion (Liu et al., 2023)	94.21	91.99	0.00
UniBEV (Wang et al., 2024a)	91.71	91.73	0.00
AttFusion (Xu et al., 2022c)	-	95.09	52.91
HEAL (Lu et al., 2024)	-	98.00	60.48
SiMO (AttFusion w/ RD)	94.98	94.02	49.69
SiMO (Pyramid Fusion w/ RD) (Ours)	97.94	97.07	69.63

V2XSet Test Set

Method	LiDAR+Camera AP@50	LiDAR-Only AP@50	Camera-Only AP@50
SiMO-PF	92.66	90.44	56.42

DAIR-V2X-C Test Set

Method	LiDAR+Camera AP@50	LiDAR-Only AP@50	Camera-Only AP@50
SiMO-PF	51.82	52.33	2.24

Model Zoo

Pretrained models are available on Hugging Face.

Model	Dataset	Config	Checkpoint
SiMO-PF	OPV2V-H	Config	🤗 HF
SiMO-AttFuse	OPV2V-H	Config	🤗 HF

Download Models from Hugging Face

# Install huggingface-hub
pip install huggingface-hub

# Download all checkpoints
python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='DempseyWen/SiMO', repo_type='model')"

# Or download specific model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='DempseyWen/SiMO', filename='path/to/checkpoint.pth')"

The downloaded checkpoints will be saved to `~/.cache/huggingface/hub/`. You can also manually download from Hugging Face.

Project Structure

SiMO/
├── opencood/
│   ├── models/
│   │   ├── fuse_modules/
│   │   │   ├── lamma.py              # LAMMA implementation
│   │   │   └── pyramid_fuse.py       # Pyramid Fusion
│   │   └── heter_pyramid_collab.py   # Main model
│   ├── tools/
│   │   ├── train.py                  # Training script
│   │   ├── train_ddp.py              # Distributed training
│   │   └── inference.py              # Testing script
│   ├── hypes_yaml/
│   │   └── opv2v/
│   │       ├── LiDAROnly/            # Single-modal configs
│   │       ├── CameraOnly/
│   │       └── MoreModality/         # Multimodal configs
│   └── data_utils/
│       └── datasets/                 # Dataset loaders
├── requirements.txt
├── setup.py
└── README.md

Citation

If you find this work useful for your research, please cite:

@inproceedings{wen2026simo,
  title={Single-Modal-Operable Multimodal Collaborative Perception},
  author={Wen, Dempsey and Lu, Yifan and others},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

If you use the OpenCOOD framework, please also cite:

@inproceedings{xu2022opencood,
  title={OpenCOOD: An Open Cooperative Perception Framework for Autonomous Driving},
  author={Xu, Runsheng and Lu, Yifan and others},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2023}
}

The downloaded checkpoints will be saved to ~/.cache/huggingface/hub/. You can also manually download from Hugging Face.

The downloaded checkpoints will be saved to `~/.cache/huggingface/hub/`. You can also manually download from Hugging Face.