DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

July 1, 2025 Β· View on GitHub

License: CC BY-SA 4.0

NeurIPS 2025 Submission

πŸ‘₯ Authors

Zhiyi Hou1,2,3,*, Enhui Ma1,3,*, Fang Li2,*, Zhiyi Lai2, Kalok Ho2, Zhanqian Wu2,
Lijun Zhou2, Long Chen2, Chitian Sun2, Haiyang Sun2,†, Bing Wang2,
Guang Chen2, Hangjun Ye2, Kaicheng Yu1,βœ‰

1Westlake University, 2Xiaomi EV, 3Zhejiang University
*Equal contribution. †Project leader. βœ‰Corresponding author.

Method Overview

πŸ“Œ Table of Contents

πŸ” Abstract

Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage.

In this work, we introduce:

  1. DriveMRP-10K: A synthetic dataset of high-risk driving motions built from nuPlan using BEV-based simulation to model risks from ego-vehicle, other agents, and environment
  2. DriveMRP-Agent: A VLM-agnostic framework that incorporates projection-based visual prompting to bridge numerical coordinates and images

By fine-tuning with DriveMRP-10K, our framework significantly improves motion risk prediction performance, with accident recognition accuracy soaring from 27.13% to 88.03%. When tested via zero-shot evaluation on real-world high-risk motion data, DriveMRP-Agent boosts accuracy from 29.42% to 68.50%.

🧠 Method Overview

πŸ—‚οΈ 1. DriveMRP-10K Dataset

Dataset Generation

  • Synthetic high-risk motion data generated via BEV-based simulation
  • Models risks from three aspects:
    • Ego-vehicle maneuvers
    • Other vehicle interactions
    • Environmental constraints
  • Includes:
    • Trajectory generation
    • Human-in-the-loop labeling
    • GPT-4o captions
  • 10K multimodal samples for VLM training

πŸ€– 2. DriveMRP-Agent Framework

Framework Architecture

  • VLM-agnostic architecture based on Qwen2.5VL-7B
  • Key components:
    • Projection-based visual prompting: Bridges numerical coordinates and images
    • Multi-context integration: Combines BEV and front-view contexts
    • Chain-of-thought reasoning: For motion risk prediction
  • Processes:
    1. Global context injection
    2. Ego-vehicle perspective alignment
    3. Trajectory projection

πŸ—ƒοΈ Dataset Structure

DriveMRP-10K/
β”œβ”€β”€ train/                  # Training samples (8,000 scenarios)
β”‚   β”œβ”€β”€ scenario_001/
β”‚   β”‚   β”œβ”€β”€ bev.png         # BEV representation
β”‚   β”‚   β”œβ”€β”€ front_view.png  # Ego-vehicle perspective
β”‚   β”‚   β”œβ”€β”€ trajectory.json  # Motion trajectory data
β”‚   β”‚   └── caption.txt     # GPT-4o generated description
β”‚   └── ...
β”œβ”€β”€ val/                    # Validation samples (1,000 scenarios)
β”œβ”€β”€ test/                   # Test samples (1,000 scenarios)
└── metadata.json           # Dataset metadata and statistics

Dataset Statistics:

SplitScenariosRisk Categories
Train8,0004
Val1,0004
Test1,0004

Risk Categories:

  1. Collision risk πŸš—πŸ’₯
  2. Emergency acceleration πŸš€
  3. Emergency braking βœ‹
  4. Illegal lane change ↔️

πŸ“Š Results

1. Performance on Synthetic Dataset (DriveMRP-10K)

MethodROUGE-1-F1ROUGE-2-F1ROUGE-L-F1BERTScoreAccuracyRecallF1-score
EM-VLM4AD-Base14.881.3811.0945.70---
Llava-1.5-7B42.6711.4427.2365.1822.341.720.85
InternVL2-8B51.1516.8431.1169.6618.353.202.98
InternVL2.5-8B49.8915.0729.2168.7026.869.584.79
Llama3.2-vision-11B23.507.0715.4857.1011.321.120.83
Qwen2.5-VL-7B-Instruct48.5415.9930.7268.8327.1313.766.66
DriveMRP-Agent (Ours)69.0842.2352.9381.2588.0389.4489.12

2. Zero-Shot Performance on Real-World Dataset

MethodROUGE-1-F1ROUGE-2-F1ROUGE-L-F1BERTScoreAccuracyRecallF1-score
InternVL2-8B52.4218.1932.4470.7222.7513.659.55
InternVL2.5-8B55.1420.5834.4571.8724.2812.188.34
Qwen2.5-VL-7B-Instruct34.3618.5824.8366.5029.4222.0613.61
DriveMRP-Agent (Ours)62.7430.8242.3576.6968.5051.3756.18

3. Performance Gains with DriveMRP-10K Fine-tuning

MethodROUGE-1-F1ROUGE-2-F1ROUGE-L-F1BERTScoreAccuracyRecallF1-score
Llava-1.5-7B42.6711.4427.2365.1822.341.720.85
+ DriveMRP-10K63.2234.6645.5777.5259.0424.1125.99
Llama3.2-vision-11B23.507.0715.4857.1011.321.120.83
+ DriveMRP-10K52.4333.6336.4770.6556.0522.0423.03
Qwen2.5-VL-7B-Instruct48.5415.9930.7268.8327.1313.766.66
+ DriveMRP-10K69.0842.2352.9381.2588.0389.4489.12

Qualitative Results

Case 1: Illegal Lane Change Risk

Illegal Lane Change

  • Ground truth: Illegal lane change
  • DriveMRP correctly identifies risk while baselines misclassify as "no risk"

Case 2: Abnormal Deceleration Risk

Abnormal Deceleration

  • Ground truth: Abnormal deceleration
  • DriveMRP detects risk from trajectory color changes

Case 3: Collision Risk

Collision Risk

  • Ground truth: Collision risk
  • DriveMRP identifies threat from trajectory proximity to obstacles

Risk Scenario Videos

ScenarioVideo
Emergency Accelerationacc-1.mp4
Emergency Brakingdec-1.mp4
Collisioncol.mp4
Illegal Lane Changechange_lane.mp4

πŸ“ Citation

@inproceedings{hou2025drivemrp,
  title     = {DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction},
  author    = {Hou, Zhiyi and Ma, Enhui and Li, Fang and Lai, Zhiyi and Ho, Kalok and Wu, Zhanqian and Zhou, Lijun and Chen, Long and Sun, Chitian and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun and Yu, Kaicheng},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  note      = {Equal contribution between the first three authors. Haiyang Sun is the project leader.},
  url       = {https://openreview.net/forum?id=anonymous_id}
}

πŸ“œ License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

πŸ™ Acknowledgements

  • This project page template was adapted from the Academic Project Page Template
  • Built upon the Qwen vision-language models
  • Dataset generated using the nuPlan dataset
  • Research supported by Zhejiang University, Westlake University, and Xiaomi EV