DeepSight

May 27, 2026 Β· View on GitHub

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

πŸŽ‰πŸŽ‰ICML 2026 πŸŽ‰πŸŽ‰

image

DeepSight achieves leading performance on most of metrics compared with E2E methods.

πŸ“– Abstract

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird’s-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark.

πŸš€ Pipeline

image

The pipeline of our method, a holistic training and inference framework for closed-loop driving. It consists of two main modules: (a) Long-term driving-world model, for aligning DINOv3 features extracted from future multi-frame RGB images in the BEV space during training. (b) An adaptive CoT module for integrating external knowledge to enhance reasoning and decision-making in long-tail cases

πŸ–ΌοΈ Visualization

image

Qualitative results of DeepSight on the Bench2Drive closed-loop evaluation set.

DeepSight

DeepSight is an autonomous driving perception and reasoning framework built on top of LLaMA-Factory, with full-pipeline customization for data processing, model training, and inference evaluation in autonomous driving scenarios.

Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/hotdogcheesewhite/DeepSight.git
cd DeepSight

# Create virtual environment
conda create -n deepsight python=3.10 -y && conda activate deepsight

# Install PyTorch (recommended 2.6.0 or other compatible version)
# torch == 2.6.0

# Install dependencies
pip install -r requirements.txt
# pip install -e .

Overview

This project extends LLaMA-Factory with the following features:

  • BEV (Bird's-Eye-View) data processing and visualization pipeline
  • VLM (Vision-Language Model) training data construction
  • DINOv3 feature extraction and BEV Query supervision
  • Open-loop and closed-loop evaluation based on Bench2Drive

Directory Structure

deepsight/
β”œβ”€β”€ configs/                                  # Training configuration files (YAML)
β”œβ”€β”€ data/                                     # Dataset processing utilities
β”œβ”€β”€ src/                                      # Core source code
β”‚   β”œβ”€β”€ tools/                                # Data & evaluation utility scripts
β”‚   β”‚   β”œβ”€β”€ crop_bev_for_bench2drive.py       # BEV image cropping
β”‚   β”‚   β”œβ”€β”€ visual_for_bev.py                 # BEV visualization
β”‚   β”‚   β”œβ”€β”€ create_date_set.py                # VLM training data construction
β”‚   β”‚   β”œβ”€β”€ eval_and_visual.py                # Inference visualization & open-loop eval
β”‚   β”‚   └── merge_model_weight.py             # Model weight merging (for vLLM)
β”‚   β”œβ”€β”€ transformers/src/transformers/        # Modified transformers
β”‚   β”‚   └── models/qwen2_5_vl/modeling_*.py    # Qwen2.5-VL model (with DINOv3)
β”‚   β”œβ”€β”€ llamafactory/data/ad_collator.py      # Data collator (removes token CE loss)
β”‚   β”œβ”€β”€ infer_for_debug.py                    # Original transformers inference
β”‚   └── infer_with_vllm.py                    # vLLM inference
β”œβ”€β”€ bench2drive/                              # Bench2Drive evaluation framework
β”‚   └── leaderboard/scripts/
β”‚       └── run_evaluation_qwen.sh            # Closed-loop evaluation script
β”œβ”€β”€ nebula.sh                                 # Nebula cluster training script
└── requirements.txt                          # Training environment dependencies

1. Data Preparation

1.1 Create BEV Maps

Script: src/tools/crop_bev_for_bench2drive.py

Crops BEV (Bird's-Eye-View) images from Bench2Drive data. Each BEV includes 5 fixed-resolution future motion images for vehicle trajectory prediction.

Notes:

  • Highly sensitive to weather conditions β€” can be mitigated by lowering the BEV height
  • Perspective distortion from nearby tall buildings

Visualization check: src/tools/visual_for_bev.py

Focus on verifying data quality in turning/cornering scenes.

1.2 Create VLM Training Data

Script: bench2drive/dataprocess/targetpointgen.py

Converts raw data into conversational format for training. Input requires annotation files.

1.3 (Optional) Manually Construct CoT Annotation Content

Script: src/tools/create_date_set_target_need_to_cot.py

Replace textprompt with desired prompts to generate data that needs Qwen-3VL annotation.

1.4 (Optional) Call API to Generate Annotation Data

Script: bench2drive/dataprocess/jsonopenai.py

Update the OpenAI API key and call the Qwen3VL model to perform annotation.

---

2. Model Training

Training Entry Point

  1. Add the corresponding dataset information in deepsight/data/dataset_info.json. Refer to the LLaMA-Factory official documentation to organize the dataset and modify the path to the previously generated JSONL file.

  2. Execute the following command:

bash nebula.sh

The training entry point is src/train.py, with hyperparameters defined in YAML configuration files under the configs/ directory.

Loss Design

src/llamafactory/data/ad_collator.py

Designs the training method.

---

3. Model Inference

3.1 Inference with Original Transformers

Script: src/infer_for_debug.py

Uses the modified original transformers for inference (includes DINOv3 and other modules).

3.2 Inference with vLLM

Script: src/infer_with_vllm.py

vLLM uses an internal implementation of transformers that does not include DINOv3 or other custom modules. Model weights must be merged before inference:

---

4. Closed-Loop Evaluation

Step 1: Install CARLA

Note: CARLA can only be used by non-root users.

Install CARLA (0.9.16 has Python 3.10 support, and Python 3.10 is required for large models):

wget https://carla-releases.s3.us-east-005.backblazeb2.com/Linux/CARLA_0.9.16.tar.gz
tar -xvf CARLA_0.9.16.tar.gz
wget https://carla-releases.s3.us-east-005.backblazeb2.com/Linux/AdditionalMaps_0.9.16.tar.gz
bash ImportAssets.sh

After extraction, execute:

/mnt/nas-data-1/zhanglingjun.zlj1/carla/carla0916/ImportAssets.sh

This will extract the additional maps.

Then create a Python 3.10 CARLA environment:

conda activate /mnt/nas-data-1/zhanglingjun.zlj_env/envs/carla/
pip install carla-0.9.16-cp310-cp310-manylinux_2_31_x86_64.whl

Start CARLA:

./CarlaUE4.sh -RenderOffScreen -nosound -fps=10 -carla-rpc-port=2000

-RenderOffScreen indicates headless mode (no GUI).

Check if it's working:

The key is that vulkaninfo | grep "GPU id" can detect the physical machine. If it can detect, then CARLA startup should be fine.

If it cannot start normally, consider:

cd /etc/vulkan

Check if there are any JSON files under icd.d/.

Under icd.d/:

sudo touch nvidia_icd.json

Write:

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libEGL_nvidia.so.0",
        "api_version" : "1.3.277"
    }
}

Under implicit_layer.d/:

sudo touch nvidia_layers.json

Write:

{
    "file_format_version" : "1.0.0",
    "layer": {
        "name": "VK_LAYER_NV_optimus",
        "type": "INSTANCE",
        "library_path": "libEGL_nvidia.so.0",
        "api_version" : "1.3.277",
        "implementation_version" : "1",
        "description" : "NVIDIA Optimus layer",
        "functions": {
            "vkGetInstanceProcAddr": "vk_optimusGetInstanceProcAddr",
            "vkGetDeviceProcAddr": "vk_optimusGetDeviceProcAddr"
        },
        "enable_environment": {
            "__NV_PRIME_RENDER_OFFLOAD": "1"
        },
        "disable_environment": {
            "DISABLE_LAYER_NV_OPTIMUS_1": ""
        }
    }
}
./CarlaUE4.sh -RenderOffScreen -nosound -fps=10 -carla-rpc-port=2000 -graphicsadapter=5

Step 2: Install Environment

cd bench2drive
conda create -n b2d python=3.10 -y && conda activate b2d

Choose torch based on different CUDA versions (vLLM 0.8.0 requires torch 2.6.0):

export PATH=YOUR_GCC_PATH/bin:$PATH
export CUDA_HOME=YOUR_CUDA_PATH/

Basically, nvcc --version should display normally, and gcc and g++ --version should display normally.

cd to bench2drive zoo folder:

pip install ninja packaging
pip install -v -e .

In the environment:

numba==0.61.2  # In order to speed up
numpy==1.26.4  # In order to adapt numba

You need to modify the content in bench2drive's requirements.

Step 3: Install QwenVL Inference Environment

A reference inference environment example: deepsight/example.txt

Script: bench2drive/leaderboard/scripts/run_evaluation_qwen.sh

Runs the Bench2Drive evaluation leaderboard pipeline with the trained Qwen model. Requires the separate inference environment installed in bench2drive/.

---

Key Components

ComponentDescription
nebula.shTraining job submission script for Nebula cluster
requirements.txtTraining environment Python dependencies
configs/YAML configuration files for different training runs
bench2drive/Bench2Drive evaluation framework (separate inference environment)
src/tools/Data processing, visualization, and evaluation utility scripts
src/transformers/Modified transformers (with DINOv3 integration)
src/llamafactory/data/ad_collator.pyAutonomous driving data collator

License

This project is distributed under the terms described in the LICENSE file.

Weights

ResourceLink
Model WeightsDeepSight on ModelScope DeepSight on HF
Datasetb2d_deepsight on ModelScope

πŸ—“οΈ TODO

  • Release DeepSight reasoning code
  • Release whole DeepSight code
  • Release checkpoints

πŸ™ Acknowledgement

Our work is primarily based on the following codebases:FSDrive, LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.