README.md

June 7, 2026 ยท View on GitHub

SII GAIR

ICML 2026 arXiv GitHub Hugging Face Hugging Face

daVinci-Dev: Agent-native Mid-training for Software Engineering

Table of Contents

News

  • 2026-05: ๐ŸŽ‰ daVinci-Dev was accepted as an oral presentation at ICML 2026.
  • 2026-01: daVinci-Dev paper, code and dataset were released!

Overview

daVinci-Dev is a family of large language models trained for agentic software engineering.

This repo provides:

  • The paper PDF: daVinci-Dev.pdf
  • A high-performance data processing pipeline under pipeline/ that calls the GitHub API to construct contextually-native PR trajectories Dpyctx\mathcal{D}^{\text{ctx}}_{\text{py}}

Model Zoo

We will open-source model checkpoints on Hugging Face:

ModelDescriptionLink
daVinci-Dev-72BFinal model (agent-native mid-training + env native SFT)https://huggingface.co/GAIR/daVinci-Dev-72B
daVinci-Dev-32BFinal model (agent-native mid-training + env native SFT)https://huggingface.co/GAIR/daVinci-Dev-32B
daVinci-Dev-72B-MTMT checkpoint (after agent-native mid-training, before SFT)https://huggingface.co/GAIR/daVinci-Dev-72B-MT
daVinci-Dev-32B-MTMT checkpoint (after agent-native mid-training, before SFT)https://huggingface.co/GAIR/daVinci-Dev-32B-MT

Datasets

Datasets are released through Hugging Face:

DatasetDescriptionLink
daVinci-DevAgent-native trajectories used in our training recipe (as permitted)https://huggingface.co/datasets/GAIR/daVinci-Dev

High-level composition (see the paper for details):

  • Contextually-native trajectories Dpyctx\mathcal{D}^{\text{ctx}}_{\text{py}} (PR-derived, Python variant)
  • Environmentally-native trajectories Dpassenv\mathcal{D}^{\text{env}}_{\text{pass}} (executable rollouts, test-passing subset)

Pipeline

The directory pipeline/ contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build Dpyctx\mathcal{D}^{\text{ctx}}_{\text{py}}.

PipelineDescriptionLink
daVinci-Dev Pipelinea high-performance pipeline used to build Dpyctx\mathcal{D}^{\text{ctx}}_{\text{py}}pipeline/

Utilities for Environmentally-native Trajectories

The directory env_traj_utils/ provides utilities for converting environmentally-native trajectories (Denv\mathcal{D}^{\text{env}}) to LLM-trainable formats:

ScriptDescription
convert_trajectories.pyConvert SWE-agent trajectories to XML function calling format
tokenize_trajectories.pyTokenize trajectories and filter by length

See the env_traj_utils README for quickstart instructions on converting env-native.jsonl to formats compatible with training frameworks like SLIME.

License

This project is a mixed release:

  • PR-derived subset: only includes PRs from repositories detected as having a permissive license in the open-source release.
  • Executable rollout subset: derived from SWE-rebench, licensed under CC-BY-4.0.
  • daVinci-Dev models: released under Qwen license. Users should verify the licensing status of any generated code before using it in production.
  • daVinci-Dev pipeline: released under the Apache-2.0 license.

Downstream users are responsible for ensuring their usage complies with the licenses of the underlying sources.

Citation

If you use this work, please cite the daVinci-Dev paper.

@misc{zeng2026davincidevagentnativemidtrainingsoftware,
      title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
      author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
      year={2026},
      eprint={2601.18418},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.18418},
}