README.md
June 7, 2026 ยท View on GitHub
daVinci-Dev: Agent-native Mid-training for Software Engineering
Table of Contents
News
- 2026-05: ๐ daVinci-Dev was accepted as an oral presentation at ICML 2026.
- 2026-01: daVinci-Dev paper, code and dataset were released!
Overview
daVinci-Dev is a family of large language models trained for agentic software engineering.
This repo provides:
- The paper PDF:
daVinci-Dev.pdf - A high-performance data processing pipeline under
pipeline/that calls the GitHub API to construct contextually-native PR trajectories
Model Zoo
We will open-source model checkpoints on Hugging Face:
| Model | Description | Link |
|---|---|---|
daVinci-Dev-72B | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B |
daVinci-Dev-32B | Final model (agent-native mid-training + env native SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B |
daVinci-Dev-72B-MT | MT checkpoint (after agent-native mid-training, before SFT) | https://huggingface.co/GAIR/daVinci-Dev-72B-MT |
daVinci-Dev-32B-MT | MT checkpoint (after agent-native mid-training, before SFT) | https://huggingface.co/GAIR/daVinci-Dev-32B-MT |
Datasets
Datasets are released through Hugging Face:
| Dataset | Description | Link |
|---|---|---|
daVinci-Dev | Agent-native trajectories used in our training recipe (as permitted) | https://huggingface.co/datasets/GAIR/daVinci-Dev |
High-level composition (see the paper for details):
- Contextually-native trajectories (PR-derived, Python variant)
- Environmentally-native trajectories (executable rollouts, test-passing subset)
Pipeline
The directory pipeline/ contains a high-performance pipeline that calls the GitHub API and constructs the structured PR representation used to build .
| Pipeline | Description | Link |
|---|---|---|
| daVinci-Dev Pipeline | a high-performance pipeline used to build | pipeline/ |
Utilities for Environmentally-native Trajectories
The directory env_traj_utils/ provides utilities for converting environmentally-native trajectories () to LLM-trainable formats:
| Script | Description |
|---|---|
convert_trajectories.py | Convert SWE-agent trajectories to XML function calling format |
tokenize_trajectories.py | Tokenize trajectories and filter by length |
See the env_traj_utils README for quickstart instructions on converting env-native.jsonl to formats compatible with training frameworks like SLIME.
License
This project is a mixed release:
- PR-derived subset: only includes PRs from repositories detected as having a permissive license in the open-source release.
- Executable rollout subset: derived from SWE-rebench, licensed under CC-BY-4.0.
- daVinci-Dev models: released under Qwen license. Users should verify the licensing status of any generated code before using it in production.
- daVinci-Dev pipeline: released under the Apache-2.0 license.
Downstream users are responsible for ensuring their usage complies with the licenses of the underlying sources.
Citation
If you use this work, please cite the daVinci-Dev paper.
@misc{zeng2026davincidevagentnativemidtrainingsoftware,
title={daVinci-Dev: Agent-native Mid-training for Software Engineering},
author={Ji Zeng and Dayuan Fu and Tiantian Mi and Yumin Zhuang and Yaxing Huang and Xuefeng Li and Lyumanshan Ye and Muhang Xie and Qishuo Hua and Zhen Huang and Mohan Jiang and Hanning Wang and Jifan Lin and Yang Xiao and Jie Sun and Yunze Wu and Pengfei Liu},
year={2026},
eprint={2601.18418},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2601.18418},
}