👉🏻 OmniGIRL 👈🏻

June 13, 2025 · View on GitHub

👉🏻 OmniGIRL 👈🏻

🌐 Website • 🤗 Hugging Face • 🐋 Env Docker Image • 📃 arXiv Paper · 📓 ISSTA 2025

✨ Key Features

🚀 Convenient, Standardized Evaluation Environment

Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.
🕸 Extensive Programming Language Coverage

Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.
🗂️ Rich Multimodal Input Data

Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.
⚒ Automatic Environment Setup & Dataset Construction Tool

We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.

📦 Environment Setup

To get started, run the bash script below to set up the environment:

bash setup.sh

🚀 Running Evaluations

After setup the environment, you need to do following things to run evaluation:

Prepare Prediction file: Some patch files in JSONL format, each item containing:
- model_name_or_path: Model Name
- instance_id: Task Instance id
- prediction_patch: Prediction Patch Content
Example:
```
{
    "model_name_or_path": "agentless-v1",
    "instance_id": "prettier__prettier-12260",
    "model_patch": "diff --git ...."
}
```

Move to omnigirl/harness, then you can run the evaluation using the following command:

# required
cd omnigirl/harness

python run_evaluation.py --predictions_path <path of your prediction results> \
                         --max_workers <number of workers> \
                         --run_id <unique id number of this evaluation>

By default, your evaluation results will be generated in omnigirl/harness/reports.
For the detailed tutorial about evaluation, please refer to omnigirl/harness directory
Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.

📖 Citation

If you find OmniGIRL useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@inproceedings{guo2025omnigirl,
  title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
  author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
  booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2025},
  publisher={{ACM}},
}

🙏 Acknowledgements

We build on prior work — SWE-bench, Agentless, and AutoCodeRover — which laid the groundwork for this study.
We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
Finally, we are grateful to the open-source developer community for their invaluable contributions.