๐๐ป OmniGIRL ๐๐ป
June 13, 2025 ยท View on GitHub
๐๐ป OmniGIRL ๐๐ป
๐ Website โข ๐ค Hugging Face โข ๐ Env Docker Image โข ๐ arXiv Paper ยท ๐ ISSTA 2025
โจ Key Features
-
๐ Convenient, Standardized Evaluation Environment
Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.
-
๐ธ Extensive Programming Language Coverage
Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.
-
๐๏ธ Rich Multimodal Input Data
Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.
-
โ Automatic Environment Setup & Dataset Construction Tool
We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.
๐ฆ Environment Setup
To get started, run the bash script below to set up the environment:
bash setup.sh
๐ Running Evaluations
After setup the environment, you need to do following things to run evaluation:
-
Prepare Prediction file: Some patch files in JSONL format, each item containing:
model_name_or_path: Model Nameinstance_id: Task Instance idprediction_patch: Prediction Patch Content
Example:
{ "model_name_or_path": "agentless-v1", "instance_id": "prettier__prettier-12260", "model_patch": "diff --git ...." } -
Move to omnigirl/harness, then you can run the evaluation using the following command:
# required cd omnigirl/harness python run_evaluation.py --predictions_path <path of your prediction results> \ --max_workers <number of workers> \ --run_id <unique id number of this evaluation> -
By default, your evaluation results will be generated in omnigirl/harness/reports.
-
For the detailed tutorial about evaluation, please refer to omnigirl/harness directory
-
Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.
๐ Citation
If you find OmniGIRL useful for your research and applications, feel free to give us a star โญ or cite us using:
@inproceedings{guo2025omnigirl,
title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
year={2025},
publisher={{ACM}},
}
๐ Acknowledgements
- We build on prior work โ SWE-bench, Agentless, and AutoCodeRover โ which laid the groundwork for this study.
- We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
- Finally, we are grateful to the open-source developer community for their invaluable contributions.