๐Ÿ‘‰๐Ÿป OmniGIRL ๐Ÿ‘ˆ๐Ÿป

June 13, 2025 ยท View on GitHub

SVG Banners

๐Ÿ‘‰๐Ÿป OmniGIRL ๐Ÿ‘ˆ๐Ÿป

๐ŸŒ Website โ€ข ๐Ÿค— Hugging Face โ€ข ๐Ÿ‹ Env Docker Image โ€ข ๐Ÿ“ƒ arXiv Paper ยท ๐Ÿ““ ISSTA 2025

โœจ Key Features

  • ๐Ÿš€ Convenient, Standardized Evaluation Environment

    Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.

  • ๐Ÿ•ธ Extensive Programming Language Coverage

    Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.

  • ๐Ÿ—‚๏ธ Rich Multimodal Input Data

    Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.

  • โš’ Automatic Environment Setup & Dataset Construction Tool

    We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.


๐Ÿ“ฆ Environment Setup

To get started, run the bash script below to set up the environment:

bash setup.sh

๐Ÿš€ Running Evaluations

After setup the environment, you need to do following things to run evaluation:

  1. Prepare Prediction file: Some patch files in JSONL format, each item containing:

    • model_name_or_path: Model Name
    • instance_id: Task Instance id
    • prediction_patch: Prediction Patch Content

    Example:

    {
        "model_name_or_path": "agentless-v1",
        "instance_id": "prettier__prettier-12260",
        "model_patch": "diff --git ...."
    }
    
  2. Move to omnigirl/harness, then you can run the evaluation using the following command:

    # required
    cd omnigirl/harness
    
    python run_evaluation.py --predictions_path <path of your prediction results> \
                             --max_workers <number of workers> \
                             --run_id <unique id number of this evaluation>
    
  3. By default, your evaluation results will be generated in omnigirl/harness/reports.

  4. For the detailed tutorial about evaluation, please refer to omnigirl/harness directory

  5. Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.

๐Ÿ“– Citation

If you find OmniGIRL useful for your research and applications, feel free to give us a star โญ or cite us using:

@inproceedings{guo2025omnigirl,
  title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
  author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
  booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2025},
  publisher={{ACM}},
}

๐Ÿ™ Acknowledgements

  • We build on prior work โ€” SWE-bench, Agentless, and AutoCodeRover โ€” which laid the groundwork for this study.
  • We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
  • Finally, we are grateful to the open-source developer community for their invaluable contributions.