๐Ÿ“ฆ 1. Environment Setup

December 4, 2025 ยท View on GitHub

Kwaipilot
Hugging Face License arXiv
GitHub stars GitHub forks

๐Ÿ‡บ๐Ÿ‡ธ English ๐Ÿ‡จ๐Ÿ‡ณ ็ฎ€ไฝ“ไธญๆ–‡


๐Ÿง  SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Current evaluations of LLMs for software engineering are limited by a narrow range of task categories, a Python-centric bias, and insufficient alignment with real-world development workflows.
To bridge these gaps, SWECompass establishes a high-coverage, multi-dimensional, and production-aligned evaluation framework:

  • โœจ Covers 8 software engineering task types, 8 programming scenarios, and 10 programming languages
  • โœจ Contains 2000 high-quality instances sourced from real GitHub pull requests
  • โœจ Supports multi-dimensional performance comparison across task types, languages, and scenarios

By integrating heterogeneous code tasks with real engineering practices, SWECompass provides a reproducible, rigorous, and production-oriented benchmark for diagnosing and improving the software engineering capabilities of large language models.


โœจ Key Features

  • โš™๏ธ Automated Docker-based evaluation environment
  • ๐Ÿ“ฆ Multi-project, multi-task, multi-language
  • ๐Ÿค– Supports execution and evaluation of model-generated patches
  • ๐Ÿ“Š Multi-dimensional performance metrics: task type, scenario, language
  • ๐ŸŒŸ Optional integration with an LLM judge for code understanding tasks
  • ๐Ÿ”„ Highly reproducible, designed for research and production applications

๐Ÿ“ฆ 1. Environment Setup

1.1 Install Docker

Refer to the official documentation:
https://docs.docker.com/engine/install/

1.2 Install Python 3.11 and Dependencies

Enter the project directory and run:

cd swe-compass
pip install -e .
pip install -r requirements.txt

๐Ÿณ 2. Download Required Docker Images and Supplementary Data

Enter the project directory and run:

cd swe-compass
bash pull_docker.sh
python download_all_data.py

The scripts will automatically download the evaluation environment from DockerHub.


๐Ÿ“„ 3. Prepare Prediction Data

You need to prepare a JSON file that maps each instance_id to its corresponding patch and metadata.

Example format (see swe-compass/data/example.json):

{
  "<instance_id>": {
    "model_name_or_path": "<your_model_name>",
    "instance_id": "<instance_id>",
    "model_patch": "<your_model_patch>"
  }
}

Each prediction entry only requires three fields: model_name_or_path, instance_id, model_patch


โ–ถ๏ธ 4. Run Evaluation

4.1 Basic Command

cd swe-compass
python validation.py \
  --dataset_name ./data/swecompass_all_2000.jsonl \
  --predictions_path <your_predictions.json> \
  --max_workers <num_workers> \
  --run_id <run_id> \
  --model_name <judge_model_name> \
  --api_key <judge_api_key> \
  --base_url <judge_model_base_url> \
  --proxy <proxy address>

4.2 Example

python validation.py \
  --dataset_name ./data/swecompass_all_2000.jsonl \
  --predictions_path ./data/example.json \
  --max_workers 10 \
  --run_id test \
  --model_name deepseek_v3 \
  --api_key xxx \
  --base_url xxx \
  --proxy http ... 

๐Ÿ“Š 5. Evaluation Outputs


5.1 Work Logs Directory

swe-compass/output/work/<run_id>/

Contains execution traces and logs for each instance.


5.2 Evaluation Results Directory

swe-compass/output/result/<run_id>/

Contains two files:

FileContent
raw_data.jsonlRaw evaluation results for each instance
result.jsonAggregated scores by task, language, and scenario

โš™๏ธ 6. Common Arguments

ArgumentDescription
--dataset_namePath to dataset
--predictions_pathModel predictions JSON file
--max_workersNumber of worker processes
--run_idUnique identifier for this run
--model_nameJudge LLM model name
--api_keyJudge LLM API key
--base_urlJudge LLM API URL
--proxyProxy address

๐Ÿค 7. Contributions

We welcome contributions from the research community in NLP, Machine Learning, and Software Engineering.
Researchers are encouraged to submit issues or pull requests that extend, evaluate, or refine the benchmark.

For collaboration or inquiries, please contact:

We appreciate constructive engagement and look forward to further improvements driven by the community.

๐Ÿ“„ 8. Citation

@article{xu2025SWECompass,
  title={SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
  author={Xu, Jingxuan and Deng, Ken and Li, Weihao and Yu, Songwei etc},
  journal={arXiv preprint arXiv:2511.05459},
  year={2025}
}