SWE-QA

April 7, 2026 Β· View on GitHub

SWE-QA is a benchmark for repository-level code question answering. This repository hosts the benchmark data (question–answer pairs tied to pinned commits) and code to construct the benchmark, clone evaluation repositories, and run baselines and agents.

It covers the original SWE-QA v1 release (12 popular Python projects such as Django and Flask) together with the complementary SWE-QA v2 that adds conan, streamlink, and reflex.

πŸ‘Our paper "SWE-QA: Can Language Models Answer Repository-level Code Questions?" has been accepted to ACL 2026 Findings.

πŸ“– Paper

For more details about the methodology and results, please refer to the paper:

  • Paper: "SWE-QA: Can Language Models Answer Repository-level Code Questions?"【arxiv】

πŸ“Š Dataset

The benchmark dataset is available on Hugging Face:

Benchmark Construction Workflow

The following diagram illustrates the workflow for constructing the SWE-QA benchmark:

Benchmark Construction Workflow

Benchmark Example

The following example shows the structure and format of questions in the benchmark:

Benchmark Example

πŸ“ Repository Structure

SWE-QA-Bench/                    # Repository root
β”œβ”€β”€ Benchmark/                 # Released benchmark (JSONL per project)
β”‚   β”œβ”€β”€ *.jsonl                # e.g. astropy.jsonl, django.jsonl, ...
β”œβ”€β”€ Benchmark construction/    # Build and score the benchmark
β”‚   β”œβ”€β”€ issue_analyzer/        # GitHub issue to question drafts
β”‚   β”œβ”€β”€ qa_generator/
β”‚   β”œβ”€β”€ repo_parser/
β”‚   β”œβ”€β”€ score/                 # e.g. llm-as-a-judge.py
β”‚   └── models/
β”œβ”€β”€ Experiment/
β”‚   β”œβ”€β”€ ErrorAnalysis/         # e.g. error_analysis.jsonl
β”‚   └── Script/                # Eval methods and agent runners
β”‚       β”œβ”€β”€ llm_direct/
β”‚       β”œβ”€β”€ rag_function_chunk/
β”‚       β”œβ”€β”€ rag_sliding_window/
β”‚       β”œβ”€β”€ SWE-agent_QA/
β”‚       β”œβ”€β”€ OpenHands_QA/
β”‚       └── Cursor-Agent_QA/
β”œβ”€β”€ assets/                    # README figures
β”œβ”€β”€ clone_repos.sh
β”œβ”€β”€ repo_commit.txt            # URLs + commits for clone_repos.sh
β”œβ”€β”€ pyproject.toml             # Dependencies (uv)
β”œβ”€β”€ uv.lock
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ LICENSE
└── README.md

After running ./clone_repos.sh, evaluated repositories are checked out under datas/repos/ (not committed to git).

πŸš€ Environment Setup

Prerequisites

  • Python 3.12
  • uv package management
  • OpenAI API access (required for all evaluation methods)
  • Voyage AI API access (required for RAG-based methods)

Installation

Install dependencies:

uv sync

If you want to run evaluation methods

uv sync --extra baseline

SWE Repository Prerequisites:

# Use the provided script to clone all repositories at specific commits
./clone_repos.sh

References

If you use SWE-QA in your work, please cite:

@article{peng2025swe,
  title={Swe-qa: Can language models answer repository-level code questions?},
  author={Peng, Weihan and Shi, Yuling and Wang, Yuhang and Zhang, Xinyun and Shen, Beijun and Gu, Xiaodong},
  journal={arXiv preprint arXiv:2509.14635},
  year={2025}
}

For a curated list of papers and resources on repository-level code generation, issue resolution, and related topics (including repo-level code QA), see Awesome Repository-Level Code Generation.