SWE-Skills-Bench

April 6, 2026 · View on GitHub

A benchmark dataset for evaluating whether injected skill documents improve agent performance on real-world software engineering tasks.

Python Docker License

What is this dataset?

SWE-Skills-Bench contains 49 real-world software engineering tasks paired with curated skill documents. Each task tests whether providing an agent with domain-specific knowledge (a "skill") measurably improves its ability to complete the task.

The dataset is designed to answer: Does giving an agent a skill document actually help?


Task List

All 49 tasks (click to expand)
Skill IDName
add-uint-supportAdd UInt Support
fixReact Code Fix & Linter
tdd-workflowTDD Workflow
security-reviewSecurity Review
springboot-tddSpring Boot TDD
add-admin-api-endpointGhost Admin API Endpoint Creator
mcp-builderMCP Server Builder
python-resiliencePython Resilience Patterns
xlsxExcel & Spreadsheet Automation
turborepoTurborepo Monorepo Build System
github-actions-templatesGitHub Actions Templates
analytics-eventsMetabase Frontend Analytics Events
prometheus-configurationPrometheus Configuration
python-anti-patternsPython Anti-Pattern Review
implementing-jsc-classes-zigBun Zig-JS Class Generator
add-malli-schemasMetabase Malli Schema Architect
clojure-writeClojure Development & REPL Workflow
django-patternsDjango Architecture Patterns
python-background-jobsPython Background Jobs
python-configurationPython Configuration Management
creating-financial-modelsFinancial Modeling Suite
prompt-engineering-patternsPrompt Engineering Patterns
risk-metrics-calculationRisk Metrics Calculation
vector-index-tuningVector Index Tuning
rag-implementationRAG Implementation Framework
spark-optimizationSpark Optimization
similarity-search-patternsSimilarity Search Patterns
llm-evaluationLLM Evaluation
analyze-ciCI Failure Analyzer
python-packagingPython Packaging & Distribution
gitops-workflowGitOps Workflow for Kubernetes
linkerd-patternsLinkerd Service Mesh Patterns
changelog-automationChangelog Automation
k8s-manifest-generatorKubernetes Manifest Generator
nx-workspace-patternsNx Workspace Patterns
bazel-build-optimizationBazel Build Optimization
istio-traffic-managementIstio Traffic Management
bash-defensive-patternsBash Defensive Patterns
gitlab-ci-patternsGitLab CI Patterns
implementing-agent-modesPostHog Agent Mode Architect
python-observabilityPython Observability Patterns
distributed-tracingDistributed Tracing & Observability
service-mesh-observabilityService Mesh Observability
slo-implementationSLO Implementation Framework
python-performance-optimizationPython Performance Optimizer
grafana-dashboardsGrafana Dashboards
dbt-transformation-patternsdbt Transformation Patterns
langsmith-fetchLangSmith Fetch
v3-performance-optimizationV3 Performance Optimization

How to Use This Dataset

There are two ways to use SWE-Skills-Bench depending on your goal.


Option A: Load via HuggingFace (quick access)

Install the datasets library if you haven't already:

pip install datasets
from datasets import load_dataset

ds = load_dataset("GeniusHTX/SWE-Skills-Bench", split="train")
print(ds)
# Dataset({features: ['skill_id', 'name', 'description', 'type',
#                     'task_prompt', 'skill_document', 'test_code',
#                     'repo_url', 'repo_commit', 'docker_image'],
#          num_rows: 49})

Option B: Run the Full Evaluation Framework

The built-in evaluation framework automates the full pipeline: Docker container setup, agent execution inside the container, test evaluation, and report generation.

Prerequisites

  • Python 3.8+
  • Docker (running locally)
  • Claude Code CLI (claude command available inside the container image — no local install required)
  • An Anthropic API key

Step 1: Install and configure

# Create and activate a Python environment
conda create -n SWE-Skills-Bench python=3.10 -y
conda activate SWE-Skills-Bench
pip install -r requirements.txt

# Set credentials (copy the example file and fill in your values)
cp .env.example .env

Edit .env with your API credentials:

# Official Anthropic API key — get one at https://console.anthropic.com/
ANTHROPIC_AUTH_TOKEN=your-anthropic-api-key

# If using a third-party proxy, set the proxy URL here; leave empty for direct API access
ANTHROPIC_BASE_URL=

Step 2: Configure the model

The framework runs the claude CLI inside each Docker container.

To change the model, edit .env before running — the framework copies it into the container automatically:

ANTHROPIC_DEFAULT_SONNET_MODEL=claude-sonnet-4-6

Step 3: Validate the setup

# Confirm Docker is running
docker info

# Validate benchmark configuration and list available skills
python main.py validate --config config/benchmark_config.yaml
python main.py list-skills --config config/benchmark_config.yaml

Step 4: Evaluate all 49 skills

# Run the agent on all tasks
python run_all_skills.py --use-skill      # experiment group
python run_all_skills.py --no-use-skill   # control group

# Evaluate all tasks
python run_all_skills_eval.py --use-skill --use-agent
python run_all_skills_eval.py --no-use-skill --use-agent

Useful batch flags: --dry-run (preview commands without executing), --resume (skip already-completed tasks), --only a,b (run specific tasks), --skip a,b (exclude specific tasks).

Step 5: Generate summary metrics

Once evaluation reports exist in reports/eval/, run the post-processing scripts:

# Unit test pass rate: skill vs. no-skill, per task and average delta
python scripts/compare_pass_rate.py --all

# Which tests failed in each group
python scripts/extract_failed_tests.py

# Token usage and wall-clock duration per task
python scripts/analyze_tokens.py

Output is written to:

DirectoryContents
reports/compare/Pass rate comparison table (skill vs. no-skill, delta)
reports/failed_test/Per-task failed test lists and overlap analysis
reports/token_and_duration/Token counts and duration by task and group

License

MIT. See LICENSE for details.