Synchronous usage

June 14, 2026 ยท View on GitHub

alt text

EvoSkill: Automated Skill Discovery for Coding Agents

Paper Blog Built by Sentient Labs License: Apache 2.0

Turn your general AI agents into state-of-the-art specialists with a benchmark and EvoSkill, a toolkit for automatically creating and improving AI skills, compatible with Claude Code, Codex CLI, OpenCode, OpenHands, Goose, Harbor, and more.

EvoSkill significantly extends the feedback-driven idea of GEPA from single-file optimization to complete agent evolution. Instead of only revising one prompt in place like GEPA, EvoSkill proposes multiple skill and prompt mutations jointly, evaluates new variants on held-out data, and has each iteration produce an entirely new agent program.

EvoSkill Architecture

Install in seconds, then run evoskill init and evoskill run to supercharge any coding agent with AI-created skills and prompts automatically. Depending on the agent, you are free to use any model provider of your choice (OpenRouter, Anthropic, OpenAI, Fireworks, and more) and any model you want (Claude, GLM, Minimax, Kimi, GPT, Gemini, Qwen, and others).

Also join us on Discord to discuss your experience, share suggestions, or show off your work!

๐Ÿค– Supported agents

Agent Support Notes
Claude Code โœ…
OpenCode โœ… CLI v1.4.0+ required (structured output support)
OpenHands โœ… No native structured output; uses fallback JSON extraction
Goose โœ… CLI v1.25.0+ required (skill discovery via summon extension)
Codex CLI โœ… Skill discovery via .agents/skills/ symlink
Harbor โœ… Containerized task benchmarks with built-in verifiers

๐ŸŽจ Features

Capability Status Explanation
Evolution with a benchmark โœ… Skills can be effectively improved against your own or academic benchmarks.
Cross-agent transferability โœ… Skills are packaged as reusable folders with instructions, metadata, and helper scripts, compatible with many coding agents.
Cross-model transferability โœ… Demonstrated in EvoSkills, skills evolved with a fixed LLM can transfer their performance increase to other LLMs.
Cross-task transferability โœ… Generated skills can be generic enough to transfer across tasks, for instance a SealQA skill improving BrowseComp performance (as shown in EvoSkill).
Evolution without a benchmark ๐Ÿ› ๏ธ An open research direction where benchmarks are generated on the fly (ex. Hermes-Agent self-evolution).
Continuous evolution ๐Ÿ› ๏ธ Integrating the ability to improve skills from regular usage.

Table of Contents

Installation

One command (recommended):

# Clone + install everything (Python deps, uv, optional agent CLIs)
curl -fsSL https://raw.githubusercontent.com/sentient-agi/EvoSkill/main/install.sh | bash

# Or, if you already cloned the repo:
./install.sh

# Install Python deps + all agent harness CLIs (macOS/Homebrew)
./install.sh --all-agents

# Install Python deps + specific agent CLIs
./install.sh --agents claude,opencode

The installer handles Python 3.12+, uv, and uv sync automatically. Harbor is included in the Python package. Agent CLIs are optional โ€” install only the harness you plan to use.

Manual install:

Requirements:

  • Python 3.12+
  • uv (recommended) or pip
# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

Agent CLI (install whichever harness you plan to use):

brew install --cask claude-code    # Claude Code
brew install opencode              # OpenCode (v1.4.0+)
brew install --cask codex          # Codex CLI
brew install block-goose-cli       # Goose (v1.25.0+)

Harbor is included in the Python install above (uv sync / pip install -e .). Run this only if you need the Harbor CLI standalone:

pip install harbor                 # Harbor (containerized benchmarks)

Common auth setup:

# Anthropic (Claude Code harness)
export ANTHROPIC_API_KEY=your-key-here

# OpenAI (Codex harness)
export OPENAI_API_KEY=your-key-here

# OpenRouter (OpenCode / Goose / OpenHands harnesses)
export OPENROUTER_API_KEY=your-key-here

# Fireworks AI (OpenCode / OpenHands harnesses, LLM scorer)
export FIREWORKS_API_KEY=your-key-here

OpenRouter-backed evolution runs also accept LLM_API_KEY, but OPENROUTER_API_KEY is the preferred env var.


Quickstart

1. Initialize a project

Run evoskill init inside any git repository:

CSV dataset (question/answer pairs):

$ evoskill init

  EvoSkill โ€” Project Setup

  Which agent runtime? โ€บ claude
  Dataset source? โ€บ CSV
  Absolute path to dataset CSV? โ€บ /path/to/questions.csv
  Question/input column name? โ€บ question
  Answer column name? โ€บ answer
  Category column name? โ€บ
  Additional data directories? โ€บ
  How do you want to run EvoSkill? โ€บ Local

Harbor dataset (containerized benchmark tasks):

$ evoskill init

  EvoSkill โ€” Project Setup

  Which agent runtime? โ€บ claude
  Dataset source? โ€บ Harbor
  Choose a Harbor dataset: โ€บ swe-bench/swe-bench-verified
  Where to store this dataset? โ€บ .evoskill/harbor/datasets/swe-bench-verified
  How do you want to run EvoSkill? โ€บ Local

This creates .evoskill/config.toml and .evoskill/task.md.

  • Dataset source โ€” CSV (static question/answer pairs) or Harbor (containerized tasks with built-in verifiers).
  • Data dirs โ€” (CSV only) absolute paths to directories the agent needs. Comma-separated if multiple.
  • Execution mode โ€” Local (direct), Docker (containerized, supports remote via DOCKER_HOST), or Daytona (managed cloud sandbox).

2. Describe your task

Edit .evoskill/task.md to describe what the agent should do:

# Task

Answer questions about quarterly financial reports.
Return only the numeric answer with units.

## Examples
- "What was revenue in Q3?" โ†’ "\$4.2B"

---

# Constraints
- Always include units in the answer
- Do not explain your reasoning, just return the answer

3. Run the loop

evoskill run

EvoSkill uses the execution mode you chose during evoskill init (local, Docker, or Daytona). You can override with --docker or --remote flags.

EvoSkill prints a live progress table:

  Iter  Accuracy  ฮ”          Skills  Frontier  Status
  1     42.0%     โ€”          0       [1]       baseline
  2     51.3%     +9.3%      1       [1, 2]    โ˜… new best
  3     49.7%     -1.6%      1       [1, 2]    discarded
  ...

4. Evaluate and inspect

evoskill eval          # score the best program on the validation set
evoskill skills        # list all discovered skills
evoskill diff          # see what changed vs baseline
evoskill logs          # view past run history

5. Use the best program

After the loop finishes, the best program lives on a git branch:

git branch | grep program/     # list all program branches
git checkout program/iter-skill-3   # switch to the best one

From there you can inspect what the loop discovered:

cat .claude/program.yaml       # system prompt, tools, score
ls .claude/skills/             # all learned skills

Copy .claude/program.yaml and .claude/skills/ into your deployment to use the evolved agent configuration.

Harbor Integration

Harbor is a framework for evaluating AI agents against containerized benchmark tasks. EvoSkill integrates with Harbor as an alternative to CSV-based datasets, using Harbor's built-in verifiers as the scoring mechanism.

How it works

Instead of running agents against static CSV questions, Harbor mode:

  1. Loads tasks from a downloaded Harbor dataset (each task has its own Dockerfile, test harness, and verifier)
  2. Runs harbor run for each task, spawning a sandboxed container where the coding agent solves the task
  3. Reads the verifier reward from the container output (0.0 to 1.0)
  4. Feeds results back into EvoSkill's self-improvement loop to evolve better skills

Setup

pip install harbor    # install the Harbor CLI

Run evoskill init and select Harbor as the dataset source. Init will show available datasets from the Harbor Hub and auto-download your selection.

Configuration

When Harbor is selected during init, the following config is auto-generated:

[dataset]
source = "harbor"
harbor_tasks_root = ".evoskill/harbor/datasets/swe-bench-verified"
train_ratio = 0.18
val_ratio = 0.12

[harbor]
enabled = true
inner_agent = "claude-code"      # auto-derived from harness.name
inner_model = "anthropic/claude-sonnet-4-6"  # auto-derived from harness.model
env = "docker"                   # "docker" for local, "daytona" for remote
n_concurrent = 1
timeout_multiplier = 1.0
container_skills_path = "/skills"

[scorer]
type = "harbor"

The inner_agent and inner_model are automatically derived from your harness selection. The env is derived from your execution mode (docker for local/Docker, daytona for Daytona).

Filtering tasks

You can filter which tasks are included using glob patterns:

[dataset]
harbor_include = ["swe-bench/*"]     # only include matching tasks
harbor_exclude = ["swe-bench/hard*"] # exclude matching tasks
harbor_difficulty = ["easy", "medium"]  # filter by difficulty metadata
harbor_limit = 50                    # max number of tasks

Execution modes

ModeHow Harbor runs tasksNotes
Localharbor run -e dockerRequires Docker installed locally
Dockerharbor run -e dockerHarbor tasks dir mounted as volume
Daytonaharbor run -e daytonaHarbor uses Daytona API to create task sandboxes. DAYTONA_API_KEY is forwarded automatically.

CLI Reference

CommandDescription
evoskill initInitialize a new project (creates .evoskill/)
evoskill runRun the self-improvement loop
evoskill run --dockerRun in a Docker container
evoskill run --remoteRun on a Daytona sandbox
evoskill evalEvaluate the best program on the validation set
evoskill skillsList all skills discovered so far
evoskill diffDiff baseline vs best, or between two iterations
evoskill logsShow recent run history
evoskill resetDelete all program branches and start fresh
evoskill remote statusCheck progress of a remote run
evoskill remote logsView logs from a remote run
evoskill remote downloadPull results from a completed remote run
evoskill remote stopStop and clean up a remote run

evoskill run

evoskill run [--continue] [--verbose] [--quiet] [--config PATH] [--docker] [--remote] [--rebuild]
FlagDescription
--continueResume from the existing frontier instead of starting fresh.
--verboseShow per-sample pass/fail results
--quietShow the progress table only, suppress proposer output
--config PATHLoad a specific config TOML file instead of .evoskill/config.toml
--dockerRun inside a Docker container (builds image from Dockerfile if needed)
--remoteRun on a Daytona sandbox (requires [remote] config)
--rebuildForce rebuild the Docker image before running

evoskill eval also accepts --config PATH.

evoskill diff

evoskill diff              # baseline โ†’ current best
evoskill diff 3 7          # iteration 3 vs iteration 7

The diff is scoped to the .claude/ directory โ€” it shows changes to skills and the system prompt, not your source code.

evoskill logs

evoskill logs              # last 5 runs (default)
evoskill logs --last 10    # last 10 runs

evoskill reset

evoskill reset             # prompts for confirmation

Deletes all program/* branches, frontier/* tags, the loop checkpoint, and feedback history. Your source code, config.toml, task.md, and any skills in .claude/skills/ are left untouched.

Configuration Reference

evoskill init creates .evoskill/config.toml. All fields are optional โ€” defaults are shown below. Relative dataset and data directory paths are resolved from the project root, meaning the directory containing .evoskill.

[harness]
name = "claude"        # "claude", "opencode", "codex", "goose", or "openhands"
model = "sonnet"       # Claude alias, Codex model name, or provider/model for OpenCode/Goose/OpenHands
data_dirs = ["/absolute/path/to/data_dir"]  # extra directories the agent can read

[evolution]
mode = "skill_only"          # "skill_only" or "prompt_only"
iterations = 20
frontier_size = 3
concurrency = 4
no_improvement_limit = 5

[dataset]
path = "data/questions.csv"  # relative to project root, or an absolute path
question_column = "question"
ground_truth_column = "ground_truth"
category_column = ""         # optional, for stratified sampling
train_ratio = 0.18
val_ratio = 0.12

[scorer]
type = "multi_tolerance"     # see scorer types below

Alternate configs can live next to the default config:

.evoskill/config.toml
.evoskill/config.openrouter.toml

Run with an explicit config:

evoskill eval --config .evoskill/config.openrouter.toml
evoskill run --config .evoskill/config.openrouter.toml

Common evolution model setups:

Anthropic:

[harness]
name = "claude"
model = "claude-sonnet-4-6"

OpenAI:

[harness]
name = "codex"
model = "gpt-5"

OpenRouter:

[harness]
name = "opencode"
model = "openrouter/openai/gpt-5-mini"

Fireworks AI:

[harness]
name = "openhands"   # or "opencode"
model = "fireworks_ai/accounts/fireworks/models/llama-v3p1-70b-instruct"

Notes:

  • claude is Anthropic-only.
  • codex uses bare OpenAI model names such as gpt-5, o3, or gpt-5.1-codex-mini.
  • opencode, goose, and openhands are multi-provider harnesses and can also use Claude and OpenAI models.
  • opencode, goose, and openhands accept provider/model strings such as anthropic/claude-sonnet-4-6, openai/gpt-5, or openrouter/openai/gpt-5-mini.
  • Fireworks AI uses FIREWORKS_API_KEY. OpenHands (litellm) expects the fireworks_ai/ prefix while OpenCode (models.dev) expects fireworks-ai/. Goose has no built-in Fireworks provider โ€” use a manual OpenAI-compatible configuration.

Scorer types

TypeDescription
multi_toleranceFlexible string matching: exact, numeric tolerance, list overlap (default)
exactCase-insensitive exact string match
llmLLM-as-judge grading with a custom rubric
scriptShell script scorer โ€” receives {predicted} and {expected} as variables
harborReads reward from Harbor's built-in verifier (auto-set when dataset source is Harbor)

LLM scorer options:

[scorer]
type = "llm"
rubric = "Award 1.0 if the answer is numerically correct within 5%, 0.0 otherwise."
model = "claude-sonnet-4-6"   # defaults to claude-sonnet-4-6
provider = "anthropic"        # "anthropic", "openai", "google", "openrouter", or "fireworks"

For OpenRouter-backed scoring, set provider = "openrouter" and use an OpenRouter model ID such as openai/gpt-5-mini or google/gemini-2.5-flash. Authentication uses OPENROUTER_API_KEY and falls back to LLM_API_KEY if needed.

For Fireworks-backed scoring, set provider = "fireworks" and use a Fireworks model ID such as accounts/fireworks/models/llama-v3p1-70b-instruct. Authentication uses FIREWORKS_API_KEY.

Script scorer options:

[scorer]
type = "script"
command = "python score.py --predicted {predicted} --expected {expected}"

Remote Execution

EvoSkill runs can take hours. Use Docker or Daytona to run on remote hardware and free up your machine.

Docker (BYOC)

Build the image from the included Dockerfile:

docker build -t evoskill .
evoskill run --docker

To run on a remote server, point Docker to it:

export DOCKER_HOST=ssh://user@your-server
evoskill run --docker

Monitor and stop:

docker compose -f .evoskill/docker-compose.yml logs -f
docker compose -f .evoskill/docker-compose.yml down

Daytona (Managed)

Install the Daytona SDK and set your API key:

pip install daytona
export DAYTONA_API_KEY=your-daytona-key

Build and push your image (Daytona runs x86 sandboxes, so cross-compile if you're on Apple Silicon):

# On Apple Silicon (ARM) โ€” cross-compile for x86
docker buildx build --platform linux/amd64 -t your-registry/evoskill:latest --push .

# On x86 Linux โ€” standard build
docker build -t evoskill .
docker tag evoskill your-registry/evoskill:latest
docker push your-registry/evoskill:latest

Set in .evoskill/config.toml:

[remote]
target = "daytona"

[remote.daytona]
image = "your-registry/evoskill:latest"
cpu = 4          # max 4 vCPUs per sandbox
memory = 8       # max 8 GB per sandbox
disk = 10        # max 10 GB per sandbox
timeout = 0      # 0 = no auto-stop, or minutes until auto-stop

The DAYTONA_API_KEY can also be set as api_key under [remote.daytona], but the env var is preferred to avoid committing secrets.

Then:

evoskill run --remote           # launch
evoskill remote status          # check progress
evoskill remote logs -f         # stream live output
evoskill remote logs            # view last output
evoskill remote download        # pull results when done
evoskill remote stop            # cancel and clean up

How It Works

EvoSkill Architecture

The self-improvement loop follows five stages:

  1. Base Agent โ€” Attempts benchmark questions using the current best program (system prompt + skills).
  2. Proposer โ€” Analyzes failure cases and proposes targeted skill or prompt changes to address them.
  3. Generator โ€” Creates the proposed changes: writes new skill files or rewrites the system prompt.
  4. Evaluator โ€” Scores the new program variant on a held-out validation set to measure improvement.
  5. Frontier โ€” Tracks the top-N performing programs as git branches; the best survive to the next iteration.

This cycle repeats for a configurable number of iterations, automatically converging on stronger agent configurations.

Git Branches

EvoSkill uses your repo's git history to version every program it creates. During a run it automatically creates and switches between branches โ€” you don't need to do anything. After a run your branch layout will look like:

main                      โ† your code, untouched
program/base              โ† initial baseline agent
program/iter-skill-1      โ† after iteration 1
program/iter-skill-2      โ† after iteration 2
...

Frontier members are marked with frontier/* tags. EvoSkill only ever writes to branches prefixed program/, so there is no risk of it touching your working branch.

When the Loop Gets Stuck

If accuracy stops improving, try the following:

  1. Check the feedback log โ€” .claude/feedback_history.md records what the proposer tried each iteration and why it succeeded or failed.
  2. Resume instead of restarting โ€” evoskill run --continue picks up from the last frontier rather than discarding progress.
  3. Reset and start fresh โ€” evoskill reset clears all branches and lets you start over with a revised task.md.

Python API

For programmatic usage, EvoSkill exposes a high-level Python API.

EvoSkill

from src.api import EvoSkill

evo = EvoSkill(
    task="sealqa",
    model="sonnet",
    mode="skill_only",
    max_iterations=20,
    frontier_size=3,
    concurrency=4,
    train_ratio=0.18,
    val_ratio=0.12,
    continue_mode=False,
)
result = await evo.run()

# Synchronous usage
result = EvoSkill(task="base").run_sync()

EvalRunner

from src.api import EvalRunner

summary = await EvalRunner(
    task="sealqa",
    model="sonnet",
    max_concurrent=8,
).run()

Citation

If you use EvoSkill in your research, please cite the original paper:

@misc{alzubi2026evoskillautomatedskilldiscovery,
      title={EvoSkill: Automated Skill Discovery for Multi-Agent Systems}, 
      author={Salaheddin Alzubi and Noah Provenzano and Jaydon Bingham and Weiyuan Chen and Tu Vu},
      year={2026},
      eprint={2603.02766},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.02766}, 
}

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.