AweAgent

June 5, 2026 · View on GitHub

AweAgent

Make Agent Research Systematic.

A unified, composable framework to build, evaluate, and train agents.

Python 3.11+ License: Apache-2.0

Agent research is fragmented across domains and stages: each task type tends to come with its own agent stack — code, search, terminal — and moving to RL usually requires rebuilding the rollout pipeline. Nothing carries over. AweAgent brings these pieces into one composable framework for building, evaluating, and training agents.

AweAgent's core capabilities:

  • Unified across task types — search, code, and terminal agents run on the same execution core, with task-specific behavior composed through reusable interfaces instead of separate stacks.
  • Composable agent harnesses — an agent is split into a step(ctx) -> action policy, the loop that runs it, and a context bus that carries state and dependencies; build new agents by recomposing these parts instead of forking the engine.
  • Protocol-centered extensibility — LLM backends, tools, runtime sandboxes, agent scaffolds, tool backends, and evaluators are exposed through small protocols and entry-point registries; register a new component instead of patching the core engine.
  • Evaluation & trajectories as first-class data — every run emits a structured result plus the full trajectory; code tasks can be evaluated in isolated Docker runtimes, and the experimental training path can collect token-level rollout data (loss mask · logprobs · weight versions).

:newspaper: News

  • [2026-06-04] 🎉 Added DeepSearch & IterResearch scaffolds + BrowseComp support.
  • [2026-05-10] 🎉 Added NL2Repo and SWE-bench Pro task support.
  • [2026-03-16] 🎉 Added unified LLM backends (openai/azure/response/ark/anthropic/sglang) with multi-provider reasoning support (docs).
  • [2026-03-15] 🎉 Added Terminus-2 scaffold with Terminal-Bench 2.0 support.
  • [2026-03-01] 🎉 Initial release with SearchSWE scaffold with BeyondSWE & ScaleSWE.

:jigsaw: Scaffolds

Reference agents shipped in-tree, all on the shared core.

ScaffoldTypeHighlightResources
OpenHands-stylecodingCodeAct-XML coding agent, behavior-compatible with OpenHands (search off)code
SearchSWEcodingSWE coding agent with web search & fetch — fixes repo issues, pulls in external docscode
DeepSearchdeep searchBase web-research QA agent; retry-until-answerable loop policycode
IterResearchdeep searchDeep search + interaction scaling for long, multi-step researchcode
Terminus-2terminaltmux terminal agent driven by raw JSON keystrokes, on the standard loopcode

OpenHands-style and SearchSWE are the same scaffold (search_swe), one enable_search flag apart — listed separately because they behave differently.

:clipboard: Benchmarks

BenchmarkDescriptionScaffoldEvaluationResources
BeyondSWEDoc2Repo · CrossRepo · DepMigrate · DomainFixSearchSWE / OpenHandsisolated Docker patch testdata · guide
ScaleSWElarge-scale SWE-bench-style dataSearchSWE / OpenHandsisolated Docker patch testdata · guide
SWE-bench-Proextended SWE-bench code tasksSearchSWE / OpenHandsisolated Docker patch testguide
NL2Repobuild a repo from a natural-language specSearchSWE / OpenHandsisolated Docker (artifact + golden tests)guide
Terminal-Bench 2.0terminal tasks in containersTerminus-2same-container rewardrepo · guide
BrowseCompweb-search QADeepSearch / IterResearchLLM-as-Judgeguide

:world_map: Roadmap

Long-term goal: practical, general-purpose agents optimized with reinforcement learning. Shipped so far — the four scaffolds and six benchmarks above. Next:

  • Multi-agent — multi-agent collaboration and orchestration on the shared core
  • RL training — reinforcement-learning rollouts via Slime with an SGLang rollout engine (experimental today)

:rocket: Installation

Requires Python 3.11+ and Docker (for sandboxed execution and isolated evaluation).

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
uv venv --python 3.11 && source .venv/bin/activate
uv pip install -e .

pip

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
python -m venv .venv && source .venv/bin/activate
pip install -e .

A single pip install -e . runs every scaffold and benchmark, and all LLM backends except Volcengine Ark, out of the box. Optional extras: ".[ark]" (Volcengine Ark backend) · ".[dev]" (pytest · ruff · mypy).

Why editable (-e)? You're installing from source and will likely tweak agents, tools, or configs — -e makes changes take effect without reinstalling. Verify everything is registered with awe-agent info.

:arrow_forward: Running a Benchmark

Download data

Datasets download through one script, run from the repo root. Data lands under datasets/<task>/ — the path each task config defaults to — so afterward you can run with no extra env vars.

bash datasets/download.sh beyond_swe              # one task
bash datasets/download.sh all                     # everything wired
FORCE=true bash datasets/download.sh beyond_swe   # re-download

Wired today: BeyondSWE · BrowseComp · Terminal-Bench 2.0. See datasets/ for HF token / mirror options and per-task notes; other datasets are covered in each benchmark's guide.

Run

# point at your LLM
export OPENAI_API_KEY="sk-..."

# sanity-check what's registered (backends, runtimes, agents, tools)
awe-agent info

# list instances — no Docker needed
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode dry-run

# batch run
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode batch

See each benchmark's guide for full setup, CLI arguments, and output format.

:building_construction: Architecture

AweAgent architecture

Four layers driven by a shared core — the figure maps 1:1 onto the modules below.

Module descriptions

  • TaskRunner (core/task) — the batch engine: loads a Task, provisions its runtime, drives each instance through the loop, routes the result to an evaluator, and writes structured output (concurrency · retries · per-instance isolation).
  • AgentContext (core/agent) — the shared bus: all rollout state (messages, trajectory, stats) plus every injected dependency (LLM, tools, tool-call format, runtime) and an optional training field. The single seam between the agent and the outside world.
  • AgentLoop (core/agent) — the rollout engine: runs the step loop, branches only on the kind of action (finish · message · tool call), dispatches tools by name, and records the trajectory + RL tokens — agnostic to whether it's driving a search, code, or terminal agent.
  • Agent scaffold (scaffold/) — the policy: a near-stateless step(ctx) → action. Built-ins: SearchSWE · DeepSearch · Terminus-2 · IterResearch.
  • Interaction layer (core/llm · core/tool · core/runtime) — the pluggable dependencies the loop injects: LLM backends, tools, tool-call formats, and runtime sandboxes.
  • Evaluation & Data (core/eval · tasks/ · integrations/) — turns a finished run into a score (isolated Docker patch test · LLM-as-judge · in-container reward) or token-level RL rollout data (Slime bridge).
  • Config / Registry (core/config · plugins/) — layered YAML config + entry-point registries that wire every part by name.

:gear: Configuration

Configs are YAML files with environment-variable substitution (${VAR}, ${VAR:-default}) and !include support.

LLM Backends

BackendConfig FileRequired Env Vars
OpenAIconfigs/llm/openai.yamlOPENAI_API_KEY
OpenAI (Responses)configs/llm/openai_response.yamlOPENAI_API_KEY
Azure OpenAIconfigs/llm/azure.yamlAZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT
Anthropicconfigs/llm/anthropic.yamlANTHROPIC_API_KEY
Volcengine Arkconfigs/llm/ark.yamlARK_API_KEY, ARK_MODEL_ID
SGLangconfigs/llm/sglang.yaml(self-hosted endpoint)

Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Sections: LLM Backend (pick one — API key + endpoint), Task Data (DATA_FILE), and Search Tools (optional — SERPAPI_API_KEY, JINA_API_KEY, only for search mode). See each benchmark's recipe guide (linked in the table above) for the full list.

:handshake: Contributing

Issues and PRs are welcome. AweAgent is built to be extended — adding an LLM backend, tool, runtime, agent scaffold, or evaluator means implementing a small protocol and registering one entry point, with no changes to the core engine. For development tooling, install with pip install -e ".[dev]" (pytest · ruff · mypy).

:scroll: Citation

If AweAgent is useful in your work, please consider citing it and giving the repo a ⭐.

@misc{aweagent2026,
  title        = {AweAgent: A Unified, Composable Framework to Build, Evaluate, and Train Agents},
  author       = {AweAI Team},
  year         = {2026},
  howpublished = {\url{https://github.com/AweAI-Team/AweAgent}}
}

📄 License

Released under the Apache-2.0 License.

📨 Contact

Questions or feedback? Open an issue or email gx.chen.chn@gmail.com.