AweAgent
June 5, 2026 · View on GitHub
AweAgent
Make Agent Research Systematic.
A unified, composable framework to build, evaluate, and train agents.
Agent research is fragmented across domains and stages: each task type tends to come with its own agent stack — code, search, terminal — and moving to RL usually requires rebuilding the rollout pipeline. Nothing carries over. AweAgent brings these pieces into one composable framework for building, evaluating, and training agents.
AweAgent's core capabilities:
- Unified across task types — search, code, and terminal agents run on the same execution core, with task-specific behavior composed through reusable interfaces instead of separate stacks.
- Composable agent harnesses — an agent is split into a
step(ctx) -> actionpolicy, the loop that runs it, and a context bus that carries state and dependencies; build new agents by recomposing these parts instead of forking the engine. - Protocol-centered extensibility — LLM backends, tools, runtime sandboxes, agent scaffolds, tool backends, and evaluators are exposed through small protocols and entry-point registries; register a new component instead of patching the core engine.
- Evaluation & trajectories as first-class data — every run emits a structured result plus the full trajectory; code tasks can be evaluated in isolated Docker runtimes, and the experimental training path can collect token-level rollout data (loss mask · logprobs · weight versions).
:newspaper: News
[2026-06-04]🎉 Added DeepSearch & IterResearch scaffolds + BrowseComp support.[2026-05-10]🎉 Added NL2Repo and SWE-bench Pro task support.[2026-03-16]🎉 Added unified LLM backends (openai/azure/response/ark/anthropic/sglang) with multi-provider reasoning support (docs).[2026-03-15]🎉 Added Terminus-2 scaffold with Terminal-Bench 2.0 support.[2026-03-01]🎉 Initial release with SearchSWE scaffold with BeyondSWE & ScaleSWE.
:jigsaw: Scaffolds
Reference agents shipped in-tree, all on the shared core.
| Scaffold | Type | Highlight | Resources |
|---|---|---|---|
| OpenHands-style | coding | CodeAct-XML coding agent, behavior-compatible with OpenHands (search off) | code |
| SearchSWE | coding | SWE coding agent with web search & fetch — fixes repo issues, pulls in external docs | code |
| DeepSearch | deep search | Base web-research QA agent; retry-until-answerable loop policy | code |
| IterResearch | deep search | Deep search + interaction scaling for long, multi-step research | code |
| Terminus-2 | terminal | tmux terminal agent driven by raw JSON keystrokes, on the standard loop | code |
OpenHands-style and SearchSWE are the same scaffold (search_swe), one enable_search flag apart — listed separately because they behave differently.
:clipboard: Benchmarks
| Benchmark | Description | Scaffold | Evaluation | Resources |
|---|---|---|---|---|
| BeyondSWE | Doc2Repo · CrossRepo · DepMigrate · DomainFix | SearchSWE / OpenHands | isolated Docker patch test | data · guide |
| ScaleSWE | large-scale SWE-bench-style data | SearchSWE / OpenHands | isolated Docker patch test | data · guide |
| SWE-bench-Pro | extended SWE-bench code tasks | SearchSWE / OpenHands | isolated Docker patch test | guide |
| NL2Repo | build a repo from a natural-language spec | SearchSWE / OpenHands | isolated Docker (artifact + golden tests) | guide |
| Terminal-Bench 2.0 | terminal tasks in containers | Terminus-2 | same-container reward | repo · guide |
| BrowseComp | web-search QA | DeepSearch / IterResearch | LLM-as-Judge | guide |
:world_map: Roadmap
Long-term goal: practical, general-purpose agents optimized with reinforcement learning. Shipped so far — the four scaffolds and six benchmarks above. Next:
- Multi-agent — multi-agent collaboration and orchestration on the shared core
- RL training — reinforcement-learning rollouts via Slime with an SGLang rollout engine (experimental today)
:rocket: Installation
Requires Python 3.11+ and Docker (for sandboxed execution and isolated evaluation).
uv (recommended)
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
uv venv --python 3.11 && source .venv/bin/activate
uv pip install -e .
pip
git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
python -m venv .venv && source .venv/bin/activate
pip install -e .
A single pip install -e . runs every scaffold and benchmark, and all LLM backends except Volcengine Ark, out of the box. Optional extras: ".[ark]" (Volcengine Ark backend) · ".[dev]" (pytest · ruff · mypy).
Why editable (
-e)? You're installing from source and will likely tweak agents, tools, or configs —-emakes changes take effect without reinstalling. Verify everything is registered withawe-agent info.
:arrow_forward: Running a Benchmark
Download data
Datasets download through one script, run from the repo root. Data lands under datasets/<task>/ — the path each task config defaults to — so afterward you can run with no extra env vars.
bash datasets/download.sh beyond_swe # one task
bash datasets/download.sh all # everything wired
FORCE=true bash datasets/download.sh beyond_swe # re-download
Wired today: BeyondSWE · BrowseComp · Terminal-Bench 2.0. See datasets/ for HF token / mirror options and per-task notes; other datasets are covered in each benchmark's guide.
Run
# point at your LLM
export OPENAI_API_KEY="sk-..."
# sanity-check what's registered (backends, runtimes, agents, tools)
awe-agent info
# list instances — no Docker needed
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode dry-run
# batch run
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode batch
See each benchmark's guide for full setup, CLI arguments, and output format.
:building_construction: Architecture
Four layers driven by a shared core — the figure maps 1:1 onto the modules below.
Module descriptions
- TaskRunner (
core/task) — the batch engine: loads aTask, provisions its runtime, drives each instance through the loop, routes the result to an evaluator, and writes structured output (concurrency · retries · per-instance isolation). - AgentContext (
core/agent) — the shared bus: all rollout state (messages, trajectory, stats) plus every injected dependency (LLM, tools, tool-call format, runtime) and an optionaltrainingfield. The single seam between the agent and the outside world. - AgentLoop (
core/agent) — the rollout engine: runs the step loop, branches only on the kind of action (finish · message · tool call), dispatches tools by name, and records the trajectory + RL tokens — agnostic to whether it's driving a search, code, or terminal agent. - Agent scaffold (
scaffold/) — the policy: a near-statelessstep(ctx) → action. Built-ins: SearchSWE · DeepSearch · Terminus-2 · IterResearch. - Interaction layer (
core/llm·core/tool·core/runtime) — the pluggable dependencies the loop injects: LLM backends, tools, tool-call formats, and runtime sandboxes. - Evaluation & Data (
core/eval·tasks/·integrations/) — turns a finished run into a score (isolated Docker patch test · LLM-as-judge · in-container reward) or token-level RL rollout data (Slime bridge). - Config / Registry (
core/config·plugins/) — layered YAML config + entry-point registries that wire every part by name.
:gear: Configuration
Configs are YAML files with environment-variable substitution (${VAR}, ${VAR:-default}) and !include support.
LLM Backends
| Backend | Config File | Required Env Vars |
|---|---|---|
| OpenAI | configs/llm/openai.yaml | OPENAI_API_KEY |
| OpenAI (Responses) | configs/llm/openai_response.yaml | OPENAI_API_KEY |
| Azure OpenAI | configs/llm/azure.yaml | AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT |
| Anthropic | configs/llm/anthropic.yaml | ANTHROPIC_API_KEY |
| Volcengine Ark | configs/llm/ark.yaml | ARK_API_KEY, ARK_MODEL_ID |
| SGLang | configs/llm/sglang.yaml | (self-hosted endpoint) |
Environment Variables
Copy .env.example to .env and fill in your values:
cp .env.example .env
Sections: LLM Backend (pick one — API key + endpoint), Task Data (DATA_FILE), and Search Tools (optional — SERPAPI_API_KEY, JINA_API_KEY, only for search mode). See each benchmark's recipe guide (linked in the table above) for the full list.
:handshake: Contributing
Issues and PRs are welcome. AweAgent is built to be extended — adding an LLM backend, tool, runtime, agent scaffold, or evaluator means implementing a small protocol and registering one entry point, with no changes to the core engine. For development tooling, install with pip install -e ".[dev]" (pytest · ruff · mypy).
:scroll: Citation
If AweAgent is useful in your work, please consider citing it and giving the repo a ⭐.
@misc{aweagent2026,
title = {AweAgent: A Unified, Composable Framework to Build, Evaluate, and Train Agents},
author = {AweAI Team},
year = {2026},
howpublished = {\url{https://github.com/AweAI-Team/AweAgent}}
}
📄 License
Released under the Apache-2.0 License.
📨 Contact
Questions or feedback? Open an issue or email gx.chen.chn@gmail.com.