Chapter 7: RL Training and Trajectory Generation
April 13, 2026 · View on GitHub
What Problem Does This Solve?
Modern LLM fine-tuning — especially via reinforcement learning from human or environment feedback — requires high-quality behavioral trajectories: recordings of what an agent did, step by step, including reasoning, tool calls, and outcomes. These trajectories are expensive to generate synthetically and hard to collect at scale.
Hermes solves this by turning every production interaction into a potential training example. trajectory.py records a complete trace of each agent loop iteration — the prompt, the reasoning, every tool call, and the final response — in the Atropos RL format that NousResearch uses for fine-tuning. If you use Hermes daily, you're continuously generating training data for the very models that power it.
The Closed Learning Loop
flowchart LR
subgraph Usage["Daily Hermes Usage"]
TUI[TUI / Gateway\nUser Interactions]
CRON[Cron Jobs\nAutomated Tasks]
BENCH[Benchmark\nEnvironments]
end
subgraph Recording["trajectory.py"]
TRAJ[Trajectory Recorder\nrecords per-turn traces]
ATROP[Atropos Formatter\nconverts to RL format]
end
subgraph Storage["~/.hermes/trajectories/"]
TJSONL[traj_*.jsonl\nAtropos format]
end
subgraph Training["RL Training Pipeline"]
FILTER[Quality Filter\nreward scoring]
ATROPOS[Atropos\nRL Framework]
FINETUNE[Fine-tuned Model\nnext version]
end
TUI --> TRAJ
CRON --> TRAJ
BENCH --> TRAJ
TRAJ --> ATROP
ATROP --> TJSONL
TJSONL --> FILTER
FILTER --> ATROPOS
ATROPOS --> FINETUNE
FINETUNE -->|improved model| TUI
trajectory.py — The Recorder
trajectory.py is attached to the agent's core loop as an observer. It records a structured trace of every turn without affecting the agent's behavior.
What Gets Recorded
# hermes_cli/agent/trajectory.py (data structures)
@dataclass
class TurnTrace:
"""A single turn in an agent trajectory."""
# Context
session_id: str
turn_index: int
timestamp: float
model: str
provider: str
# Input
prompt_tokens: int
system_prompt_hash: str # For deduplication; not the full prompt
user_message: str
conversation_history_length: int
# Agent reasoning (if chain-of-thought is enabled)
reasoning: str | None
# Tool calls (may be multiple per turn)
tool_calls: list[ToolCall]
# Output
assistant_response: str
completion_tokens: int
# Outcome signals (filled in post-turn)
user_feedback: str | None # explicit feedback if user gave it
task_completed: bool | None # set by environment for benchmark tasks
reward: float | None # set by reward model or environment
@dataclass
class ToolCall:
tool_name: str
arguments: dict
result: str | None
error: str | None
duration_ms: float
success: bool
Recording a Trajectory
# hermes_cli/agent/trajectory.py (recording flow)
class TrajectoryRecorder:
def __init__(self, config: Config):
self.enabled = config.trajectory.enabled
self.output_dir = Path(config.trajectory.output_dir)
self.current_trajectory: list[TurnTrace] = []
def record_turn(
self,
user_message: str,
reasoning: str | None,
tool_calls: list[ToolCall],
assistant_response: str,
model_info: ModelInfo,
token_counts: TokenCounts
) -> TurnTrace:
"""Record a single turn. Called after each agent response."""
if not self.enabled:
return None
trace = TurnTrace(
session_id=self.session_id,
turn_index=len(self.current_trajectory),
timestamp=time.time(),
model=model_info.model,
provider=model_info.provider,
prompt_tokens=token_counts.prompt,
system_prompt_hash=hash_system_prompt(self.current_system_prompt),
user_message=user_message,
conversation_history_length=len(self.history),
reasoning=reasoning,
tool_calls=tool_calls,
assistant_response=assistant_response,
completion_tokens=token_counts.completion,
)
self.current_trajectory.append(trace)
return trace
def finalize(self, session_outcome: SessionOutcome):
"""Write the complete trajectory to disk at session end."""
trajectory = Trajectory(
session_id=self.session_id,
turns=self.current_trajectory,
outcome=session_outcome,
format_version="atropos-v1"
)
output_path = self.output_dir / f"traj_{self.session_id}.jsonl"
with open(output_path, "w") as f:
for turn in trajectory.turns:
f.write(json.dumps(asdict(turn)) + "\n")
Atropos Format
Atropos is NousResearch's RL training framework. The trajectory format it consumes is a JSONL file where each line is a turn trace:
{"session_id": "sess_abc123", "turn_index": 0, "model": "gpt-4o", "user_message": "Can you help me debug this Python function?", "reasoning": "The user has a Python debugging question. I should ask to see the code.", "tool_calls": [], "assistant_response": "I'd be happy to help debug your Python function. Could you share the code?", "prompt_tokens": 1847, "completion_tokens": 23, "reward": null}
{"session_id": "sess_abc123", "turn_index": 1, "model": "gpt-4o", "user_message": "def process(df):\n return df.groupby('a').sum()", "reasoning": "Simple groupby operation. The issue might be NaN handling or column types.", "tool_calls": [{"tool_name": "shell_exec", "arguments": {"command": "python3 -c \"import pandas as pd; df = pd.DataFrame({'a': [1,1,2], 'b': [None, 2, 3]}); print(df.groupby('a').sum())\"}"}, "result": " b\na \n1 2.0\n2 3.0", "success": true, "duration_ms": 234}], "assistant_response": "The function looks correct for basic aggregation. However, note that NaN values are silently dropped by groupby().sum()...", "prompt_tokens": 2103, "completion_tokens": 187, "reward": 1.0}
Trajectory Configuration
# ~/.hermes/config.yaml
trajectory:
enabled: true
output_dir: "~/.hermes/trajectories"
# What to record
record_reasoning: true # Include chain-of-thought if available
record_tool_calls: true # Include all tool call arguments and results
record_system_prompt: false # Exclude for privacy (hash only)
# Quality filtering
min_turn_count: 2 # Skip single-turn sessions
require_tool_calls: false # Include even non-tool-using sessions
# Reward signals
reward_model: null # Path to local reward model, or null for human feedback only
# Upload
auto_upload: false # Upload to NousResearch if true
upload_endpoint: "https://training.nousresearch.com/trajectories"
upload_api_key: "nk-..."
Benchmark Environments
Hermes ships with four benchmark environments designed to generate high-quality training trajectories for specific skill domains.
Overview
| Environment | Location | Tests | Domain |
|---|---|---|---|
| hermes_swe_env | environments/hermes_swe_env/ | Software engineering tasks | Code editing, bug fixing, PR review |
| tblite | environments/tblite/ | Terminal-based tasks | Shell scripting, file manipulation, system admin |
| terminalbench_2 | environments/terminalbench_2/ | Terminal reasoning | Complex multi-step terminal workflows |
| yc_bench | environments/yc_bench/ | Business/startup tasks | Research, analysis, document generation |
hermes_swe_env — Software Engineering Benchmark
Based on SWE-bench methodology, hermes_swe_env presents the agent with real-world software engineering tasks:
# hermes_cli/environments/hermes_swe_env/__init__.py (structure)
class HermesSWEEnv:
"""
Software engineering benchmark environment.
Each task is a GitHub issue + repository snapshot.
The agent must produce a patch that resolves the issue.
Success is measured by automated test suite pass rate.
"""
async def run_task(self, task: SWETask) -> TaskResult:
"""
Set up a Docker container with the task's repository,
present the issue to the agent, and evaluate the result.
"""
container = await self._setup_container(task.repo_snapshot)
prompt = f"""
You are working on the following GitHub issue:
Repository: {task.repo}
Issue #{task.issue_number}: {task.issue_title}
{task.issue_body}
Please resolve this issue by editing the relevant files.
"""
result = await self.agent.run(
prompt=prompt,
backend="docker",
container=container,
max_iterations=20
)
test_pass_rate = await self._run_tests(container)
return TaskResult(
task_id=task.id,
success=test_pass_rate > 0.9,
test_pass_rate=test_pass_rate,
patch=await self._extract_patch(container),
trajectory=result.trajectory
)
tblite — Terminal Benchmark Lite
A collection of terminal-focused tasks ranging from simple file operations to complex shell scripting challenges:
# hermes_cli/environments/tblite/__init__.py (structure)
TASK_CATEGORIES = {
"file_ops": [
"Find all Python files modified in the last 24 hours",
"Create a directory structure for a new Python package",
"Extract specific lines from multiple log files",
],
"shell_scripting": [
"Write a bash script to monitor disk usage and alert when > 90%",
"Parse a CSV file and output statistics",
"Create a backup script with rotation",
],
"system_admin": [
"Set up a cron job to run a Python script daily",
"Configure environment variables for a Python project",
"Debug a failing systemd service",
]
}
terminalbench_2 — Advanced Terminal Reasoning
terminalbench_2 focuses on multi-step terminal workflows that require planning and state management:
# hermes_cli/environments/terminalbench_2/__init__.py (structure)
class TerminalBench2:
"""
Advanced terminal benchmark with longer-horizon tasks.
Evaluates ability to maintain state across many steps,
recover from errors, and use terminal tools efficiently.
"""
pass
yc_bench — Business Task Benchmark
Evaluates the agent's ability to perform business and startup-related tasks:
# hermes_cli/environments/yc_bench/__init__.py (structure)
TASK_TYPES = [
"market_research", # Research a market and produce a report
"competitor_analysis", # Analyze competitors and create comparison matrix
"technical_spec", # Write a technical specification document
"financial_model", # Build a simple financial model in a spreadsheet
"user_interview_analysis", # Analyze interview transcripts for themes
]
Tool-Call Parsers for Multi-Model RL
One of Hermes's most technically sophisticated features is its ability to generate RL training data from multiple model families. Different models use different tool-call formats, and trajectory.py includes parsers for each:
# hermes_cli/agent/trajectory.py (tool call parsers)
class ToolCallParser:
"""
Parse tool calls from different model families into
a unified ToolCall format for trajectory recording.
"""
@staticmethod
def parse(response: str, model_family: str) -> list[ToolCall]:
parser = {
"hermes": ToolCallParser._parse_hermes, # Hermes function calling
"deepseek": ToolCallParser._parse_deepseek, # DeepSeek tool use
"qwen": ToolCallParser._parse_qwen, # Qwen tool calls
"glm": ToolCallParser._parse_glm, # GLM function calls
"llama": ToolCallParser._parse_llama, # Llama tool use
"kimi": ToolCallParser._parse_kimi, # Kimi (Moonshot) tools
"mistral": ToolCallParser._parse_mistral, # Mistral tool calls
}.get(model_family, ToolCallParser._parse_openai)
return parser(response)
@staticmethod
def _parse_hermes(response: str) -> list[ToolCall]:
"""Parse Hermes function calling format."""
# Hermes uses XML-like tags: <tool_call>...</tool_call>
calls = []
for match in re.finditer(r'<tool_call>(.*?)</tool_call>', response, re.DOTALL):
try:
call_data = json.loads(match.group(1))
calls.append(ToolCall(
tool_name=call_data["name"],
arguments=call_data.get("arguments", {})
))
except json.JSONDecodeError:
pass
return calls
@staticmethod
def _parse_deepseek(response: str) -> list[ToolCall]:
"""Parse DeepSeek tool use format."""
# DeepSeek uses a different JSON structure
...
Model Family Support Matrix
| Model Family | Tool Format | Reasoning Format | Notes |
|---|---|---|---|
| Hermes (NousResearch) | XML tags: <tool_call> | <reasoning> | Native format |
| DeepSeek | JSON in <tool_call> | <think> | R1-style reasoning |
| Qwen | OpenAI-compatible JSON | Optional <think> | Qwen2.5 family |
| GLM | Function call JSON | Not exposed | GLM-4 family |
| Llama | OpenAI-compatible | Optional chain | Llama 3.x family |
| Kimi (Moonshot) | OpenAI-compatible | <think> | k1.5 family |
| Mistral | OpenAI-compatible | Not exposed | Mistral/Mixtral |
| OpenAI (fallback) | Standard function calling | Not exposed | GPT-4o family |
Running the Data Generation Pipeline
Generate Trajectories from Benchmarks
# Run hermes_swe_env benchmark and generate trajectories
hermes bench run hermes_swe_env \
--model "gpt-4o" \
--tasks 50 \
--output ~/.hermes/trajectories/swe_bench_run_1/
# Run tblite benchmark
hermes bench run tblite \
--model "meta-llama/Llama-3.3-70b-Instruct-Turbo" \
--tasks 100 \
--backend docker \
--concurrency 5 \
--output ~/.hermes/trajectories/tblite_run_1/
Filter and Score Trajectories
# Score trajectories with a reward model
hermes traj score \
--input ~/.hermes/trajectories/swe_bench_run_1/ \
--reward-model ~/models/reward_model.ckpt \
--output ~/.hermes/trajectories/scored/
# Filter to high-quality trajectories
hermes traj filter \
--input ~/.hermes/trajectories/scored/ \
--min-reward 0.7 \
--min-turns 3 \
--output ~/.hermes/trajectories/filtered/
# Convert to Atropos training format
hermes traj export \
--input ~/.hermes/trajectories/filtered/ \
--format atropos-v1 \
--output ~/training_data/hermes_trajectories.jsonl
Upload to NousResearch
# Upload high-quality trajectories to contribute to model training
hermes traj upload \
--input ~/.hermes/trajectories/filtered/ \
--endpoint https://training.nousresearch.com/trajectories \
--api-key $NOUSRESEARCH_API_KEY
Data Generation Pipeline Architecture
sequenceDiagram
participant Env as Benchmark Environment
participant Agent as Hermes Agent
participant Traj as trajectory.py
participant FS as ~/.hermes/trajectories/
participant Atropos as Atropos RL
Env->>Agent: present task
loop Agent loop (max_iterations)
Agent->>Agent: build prompt
Agent->>Agent: call LLM
Agent->>Agent: parse tool calls
Agent->>Env: execute tool calls
Env-->>Agent: tool results
Traj->>Traj: record TurnTrace
end
Env->>Traj: session_outcome (pass/fail + reward)
Traj->>Traj: finalize trajectory
Traj->>FS: write traj_*.jsonl
Note over FS: Quality filtering step
FS->>Atropos: high-quality trajectories
Atropos->>Atropos: RL training update
Atropos-->>Agent: improved policy (next model version)
Reward Signals
Trajectories become useful for RL training only when they have reward signals. Hermes supports three reward sources:
| Reward Source | When Available | Quality |
|---|---|---|
| Environment feedback | Benchmark runs (automated test pass/fail) | High — ground truth |
| User explicit feedback | User rates response with 👍/👎 in TUI | High — human judgment |
| Reward model | Configured local or API reward model | Medium — depends on model quality |
| Implicit signal | Session length, skill creation events, memory writes | Low — correlational |
For production use, the most valuable trajectories come from benchmark runs where success is objectively measurable. Interactive session trajectories are valuable when users provide explicit feedback.
Chapter Summary
| Concept | Key Takeaway |
|---|---|
| trajectory.py | Silent observer on agent loop; records every turn in Atropos format |
| Atropos format | JSONL; one line per turn; includes reasoning, tool calls, outcomes, rewards |
| Closed loop | Daily usage → trajectories → Atropos training → improved models → daily usage |
| hermes_swe_env | SWE-bench-style software engineering tasks; Docker-isolated; evaluated by tests |
| tblite | Terminal task benchmark; shell scripting, file ops, system admin |
| terminalbench_2 | Long-horizon terminal reasoning tasks |
| yc_bench | Business task benchmark; research, analysis, document generation |
| Tool-call parsers | Unified parser for 7+ model families; enables multi-model RL training |
| Reward signals | Environment feedback (best), user feedback, reward model, implicit signals |
| Upload workflow | Filter → score → export → upload to NousResearch training endpoint |