Chapter 7: RL Training and Trajectory Generation

April 13, 2026 · View on GitHub

What Problem Does This Solve?

Modern LLM fine-tuning — especially via reinforcement learning from human or environment feedback — requires high-quality behavioral trajectories: recordings of what an agent did, step by step, including reasoning, tool calls, and outcomes. These trajectories are expensive to generate synthetically and hard to collect at scale.

Hermes solves this by turning every production interaction into a potential training example. trajectory.py records a complete trace of each agent loop iteration — the prompt, the reasoning, every tool call, and the final response — in the Atropos RL format that NousResearch uses for fine-tuning. If you use Hermes daily, you're continuously generating training data for the very models that power it.


The Closed Learning Loop

flowchart LR
    subgraph Usage["Daily Hermes Usage"]
        TUI[TUI / Gateway\nUser Interactions]
        CRON[Cron Jobs\nAutomated Tasks]
        BENCH[Benchmark\nEnvironments]
    end

    subgraph Recording["trajectory.py"]
        TRAJ[Trajectory Recorder\nrecords per-turn traces]
        ATROP[Atropos Formatter\nconverts to RL format]
    end

    subgraph Storage["~/.hermes/trajectories/"]
        TJSONL[traj_*.jsonl\nAtropos format]
    end

    subgraph Training["RL Training Pipeline"]
        FILTER[Quality Filter\nreward scoring]
        ATROPOS[Atropos\nRL Framework]
        FINETUNE[Fine-tuned Model\nnext version]
    end

    TUI --> TRAJ
    CRON --> TRAJ
    BENCH --> TRAJ
    TRAJ --> ATROP
    ATROP --> TJSONL
    TJSONL --> FILTER
    FILTER --> ATROPOS
    ATROPOS --> FINETUNE
    FINETUNE -->|improved model| TUI

trajectory.py — The Recorder

trajectory.py is attached to the agent's core loop as an observer. It records a structured trace of every turn without affecting the agent's behavior.

What Gets Recorded

# hermes_cli/agent/trajectory.py (data structures)

@dataclass
class TurnTrace:
    """A single turn in an agent trajectory."""
    
    # Context
    session_id: str
    turn_index: int
    timestamp: float
    model: str
    provider: str
    
    # Input
    prompt_tokens: int
    system_prompt_hash: str    # For deduplication; not the full prompt
    user_message: str
    conversation_history_length: int
    
    # Agent reasoning (if chain-of-thought is enabled)
    reasoning: str | None
    
    # Tool calls (may be multiple per turn)
    tool_calls: list[ToolCall]
    
    # Output
    assistant_response: str
    completion_tokens: int
    
    # Outcome signals (filled in post-turn)
    user_feedback: str | None   # explicit feedback if user gave it
    task_completed: bool | None # set by environment for benchmark tasks
    reward: float | None        # set by reward model or environment


@dataclass
class ToolCall:
    tool_name: str
    arguments: dict
    result: str | None
    error: str | None
    duration_ms: float
    success: bool

Recording a Trajectory

# hermes_cli/agent/trajectory.py (recording flow)

class TrajectoryRecorder:
    def __init__(self, config: Config):
        self.enabled = config.trajectory.enabled
        self.output_dir = Path(config.trajectory.output_dir)
        self.current_trajectory: list[TurnTrace] = []

    def record_turn(
        self,
        user_message: str,
        reasoning: str | None,
        tool_calls: list[ToolCall],
        assistant_response: str,
        model_info: ModelInfo,
        token_counts: TokenCounts
    ) -> TurnTrace:
        """Record a single turn. Called after each agent response."""
        if not self.enabled:
            return None
        
        trace = TurnTrace(
            session_id=self.session_id,
            turn_index=len(self.current_trajectory),
            timestamp=time.time(),
            model=model_info.model,
            provider=model_info.provider,
            prompt_tokens=token_counts.prompt,
            system_prompt_hash=hash_system_prompt(self.current_system_prompt),
            user_message=user_message,
            conversation_history_length=len(self.history),
            reasoning=reasoning,
            tool_calls=tool_calls,
            assistant_response=assistant_response,
            completion_tokens=token_counts.completion,
        )
        
        self.current_trajectory.append(trace)
        return trace
    
    def finalize(self, session_outcome: SessionOutcome):
        """Write the complete trajectory to disk at session end."""
        trajectory = Trajectory(
            session_id=self.session_id,
            turns=self.current_trajectory,
            outcome=session_outcome,
            format_version="atropos-v1"
        )
        
        output_path = self.output_dir / f"traj_{self.session_id}.jsonl"
        with open(output_path, "w") as f:
            for turn in trajectory.turns:
                f.write(json.dumps(asdict(turn)) + "\n")

Atropos Format

Atropos is NousResearch's RL training framework. The trajectory format it consumes is a JSONL file where each line is a turn trace:

{"session_id": "sess_abc123", "turn_index": 0, "model": "gpt-4o", "user_message": "Can you help me debug this Python function?", "reasoning": "The user has a Python debugging question. I should ask to see the code.", "tool_calls": [], "assistant_response": "I'd be happy to help debug your Python function. Could you share the code?", "prompt_tokens": 1847, "completion_tokens": 23, "reward": null}
{"session_id": "sess_abc123", "turn_index": 1, "model": "gpt-4o", "user_message": "def process(df):\n    return df.groupby('a').sum()", "reasoning": "Simple groupby operation. The issue might be NaN handling or column types.", "tool_calls": [{"tool_name": "shell_exec", "arguments": {"command": "python3 -c \"import pandas as pd; df = pd.DataFrame({'a': [1,1,2], 'b': [None, 2, 3]}); print(df.groupby('a').sum())\"}"}, "result": "     b\na     \n1  2.0\n2  3.0", "success": true, "duration_ms": 234}], "assistant_response": "The function looks correct for basic aggregation. However, note that NaN values are silently dropped by groupby().sum()...", "prompt_tokens": 2103, "completion_tokens": 187, "reward": 1.0}

Trajectory Configuration

# ~/.hermes/config.yaml

trajectory:
  enabled: true
  output_dir: "~/.hermes/trajectories"
  
  # What to record
  record_reasoning: true      # Include chain-of-thought if available
  record_tool_calls: true     # Include all tool call arguments and results
  record_system_prompt: false # Exclude for privacy (hash only)
  
  # Quality filtering
  min_turn_count: 2           # Skip single-turn sessions
  require_tool_calls: false   # Include even non-tool-using sessions
  
  # Reward signals
  reward_model: null          # Path to local reward model, or null for human feedback only
  
  # Upload
  auto_upload: false          # Upload to NousResearch if true
  upload_endpoint: "https://training.nousresearch.com/trajectories"
  upload_api_key: "nk-..."

Benchmark Environments

Hermes ships with four benchmark environments designed to generate high-quality training trajectories for specific skill domains.

Overview

EnvironmentLocationTestsDomain
hermes_swe_envenvironments/hermes_swe_env/Software engineering tasksCode editing, bug fixing, PR review
tbliteenvironments/tblite/Terminal-based tasksShell scripting, file manipulation, system admin
terminalbench_2environments/terminalbench_2/Terminal reasoningComplex multi-step terminal workflows
yc_benchenvironments/yc_bench/Business/startup tasksResearch, analysis, document generation

hermes_swe_env — Software Engineering Benchmark

Based on SWE-bench methodology, hermes_swe_env presents the agent with real-world software engineering tasks:

# hermes_cli/environments/hermes_swe_env/__init__.py (structure)

class HermesSWEEnv:
    """
    Software engineering benchmark environment.
    
    Each task is a GitHub issue + repository snapshot.
    The agent must produce a patch that resolves the issue.
    Success is measured by automated test suite pass rate.
    """
    
    async def run_task(self, task: SWETask) -> TaskResult:
        """
        Set up a Docker container with the task's repository,
        present the issue to the agent, and evaluate the result.
        """
        container = await self._setup_container(task.repo_snapshot)
        
        prompt = f"""
        You are working on the following GitHub issue:
        
        Repository: {task.repo}
        Issue #{task.issue_number}: {task.issue_title}
        
        {task.issue_body}
        
        Please resolve this issue by editing the relevant files.
        """
        
        result = await self.agent.run(
            prompt=prompt,
            backend="docker",
            container=container,
            max_iterations=20
        )
        
        test_pass_rate = await self._run_tests(container)
        
        return TaskResult(
            task_id=task.id,
            success=test_pass_rate > 0.9,
            test_pass_rate=test_pass_rate,
            patch=await self._extract_patch(container),
            trajectory=result.trajectory
        )

tblite — Terminal Benchmark Lite

A collection of terminal-focused tasks ranging from simple file operations to complex shell scripting challenges:

# hermes_cli/environments/tblite/__init__.py (structure)

TASK_CATEGORIES = {
    "file_ops": [
        "Find all Python files modified in the last 24 hours",
        "Create a directory structure for a new Python package",
        "Extract specific lines from multiple log files",
    ],
    "shell_scripting": [
        "Write a bash script to monitor disk usage and alert when > 90%",
        "Parse a CSV file and output statistics",
        "Create a backup script with rotation",
    ],
    "system_admin": [
        "Set up a cron job to run a Python script daily",
        "Configure environment variables for a Python project",
        "Debug a failing systemd service",
    ]
}

terminalbench_2 — Advanced Terminal Reasoning

terminalbench_2 focuses on multi-step terminal workflows that require planning and state management:

# hermes_cli/environments/terminalbench_2/__init__.py (structure)

class TerminalBench2:
    """
    Advanced terminal benchmark with longer-horizon tasks.
    Evaluates ability to maintain state across many steps,
    recover from errors, and use terminal tools efficiently.
    """
    pass

yc_bench — Business Task Benchmark

Evaluates the agent's ability to perform business and startup-related tasks:

# hermes_cli/environments/yc_bench/__init__.py (structure)

TASK_TYPES = [
    "market_research",       # Research a market and produce a report
    "competitor_analysis",   # Analyze competitors and create comparison matrix
    "technical_spec",        # Write a technical specification document
    "financial_model",       # Build a simple financial model in a spreadsheet
    "user_interview_analysis", # Analyze interview transcripts for themes
]

Tool-Call Parsers for Multi-Model RL

One of Hermes's most technically sophisticated features is its ability to generate RL training data from multiple model families. Different models use different tool-call formats, and trajectory.py includes parsers for each:

# hermes_cli/agent/trajectory.py (tool call parsers)

class ToolCallParser:
    """
    Parse tool calls from different model families into
    a unified ToolCall format for trajectory recording.
    """
    
    @staticmethod
    def parse(response: str, model_family: str) -> list[ToolCall]:
        parser = {
            "hermes":   ToolCallParser._parse_hermes,    # Hermes function calling
            "deepseek": ToolCallParser._parse_deepseek,  # DeepSeek tool use
            "qwen":     ToolCallParser._parse_qwen,      # Qwen tool calls
            "glm":      ToolCallParser._parse_glm,       # GLM function calls
            "llama":    ToolCallParser._parse_llama,     # Llama tool use
            "kimi":     ToolCallParser._parse_kimi,      # Kimi (Moonshot) tools
            "mistral":  ToolCallParser._parse_mistral,   # Mistral tool calls
        }.get(model_family, ToolCallParser._parse_openai)
        
        return parser(response)
    
    @staticmethod
    def _parse_hermes(response: str) -> list[ToolCall]:
        """Parse Hermes function calling format."""
        # Hermes uses XML-like tags: <tool_call>...</tool_call>
        calls = []
        for match in re.finditer(r'<tool_call>(.*?)</tool_call>', response, re.DOTALL):
            try:
                call_data = json.loads(match.group(1))
                calls.append(ToolCall(
                    tool_name=call_data["name"],
                    arguments=call_data.get("arguments", {})
                ))
            except json.JSONDecodeError:
                pass
        return calls
    
    @staticmethod
    def _parse_deepseek(response: str) -> list[ToolCall]:
        """Parse DeepSeek tool use format."""
        # DeepSeek uses a different JSON structure
        ...

Model Family Support Matrix

Model FamilyTool FormatReasoning FormatNotes
Hermes (NousResearch)XML tags: <tool_call><reasoning>Native format
DeepSeekJSON in <tool_call><think>R1-style reasoning
QwenOpenAI-compatible JSONOptional <think>Qwen2.5 family
GLMFunction call JSONNot exposedGLM-4 family
LlamaOpenAI-compatibleOptional chainLlama 3.x family
Kimi (Moonshot)OpenAI-compatible<think>k1.5 family
MistralOpenAI-compatibleNot exposedMistral/Mixtral
OpenAI (fallback)Standard function callingNot exposedGPT-4o family

Running the Data Generation Pipeline

Generate Trajectories from Benchmarks

# Run hermes_swe_env benchmark and generate trajectories
hermes bench run hermes_swe_env \
  --model "gpt-4o" \
  --tasks 50 \
  --output ~/.hermes/trajectories/swe_bench_run_1/

# Run tblite benchmark
hermes bench run tblite \
  --model "meta-llama/Llama-3.3-70b-Instruct-Turbo" \
  --tasks 100 \
  --backend docker \
  --concurrency 5 \
  --output ~/.hermes/trajectories/tblite_run_1/

Filter and Score Trajectories

# Score trajectories with a reward model
hermes traj score \
  --input ~/.hermes/trajectories/swe_bench_run_1/ \
  --reward-model ~/models/reward_model.ckpt \
  --output ~/.hermes/trajectories/scored/

# Filter to high-quality trajectories
hermes traj filter \
  --input ~/.hermes/trajectories/scored/ \
  --min-reward 0.7 \
  --min-turns 3 \
  --output ~/.hermes/trajectories/filtered/

# Convert to Atropos training format
hermes traj export \
  --input ~/.hermes/trajectories/filtered/ \
  --format atropos-v1 \
  --output ~/training_data/hermes_trajectories.jsonl

Upload to NousResearch

# Upload high-quality trajectories to contribute to model training
hermes traj upload \
  --input ~/.hermes/trajectories/filtered/ \
  --endpoint https://training.nousresearch.com/trajectories \
  --api-key $NOUSRESEARCH_API_KEY

Data Generation Pipeline Architecture

sequenceDiagram
    participant Env as Benchmark Environment
    participant Agent as Hermes Agent
    participant Traj as trajectory.py
    participant FS as ~/.hermes/trajectories/
    participant Atropos as Atropos RL

    Env->>Agent: present task
    
    loop Agent loop (max_iterations)
        Agent->>Agent: build prompt
        Agent->>Agent: call LLM
        Agent->>Agent: parse tool calls
        Agent->>Env: execute tool calls
        Env-->>Agent: tool results
        Traj->>Traj: record TurnTrace
    end
    
    Env->>Traj: session_outcome (pass/fail + reward)
    Traj->>Traj: finalize trajectory
    Traj->>FS: write traj_*.jsonl
    
    Note over FS: Quality filtering step
    FS->>Atropos: high-quality trajectories
    Atropos->>Atropos: RL training update
    Atropos-->>Agent: improved policy (next model version)

Reward Signals

Trajectories become useful for RL training only when they have reward signals. Hermes supports three reward sources:

Reward SourceWhen AvailableQuality
Environment feedbackBenchmark runs (automated test pass/fail)High — ground truth
User explicit feedbackUser rates response with 👍/👎 in TUIHigh — human judgment
Reward modelConfigured local or API reward modelMedium — depends on model quality
Implicit signalSession length, skill creation events, memory writesLow — correlational

For production use, the most valuable trajectories come from benchmark runs where success is objectively measurable. Interactive session trajectories are valuable when users provide explicit feedback.


Chapter Summary

ConceptKey Takeaway
trajectory.pySilent observer on agent loop; records every turn in Atropos format
Atropos formatJSONL; one line per turn; includes reasoning, tool calls, outcomes, rewards
Closed loopDaily usage → trajectories → Atropos training → improved models → daily usage
hermes_swe_envSWE-bench-style software engineering tasks; Docker-isolated; evaluated by tests
tbliteTerminal task benchmark; shell scripting, file ops, system admin
terminalbench_2Long-horizon terminal reasoning tasks
yc_benchBusiness task benchmark; research, analysis, document generation
Tool-call parsersUnified parser for 7+ model families; enables multi-model RL training
Reward signalsEnvironment feedback (best), user feedback, reward model, implicit signals
Upload workflowFilter → score → export → upload to NousResearch training endpoint