Chapter 4: User Mode: Deep Research System

April 13, 2026 · View on GitHub

What Problem Does This Solve?

General-purpose research tasks don't fit neatly into a single tool. A question like "What is the latest Python performance benchmark and how does it compare to the 2023 results?" requires:

Web search and browsing to find current benchmarks
Document reading to parse PDFs or DOCX files
Code execution to run statistical comparisons
File writing to save the final report

A single agent trying to do all of this becomes confused about which tool to use when. AutoAgent solves this with a triage + specialist architecture: SystemTriageAgent handles routing, and four specialist agents each do one thing well.

The Agent Graph

flowchart TD
    U[User Request] --> STA[SystemTriageAgent\nOrchestrator]

    STA -->|web browsing / search| WSA[WebSurferAgent\nPlaywright + Screenshots]
    STA -->|file reading / parsing| FSA[FileSurferAgent\nMarkdownBrowser]
    STA -->|code / computation| PA[ProgrammingAgent\nDockerEnv]

    WSA -->|transfer_back| STA
    FSA -->|transfer_back| STA
    PA -->|transfer_back| STA

    STA -->|task complete| CR[case_resolved]
    STA -->|task failed| CNR[case_not_resolved]

    CR --> End([Return Response])
    CNR --> End

Each specialist agent has access only to the tools it needs. This keeps tool schemas small and reduces LLM confusion about which tool to call.

SystemTriageAgent (`system_triage_agent.py`)

Role

SystemTriageAgent is the entry point for all User Mode interactions. It:

Analyzes the user's request
Decides which specialist(s) are needed
Transfers control via transfer_to_X() functions
Receives results back and synthesizes a final answer
Calls case_resolved when the task is complete

Transfer Functions

The transfer functions are injected into SystemTriageAgent's function list at initialization:

# autoagent/system_triage_agent.py

from autoagent.types import Agent, Result

def transfer_to_websurfer(context_variables: dict) -> Result:
    """Transfer to WebSurferAgent for web browsing and search tasks.
    
    Use when: the task requires browsing websites, searching the web,
    or extracting information from online sources.
    """
    return Result(
        value="Transferring to WebSurferAgent for web research",
        agent=websurfer_agent,
    )

def transfer_to_filesurfer(context_variables: dict) -> Result:
    """Transfer to FileSurferAgent for reading local files and documents.
    
    Use when: the task involves reading PDFs, DOCX, or other local files.
    """
    return Result(
        value="Transferring to FileSurferAgent for document reading",
        agent=filesurfer_agent,
    )

def transfer_to_programming(context_variables: dict) -> Result:
    """Transfer to ProgrammingAgent for code execution and data analysis.
    
    Use when: the task requires writing or running Python code.
    """
    return Result(
        value="Transferring to ProgrammingAgent",
        agent=programming_agent,
    )

def case_resolved(context_variables: dict, summary: str) -> Result:
    """Signal that the task has been successfully completed."""
    return Result(value=f"CASE_RESOLVED: {summary}")

def case_not_resolved(context_variables: dict, reason: str) -> Result:
    """Signal that the task could not be completed."""
    return Result(value=f"CASE_NOT_RESOLVED: {reason}")

system_triage_agent = Agent(
    name="SystemTriageAgent",
    model="gpt-4o",
    instructions="""You are a research coordinator. Analyze user requests and
    route them to the appropriate specialist. After the specialist completes
    their work, synthesize the results and call case_resolved with a summary.
    
    Always route to specialists rather than attempting the task directly.
    """,
    functions=[
        transfer_to_websurfer,
        transfer_to_filesurfer,
        transfer_to_programming,
        case_resolved,
        case_not_resolved,
    ],
)

Handoff Flow in Detail

When SystemTriageAgent calls transfer_to_websurfer(), the MetaChain run loop detects the Result.agent field and switches the active agent:

sequenceDiagram
    participant MC as MetaChain
    participant STA as SystemTriageAgent
    participant WSA as WebSurferAgent

    MC->>STA: LLM call: "Research Python benchmarks"
    STA-->>MC: tool_call: transfer_to_websurfer()
    MC->>MC: handle_tool_calls() → Result(agent=websurfer_agent)
    MC->>MC: active_agent = websurfer_agent
    MC->>WSA: LLM call with same conversation history
    WSA-->>MC: tool_calls: browse_web(), scroll_page(), etc.
    MC->>MC: Execute browser tools, append results
    WSA-->>MC: tool_call: transfer_back_to_triage()
    MC->>MC: active_agent = system_triage_agent
    MC->>STA: LLM call with accumulated results
    STA-->>MC: tool_call: case_resolved(summary=...)
    MC->>MC: Terminate loop

The key insight: the conversation history persists across handoffs. When control returns to SystemTriageAgent, it sees all the messages from WebSurferAgent's work and can synthesize them.

WebSurferAgent (`websurfer_agent.py`)

Capabilities

WebSurferAgent controls the BrowserEnv (Playwright) to navigate websites and extract information:

# autoagent/websurfer_agent.py (tool functions)

def browse_web(url: str, context_variables: dict) -> str:
    """Navigate to a URL and return page content with screenshot reference."""
    web_env: BrowserEnv = context_variables["web_env"]
    obs = web_env.navigate(url)
    return f"URL: {obs.url}\n\nContent:\n{obs.content[:4000]}"

def search_web(query: str, context_variables: dict) -> str:
    """Search the web using the browser."""
    web_env: BrowserEnv = context_variables["web_env"]
    search_url = f"https://www.google.com/search?q={quote(query)}"
    obs = web_env.navigate(search_url)
    return obs.content[:4000]

def scroll_down(context_variables: dict) -> str:
    """Scroll down on the current page."""
    web_env: BrowserEnv = context_variables["web_env"]
    web_env.page.keyboard.press("PageDown")
    obs = web_env._get_observation()
    return obs.content[:2000]

def click_element(selector: str, context_variables: dict) -> str:
    """Click an element on the current page."""
    web_env: BrowserEnv = context_variables["web_env"]
    obs = web_env.click(selector)
    return obs.content[:2000]

def transfer_back_to_triage(context_variables: dict, summary: str) -> Result:
    """Return to SystemTriageAgent with research results."""
    return Result(
        value=f"WebSurfer completed: {summary}",
        agent=system_triage_agent,
    )

Multimodal Screenshot Loop

For visual navigation tasks, WebSurferAgent uses GPT-4V-style message construction:

# autoagent/websurfer_agent.py

def get_visual_observation(context_variables: dict) -> list[dict]:
    """Return current page screenshot as a multimodal message part."""
    web_env: BrowserEnv = context_variables["web_env"]
    obs = web_env._get_observation()

    # Encode screenshot as base64 for vision models
    screenshot_b64 = base64.b64encode(obs.screenshot).decode()

    return [
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/png;base64,{screenshot_b64}",
                "detail": "high",
            }
        },
        {
            "type": "text",
            "text": f"Current URL: {obs.url}\n\nPage content summary:\n{obs.content[:1000]}"
        }
    ]

This allows WebSurferAgent to navigate pages that require visual understanding (CAPTCHA-free sites, pages with complex layouts, image-heavy content).

FileSurferAgent (`filesurfer_agent.py`)

Capabilities

FileSurferAgent uses RequestsMarkdownBrowser for document reading and file operations:

# autoagent/filesurfer_agent.py (tool functions)

def read_file(file_path: str, context_variables: dict) -> str:
    """Read a file from the workspace, converting to Markdown."""
    file_env: RequestsMarkdownBrowser = context_variables["file_env"]
    return file_env.visit_page(file_path)

def page_down_file(context_variables: dict) -> str:
    """Scroll to the next page of the current document."""
    file_env: RequestsMarkdownBrowser = context_variables["file_env"]
    return file_env.page_down()

def list_workspace_files(context_variables: dict) -> str:
    """List all files in the workspace directory."""
    workspace = context_variables.get("workspace", "./workspace")
    files = []
    for path in Path(workspace).rglob("*"):
        if path.is_file():
            files.append(str(path.relative_to(workspace)))
    return "\n".join(files)

def write_file(file_path: str, content: str, context_variables: dict) -> str:
    """Write content to a file in the workspace."""
    workspace = context_variables.get("workspace", "./workspace")
    full_path = Path(workspace) / file_path
    full_path.parent.mkdir(parents=True, exist_ok=True)
    full_path.write_text(content)
    return f"Written to {full_path}"

File Upload Workflow

Users can upload files for analysis via the workspace directory:

# Copy a file into the workspace before starting the session
cp my_research_paper.pdf workspace/

# Then in AutoAgent:
# AutoAgent> Summarize the PDF in workspace/my_research_paper.pdf

FileSurferAgent uses RequestsMarkdownBrowser._convert_pdf() to extract text and then processes it page by page within the LLM's context window.

ProgrammingAgent (`programming_agent.py`)

Capabilities

ProgrammingAgent writes and executes Python code in the Docker sandbox:

# autoagent/programming_agent.py (tool functions)

def execute_python(code: str, context_variables: dict) -> str:
    """Execute Python code in the Docker sandbox.
    
    The sandbox maintains state between calls — variables and imports
    persist within a session.
    """
    code_env: DockerEnv = context_variables["code_env"]
    stdout, stderr, result = code_env.execute_code(code)

    output = ""
    if stdout:
        output += f"STDOUT:\n{stdout}"
    if stderr:
        output += f"\nSTDERR:\n{stderr}"
    if result:
        output += f"\nRESULT: {result}"

    return output or "Code executed successfully (no output)"

def install_package(package: str, context_variables: dict) -> str:
    """Install a Python package in the Docker sandbox."""
    code_env: DockerEnv = context_variables["code_env"]
    install_code = f"import subprocess; subprocess.run(['pip', 'install', '{package}'], capture_output=True)"
    stdout, stderr, _ = code_env.execute_code(install_code)
    return f"Installed {package}"

def list_workspace_contents(context_variables: dict) -> str:
    """List files in the mounted workspace directory."""
    code_env: DockerEnv = context_variables["code_env"]
    stdout, _, _ = code_env.execute_code("import os; print(os.listdir('/workspace'))")
    return stdout

When code fails, ProgrammingAgent retries with error context in the conversation history:

[ProgrammingAgent] Writing code to parse CSV...
[Tool: execute_python] 
  STDERR: ImportError: No module named 'pandas'

[ProgrammingAgent] Need to install pandas first
[Tool: install_package] package=pandas
[Tool: execute_python] (retry with same code)
  STDOUT: Parsed 1000 rows successfully

This happens naturally through the conversation history — no special retry logic is needed in the agent code itself.

Direct Agent Routing with @mention

The @AgentName syntax in inner.py allows bypassing SystemTriageAgent:

# autoagent/inner.py (simplified)

def parse_user_input(message: str, registered_agents: dict) -> tuple[Agent, str]:
    """Check if the message starts with @AgentName and route directly."""
    if message.startswith("@"):
        parts = message.split(" ", 1)
        agent_name = parts[0][1:]  # Strip the @
        actual_message = parts[1] if len(parts) > 1 else ""

        if agent_name in registered_agents:
            return registered_agents[agent_name], actual_message

    # Default: route through SystemTriageAgent
    return system_triage_agent, message

Examples:

# Route directly to WebSurferAgent
AutoAgent> @WebSurferAgent find the latest PyPI release of litellm

# Route directly to ProgrammingAgent
AutoAgent> @ProgrammingAgent run this code: import sys; print(sys.version)

# Route directly to a custom registered agent
AutoAgent> @SalesAgent recommend a product for a \$50 budget in electronics

GAIA Benchmark Performance

The academic paper (arxiv:2502.05957) evaluates AutoAgent on the GAIA benchmark, which tests general AI assistants on real-world tasks requiring multi-step reasoning across web, file, and code capabilities:

GAIA Level	Task Type	AutoAgent Performance
Level 1	Simple factual lookups	~85%
Level 2	Multi-step reasoning with tools	~67%
Level 3	Complex multi-source synthesis	~40%

GAIA Level 1 tasks are single-step (e.g., "What is the capital of France?"). Level 3 tasks require chaining 5-10 tool calls across multiple sources with complex reasoning.

The benchmark is run via evaluation/gaia/run_infer.py — Chapter 8 covers the evaluation infrastructure in detail.

Summary

Component	File	Role
`SystemTriageAgent`	`system_triage_agent.py`	Orchestrator: routes to specialists, synthesizes results
`WebSurferAgent`	`websurfer_agent.py`	Web browsing via Playwright + multimodal screenshots
`FileSurferAgent`	`filesurfer_agent.py`	Document reading via MarkdownBrowser + file writing
`ProgrammingAgent`	`programming_agent.py`	Python code execution via DockerEnv
`transfer_to_X()`	All agent files	Agent handoff via `Result(agent=next_agent)`
`case_resolved`	`system_triage_agent.py`	Task completion signal
`case_not_resolved`	`system_triage_agent.py`	Task failure signal
`@mention` routing	`inner.py`	Bypass triage, route directly to named agent
GAIA benchmark	`evaluation/gaia/`	Multi-level task evaluation (Levels 1-3)

Continue to Chapter 5: Agent Editor: From NL to Deployed Agents to learn how the 4-phase pipeline generates, tests, and registers new agents from natural language descriptions.