Moatless Tools

June 25, 2025 · View on GitHub

Moatless Tools is a hobby project where I experiment with some ideas I have about how LLMs can be used to edit code in large existing codebases. I believe that rather than relying on an agent to reason its way to a solution, it is crucial to build good tools to insert the right context into the prompt and handle the response.

For the implementation used in the paper SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, please see moatless-tree-search.

SWE-Bench

I use the SWE-bench benchmark as a way to verify my ideas.

  • Claude 4 Sonnet - 70.8% solve rate, $0.63 per instance.

Try it out

Run in Docker

  1. Clone the repository:

    git clone https://github.com/aorwall/moatless-tools.git
    cd moatless-tools
    
  2. Set up environment variables:

    cp .env.example .env
    

    Edit the .env file to set your API keys and other configuration options, including the required MOATLESS_DIR variable:

    MOATLESS_DIR=/path/to/your/moatless/data
    

    Note: MOATLESS_DIR specifies the directory where Moatless will store configuration files and trajectory data. This directory will be mounted as a volume in the Docker containers.

  3. Start the services:

    make run
    
  4. Access the UI at http://localhost

Install from PyPI

# Install base package only
pip install moatless

# Install with Kubernetes runner support
pip install "moatless[kubernetes]"

Install from source

Clone the repository and install using Poetry

# Clone the repository
git clone https://github.com/aorwall/moatless-tools.git
cd moatless-tools

# Install using uv
uv sync

Code Examples

Basic agent flow

from moatless.actions import Respond
from moatless.agent import ActionAgent
from moatless.completion.tool_call import ToolCallCompletionModel

completion_model = ToolCallCompletionModel(
    model="gpt-4.1-mini",
    temperature=0.0,
    model_api_key=""
)

agent = ActionAgent(
    completion_model=completion_model,
    system_prompt="You are a helpful assistant that can answer questions.",
    actions=[
        Respond()
    ]
)

observation = await agent.run_simple("Hello")

print(observation.message)

Code inspector agent

[notebooks/code_inspector_agent.ipynb](See notebook)

Run SWE-Bench evaluations

Before running the evaluation, you'll need:

  1. At least one LLM provider API key (e.g., OpenAI, Anthropic, etc.)
  2. A Voyage AI API key from voyageai.com to use the pre-embedded vector stores for SWE-Bench instances.

Verify Setup

Before running the full evaluation, you can verify your setup running a simple SWE-Bench instance.

uv run python scripts/docker_run.py  --flow swebench_tools --model-id gpt-4o-mini-2024-07-18 --instance-id django__django-11099 --evaluation-name testing_setup

The script will run the model against a sample SWE-Bench instance

Results are saved in .moatless/projects/testing_setup.

Run evaluation

Evaluation script to run evaluation in a docker containers.

python3 scripts/run_evaluation.py  --model gpt-4o-mini-2024-07-18 --dataset-split [dataset_split] --evaluation-name [evaluation_name]

Required arguments:

  • --dataset-split: Dataset split to use
  • --evaluation-name: Name of the evaluation

Optional arguments:

  • --model: Model to use for evaluation (default: gpt-4o-mini-2024-07-18, equivalent to --model-id)
  • --model-id: Model configuration ID to use (replaces entire model configuration)
  • --litellm-model-name: LiteLLM model name to override (keeps other model settings)
  • --flow: Flow to use for evaluation (defaults to "simple_coding")
  • --num-parallel-jobs: Number of parallel jobs (default: 1)

Flows

Available flows that can be specified with the --flow argument:

Flow IDFormatBest Suited ForDefault Model
swebench_toolsFunction callingModels with native function calling support gpt-4o-mini-2024-07-18
swebench_tools_and_reasoningFunction callingReasoning models with native function calling supportclaude-sonnet-4-20250514
swebench_reactReACTOpen source models without native function calling supportopenrouter/mistralai/devstral-small
swebench_react_reasoningReACTReasoning models without function calling supportopenrouter/deepseek/deepseek-r1-0528

Model Configuration

Both evaluation scripts support flexible model configuration through the following options:

Model Options

  • --model-id: Specify a complete model configuration ID. This replaces the entire completion model configuration including temperature, max tokens, and all other settings.
  • --litellm-model-name: Override only the LiteLLM model name while keeping all other completion model settings (temperature, max tokens, etc.) from the flow or model configuration.

Verified Models

Default model configurations are provided for verified models. Note that other models may work but have not been extensively tested. Verified models are models that have been tested and found to work with the Verified Mini subset of the SWE-Bench dataset.

When specifying just the --model-id argument, the following configurations are used:

ModelResponse FormatMessage HistoryThoughts in ActionVerified Mini
claude-3-5-sonnet-20241022tool_callmessagesno46%
claude-3-5-haiku-20241022tool_callmessagesno28%
gpt-4o-2024-11-20tool_callmessagesyes32%
gpt-4o-mini-2024-07-18tool_callmessagesyes16%
o1-mini-2024-09-12reactreactno (disabled thoughts)28%
deepseek/deepseek-chatreactreactno36%
deepseek/deepseek-reasonerreactreactno (disabled thoughts)50%
gemini/gemini-2.0-flash-expreactreactno38%
openrouter/meta-llama/llama-3.1-70b-instructreactreactno-
openrouter/meta-llama/llama-3.1-405b-instructreactreactno28%
openrouter/qwen/qwen-2.5-coder-32b-instructreactreactno32%

Dataset splits

Available dataset splits that can be specified with the --dataset-split argument:

Split NameDescriptionInstance Count
liteAll instances from the lite dataset300
verifiedAll instances from the verified dataset500
verified_miniMariusHobbhahn/swe-bench-verified-mini, a subset of SWE-Bench Verified50
lite_and_verified_solvableInstances that exist in both lite and verified datasets and have at least one solved submission to SWE-Bench84

Example usage

# Run evaluation with Claude 3.5 Sonnet using complete model configuration
python3 scripts/run_evaluation.py \
  --model-id claude-3-5-sonnet-20241022 \
  --flow swebench_tools_and_reasoning \
  --dataset-split verified_mini \
  --num-parallel-jobs 5

# Run evaluation overriding just the model name while keeping flow's model settings
python3 scripts/run_evaluation.py \
  --litellm-model-name openrouter/qwen/qwen-2.5-coder-32b-instruct \
  --flow swebench_react \
  --dataset-split verified_mini \
  --num-parallel-jobs 5