Code Mode Benchmark

October 9, 2025 · View on GitHub

"LLMs are better at writing code to call tools than at calling tools directly." — Cloudflare Code Mode Research

A comprehensive benchmark comparing Code Mode (code generation) vs Traditional Function Calling for LLM tool interactions. Demonstrates that Code Mode achieves 60% faster execution, 68% fewer tokens, and 88% fewer API round trips while maintaining equal accuracy.

🎯 Key Results

Metric	Regular Agent	Code Mode	Improvement
Average Latency	11.88s	4.71s	60.4% faster ⚡
API Round Trips	8.0 iterations	1.0 iteration	87.5% reduction 🔄
Token Usage	144,250 tokens	45,741 tokens	68.3% savings 💰
Success Rate	6/8 (75%)	7/8 (88%)	+13% higher ✅
Validation Accuracy	100%	100%	Equal accuracy

Annual Cost Savings: $9,536/year at 1,000 scenarios/day (Claude Haiku pricing)

📊 View Full Results | 📈 Raw Data Tables

🚀 Quick Start

Prerequisites

Python 3.11+
Anthropic API key (for Claude)
Google API key (for Gemini, optional)

Installation

# Clone the repository
git clone <repository-url>
cd codemode_benchmark

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your API keys

Run the Benchmark

# Run full benchmark with Claude
make run

# Run with Gemini
python benchmark.py --model gemini

# Run specific scenario
python benchmark.py --scenario 1

# Run limited scenarios
python benchmark.py --limit 3

📁 Repository Structure

codemode_benchmark/
├── README.md                 # This file
├── benchmark.py             # Main benchmark runner
├── requirements.txt         # Python dependencies
├── Makefile                 # Convenient commands
│
├── agents/                  # Agent implementations
│   ├── __init__.py
│   ├── codemode_agent.py           # Code Mode (code generation)
│   ├── regular_agent.py            # Traditional function calling
│   ├── gemini_codemode_agent.py    # Gemini Code Mode
│   └── gemini_regular_agent.py     # Gemini function calling
│
├── tools/                   # Tool definitions
│   ├── __init__.py
│   ├── business_tools.py           # Accounting/invoicing tools
│   ├── accounting_tools.py         # Core accounting logic
│   └── example_tools.py            # Simple example tools
│
├── sandbox/                 # Secure code execution
│   ├── __init__.py
│   └── executor.py                 # RestrictedPython sandbox
│
├── tests/                   # Test files
│   ├── test_api.py
│   ├── test_scenarios.py           # Scenario definitions
│   └── ...
│
├── debug/                   # Debug scripts (development)
│   └── debug_*.py
│
├── docs/                    # Documentation
│   ├── BENCHMARK_SUMMARY.md        # Comprehensive analysis
│   ├── RESULTS_DATA.md             # Raw data tables
│   ├── QUICKSTART.md               # Quick start guide
│   ├── TOOLS.md                    # Tool API documentation
│   ├── CHANGELOG.md                # Version history
│   └── GEMINI.md                   # Gemini-specific notes
│
└── results/                 # Benchmark results
    ├── benchmark_results_claude.json
    ├── benchmark_results_gemini.json
    ├── results.log
    └── results-gemini.log

🔬 What is Code Mode?

Traditional Function Calling (Regular Agent)

User Query → LLM → Tool Call #1 → Execute → Result
          ↓
       LLM processes result → Tool Call #2 → Execute → Result
          ↓
       [Repeat 5-16 times...]
          ↓
       Final Response

Problems:

Multiple API round trips
Neural network processing between each tool call
Context grows with each iteration
High latency and token costs

Code Mode

User Query → LLM generates complete code → Executes all tools → Final Response

Advantages:

Single code generation pass
Batch multiple operations
No context re-processing
Natural programming constructs (loops, variables, conditionals)

Example:

Regular Agent sees this as 3 separate tool calls:

{"name": "create_transaction", "input": {"amount": 2500, ...}}
{"name": "create_transaction", "input": {"amount": 150, ...}}
{"name": "get_financial_summary", "input": {}}

Code Mode generates efficient code:

expenses = [
    ("rent", 2500, "Monthly rent"),
    ("utilities", 150, "Electricity")
]
for category, amount, desc in expenses:
    tools.create_transaction("expense", category, amount, desc)

summary = json.loads(tools.get_financial_summary())
result = f"Total: ${summary['summary']['total_expenses']}"

🎯 Test Scenarios

The benchmark includes 8 realistic business scenarios:

Monthly Expense Recording - Record 4 expenses and generate summary
Client Invoicing Workflow - Create 2 invoices, update status, summarize
Payment Processing - Create invoice, process partial payments
Mixed Income/Expense Tracking - 7 transactions with financial analysis
Multi-Account Management - Complex transfers between 3 accounts
Quarter-End Analysis - Simulate 3 months of business activity
Complex Multi-Client Invoicing - 3 invoices with partial payments (16 operations)
Budget Tracking - 14 categorized expenses with analysis

Each scenario includes automated validation to ensure correctness.

🛠️ Implementation Details

Code Mode Architecture

class CodeModeAgent:
    def run(self, user_message: str) -> Dict[str, Any]:
        # 1. Send message with tools API documentation
        response = self.client.messages.create(
            system=self._create_system_prompt(),  # Contains tools API
            messages=[{"role": "user", "content": user_message}]
        )

        # 2. Extract generated code
        code = extract_code_from_response(response)

        # 3. Execute in sandbox
        result = self.executor.execute(code)

        return result

Tools API with TypedDict

from typing import TypedDict, Literal

class TransactionResponse(TypedDict):
    status: Literal["success"]
    transaction: TransactionDict
    new_balance: float

def create_transaction(
    transaction_type: Literal["income", "expense", "transfer"],
    category: str,
    amount: float,
    description: str,
    account: str = "checking"
) -> str:
    """
    Create a new transaction.

    Returns: JSON string with TransactionResponse structure

    Example:
        result = tools.create_transaction("expense", "rent", 2500.0, "Monthly rent")
        data = json.loads(result)
        print(data["new_balance"])  # 7500.0
    """
    # Implementation...

Security with RestrictedPython

Code execution uses RestrictedPython for sandboxing:

No filesystem access
No network access
No dangerous imports
Controlled builtins

📊 Performance Breakdown

By Scenario Complexity

Complexity	Scenarios	Avg Speedup	Avg Token Savings
High (10+ ops)	2	79.2%	36,389 tokens
Medium (5-9 ops)	3	47.5%	8,774 tokens
Low (3-4 ops)	1	45.3%	6,209 tokens

Key Insight: Code Mode advantage scales with complexity, but even simple tasks benefit significantly.

Cost Analysis at Scale

Daily Volume	Regular Annual	Code Mode Annual	Annual Savings
100	$252	$77	$175
1,000	$2,519	$766	$1,753
10,000	$25,185	$7,665	$17,520
100,000	$251,850	$76,650	$175,200

(Based on Claude Haiku pricing: $0.25/1M input, $1.25/1M output)

🤖 Supported Models

Claude (Anthropic)

Model: Claude 3 Haiku
Performance: 60.4% faster, 68.3% fewer tokens
Best For: Cost-sensitive production workloads
Status: ✅ Fully tested (8/8 scenarios)

Gemini (Google)

Model: Gemini 2.0 Flash Experimental
Performance: 15.1% faster, 70.6% fewer iterations
Best For: Low-latency requirements
Status: ✅ Partially tested (2/8 scenarios)
Note: Faster baseline but more verbose code generation

🧪 Running Tests

# Run all tests
make test

# Run specific test file
python -m pytest tests/test_scenarios.py

# Test Code Mode agent directly
python agents/codemode_agent.py

# Test Regular Agent directly
python agents/regular_agent.py

# Test sandbox execution
python sandbox/executor.py

📚 Documentation

Benchmark Summary - Comprehensive analysis with insights
Results Data - Raw performance tables
Quick Start Guide - Step-by-step setup
Tools Documentation - Available tools and API
Changelog - Version history
Gemini Notes - Gemini-specific information

💡 Key Learnings

Why Code Mode Wins

Batching Advantage
- Single code block replaces multiple API calls
- No neural network processing between operations
- Example: 16 iterations → 1 iteration (Scenario 7)
Cognitive Efficiency
- LLMs have extensive training on code generation
- Natural programming constructs (loops, variables, conditionals)
- TypedDict provides clear type contracts
Computational Efficiency
- No context re-processing between tool calls
- Direct code execution in sandbox
- Reduced token overhead

When to Use Code Mode

✅ Multi-step workflows - Greatest benefit with many operations ✅ Complex business logic - Invoicing, accounting, data processing ✅ Batch operations - Similar actions on multiple items ✅ Cost-sensitive workloads - Production at scale ✅ Latency-critical applications - User-facing systems

Best Practices

Use TypedDict for response types - Provides clear structure to LLM
Include examples in docstrings - Shows correct usage patterns
Batch similar operations - Leverage loops in code
Validate results - Automated checks ensure correctness
Handle errors gracefully - Try-except in generated code

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (make test)
Commit (git commit -m 'Add amazing feature')
Push (git push origin feature/amazing-feature)
Open a Pull Request

📖 References

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Inspired by Cloudflare's Code Mode research
Built on Anthropic's Building Effective Agents framework
Uses RestrictedPython for secure code execution

📞 Contact

For questions or feedback, please open an issue on GitHub.

Benchmark Date: January 2025 Models Tested: Claude 3 Haiku, Gemini 2.0 Flash Experimental Test Scenarios: 8 realistic business workflows Result: Code Mode is 60% faster, uses 68% fewer tokens, with equal accuracy