Using Gemini with the Benchmark

October 9, 2025 · View on GitHub

This guide explains how to use Google's Gemini model with the Code Mode benchmark.

Setup

1. Get a Google API Key

Go to Google AI Studio
Sign in with your Google account
Click "Create API Key"
Copy your API key

2. Configure the API Key

Add your Gemini API key to .env:

GOOGLE_API_KEY=your_google_api_key_here

You can have both Claude and Gemini keys in the same .env file:

ANTHROPIC_API_KEY=sk-ant-xxxxx
GOOGLE_API_KEY=your_google_api_key_here

Running with Gemini

Command Line

# Run full benchmark with Gemini
python benchmark.py --model gemini

# Run quick test (2 scenarios) with Gemini
python benchmark.py --model gemini --limit 2

# Run specific scenario with Gemini
python benchmark.py --model gemini --scenario 3

Makefile

# Run full benchmark with Gemini
make run-gemini

# Run quick test with Gemini
make run-gemini-quick

Comparing Models

You can run the benchmark with both models to compare:

# Run with Claude (default)
python benchmark.py --limit 2
# Results saved to: benchmark_results_claude.json

# Run with Gemini
python benchmark.py --model gemini --limit 2
# Results saved to: benchmark_results_gemini.json

# Compare the JSON files

Implementation Details

Gemini Regular Agent

Uses Gemini's native function calling:

Converts Anthropic tool schemas to Gemini format
Uses genai.GenerativeModel with tools parameter
Handles function calls via function_call parts
Returns function responses using FunctionResponse proto

Location: agents/gemini_regular_agent.py

Gemini Code Mode Agent

Uses Gemini to generate Python code:

Same system prompt as Claude version
Generates code in ```python blocks
Executes in the same sandbox (RestrictedPython)
Uses the same tools API

Location: agents/gemini_codemode_agent.py

Testing Individual Agents

# Test Gemini Regular Agent
cd codemode_benchmark
source venv/bin/activate
python agents/gemini_regular_agent.py

# Test Gemini Code Mode Agent
python agents/gemini_codemode_agent.py

# Test agent factory with both models
python agents/agent_factory.py

Differences from Claude

API Differences

Schema format: Gemini uses different parameter schema format
Function calling: Gemini uses proto-based function responses
Token counting: Different token counting mechanisms
Context window: Different limits

Expected Behavior

Both models should:

Complete all scenarios successfully
Pass validation checks
Generate correct final state

Performance may differ:

Token usage
Execution time
Number of iterations
Code quality (for Code Mode)

Troubleshooting

"GOOGLE_API_KEY not configured"

Check .env file exists
Verify the key is correct
Make sure there are no extra spaces or quotes

"Gemini API quota exceeded"

Gemini has free tier rate limits
Wait a minute between runs
Or upgrade to paid tier

"Model not found" error

The code uses gemini-1.5-pro-latest
Ensure your API key has access to this model
You can change the model name in the agent files if needed

Schema conversion issues

If a tool doesn't work, check schema conversion in _convert_schemas_to_gemini
Gemini may have stricter requirements for some parameter types

Model Configuration

To change the Gemini model version, edit:

For Regular Agent (agents/gemini_regular_agent.py):

self.model_name = "gemini-1.5-pro-latest"  # Change here

For Code Mode Agent (agents/gemini_codemode_agent.py):

self.model_name = "gemini-1.5-pro-latest"  # Change here

Available models:

gemini-1.5-pro-latest
gemini-1.5-flash-latest
gemini-1.0-pro-latest

Notes

Gemini's function calling API is different from Claude's
Schema conversion happens automatically
Code Mode uses the same sandbox for both models
Results are saved to separate files by model name

Adding More Models

To add support for other models (e.g., GPT-4), follow this pattern:

Create agents/yourmodel_regular_agent.py
Create agents/yourmodel_codemode_agent.py

Add to agents/agent_factory.py:

"yourmodel": {
    "name": "Your Model Name",
    "api_key_env": "YOUR_API_KEY",
    "regular": YourModelRegularAgent,
    "codemode": YourModelCodeModeAgent,
}

Update benchmark.py to handle the new API key
Update documentation

The agent factory makes it easy to support multiple LLM providers!