VT Code Benchmarks

February 13, 2026 ยท View on GitHub

This directory contains benchmark results and documentation for evaluating VT Code's code generation capabilities.

Overview

VT Code is evaluated on industry-standard benchmarks to measure:

  • Code Generation Quality: Correctness and functionality of generated code
  • Performance: Response latency and throughput
  • Cost Efficiency: Token usage and API costs across providers

HumanEval Benchmark

HumanEval is a benchmark for evaluating code generation models on 164 hand-written programming problems. Each problem includes:

  • Function signature and docstring
  • Unit tests to verify correctness
  • Pass@1 metric (percentage of problems solved on first attempt)

Latest Results (October 2025)

MAJOR ACHIEVEMENT: gpt-5-nano achieves frontier-tier performance (94.5%)

Comparison Chart

Two models benchmarked:

ModelProviderPass@1PassedFailedLatency (P50)Cost
gpt-5-nanoOpenAI94.5%155/1649/16410.4s~$0.10-0.30/1M
gemini-3-flash-previewGoogle61.6%101/16463/1640.97s$0.00 (free)

Configuration: temperature=0.0, seed=42, timeout=120s

Key Findings

gpt-5-nano:

  • Frontier-tier performance (94.5%)
  • TOP 5 globally
  • Very affordable (~$0.10-0.30/1M tokens)
  • 10-50x cheaper than premium competitors
  • 10.4s median latency

gemini-3-flash-preview:

  • 10x faster (0.97s)
  • Completely FREE (Google free tier)
  • Good for development (61.6%)
  • Perfect for rapid iteration
  • Ideal for high-volume testing

Strategic Choice:

  • Use gpt-5-nano for production validation and critical tasks
  • Use gemini-3-flash-preview for development and prototyping

See GPT5_NANO_VS_GEMINI.md for detailed comparison. | Estimated Cost | $0.0000 |

Note: Token counts are not currently reported by vtcode. The model is in Google's free tier, so actual cost is $0.

Comparison with Other Models

ModelPass@1Latency (P50)Cost (est.)
gemini-3-flash-preview61.6%0.97s$0.00
More results coming soon---

Methodology

  1. Dataset: Complete HumanEval dataset (164 problems)
  2. Prompt Format: Raw code-only format optimized for Gemini
  3. Evaluation: Automated test execution with Python unittest
  4. Reproducibility: Fixed seed (42) for deterministic sampling
  5. Rate Limiting: 500ms sleep between tasks to respect API limits

Running Benchmarks

Prerequisites

# Install Python dependencies
pip install datasets

# Ensure vtcode is built
cargo build --release

Basic Usage

# Run full benchmark (164 tasks)
make bench-humaneval PROVIDER=gemini MODEL='gemini-3-flash-preview'

# Run subset for quick testing
make bench-humaneval PROVIDER=gemini MODEL='gemini-3-flash-preview' N_HE=10

# Run with custom parameters
make bench-humaneval \
  PROVIDER=openai \
  MODEL='gpt-5' \
  N_HE=50 \
  SEED=42 \
  SLEEP_MS=500 \
  RETRY_MAX=3

Environment Variables

VariableDefaultDescription
PROVIDERgeminiLLM provider (gemini, openai, anthropic, etc.)
MODELgemini-3-flash-previewModel identifier
N_HE164Number of tasks to run (max 164)
SEED1337Random seed for reproducibility
USE_TOOLS0Enable tool usage (0=disabled, 1=enabled)
TEMP0.0Temperature for sampling
MAX_OUT1024Maximum output tokens
TIMEOUT_S120Timeout per task in seconds
SLEEP_MS0Sleep between tasks (ms)
RETRY_MAX2Maximum retry attempts
BACKOFF_MS500Backoff delay for retries (ms)
INPUT_PRICE0.0Cost per 1k input tokens (USD)
OUTPUT_PRICE0.0Cost per 1k output tokens (USD)

Visualization

Generate charts and summaries from results:

# Generate ASCII chart and markdown summary
python3 scripts/render_benchmark_chart.py reports/HE_*.json

# View latest results
cat reports/HE_*_summary.md

Results Archive

All benchmark results are stored in the reports/ directory with the naming convention:

HE_YYYYMMDD-HHMMSS_<model>_tools-<0|1>_N<count>.json

Each report includes:

  • Metadata (model, provider, configuration)
  • Summary statistics (pass@1, latency, cost)
  • Individual task results (passed/failed, errors, timing)

Known Issues

  1. Token Counting: vtcode doesn't currently report token usage from the LLM API
  2. Stderr Pollution: Fixed in v0.30.4 - .env loading message no longer pollutes output
  3. CLI Flags: --temperature and --max-output-tokens not supported by ask command

Future Work

  • Add support for more benchmarks (MBPP, CodeContests)
  • Multi-model comparison dashboard
  • Token usage tracking and reporting
  • Cost optimization analysis
  • Performance profiling and optimization

Contributing

To add new benchmarks or improve existing ones:

  1. Add benchmark script to scripts/
  2. Document methodology in this directory
  3. Update Makefile with new targets
  4. Submit PR with results and analysis

References