olmo-eval
June 11, 2026 · View on GitHub
Overview
This project provides a unified workbench for evaluating language models throughout the model development loop.
Features:
- Registry of benchmark tasks and composable suites, with named variants for few-shot settings, formatting, and scoring (e.g. humaneval:3shot:bpb).
- Support for inference via vLLM, LiteLLM for commercial APIs, and a mock provider for dry runs and debugging.
- Harness abstraction that separates execution policy from task definition, so any task can be run baseline or tool-augmented without modification.
- Multi-turn agentic evaluation with tool calling, scaffolds, and sandboxed environments via Docker, Podman, or Modal.
- LLM-as-judge scoring with auxiliary providers, including locally served judge models.
- Aggregate and instance-level prediction storage.
- Inspection tooling for viewing instances, formatted prompts, token arrays, and model responses.
Quick Start
This project uses uv with a checked-in uv.lock
for reproducible builds. To get started, sync the repo with uv, browse the
available tasks and suites, and preview a run with the built-in mock provider.
Run Your First Eval
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install Python 3.12 if your machine does not already have it
uv python install 3.12
# Install dependencies + the package (editable) from the lockfile.
# The default groups (`dev` + `vllm`) are installed automatically, which
# pulls in storage, beaker, hf, and the vLLM inference provider. vLLM
# deps are marked Linux-only via PEP 508 markers, so this works on macOS
# too — no extra flags needed.
uv sync --frozen
# Install pre-commit hooks
make setup
# To update the lockfile after changing pyproject.toml
uv lock
# Add an optional extra on top of the defaults (e.g. agents, litellm)
uv sync --frozen --extra agents
# `openhands` conflicts with vllm — opt out of the vllm group when using it
uv sync --frozen --no-group vllm --extra openhands
# Browse a few suites
uv run olmo-eval suite inspect mmlu
uv run olmo-eval suite inspect gpqa
uv run olmo-eval suite inspect olmobase:code
# Preview a run without loading a model
uv run olmo-eval run -m mock -t gsm8k --dry-run
# Preview another run with a different task spec
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --dry-run
Key Concepts
The evaluation framework is built around these core abstractions:
| Abstraction | Description |
|---|---|
| Task | Benchmark specification defining dataset slice, request construction, and scoring logic |
| Suite | Benchmark collection that composes tasks and/or nested suites and defines result aggregation |
| Harness | Execution runtime around the inference provider, tools, scaffolds, and runtime behavior |
| Formatter | Prompt renderer from an instance and few-shot context to an LM request |
| Scorer | Per-example evaluator from model output to raw score or judgment |
| Metric | Dataset-level aggregator over per-example scores |
Tasks
Tasks define how to load data, format prompts, and score outputs. Register with @register:
from olmo_eval.evals.tasks.common import Task, register
from olmo_eval.data import DataSource
@register("my_task")
class MyTask(Task):
# DataSource specifies path, subset (optional), and split
data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")
...
Variants can also act as named evaluation presets (for example, few-shot settings):
from olmo_eval.evals.tasks.common import register_variant
register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=42)
# Built-in example: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb
Runtime Dependencies allow tasks to specify packages installed at job startup:
@register("code_eval")
class CodeEvalTask(Task):
data_source = DataSource(path="my-org/code-dataset", split="test")
dependencies = ["code-sandbox==1.0", "git+https://github.com/user/repo@v2.0"]
...
Suites
Suites group multiple tasks for batch evaluation:
from olmo_eval.evals.suites import Suite, register
register(Suite(
name="my_suite",
tasks=("task_a:3shot", "task_b:3shot", "task_c:3shot"),
))
Aggregation
Suites support different strategies for combining task results:
| Strategy | Description |
|---|---|
AVERAGE | Simple average of all task scores (default) |
AVERAGE_OF_AVERAGES | Average over child suite averages (equal weight per child) |
DISPLAY_ONLY | Display child results without computing suite average |
NONE | No aggregation - just collect individual task results |
Average of Averages Example:
from olmo_eval.evals.suites import Suite, AggregationStrategy, register
# Nested suite with 3 tasks
multilingual_code = Suite(
name="multilingual_code",
tasks=("mbpp_python", "mbpp_java", "mbpp_rust"),
aggregation=AggregationStrategy.AVERAGE,
)
# Parent suite using average of averages
register(Suite(
name="code_eval",
tasks=(
"humaneval", # Single task (score: 0.80)
multilingual_code, # Nested suite with 3 tasks (scores: 0.40, 0.50, 0.60)
),
aggregation=AggregationStrategy.AVERAGE_OF_AVERAGES,
))
# Results:
# - humaneval: 0.80
# - multilingual_code average: (0.40 + 0.50 + 0.60) / 3 = 0.50
#
# AVERAGE_OF_AVERAGES: (0.80 + 0.50) / 2 = 0.65
# vs AVERAGE: (0.80 + 0.40 + 0.50 + 0.60) / 4 = 0.575
Note: Currently AVERAGE_OF_AVERAGES gives each child equal weight regardless of how many tasks it contains. Custom weighting may be supported in the future.
Formatters
Formatters convert instances into LM requests. See olmo_eval.common.formatters for available options.
from olmo_eval.common.formatters import MultipleChoiceFormatter, ChatFormatter
# Multiple choice with logprob scoring
formatter = MultipleChoiceFormatter(template="Q: {question}\n\nA:")
# Chat-based formatting
formatter = ChatFormatter(system_prompt="You are a helpful assistant.")
Scorers
Scorers compute a score for each instance/output pair. See olmo_eval.common.scorers for available options.
from olmo_eval.common.scorers import ExactMatchScorer, MultipleChoiceScorer
# Exact string match
scorer = ExactMatchScorer()
# Multiple choice comparison
scorer = MultipleChoiceScorer()
Metrics
Metrics aggregate scores across responses. See olmo_eval.common.metrics for available options.
from olmo_eval.common.metrics import AccuracyMetric, F1Metric
from olmo_eval.common.scorers import ExactMatchScorer, F1Scorer
# Mean accuracy
metric = AccuracyMetric(scorer=ExactMatchScorer)
# Mean F1 score
metric = F1Metric(scorer=F1Scorer)
Model Presets
Pre-configured model settings in olmo_eval/common/constants/models.py:
from olmo_eval.common.constants import get_model_presets
# Returns dict of preset name -> ModelConfig
presets = get_model_presets()
# {
# "llama3.1-8b": ModelConfig(model="meta-llama/Meta-Llama-3.1-8B"),
# "olmo-2-7b": ModelConfig(model="allenai/OLMo-2-1124-7B"),
# ...
# }
Harness
A Harness is the runtime orchestration layer for an evaluation run. It combines the primary inference provider with execution policy such as system prompts, tools, auxiliary providers, sandboxing, metrics collection, and an optional scaffold for multi-turn control. This lets the same task run in plain, tool-using, or scaffolded modes without changing the task definition.
Key concept: Any task can be run with or without tools—that's determined by the Harness configuration, not the task definition. This allows comparing baseline vs tool-augmented performance on the same task.
Using Harness via CLI
# Run task without tools or a scaffold (baseline)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge
# Run task with search tools via harness preset
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu
# Use a custom harness config file
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config ./my_harness.yaml
HarnessConfig
Configuration for a harness:
from olmo_eval.harness import HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
semantic_scholar_search,
serper_web_search,
serper_fetch_page,
)
# Get a preset
config = get_harness_preset("dr_tulu")
# Or create custom config with tools
config = HarnessConfig(
name="my_harness",
provider=ProviderConfig(model="gpt-4o", kind="litellm"),
tools=(semantic_scholar_search, serper_web_search, serper_fetch_page),
system_prompt="You are a helpful assistant with search tools.",
max_turns=10,
max_concurrency=8,
scaffold="openai_agents",
required_secrets=("S2_API_KEY", "SERPER_API_KEY"),
)
| Field | Type | Default | Description |
|---|---|---|---|
name | str | Required | Harness identifier |
provider | ProviderConfig | ProviderConfig() | Model provider configuration |
tools | tuple[Tool | str, ...] | () | Tool instances or registered tool names |
system_prompt | str | None | None | System prompt to inject |
tool_choice | str | "auto" | Tool selection mode (auto, none, required) |
scaffold | str | None | None | Execution scaffold (e.g., openai_agents) |
max_turns | int | None | None | Max turns for multi-turn execution |
max_concurrency | int | None | None | Concurrent executions |
scoring_concurrency | int | None | None | Max concurrent scoring operations |
sandboxes | tuple[SandboxConfig, ...] | () | Sandbox configurations for isolated tool execution |
scaffold_kwargs | dict[str, Any] | {} | Scaffold-specific options (e.g., enable_compaction) |
metrics | MetricsConfig | None | None | Inference metrics collection config |
batching | BatchConfig | None | None | Batching strategy configuration |
required_secrets | tuple[str, ...] | () | Required environment variables |
Scaffolds
Scaffolds define how the Harness executes multi-turn requests with tool calling. A scaffold handles the agentic loop: calling the model, executing tools, and feeding results back.
# List available scaffolds
uv run olmo-eval scaffolds
When to use a scaffold:
- For multi-turn execution with
harness.run(), you must specify a scaffold - For single-turn generation with
harness.generate(), no scaffold is needed
# Multi-turn execution requires a scaffold
config = HarnessConfig(
name="my_agent",
provider=ProviderConfig(model="gpt-4o", kind="litellm"),
tools=(semantic_scholar_search, serper_web_search),
scaffold="openai_agents", # Required for run()
)
harness = Harness(config)
result = await harness.run(request) # Uses the scaffold
# Single-turn generation works without a scaffold
config = HarnessConfig(
name="simple",
provider=ProviderConfig(model="gpt-4o", kind="litellm"),
)
harness = Harness(config)
outputs = harness.generate(requests) # No scaffold needed
Inference Metrics
Harness configurations can include MetricsConfig to collect inference performance metrics during evaluation:
from olmo_eval.harness import HarnessConfig, ProviderConfig
from olmo_eval.inference.metrics import MetricsConfig
config = HarnessConfig(
name="with_metrics",
provider=ProviderConfig(model="llama3.1-8b", kind="vllm_server"),
metrics=MetricsConfig(
enabled=True,
reporters=("file", "db"), # Save to file and database
collect_vllm_server=True, # Poll vLLM server /metrics endpoint
),
)
Visualizing Metrics:
# Plot metrics from database (requires at least one filter)
uv run olmo-eval metrics plot -G my-benchmark-group
uv run olmo-eval metrics plot -m OLMo-3 --metric throughput
# Show statistics table without interactive plots
uv run olmo-eval metrics plot -e experiment_123 --stats-only
When using the db reporter, metrics are stored in a PostgreSQL database (default name: olmo_eval_metrics). You must configure your own database connection using the OLMO_EVAL_DB_* environment variables (see Database Configuration).
Auxiliary Providers and Local Judge Models
Some tasks or custom scorers use LLM-as-judge scoring, where a separate model evaluates responses. The auxiliary_providers configuration lets you specify additional inference providers for scoring or judging. Harness overrides must come immediately after --harness, while task overrides like limit=... must come after -t.
Local example with uv run olmo-eval run:
uv run olmo-eval run \
--harness default \
-o provider.max_model_len=16384 \
-o provider.num_instances=1 \
-o 'metrics.reporters=[file]' \
-o 'metrics.collect_gpu=true' \
-o 'provider.kwargs.timeout=300' \
-o auxiliary_providers.judge.kind=vllm_server \
-o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
-o auxiliary_providers.judge.num_instances=1 \
-o scoring_concurrency=4 \
-m Qwen/Qwen3-8B \
-t simpleqa:judge \
-o limit=10
Key configuration options:
| Option | Description |
|---|---|
auxiliary_providers.judge.kind | Provider type: vllm_server, litellm, etc. |
auxiliary_providers.judge.model | Model to use for judging |
auxiliary_providers.judge.num_instances | Number of parallel vLLM instances |
auxiliary_providers.judge.base_url | URL for external servers (when not spawning locally) |
scoring_concurrency | Number of concurrent scoring requests |
Defining Tools
Tools combine schema (for the LLM) and implementation (for execution) in a single definition:
from olmo_eval.harness import tool, registered_tool
# Option 1: @tool decorator (local use)
@tool(description="Search the web for information")
async def web_search(query: str) -> str:
"""Search implementation."""
return await search_api(query)
# Option 2: @registered_tool decorator (global registry, for cross-process use)
@registered_tool(description="Fetch a webpage")
async def fetch_page(url: str) -> str:
"""Fetch implementation."""
return await fetch_url(url)
Tools are automatically registered when using @registered_tool, making them available by name in HarnessConfig.
Custom Harness Config File
Create a YAML file for custom harness configurations:
# my_harness.yaml
name: custom_search
tool_names:
- semantic_scholar_snippet_search
- serper_google_webpage_search
system_prompt: |
You are a research assistant with web search capabilities.
Use search tools to find accurate information before answering.
max_turns: 15
max_concurrency: 4
required_secrets:
- S2_API_KEY
- SERPER_API_KEY
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config my_harness.yaml
Programmatic Usage
from olmo_eval.harness import Harness, HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
semantic_scholar_search,
serper_web_search,
)
# Create harness with preset and provider override
config = get_harness_preset("dr_tulu").with_provider(
ProviderConfig(model="meta-llama/Llama-3.1-8B-Instruct", kind="vllm")
)
harness = Harness(config)
# Or create from scratch
config = HarnessConfig(
name="my_harness",
provider=ProviderConfig(model="gpt-4o", kind="litellm"),
tools=(semantic_scholar_search, serper_web_search),
system_prompt="You are a helpful assistant.",
scaffold="openai_agents",
)
harness = Harness(config)
# Multi-turn execution with tool calling
result = await harness.run(request, sampling_params)
print(result.trajectory) # Shows all turns including tool calls
print(result.final_output) # Final model response
Adding New Tasks
This section explains how to create new evaluation tasks.
Quick Start: Minimal Task Example
"""Example: Minimal task implementation."""
from collections.abc import Iterator
from typing import Any
from olmo_eval.common.types import Instance, LMOutput, LMRequest, RequestType
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register
@register("my_task")
class MyTask(Task):
"""My task implementation."""
# DataSource arguments:
# path: HuggingFace dataset path (e.g., "cais/mmlu")
# subset: Dataset subset/config (e.g., "abstract_algebra")
# split: Dataset split (e.g., "test", "validation")
data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")
@property
def instances(self) -> Iterator[Instance]:
"""Load and yield instances from the dataset."""
if self._instances_cache is None:
self._instances_cache = []
loader = DataLoader()
source = self.config.get_data_source()
for doc in loader.load(source):
self._instances_cache.append(self.process_doc(doc))
yield from self._instances_cache
def process_doc(self, doc: dict[str, Any]) -> Instance:
"""Convert a dataset document to an Instance."""
return Instance(
question=doc["question"],
gold_answer=doc["answer"],
choices=tuple(doc["choices"]), # For MC tasks
metadata={"id": doc["id"]},
)
def format_request(self, instance: Instance) -> LMRequest:
"""Format instance for the language model."""
if self.config.formatter is not None:
return self.config.formatter.format(instance, self.get_fewshot())
# Fallback formatting
return LMRequest(request_type=RequestType.COMPLETION, prompt=instance.question)
def extract_answer(self, output: LMOutput) -> str | None:
"""Extract the answer from model output."""
return output.text.strip()
Task Class Overview
| Method | Required | Purpose |
|---|---|---|
instances | Yes | Property that yields Instance objects from the dataset |
process_doc(doc) | Yes | Converts a raw document dict into an Instance |
format_request(instance) | Yes | Converts an Instance into an LMRequest for the model |
extract_answer(output) | Yes | Extracts the answer string from LMOutput |
_build_fewshot() | No | Override to customize few-shot example loading |
score_responses(...) | No | Override to customize scoring logic |
compute_metrics(...) | No | Override to customize metric computation |
TaskConfig Reference
| Field | Type | Default | Description |
|---|---|---|---|
name | str | Required | Task identifier used in CLI |
data_source | DataSource | str | None | Dataset source (HuggingFace, S3, GCS, local, or URI string) |
fewshot_source | DataSource | str | None | Optional separate source for few-shot examples |
formatter | Formatter | None | Request formatter |
metrics | tuple[Metric, ...] | () | Evaluation metrics (scorers are inferred from metrics) |
num_fewshot | int | 0 | Number of few-shot examples |
fewshot_seed | int | 42 | Random seed for few-shot selection |
seed | int | 42 | General random seed for task |
limit | int | None | None | Max instances to evaluate |
split | Split | Split.TEST | Dataset split to use |
primary_metric | MetricName | Metric | None | None | Primary metric for ranking (defaults to single metric if only one) |
sampling_params | SamplingParams | None | None | Default sampling parameters for this task |
dependencies | list[str] | None | None | Runtime packages to install (e.g., ["pkg==1.0"]) |
Data Sources
Tasks can load data from multiple sources using DataSource:
from olmo_eval.data import DataSource
# HuggingFace datasets - specify path, subset, and split
DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")
# Using URI string (alternative syntax)
DataSource.from_uri("hf://cais/mmlu?subset=abstract_algebra&split=test")
# Without subset (for datasets that don't have subsets)
DataSource(path="openai_humaneval", split="test")
# With specific data files and revision
DataSource(path="my-org/dataset", data_files="data/test.jsonl", revision="v1.0")
# Local JSONL files
DataSource(path="/path/to/dataset.jsonl")
# S3
DataSource(path="s3://my-bucket/datasets/data.jsonl")
# GCS
DataSource(path="gs://my-bucket/datasets/data.parquet")
DataSource Fields:
| Field | Type | Default | Description |
|---|---|---|---|
path | str | Required | Dataset path (HuggingFace repo, S3/GCS URI, or local path) |
subset | str | None | None | Dataset subset/config name |
split | str | "test" | Dataset split |
data_files | str | None | None | Specific data files to load |
revision | str | None | None | Dataset revision/version |
Common Patterns
Multiple Choice Tasks:
formatter=MultipleChoiceFormatter(template="Question: {question}\n\nAnswer:")
metrics=(AccuracyMetric(scorer=MultipleChoiceScorer),)
Generation Tasks (exact match):
formatter=CompletionFormatter(template="{question}")
metrics=(AccuracyMetric(scorer=ExactMatchScorer),)
Tasks with Multiple Subsets (like MMLU with 57 subjects):
# Base class with shared logic
class MMLUTask(Task):
...
# Register each subset - the subset is specified in DataSource
@register("mmlu_anatomy")
class MMLUAnatomy(MMLUTask):
data_source = DataSource(path="cais/mmlu", subset="anatomy", split="test")
@register("mmlu_physics")
class MMLUPhysics(MMLUTask):
data_source = DataSource(path="cais/mmlu", subset="high_school_physics", split="test")
Adding Variants
Variants modify how a task is formatted/scored (e.g., :mc, :bpb):
from olmo_eval.evals.tasks.common import register_variant
# Register after task is defined
register_variant("my_task", "bpb", formatter=PPLFormatter(), metrics=(BPBMetricByteAvg(scorer=BitsPerByteScorer),))
Variants can also encode configuration presets (e.g., :3shot, :zero):
from olmo_eval.evals.tasks.common import register_variant
register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("my_task", "zero", num_fewshot=0)
Usage: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb
Tool-Augmented Evaluation
olmo-eval supports evaluating models with tool use through the Harness abstraction. This enables comparing baseline model performance against tool-augmented performance on the same tasks.
The Harness is the preferred way to add tools to evaluations. It separates tool configuration from task definition, allowing any task to be run with or without tools:
# Baseline evaluation (no tools)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge
# Same task with search tools
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu
See the Harness section above for full documentation on:
- Creating custom harness configurations
- Defining tools with the
@tooldecorator - Programmatic usage
Querying Results
Evaluation results can be stored in PostgreSQL and queried via the CLI.
Basic Queries
# Query by experiment ID
uv run olmo-eval results query --experiment exp_001
# Query by model
uv run olmo-eval results query --model llama3.1-8b
# Query by task (shows comparison matrix)
uv run olmo-eval results query --task mmlu --task gsm8k
# Query by experiment group
uv run olmo-eval results query -G my-benchmark-group --format json
# Combine filters
uv run olmo-eval results query --model llama3.1-8b --task mmlu --format json
Instance-Level Predictions
Include --instances to retrieve instance-level predictions:
# Get instances for an experiment
uv run olmo-eval results query --experiment exp_001 --task mmlu --instances --format json
# Paginate through large result sets using keyset pagination
uv run olmo-eval results query --task mmlu --instances --limit 1000 --format json
# Get next page using last_id from previous response
uv run olmo-eval results query --task mmlu --instances --limit 1000 --after-id 1000 --format json
JSON output includes pagination metadata:
{
"experiments": [...],
"pagination": {
"last_id": 12345,
"has_more": true
}
}
Output Formats
| Format | Flag | Description |
|---|---|---|
| Table | --format table | Rich terminal tables (default) |
| JSON | --format json | Structured JSON with pagination metadata |
| CSV | --format csv | CSV output to stdout |
Database Configuration
AI2 Users (Recommended)
Set these two environment variables to connect to the shared database:
export OLMO_EVAL_DB_HOST="<database-host>"
export OLMO_EVAL_DB_SECRET_ARN="arn:aws:secretsmanager:us-west-2:..."
The password is automatically fetched from AWS Secrets Manager on first connection.
This requires AWS credentials configured (via ~/.aws/credentials or environment variables).
All Environment Variables
| Variable | Default | Description |
|---|---|---|
OLMO_EVAL_DB_HOST | localhost | Database host |
OLMO_EVAL_DB_PORT | 5432 | Database port |
OLMO_EVAL_DB_NAME | olmo_eval | Database name |
OLMO_EVAL_DB_USER | postgres | Database user |
OLMO_EVAL_DB_PASSWORD | - | Database password (use this OR OLMO_EVAL_DB_SECRET_ARN) |
OLMO_EVAL_DB_SECRET_ARN | - | AWS Secrets Manager ARN for password (fetched on auth failure) |
Advanced Usage
Multi-GPU and Tool-Augmented Evaluation
# Basic evaluation
uv run olmo-eval run -m llama3.1-8b -t mmlu -t gsm8k -t arc_easy
# Large models with multi-GPU tensor parallelism
uv run olmo-eval run -m llama3.1-70b -t mmlu --num-gpus 4
# Refresh Hugging Face cache before loading a remote model
uv run olmo-eval run -m allenai/OLMo-2-1124-7B -t mmlu --force-download-model
# Tool-augmented evaluation with harness
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu
Debugging and Inspection
olmo-eval provides tools for inspecting tasks, requests, and responses at various stages of evaluation.
Task Inspection (uv run olmo-eval task inspect)
Inspect task instances without running evaluation:
# View raw instance data
uv run olmo-eval task inspect arc_easy
# View multiple instances
uv run olmo-eval task inspect arc_easy -n 5 --skip 10
# View the LM request that will be sent to the model
uv run olmo-eval task inspect arc_easy:mc --request
# View formatted prompt with chat template applied
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --formatted
# View tokenized representation
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --tokens
# Export as JSON for programmatic use
uv run olmo-eval task inspect arc_easy --json
| Option | Description |
|---|---|
-n, --count | Number of instances to display |
-s, --skip | Number of instances to skip |
--instance | Show instance details (default if no other flags) |
--request | Show the LM request |
-T, --tokenizer | Tokenizer for formatting/tokenization |
--formatted | Show prompt after template applied (requires -T) |
--tokens | Show token array (requires -T) |
--max-tokens | Max tokens to display (0 for no limit) |
--max-chars | Max chars for formatted prompt (0 for no limit) |
--max-string-length | Max chars for instance field values (0 for no limit) |
--json | Output as JSON |
Runtime Inspection Flags
Inspect data during evaluation runs with uv run olmo-eval run:
# Enable all inspection flags at once
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect
# Or use individual flags for specific inspection
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-instance --inspect-request
# Inspect the response after model generation
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-response
# Combine multiple inspection flags
uv run olmo-eval run -m llama3.1-8b -t mmlu \
--inspect-instance \
--inspect-request \
--inspect-response
| Flag | Description |
|---|---|
--inspect | Enable all inspection flags below |
--inspect-instance | Print the first instance of each task before running |
--inspect-request | Print the first LM request before model generation |
--inspect-formatted | Show formatted prompt (after chat template applied) |
--inspect-tokens | Show token array before evaluation |
--inspect-response | Print the first response after model generation |
Mock Provider for Testing
Use the mock provider to test inspection tools without loading a real model:
# Quick inspection without vLLM or PyTorch
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --inspect-request
# Dry run with mock to preview configuration
uv run olmo-eval run -m mock -t mmlu --dry-run
External Evals
External evals are standalone evaluations that run outside the normal task pipeline.
Use them when a benchmark already comes with its own harness, verifier, or environment
and does not fit cleanly into the usual task formatter/scorer flow. They are a good fit
for agent-style benchmarks like terminal_bench_2, tau2_bench, and asta_bench
that need sandbox orchestration, benchmark-specific setup, or end-to-end execution
against an external repo or runner.
Defining an External Eval
from typing import Any
from olmo_eval.evals.external import SandboxedExternalEval, ExternalEvalResult, register_external_eval
class MyBenchmarkExternalEval(SandboxedExternalEval):
"""My benchmark evaluation."""
name = "my_benchmark"
description = "Evaluates model on my benchmark"
timeout_seconds = 3600
required_secrets = ("MY_API_KEY",)
@property
def sandbox_image(self) -> str:
return "my-benchmark:latest"
@property
def working_dir(self) -> str:
return "/workspace"
@property
def setup_command(self) -> tuple[str, ...]:
return ("pip install -r requirements.txt",)
@property
def arguments(self) -> dict[str, tuple[str, Any | None]]:
# Returns dict of arg_name -> (description, default_value)
return {"subset": ("Which subset to evaluate", "default")}
async def execute(self, provider, args, output_dir, container_runtime):
# Run benchmark in sandbox
result = await self.run_in_sandbox(provider, args, output_dir)
return ExternalEvalResult(
name=self.name,
metrics={"accuracy": result.score},
success=True,
)
# Register the eval
register_external_eval(MyBenchmarkExternalEval())
Running External Evals
# List available external evals
uv run olmo-eval external-evals
# Run a built-in external eval
uv run olmo-eval run-external -e tau2_bench --model llama3.1-8b -a domain=airline -a num_tasks=1
ExternalEvalResult
External evals return structured results:
| Field | Type | Description |
|---|---|---|
name | str | Eval identifier |
metrics | dict[str, float] | Evaluation metrics |
metadata | dict | Additional metadata |
success | bool | Whether the eval completed successfully |
error | str | None | Error message if failed |
duration_seconds | float | Execution time |
raw_output | str | None | Raw stdout/stderr from the evaluation |
predictions | list | Instance-level predictions |
Sandboxes
Sandboxes provide isolated execution environments for code execution, tool use, and external evals.
Configuration
from olmo_eval.harness.sandbox import SandboxConfig, SandboxMode, Capability
config = SandboxConfig(
image="python:3.12-slim",
mode=SandboxMode.DOCKER,
command_timeout=30.0,
startup_timeout=60.0,
instances=4, # Run 4 parallel executors
working_dir="/workspace",
environment=(("MY_VAR", "value"),),
volumes=(("/host/path", "/container/path"),),
capabilities=Capability.BASH | Capability.PYTHON, # Union of frozensets
)
SandboxConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
image | str | Required | Container image |
mode | SandboxMode | DOCKER | LOCAL, DOCKER, or MODAL |
container_runtime | str | "podman" | "docker" or "podman" |
command_timeout | float | 30.0 | Timeout per command (seconds) |
startup_timeout | float | 60.0 | Container startup timeout |
instances | int | 1 | Number of parallel executors |
working_dir | str | "/workspace" | Working directory in container |
environment | tuple | () | Environment variables |
volumes | tuple | () | Volume mounts (host, container) |
capabilities | frozenset[str] | Capability.DEFAULT | Capabilities like Capability.BASH, Capability.PYTHON |
remove_container | bool | True | Remove container after use |
docker_args | tuple[str, ...] | () | Additional Docker/Podman arguments |
log_dir | str | None | None | Directory for container logs |
exec_shell | tuple[str, ...] | None | None | Custom shell for command execution |
enable_diagnostics | bool | True | Run background diagnostics monitor |
Using SandboxManager
The SandboxManager manages multiple executors with capability-based routing:
from olmo_eval.harness.sandbox import SandboxConfig, SandboxManager, SandboxMode, Capability
configs = [
SandboxConfig(image="python:3.12", mode=SandboxMode.DOCKER, capabilities=Capability.PYTHON, instances=2),
SandboxConfig(image="ubuntu:22.04", mode=SandboxMode.DOCKER, capabilities=Capability.BASH),
]
manager = SandboxManager(configs, owner="my-scorer")
await manager.start()
# Execute with specific capability - routes to matching executor
result = await manager.execute_with_capabilities(
"print('hello')",
Capability.PYTHON
)
# Round-robin across matching executors
results = await asyncio.gather(*[
manager.execute_with_capabilities(cmd, Capability.PYTHON)
for cmd in commands
])
await manager.stop()
Sandbox Modes
| Mode | Description |
|---|---|
LOCAL | Run commands locally (development only) |
DOCKER | Run in Docker/Podman containers |
MODAL | Run on Modal cloud platform |
Launching on Beaker
olmo-eval includes built-in support for launching evaluation jobs on Beaker.
Installation
The beaker extra is included in the default dev group, so a plain
uv sync --frozen is enough. If you previously opted out of the default
groups, re-enable it with:
uv sync --frozen --extra beaker
CLI Usage
Launch an evaluation job:
# Basic evaluation
uv run olmo-eval beaker launch -n "eval-llama3-mmlu" -m llama3.1-8b -t mmlu \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Multiple tasks
uv run olmo-eval beaker launch -n "eval-llama3-suite" \
-m llama3.1-8b \
-t mmlu -t gsm8k -t hellaswag \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Large model with multiple GPUs
uv run olmo-eval beaker launch \
--name "eval-70b-full" \
--model meta-llama/Llama-3.1-70B-Instruct \
--task mmlu --task gsm8k --task arc_easy \
--cluster h100 \
--workspace "ai2/olmo-eval-debug" \
--budget "ai2/oe-base" \
--gpus 4 \
--timeout 48h
# Preview the Beaker spec without launching
uv run olmo-eval beaker launch -n "test" -m llama3.1-8b -t arc_easy \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base" \
--dry-run
# With a harness preset for tool-augmented evaluation
uv run olmo-eval beaker launch -n "eval-with-tools" \
-m llama3.1-8b \
-t simpleqa:judge \
--harness dr_tulu \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# With inspection flags for debugging
uv run olmo-eval beaker launch -n "debug-eval" \
-m llama3.1-8b \
-t mmlu -o limit=10 \
--inspect-request \
--inspect-response \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Run external evaluations
uv run olmo-eval beaker launch -n "external-eval" \
-E tau2_bench \
-m llama3.1-8b \
-A domain=airline \
-A num_tasks=1 \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
Advanced Usage
Local Judge Models
For tasks or custom scorers that use a named auxiliary judge provider, you can run a local judge model alongside the main model. Put harness overrides immediately after --harness, then put task overrides after -t.
uv run olmo-eval beaker launch \
--harness default \
-o provider.max_model_len=16384 \
-o provider.num_instances=1 \
-o 'metrics.reporters=[file]' \
-o 'metrics.collect_gpu=true' \
-o 'provider.kwargs.timeout=300' \
-o auxiliary_providers.judge.kind=vllm_server \
-o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
-o auxiliary_providers.judge.num_instances=1 \
-o scoring_concurrency=4 \
-m Qwen/Qwen3-8B \
-t "simpleqa:judge@urgent" \
-o limit=10 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base" \
--cluster h100 \
--inspect \
--group olmo-eval-local-judge-2 -y
Per-Task Priorities
Tasks can include an optional @priority suffix to set different priorities per task.
Tasks with different priorities will be launched as separate Beaker experiments:
# Mixed priorities - creates separate experiments per priority level
uv run olmo-eval beaker launch -n "eval-suite" -m llama3.1-8b \
-t "mmlu@high" \
-t "gsm8k@normal" \
-t "arc_easy@low" \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Creates 3 experiments:
# eval-suite-high: runs mmlu at high priority
# eval-suite-normal: runs gsm8k at normal priority
# eval-suite-low: runs arc_easy at low priority
# With task variants (@ comes after the task spec)
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b -t "arc_easy:mc@high" \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Tasks without @priority use the config file priority (default: normal)
Experiment Groups
Groups logically organize experiments for management and result retrieval:
# Launch with grouping
uv run olmo-eval beaker launch -n "benchmark-v1" --group "benchmark-2024" \
-m llama3.1-8b -m olmo-2-7b \
-t mmlu -t gsm8k -t hellaswag \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Creates experiment and adds it to "benchmark-2024" group
# Check group status and results
uv run olmo-eval beaker group info benchmark-2024
# Show detailed task info
uv run olmo-eval beaker group info benchmark-2024 --verbose
# Wait for completion and export as CSV
uv run olmo-eval beaker group info benchmark-2024 --wait --format csv > results.csv
# Export as JSON
uv run olmo-eval beaker group info benchmark-2024 --format json
# Watch experiment logs
uv run olmo-eval beaker watch -e <experiment-id>
# Cancel all experiments in a group
uv run olmo-eval beaker group cancel benchmark-2024
# List groups in a workspace
uv run olmo-eval beaker group list -w <workspace>
Inference Provider Configuration
Docker images do NOT include inference providers (vllm, transformers, litellm) by default. Each model must resolve to a provider configuration, either from a built-in model preset or from harness overrides.
Via config file (recommended):
name: eval-mixed-providers
models:
- llama3.1-8b
- gpt-4o
tasks:
- mmlu
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
CLI Options
| Option | Short | Default | Description |
|---|---|---|---|
--config | -f | none | YAML config file (CLI args override config values) |
--name | -n | auto | Experiment name (auto-generated from model/tasks if not provided) |
--model | -m | required | Model name or HuggingFace path (can specify multiple) |
--task | -t | required | Task name with optional @priority suffix (can specify multiple) |
--harness | -H | none | Harness preset name |
--override | -o | none | Override for preceding -t or -H (can specify multiple) |
--cluster | -c | required | Cluster alias (h100, a100, aus) or full name |
--gpus | -G | auto | Number of GPUs (defaults to 1 for GPU providers, 0 otherwise) |
--max-gpus-per-node | 8 | Maximum GPUs per node (tasks split if exceeded) | |
--priority | -p | normal | Job priority (low, normal, high, urgent) |
--preemptible | true | Allow preemption | |
--timeout | -T | 24h | Job timeout (e.g., 24h, 30m) |
--retries | -r | none | Number of retries on failure |
--workspace | -w | required | Beaker workspace |
--budget | -B | required | Beaker budget |
--image | -I | default | Custom Beaker image |
--group | -g | auto | Add experiments to Beaker group(s) (auto-generated if not specified) |
--external-eval | -E | none | External evaluation name(s) to run instead of tasks |
--eval-arg | -A | none | Arguments for external evals (key=value) |
--provider-kwarg | -K | none | Provider kwargs for external evals (key=value) |
--force-download-model | false | Refresh Hugging Face model/tokenizer cache before loading | |
--uv-cache-dir | default | UV cache directory for package downloads | |
--dry-run | -d | false | Print spec without launching |
--yes | -y | false | Skip confirmation prompt |
--follow/--no-follow | true | Follow logs after launch | |
--secret-env | none | Map Beaker secret to env var (SECRET:VAR) | |
--aws-credentials | auto | Inject AWS credentials (auto-detected from s3:// paths) | |
--gcp-credentials | auto | Inject GCP credentials (auto-detected from gs:// model paths) | |
--store | false | Persist results to configured database |
Per-Task Overrides
Use the -o/--override flag to apply configuration overrides to the preceding -t:
# Task overrides (apply to the preceding -t)
uv run olmo-eval beaker launch -n "eval" \
-m llama3.1-8b \
-t mmlu -o limit=100 -o num_fewshot=5 \
-t gsm8k -o limit=50 \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
The -o flag uses OmegaConf dotlist syntax, supporting:
| Type | Syntax | Example |
|---|---|---|
| String | key=value | -o formatter.template="Q: {q}" |
| Number | key=123 | -o limit=100 |
| Boolean | key=true | -o preemptible=false |
| Nested | a.b.c=val | -o scorer.normalize=true |
| List | key=[a,b] | -o 'dependencies=[pkg1, pkg2]' |
Note: Quote complex values to prevent shell interpretation:
# Good - single quotes protect the value
-o 'extra_config={key: value, nested: {a: 1}}'
Secret Environment Overrides
By default, Beaker secrets are mapped using the pattern {username}_{ENV_VAR} (e.g., ai2-tylerm_OPENAI_API_KEY).
Use --secret-env to override this with a custom Beaker secret name:
# Use a team-shared secret instead of your personal secret
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t mmlu \
--secret-env team-openai-key:OPENAI_API_KEY \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Multiple secret overrides
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t simpleqa:judge \
--harness dr_tulu \
--secret-env team-openai-key:OPENAI_API_KEY \
--secret-env shared-serper-key:SERPER_API_KEY \
--secret-env shared-s2-key:S2_API_KEY \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
Format: BEAKER_SECRET_NAME:ENV_VAR_NAME
This is useful for:
- Using team-shared API keys instead of personal secrets
- Testing with different credential sets
- Sharing jobs that use organization-level secrets
YAML Configuration
For complex or reusable configurations, use YAML config files with the --config/-f option.
CLI arguments override values from the config file.
Basic config file (eval_config.yaml):
name: eval-llama3-core
models:
- llama3.1-8b
tasks:
- mmlu
- gsm8k
- hellaswag
- arc_challenge
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
priority: normal
timeout: 24h
Usage:
# Run from config file
uv run olmo-eval beaker launch -f eval_config.yaml --dry-run
# Override specific values
uv run olmo-eval beaker launch -f eval_config.yaml --gpus 4
# Add additional models via CLI
uv run olmo-eval beaker launch -f eval_config.yaml -m olmo-2-7b
Multi-model comparison config:
name: eval-model-comparison
models:
- llama3.1-8b
- olmo-2-7b
- mistral-7b
tasks:
- mmlu
- gsm8k
- hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
Per-task priorities in config (examples/configs/prioritized_tasks.yaml):
Use @priority suffix on tasks to run different tasks at different priority levels.
Tasks with different priorities create separate Beaker experiments:
name: eval-prioritized
models:
- llama3.1-8b
- olmo-2-7b
tasks:
# High priority - run first
- mmlu@high
- gsm8k@high
# Normal priority
- hellaswag@normal
- arc_challenge@normal
# Low priority - run when resources available
- winogrande@low
- arc_easy@low
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
timeout: 24h
This creates 3 experiments (one per priority level, with both models in each):
eval-prioritized-high: models=[llama3.1-8b, olmo-2-7b], tasks=[mmlu, gsm8k]
eval-prioritized-normal: models=[llama3.1-8b, olmo-2-7b], tasks=[hellaswag, arc_challenge]
eval-prioritized-low: models=[llama3.1-8b, olmo-2-7b], tasks=[winogrande, arc_easy]
Large model config:
name: eval-70b-full
models:
- meta-llama/Llama-3.1-70B-Instruct
tasks:
- mmlu
- gsm8k
- hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 4
priority: high
preemptible: false
timeout: 48h
retries: 2
description: "Full evaluation suite for Llama 70B"
Config file fields:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | yes | Experiment name |
models | list | yes | List of model names or presets |
tasks | list | yes | List of task specs (with optional @priority) |
cluster | string | yes | Cluster alias or full name |
gpus | int | no | Default GPUs per model instance (auto-detected based on provider) |
max_gpus_per_node | int | no | Max GPUs per node, splits tasks if exceeded (default: 8) |
priority | string | no | Default priority (default: normal) |
preemptible | bool | no | Allow preemption (default: true) |
timeout | string | no | Job timeout (default: 24h) |
retries | int | no | Retry count on failure |
workspace | string | yes | Beaker workspace |
budget | string | yes | Beaker budget |
beaker_image | string | no | Container image to use (config-only) |
description | string | no | Optional Beaker description |
groups | list | no | Beaker groups to add experiments to |
See examples/beaker/configs/ for more configuration examples.
Cluster Aliases
# List available cluster aliases
uv run olmo-eval beaker clusters
Programmatic API
from olmo_eval.launch import BeakerJobConfig, BeakerLauncher
config = BeakerJobConfig(
name="eval-llama3-mmlu",
command=["uv", "run", "olmo-eval", "run", "-m", "llama3.1-8b", "-t", "mmlu"],
cluster="h100",
num_gpus=1,
)
launcher = BeakerLauncher()
experiment = launcher.launch(config)
print(f"Launched: {launcher.beaker.experiment.url(experiment)}")
Docker Image Management
Docker images provide the runtime environment (Python, PyTorch, CUDA) but do NOT include:
- Source code - Gantry mounts your git repository at runtime
- Inference providers - Installed at job startup from each model's resolved provider config
This approach allows you to:
- Use any git commit without rebuilding images
- Keep images small and cacheable
Building Images
Images are tagged with CUDA and PyTorch versions: cu{version}-trc{version}-{arch}
# Build with defaults
./scripts/build_image.sh
# Specific CUDA + PyTorch version
./scripts/build_image.sh --cuda-version 12.8.1 --torch-version 2.9.0
# Production build
./scripts/build_image.sh --platform linux/amd64
# See supported CUDA+PyTorch pairs
./scripts/build_image.sh --help
Supported CUDA versions: 12.6.1, 12.8.0, 12.8.1, 12.9.1
PyTorch version: Configurable via --torch-version
Configuration: See scripts/build_config.sh
Pushing Images
# Push most recent build
./scripts/beaker/push_beaker_image.sh
# Preview without pushing
./scripts/beaker/push_beaker_image.sh --dry-run
The script auto-detects the image name from the tag (e.g., olmo-eval-cu128-trc291-amd64)
What's in the Image
The image contains:
- Python 3.12 (via uv)
- PyTorch with CUDA support
- System dependencies (git, uv, ca-certificates)
The image does NOT contain:
- olmo-eval source code (provided by gantry at runtime)
- olmo-eval dependencies like click, datasets, rich, etc. (installed at job startup)
- Storage backends like boto3, psycopg (installed at job startup if needed)
- Inference providers like vllm, transformers, litellm (installed at job startup)
Installing Inference Providers at Runtime
Inference providers are NOT baked into images. They are installed at job startup from the resolved provider configuration for each model:
# In config file
models:
- llama3.1-8b
- gpt-4o
# Or force the provider kind via a harness override
uv run olmo-eval beaker launch -n "eval" \
--harness default \
-o provider.kind=vllm_server \
-m llama3.1-8b \
-t mmlu \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Manual installation inside container
uv pip install -e '.[vllm]' # includes vllm[runai]
Task-Specific Dependencies
Tasks can declare runtime dependencies that are installed at job startup (see Tasks). Dependencies are automatically merged, deduplicated, and installed after the inference provider.
You can also add or override dependencies via the CLI:
# Add dependencies to a task via -o flag
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
-t humaneval:3shot:bpb -o 'dependencies=["code-sandbox==1.0", "git+https://github.com/user/repo@v2.0"]' \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
# Dependencies from multiple tasks are merged
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
-t humaneval:3shot:bpb -o 'dependencies=["pkg1"]' \
-t mbpp:3shot:bpb -o 'dependencies=["pkg2"]' \
--cluster h100 \
-w "ai2/olmo-eval-debug" \
-B "ai2/oe-base"
Development
This repo uses uv with a checked-in uv.lock for reproducible installs.
The default dependency groups (dev + vllm) are installed automatically,
which covers storage, beaker, hf, and the vLLM inference provider.
# Install dependencies from the lockfile
uv sync --frozen
# Install pre-commit hooks
make setup
# Run linter / formatter
make lint
make fix # auto-fix
# Run tests (and type checks)
make test
make verify
# Update the lockfile after editing pyproject.toml
uv lock
CI runs uv sync --frozen and uv run --frozen ..., so any change to
pyproject.toml must be accompanied by a refreshed uv.lock.