olmo-eval

June 11, 2026 · View on GitHub

CI License

Overview

This project provides a unified workbench for evaluating language models throughout the model development loop.

Features:

  • Registry of benchmark tasks and composable suites, with named variants for few-shot settings, formatting, and scoring (e.g. humaneval:3shot:bpb).
  • Support for inference via vLLM, LiteLLM for commercial APIs, and a mock provider for dry runs and debugging.
  • Harness abstraction that separates execution policy from task definition, so any task can be run baseline or tool-augmented without modification.
  • Multi-turn agentic evaluation with tool calling, scaffolds, and sandboxed environments via Docker, Podman, or Modal.
  • LLM-as-judge scoring with auxiliary providers, including locally served judge models.
  • Aggregate and instance-level prediction storage.
  • Inspection tooling for viewing instances, formatted prompts, token arrays, and model responses.

Quick Start

This project uses uv with a checked-in uv.lock for reproducible builds. To get started, sync the repo with uv, browse the available tasks and suites, and preview a run with the built-in mock provider.

Run Your First Eval

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Python 3.12 if your machine does not already have it
uv python install 3.12

# Install dependencies + the package (editable) from the lockfile.
# The default groups (`dev` + `vllm`) are installed automatically, which
# pulls in storage, beaker, hf, and the vLLM inference provider. vLLM
# deps are marked Linux-only via PEP 508 markers, so this works on macOS
# too — no extra flags needed.
uv sync --frozen

# Install pre-commit hooks
make setup

# To update the lockfile after changing pyproject.toml
uv lock

# Add an optional extra on top of the defaults (e.g. agents, litellm)
uv sync --frozen --extra agents

# `openhands` conflicts with vllm — opt out of the vllm group when using it
uv sync --frozen --no-group vllm --extra openhands

# Browse a few suites
uv run olmo-eval suite inspect mmlu
uv run olmo-eval suite inspect gpqa
uv run olmo-eval suite inspect olmobase:code

# Preview a run without loading a model
uv run olmo-eval run -m mock -t gsm8k --dry-run

# Preview another run with a different task spec
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --dry-run

Key Concepts

The evaluation framework is built around these core abstractions:

AbstractionDescription
TaskBenchmark specification defining dataset slice, request construction, and scoring logic
SuiteBenchmark collection that composes tasks and/or nested suites and defines result aggregation
HarnessExecution runtime around the inference provider, tools, scaffolds, and runtime behavior
FormatterPrompt renderer from an instance and few-shot context to an LM request
ScorerPer-example evaluator from model output to raw score or judgment
MetricDataset-level aggregator over per-example scores

Tasks

Tasks define how to load data, format prompts, and score outputs. Register with @register:

from olmo_eval.evals.tasks.common import Task, register
from olmo_eval.data import DataSource

@register("my_task")
class MyTask(Task):
    # DataSource specifies path, subset (optional), and split
    data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")
    ...

Variants can also act as named evaluation presets (for example, few-shot settings):

from olmo_eval.evals.tasks.common import register_variant

register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=42)
# Built-in example: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb

Runtime Dependencies allow tasks to specify packages installed at job startup:

@register("code_eval")
class CodeEvalTask(Task):
    data_source = DataSource(path="my-org/code-dataset", split="test")
    dependencies = ["code-sandbox==1.0", "git+https://github.com/user/repo@v2.0"]
    ...

Suites

Suites group multiple tasks for batch evaluation:

from olmo_eval.evals.suites import Suite, register

register(Suite(
    name="my_suite",
    tasks=("task_a:3shot", "task_b:3shot", "task_c:3shot"),
))

Aggregation

Suites support different strategies for combining task results:

StrategyDescription
AVERAGESimple average of all task scores (default)
AVERAGE_OF_AVERAGESAverage over child suite averages (equal weight per child)
DISPLAY_ONLYDisplay child results without computing suite average
NONENo aggregation - just collect individual task results

Average of Averages Example:

from olmo_eval.evals.suites import Suite, AggregationStrategy, register

# Nested suite with 3 tasks
multilingual_code = Suite(
    name="multilingual_code",
    tasks=("mbpp_python", "mbpp_java", "mbpp_rust"),
    aggregation=AggregationStrategy.AVERAGE,
)

# Parent suite using average of averages
register(Suite(
    name="code_eval",
    tasks=(
        "humaneval",        # Single task (score: 0.80)
        multilingual_code,  # Nested suite with 3 tasks (scores: 0.40, 0.50, 0.60)
    ),
    aggregation=AggregationStrategy.AVERAGE_OF_AVERAGES,
))

# Results:
# - humaneval: 0.80
# - multilingual_code average: (0.40 + 0.50 + 0.60) / 3 = 0.50
#
# AVERAGE_OF_AVERAGES: (0.80 + 0.50) / 2 = 0.65
# vs AVERAGE:          (0.80 + 0.40 + 0.50 + 0.60) / 4 = 0.575

Note: Currently AVERAGE_OF_AVERAGES gives each child equal weight regardless of how many tasks it contains. Custom weighting may be supported in the future.

Formatters

Formatters convert instances into LM requests. See olmo_eval.common.formatters for available options.

from olmo_eval.common.formatters import MultipleChoiceFormatter, ChatFormatter

# Multiple choice with logprob scoring
formatter = MultipleChoiceFormatter(template="Q: {question}\n\nA:")

# Chat-based formatting
formatter = ChatFormatter(system_prompt="You are a helpful assistant.")

Scorers

Scorers compute a score for each instance/output pair. See olmo_eval.common.scorers for available options.

from olmo_eval.common.scorers import ExactMatchScorer, MultipleChoiceScorer

# Exact string match
scorer = ExactMatchScorer()

# Multiple choice comparison
scorer = MultipleChoiceScorer()

Metrics

Metrics aggregate scores across responses. See olmo_eval.common.metrics for available options.

from olmo_eval.common.metrics import AccuracyMetric, F1Metric
from olmo_eval.common.scorers import ExactMatchScorer, F1Scorer

# Mean accuracy
metric = AccuracyMetric(scorer=ExactMatchScorer)

# Mean F1 score
metric = F1Metric(scorer=F1Scorer)

Model Presets

Pre-configured model settings in olmo_eval/common/constants/models.py:

from olmo_eval.common.constants import get_model_presets

# Returns dict of preset name -> ModelConfig
presets = get_model_presets()
# {
#     "llama3.1-8b": ModelConfig(model="meta-llama/Meta-Llama-3.1-8B"),
#     "olmo-2-7b": ModelConfig(model="allenai/OLMo-2-1124-7B"),
#     ...
# }

Harness

A Harness is the runtime orchestration layer for an evaluation run. It combines the primary inference provider with execution policy such as system prompts, tools, auxiliary providers, sandboxing, metrics collection, and an optional scaffold for multi-turn control. This lets the same task run in plain, tool-using, or scaffolded modes without changing the task definition.

Key concept: Any task can be run with or without tools—that's determined by the Harness configuration, not the task definition. This allows comparing baseline vs tool-augmented performance on the same task.

Using Harness via CLI

# Run task without tools or a scaffold (baseline)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge

# Run task with search tools via harness preset
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

# Use a custom harness config file
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config ./my_harness.yaml

HarnessConfig

Configuration for a harness:

from olmo_eval.harness import HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
    semantic_scholar_search,
    serper_web_search,
    serper_fetch_page,
)

# Get a preset
config = get_harness_preset("dr_tulu")

# Or create custom config with tools
config = HarnessConfig(
    name="my_harness",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search, serper_fetch_page),
    system_prompt="You are a helpful assistant with search tools.",
    max_turns=10,
    max_concurrency=8,
    scaffold="openai_agents",
    required_secrets=("S2_API_KEY", "SERPER_API_KEY"),
)
FieldTypeDefaultDescription
namestrRequiredHarness identifier
providerProviderConfigProviderConfig()Model provider configuration
toolstuple[Tool | str, ...]()Tool instances or registered tool names
system_promptstr | NoneNoneSystem prompt to inject
tool_choicestr"auto"Tool selection mode (auto, none, required)
scaffoldstr | NoneNoneExecution scaffold (e.g., openai_agents)
max_turnsint | NoneNoneMax turns for multi-turn execution
max_concurrencyint | NoneNoneConcurrent executions
scoring_concurrencyint | NoneNoneMax concurrent scoring operations
sandboxestuple[SandboxConfig, ...]()Sandbox configurations for isolated tool execution
scaffold_kwargsdict[str, Any]{}Scaffold-specific options (e.g., enable_compaction)
metricsMetricsConfig | NoneNoneInference metrics collection config
batchingBatchConfig | NoneNoneBatching strategy configuration
required_secretstuple[str, ...]()Required environment variables

Scaffolds

Scaffolds define how the Harness executes multi-turn requests with tool calling. A scaffold handles the agentic loop: calling the model, executing tools, and feeding results back.

# List available scaffolds
uv run olmo-eval scaffolds

When to use a scaffold:

  • For multi-turn execution with harness.run(), you must specify a scaffold
  • For single-turn generation with harness.generate(), no scaffold is needed
# Multi-turn execution requires a scaffold
config = HarnessConfig(
    name="my_agent",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search),
    scaffold="openai_agents",  # Required for run()
)
harness = Harness(config)
result = await harness.run(request)  # Uses the scaffold

# Single-turn generation works without a scaffold
config = HarnessConfig(
    name="simple",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
)
harness = Harness(config)
outputs = harness.generate(requests)  # No scaffold needed

Inference Metrics

Harness configurations can include MetricsConfig to collect inference performance metrics during evaluation:

from olmo_eval.harness import HarnessConfig, ProviderConfig
from olmo_eval.inference.metrics import MetricsConfig

config = HarnessConfig(
    name="with_metrics",
    provider=ProviderConfig(model="llama3.1-8b", kind="vllm_server"),
    metrics=MetricsConfig(
        enabled=True,
        reporters=("file", "db"),  # Save to file and database
        collect_vllm_server=True,  # Poll vLLM server /metrics endpoint
    ),
)

Visualizing Metrics:

# Plot metrics from database (requires at least one filter)
uv run olmo-eval metrics plot -G my-benchmark-group
uv run olmo-eval metrics plot -m OLMo-3 --metric throughput

# Show statistics table without interactive plots
uv run olmo-eval metrics plot -e experiment_123 --stats-only

When using the db reporter, metrics are stored in a PostgreSQL database (default name: olmo_eval_metrics). You must configure your own database connection using the OLMO_EVAL_DB_* environment variables (see Database Configuration).

Auxiliary Providers and Local Judge Models

Some tasks or custom scorers use LLM-as-judge scoring, where a separate model evaluates responses. The auxiliary_providers configuration lets you specify additional inference providers for scoring or judging. Harness overrides must come immediately after --harness, while task overrides like limit=... must come after -t.

Local example with uv run olmo-eval run:

uv run olmo-eval run \
    --harness default \
    -o provider.max_model_len=16384 \
    -o provider.num_instances=1 \
    -o 'metrics.reporters=[file]' \
    -o 'metrics.collect_gpu=true' \
    -o 'provider.kwargs.timeout=300' \
    -o auxiliary_providers.judge.kind=vllm_server \
    -o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
    -o auxiliary_providers.judge.num_instances=1 \
    -o scoring_concurrency=4 \
    -m Qwen/Qwen3-8B \
    -t simpleqa:judge \
    -o limit=10

Key configuration options:

OptionDescription
auxiliary_providers.judge.kindProvider type: vllm_server, litellm, etc.
auxiliary_providers.judge.modelModel to use for judging
auxiliary_providers.judge.num_instancesNumber of parallel vLLM instances
auxiliary_providers.judge.base_urlURL for external servers (when not spawning locally)
scoring_concurrencyNumber of concurrent scoring requests

Defining Tools

Tools combine schema (for the LLM) and implementation (for execution) in a single definition:

from olmo_eval.harness import tool, registered_tool

# Option 1: @tool decorator (local use)
@tool(description="Search the web for information")
async def web_search(query: str) -> str:
    """Search implementation."""
    return await search_api(query)

# Option 2: @registered_tool decorator (global registry, for cross-process use)
@registered_tool(description="Fetch a webpage")
async def fetch_page(url: str) -> str:
    """Fetch implementation."""
    return await fetch_url(url)

Tools are automatically registered when using @registered_tool, making them available by name in HarnessConfig.

Custom Harness Config File

Create a YAML file for custom harness configurations:

# my_harness.yaml
name: custom_search
tool_names:
  - semantic_scholar_snippet_search
  - serper_google_webpage_search
system_prompt: |
  You are a research assistant with web search capabilities.
  Use search tools to find accurate information before answering.
max_turns: 15
max_concurrency: 4
required_secrets:
  - S2_API_KEY
  - SERPER_API_KEY
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness-config my_harness.yaml

Programmatic Usage

from olmo_eval.harness import Harness, HarnessConfig, ProviderConfig, get_harness_preset
from olmo_eval.harness.tools.search import (
    semantic_scholar_search,
    serper_web_search,
)

# Create harness with preset and provider override
config = get_harness_preset("dr_tulu").with_provider(
    ProviderConfig(model="meta-llama/Llama-3.1-8B-Instruct", kind="vllm")
)
harness = Harness(config)

# Or create from scratch
config = HarnessConfig(
    name="my_harness",
    provider=ProviderConfig(model="gpt-4o", kind="litellm"),
    tools=(semantic_scholar_search, serper_web_search),
    system_prompt="You are a helpful assistant.",
    scaffold="openai_agents",
)
harness = Harness(config)

# Multi-turn execution with tool calling
result = await harness.run(request, sampling_params)
print(result.trajectory)  # Shows all turns including tool calls
print(result.final_output)  # Final model response

Adding New Tasks

This section explains how to create new evaluation tasks.

Quick Start: Minimal Task Example

"""Example: Minimal task implementation."""
from collections.abc import Iterator
from typing import Any

from olmo_eval.common.types import Instance, LMOutput, LMRequest, RequestType
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register


@register("my_task")
class MyTask(Task):
    """My task implementation."""

    # DataSource arguments:
    #   path: HuggingFace dataset path (e.g., "cais/mmlu")
    #   subset: Dataset subset/config (e.g., "abstract_algebra")
    #   split: Dataset split (e.g., "test", "validation")
    data_source = DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")

    @property
    def instances(self) -> Iterator[Instance]:
        """Load and yield instances from the dataset."""
        if self._instances_cache is None:
            self._instances_cache = []
            loader = DataLoader()
            source = self.config.get_data_source()
            for doc in loader.load(source):
                self._instances_cache.append(self.process_doc(doc))
        yield from self._instances_cache

    def process_doc(self, doc: dict[str, Any]) -> Instance:
        """Convert a dataset document to an Instance."""
        return Instance(
            question=doc["question"],
            gold_answer=doc["answer"],
            choices=tuple(doc["choices"]),  # For MC tasks
            metadata={"id": doc["id"]},
        )

    def format_request(self, instance: Instance) -> LMRequest:
        """Format instance for the language model."""
        if self.config.formatter is not None:
            return self.config.formatter.format(instance, self.get_fewshot())
        # Fallback formatting
        return LMRequest(request_type=RequestType.COMPLETION, prompt=instance.question)

    def extract_answer(self, output: LMOutput) -> str | None:
        """Extract the answer from model output."""
        return output.text.strip()

Task Class Overview

MethodRequiredPurpose
instancesYesProperty that yields Instance objects from the dataset
process_doc(doc)YesConverts a raw document dict into an Instance
format_request(instance)YesConverts an Instance into an LMRequest for the model
extract_answer(output)YesExtracts the answer string from LMOutput
_build_fewshot()NoOverride to customize few-shot example loading
score_responses(...)NoOverride to customize scoring logic
compute_metrics(...)NoOverride to customize metric computation

TaskConfig Reference

FieldTypeDefaultDescription
namestrRequiredTask identifier used in CLI
data_sourceDataSource | strNoneDataset source (HuggingFace, S3, GCS, local, or URI string)
fewshot_sourceDataSource | strNoneOptional separate source for few-shot examples
formatterFormatterNoneRequest formatter
metricstuple[Metric, ...]()Evaluation metrics (scorers are inferred from metrics)
num_fewshotint0Number of few-shot examples
fewshot_seedint42Random seed for few-shot selection
seedint42General random seed for task
limitint | NoneNoneMax instances to evaluate
splitSplitSplit.TESTDataset split to use
primary_metricMetricName | Metric | NoneNonePrimary metric for ranking (defaults to single metric if only one)
sampling_paramsSamplingParams | NoneNoneDefault sampling parameters for this task
dependencieslist[str] | NoneNoneRuntime packages to install (e.g., ["pkg==1.0"])

Data Sources

Tasks can load data from multiple sources using DataSource:

from olmo_eval.data import DataSource

# HuggingFace datasets - specify path, subset, and split
DataSource(path="cais/mmlu", subset="abstract_algebra", split="test")

# Using URI string (alternative syntax)
DataSource.from_uri("hf://cais/mmlu?subset=abstract_algebra&split=test")

# Without subset (for datasets that don't have subsets)
DataSource(path="openai_humaneval", split="test")

# With specific data files and revision
DataSource(path="my-org/dataset", data_files="data/test.jsonl", revision="v1.0")

# Local JSONL files
DataSource(path="/path/to/dataset.jsonl")

# S3
DataSource(path="s3://my-bucket/datasets/data.jsonl")

# GCS
DataSource(path="gs://my-bucket/datasets/data.parquet")

DataSource Fields:

FieldTypeDefaultDescription
pathstrRequiredDataset path (HuggingFace repo, S3/GCS URI, or local path)
subsetstr | NoneNoneDataset subset/config name
splitstr"test"Dataset split
data_filesstr | NoneNoneSpecific data files to load
revisionstr | NoneNoneDataset revision/version

Common Patterns

Multiple Choice Tasks:

formatter=MultipleChoiceFormatter(template="Question: {question}\n\nAnswer:")
metrics=(AccuracyMetric(scorer=MultipleChoiceScorer),)

Generation Tasks (exact match):

formatter=CompletionFormatter(template="{question}")
metrics=(AccuracyMetric(scorer=ExactMatchScorer),)

Tasks with Multiple Subsets (like MMLU with 57 subjects):

# Base class with shared logic
class MMLUTask(Task):
    ...

# Register each subset - the subset is specified in DataSource
@register("mmlu_anatomy")
class MMLUAnatomy(MMLUTask):
    data_source = DataSource(path="cais/mmlu", subset="anatomy", split="test")

@register("mmlu_physics")
class MMLUPhysics(MMLUTask):
    data_source = DataSource(path="cais/mmlu", subset="high_school_physics", split="test")

Adding Variants

Variants modify how a task is formatted/scored (e.g., :mc, :bpb):

from olmo_eval.evals.tasks.common import register_variant

# Register after task is defined
register_variant("my_task", "bpb", formatter=PPLFormatter(), metrics=(BPBMetricByteAvg(scorer=BitsPerByteScorer),))

Variants can also encode configuration presets (e.g., :3shot, :zero):

from olmo_eval.evals.tasks.common import register_variant

register_variant("my_task", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("my_task", "zero", num_fewshot=0)

Usage: uv run olmo-eval run -m llama3.1-8b -t humaneval:3shot:bpb

Tool-Augmented Evaluation

olmo-eval supports evaluating models with tool use through the Harness abstraction. This enables comparing baseline model performance against tool-augmented performance on the same tasks.

The Harness is the preferred way to add tools to evaluations. It separates tool configuration from task definition, allowing any task to be run with or without tools:

# Baseline evaluation (no tools)
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge

# Same task with search tools
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

See the Harness section above for full documentation on:

  • Creating custom harness configurations
  • Defining tools with the @tool decorator
  • Programmatic usage

Querying Results

Evaluation results can be stored in PostgreSQL and queried via the CLI.

Basic Queries

# Query by experiment ID
uv run olmo-eval results query --experiment exp_001

# Query by model
uv run olmo-eval results query --model llama3.1-8b

# Query by task (shows comparison matrix)
uv run olmo-eval results query --task mmlu --task gsm8k

# Query by experiment group
uv run olmo-eval results query -G my-benchmark-group --format json

# Combine filters
uv run olmo-eval results query --model llama3.1-8b --task mmlu --format json

Instance-Level Predictions

Include --instances to retrieve instance-level predictions:

# Get instances for an experiment
uv run olmo-eval results query --experiment exp_001 --task mmlu --instances --format json

# Paginate through large result sets using keyset pagination
uv run olmo-eval results query --task mmlu --instances --limit 1000 --format json

# Get next page using last_id from previous response
uv run olmo-eval results query --task mmlu --instances --limit 1000 --after-id 1000 --format json

JSON output includes pagination metadata:

{
  "experiments": [...],
  "pagination": {
    "last_id": 12345,
    "has_more": true
  }
}

Output Formats

FormatFlagDescription
Table--format tableRich terminal tables (default)
JSON--format jsonStructured JSON with pagination metadata
CSV--format csvCSV output to stdout

Database Configuration

Set these two environment variables to connect to the shared database:

export OLMO_EVAL_DB_HOST="<database-host>"
export OLMO_EVAL_DB_SECRET_ARN="arn:aws:secretsmanager:us-west-2:..."

The password is automatically fetched from AWS Secrets Manager on first connection. This requires AWS credentials configured (via ~/.aws/credentials or environment variables).

All Environment Variables

VariableDefaultDescription
OLMO_EVAL_DB_HOSTlocalhostDatabase host
OLMO_EVAL_DB_PORT5432Database port
OLMO_EVAL_DB_NAMEolmo_evalDatabase name
OLMO_EVAL_DB_USERpostgresDatabase user
OLMO_EVAL_DB_PASSWORD-Database password (use this OR OLMO_EVAL_DB_SECRET_ARN)
OLMO_EVAL_DB_SECRET_ARN-AWS Secrets Manager ARN for password (fetched on auth failure)

Advanced Usage

Multi-GPU and Tool-Augmented Evaluation

# Basic evaluation
uv run olmo-eval run -m llama3.1-8b -t mmlu -t gsm8k -t arc_easy

# Large models with multi-GPU tensor parallelism
uv run olmo-eval run -m llama3.1-70b -t mmlu --num-gpus 4

# Refresh Hugging Face cache before loading a remote model
uv run olmo-eval run -m allenai/OLMo-2-1124-7B -t mmlu --force-download-model

# Tool-augmented evaluation with harness
uv run olmo-eval run -m llama3.1-8b -t simpleqa:judge --harness dr_tulu

Debugging and Inspection

olmo-eval provides tools for inspecting tasks, requests, and responses at various stages of evaluation.

Task Inspection (uv run olmo-eval task inspect)

Inspect task instances without running evaluation:

# View raw instance data
uv run olmo-eval task inspect arc_easy

# View multiple instances
uv run olmo-eval task inspect arc_easy -n 5 --skip 10

# View the LM request that will be sent to the model
uv run olmo-eval task inspect arc_easy:mc --request

# View formatted prompt with chat template applied
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --formatted

# View tokenized representation
uv run olmo-eval task inspect humaneval -T meta-llama/Llama-3.1-8B-Instruct --tokens

# Export as JSON for programmatic use
uv run olmo-eval task inspect arc_easy --json
OptionDescription
-n, --countNumber of instances to display
-s, --skipNumber of instances to skip
--instanceShow instance details (default if no other flags)
--requestShow the LM request
-T, --tokenizerTokenizer for formatting/tokenization
--formattedShow prompt after template applied (requires -T)
--tokensShow token array (requires -T)
--max-tokensMax tokens to display (0 for no limit)
--max-charsMax chars for formatted prompt (0 for no limit)
--max-string-lengthMax chars for instance field values (0 for no limit)
--jsonOutput as JSON

Runtime Inspection Flags

Inspect data during evaluation runs with uv run olmo-eval run:

# Enable all inspection flags at once
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect

# Or use individual flags for specific inspection
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-instance --inspect-request

# Inspect the response after model generation
uv run olmo-eval run -m llama3.1-8b -t mmlu --inspect-response

# Combine multiple inspection flags
uv run olmo-eval run -m llama3.1-8b -t mmlu \
    --inspect-instance \
    --inspect-request \
    --inspect-response
FlagDescription
--inspectEnable all inspection flags below
--inspect-instancePrint the first instance of each task before running
--inspect-requestPrint the first LM request before model generation
--inspect-formattedShow formatted prompt (after chat template applied)
--inspect-tokensShow token array before evaluation
--inspect-responsePrint the first response after model generation

Mock Provider for Testing

Use the mock provider to test inspection tools without loading a real model:

# Quick inspection without vLLM or PyTorch
uv run olmo-eval run -m mock -t humaneval:3shot:bpb --inspect-request

# Dry run with mock to preview configuration
uv run olmo-eval run -m mock -t mmlu --dry-run

External Evals

External evals are standalone evaluations that run outside the normal task pipeline. Use them when a benchmark already comes with its own harness, verifier, or environment and does not fit cleanly into the usual task formatter/scorer flow. They are a good fit for agent-style benchmarks like terminal_bench_2, tau2_bench, and asta_bench that need sandbox orchestration, benchmark-specific setup, or end-to-end execution against an external repo or runner.

Defining an External Eval

from typing import Any

from olmo_eval.evals.external import SandboxedExternalEval, ExternalEvalResult, register_external_eval

class MyBenchmarkExternalEval(SandboxedExternalEval):
    """My benchmark evaluation."""

    name = "my_benchmark"
    description = "Evaluates model on my benchmark"
    timeout_seconds = 3600
    required_secrets = ("MY_API_KEY",)

    @property
    def sandbox_image(self) -> str:
        return "my-benchmark:latest"

    @property
    def working_dir(self) -> str:
        return "/workspace"

    @property
    def setup_command(self) -> tuple[str, ...]:
        return ("pip install -r requirements.txt",)

    @property
    def arguments(self) -> dict[str, tuple[str, Any | None]]:
        # Returns dict of arg_name -> (description, default_value)
        return {"subset": ("Which subset to evaluate", "default")}

    async def execute(self, provider, args, output_dir, container_runtime):
        # Run benchmark in sandbox
        result = await self.run_in_sandbox(provider, args, output_dir)
        return ExternalEvalResult(
            name=self.name,
            metrics={"accuracy": result.score},
            success=True,
        )

# Register the eval
register_external_eval(MyBenchmarkExternalEval())

Running External Evals

# List available external evals
uv run olmo-eval external-evals

# Run a built-in external eval
uv run olmo-eval run-external -e tau2_bench --model llama3.1-8b -a domain=airline -a num_tasks=1

ExternalEvalResult

External evals return structured results:

FieldTypeDescription
namestrEval identifier
metricsdict[str, float]Evaluation metrics
metadatadictAdditional metadata
successboolWhether the eval completed successfully
errorstr | NoneError message if failed
duration_secondsfloatExecution time
raw_outputstr | NoneRaw stdout/stderr from the evaluation
predictionslistInstance-level predictions

Sandboxes

Sandboxes provide isolated execution environments for code execution, tool use, and external evals.

Configuration

from olmo_eval.harness.sandbox import SandboxConfig, SandboxMode, Capability

config = SandboxConfig(
    image="python:3.12-slim",
    mode=SandboxMode.DOCKER,
    command_timeout=30.0,
    startup_timeout=60.0,
    instances=4,  # Run 4 parallel executors
    working_dir="/workspace",
    environment=(("MY_VAR", "value"),),
    volumes=(("/host/path", "/container/path"),),
    capabilities=Capability.BASH | Capability.PYTHON,  # Union of frozensets
)

SandboxConfig Fields

FieldTypeDefaultDescription
imagestrRequiredContainer image
modeSandboxModeDOCKERLOCAL, DOCKER, or MODAL
container_runtimestr"podman""docker" or "podman"
command_timeoutfloat30.0Timeout per command (seconds)
startup_timeoutfloat60.0Container startup timeout
instancesint1Number of parallel executors
working_dirstr"/workspace"Working directory in container
environmenttuple()Environment variables
volumestuple()Volume mounts (host, container)
capabilitiesfrozenset[str]Capability.DEFAULTCapabilities like Capability.BASH, Capability.PYTHON
remove_containerboolTrueRemove container after use
docker_argstuple[str, ...]()Additional Docker/Podman arguments
log_dirstr | NoneNoneDirectory for container logs
exec_shelltuple[str, ...] | NoneNoneCustom shell for command execution
enable_diagnosticsboolTrueRun background diagnostics monitor

Using SandboxManager

The SandboxManager manages multiple executors with capability-based routing:

from olmo_eval.harness.sandbox import SandboxConfig, SandboxManager, SandboxMode, Capability

configs = [
    SandboxConfig(image="python:3.12", mode=SandboxMode.DOCKER, capabilities=Capability.PYTHON, instances=2),
    SandboxConfig(image="ubuntu:22.04", mode=SandboxMode.DOCKER, capabilities=Capability.BASH),
]

manager = SandboxManager(configs, owner="my-scorer")
await manager.start()

# Execute with specific capability - routes to matching executor
result = await manager.execute_with_capabilities(
    "print('hello')",
    Capability.PYTHON
)

# Round-robin across matching executors
results = await asyncio.gather(*[
    manager.execute_with_capabilities(cmd, Capability.PYTHON)
    for cmd in commands
])

await manager.stop()

Sandbox Modes

ModeDescription
LOCALRun commands locally (development only)
DOCKERRun in Docker/Podman containers
MODALRun on Modal cloud platform

Launching on Beaker

olmo-eval includes built-in support for launching evaluation jobs on Beaker.

Installation

The beaker extra is included in the default dev group, so a plain uv sync --frozen is enough. If you previously opted out of the default groups, re-enable it with:

uv sync --frozen --extra beaker

CLI Usage

Launch an evaluation job:

# Basic evaluation
uv run olmo-eval beaker launch -n "eval-llama3-mmlu" -m llama3.1-8b -t mmlu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Multiple tasks
uv run olmo-eval beaker launch -n "eval-llama3-suite" \
    -m llama3.1-8b \
    -t mmlu -t gsm8k -t hellaswag \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Large model with multiple GPUs
uv run olmo-eval beaker launch \
    --name "eval-70b-full" \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --task mmlu --task gsm8k --task arc_easy \
    --cluster h100 \
    --workspace "ai2/olmo-eval-debug" \
    --budget "ai2/oe-base" \
    --gpus 4 \
    --timeout 48h

# Preview the Beaker spec without launching
uv run olmo-eval beaker launch -n "test" -m llama3.1-8b -t arc_easy \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base" \
    --dry-run

# With a harness preset for tool-augmented evaluation
uv run olmo-eval beaker launch -n "eval-with-tools" \
    -m llama3.1-8b \
    -t simpleqa:judge \
    --harness dr_tulu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# With inspection flags for debugging
uv run olmo-eval beaker launch -n "debug-eval" \
    -m llama3.1-8b \
    -t mmlu -o limit=10 \
    --inspect-request \
    --inspect-response \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Run external evaluations
uv run olmo-eval beaker launch -n "external-eval" \
    -E tau2_bench \
    -m llama3.1-8b \
    -A domain=airline \
    -A num_tasks=1 \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Advanced Usage

Local Judge Models

For tasks or custom scorers that use a named auxiliary judge provider, you can run a local judge model alongside the main model. Put harness overrides immediately after --harness, then put task overrides after -t.

uv run olmo-eval beaker launch \
    --harness default \
    -o provider.max_model_len=16384 \
    -o provider.num_instances=1 \
    -o 'metrics.reporters=[file]' \
    -o 'metrics.collect_gpu=true' \
    -o 'provider.kwargs.timeout=300' \
    -o auxiliary_providers.judge.kind=vllm_server \
    -o auxiliary_providers.judge.model=Qwen/Qwen3-8B \
    -o auxiliary_providers.judge.num_instances=1 \
    -o scoring_concurrency=4 \
    -m Qwen/Qwen3-8B \
    -t "simpleqa:judge@urgent" \
    -o limit=10 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base" \
    --cluster h100 \
    --inspect \
    --group olmo-eval-local-judge-2 -y

Per-Task Priorities

Tasks can include an optional @priority suffix to set different priorities per task. Tasks with different priorities will be launched as separate Beaker experiments:

# Mixed priorities - creates separate experiments per priority level
uv run olmo-eval beaker launch -n "eval-suite" -m llama3.1-8b \
    -t "mmlu@high" \
    -t "gsm8k@normal" \
    -t "arc_easy@low" \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Creates 3 experiments:
#   eval-suite-high:   runs mmlu at high priority
#   eval-suite-normal: runs gsm8k at normal priority
#   eval-suite-low:    runs arc_easy at low priority

# With task variants (@ comes after the task spec)
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b -t "arc_easy:mc@high" \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Tasks without @priority use the config file priority (default: normal)

Experiment Groups

Groups logically organize experiments for management and result retrieval:

# Launch with grouping
uv run olmo-eval beaker launch -n "benchmark-v1" --group "benchmark-2024" \
    -m llama3.1-8b -m olmo-2-7b \
    -t mmlu -t gsm8k -t hellaswag \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Creates experiment and adds it to "benchmark-2024" group

# Check group status and results
uv run olmo-eval beaker group info benchmark-2024

# Show detailed task info
uv run olmo-eval beaker group info benchmark-2024 --verbose

# Wait for completion and export as CSV
uv run olmo-eval beaker group info benchmark-2024 --wait --format csv > results.csv

# Export as JSON
uv run olmo-eval beaker group info benchmark-2024 --format json

# Watch experiment logs
uv run olmo-eval beaker watch -e <experiment-id>

# Cancel all experiments in a group
uv run olmo-eval beaker group cancel benchmark-2024

# List groups in a workspace
uv run olmo-eval beaker group list -w <workspace>

Inference Provider Configuration

Docker images do NOT include inference providers (vllm, transformers, litellm) by default. Each model must resolve to a provider configuration, either from a built-in model preset or from harness overrides.

Via config file (recommended):

name: eval-mixed-providers
models:
  - llama3.1-8b
  - gpt-4o
tasks:
  - mmlu
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base

CLI Options

OptionShortDefaultDescription
--config-fnoneYAML config file (CLI args override config values)
--name-nautoExperiment name (auto-generated from model/tasks if not provided)
--model-mrequiredModel name or HuggingFace path (can specify multiple)
--task-trequiredTask name with optional @priority suffix (can specify multiple)
--harness-HnoneHarness preset name
--override-ononeOverride for preceding -t or -H (can specify multiple)
--cluster-crequiredCluster alias (h100, a100, aus) or full name
--gpus-GautoNumber of GPUs (defaults to 1 for GPU providers, 0 otherwise)
--max-gpus-per-node8Maximum GPUs per node (tasks split if exceeded)
--priority-pnormalJob priority (low, normal, high, urgent)
--preemptibletrueAllow preemption
--timeout-T24hJob timeout (e.g., 24h, 30m)
--retries-rnoneNumber of retries on failure
--workspace-wrequiredBeaker workspace
--budget-BrequiredBeaker budget
--image-IdefaultCustom Beaker image
--group-gautoAdd experiments to Beaker group(s) (auto-generated if not specified)
--external-eval-EnoneExternal evaluation name(s) to run instead of tasks
--eval-arg-AnoneArguments for external evals (key=value)
--provider-kwarg-KnoneProvider kwargs for external evals (key=value)
--force-download-modelfalseRefresh Hugging Face model/tokenizer cache before loading
--uv-cache-dirdefaultUV cache directory for package downloads
--dry-run-dfalsePrint spec without launching
--yes-yfalseSkip confirmation prompt
--follow/--no-followtrueFollow logs after launch
--secret-envnoneMap Beaker secret to env var (SECRET:VAR)
--aws-credentialsautoInject AWS credentials (auto-detected from s3:// paths)
--gcp-credentialsautoInject GCP credentials (auto-detected from gs:// model paths)
--storefalsePersist results to configured database

Per-Task Overrides

Use the -o/--override flag to apply configuration overrides to the preceding -t:

# Task overrides (apply to the preceding -t)
uv run olmo-eval beaker launch -n "eval" \
    -m llama3.1-8b \
    -t mmlu -o limit=100 -o num_fewshot=5 \
    -t gsm8k -o limit=50 \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

The -o flag uses OmegaConf dotlist syntax, supporting:

TypeSyntaxExample
Stringkey=value-o formatter.template="Q: {q}"
Numberkey=123-o limit=100
Booleankey=true-o preemptible=false
Nesteda.b.c=val-o scorer.normalize=true
Listkey=[a,b]-o 'dependencies=[pkg1, pkg2]'

Note: Quote complex values to prevent shell interpretation:

# Good - single quotes protect the value
-o 'extra_config={key: value, nested: {a: 1}}'

Secret Environment Overrides

By default, Beaker secrets are mapped using the pattern {username}_{ENV_VAR} (e.g., ai2-tylerm_OPENAI_API_KEY). Use --secret-env to override this with a custom Beaker secret name:

# Use a team-shared secret instead of your personal secret
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t mmlu \
    --secret-env team-openai-key:OPENAI_API_KEY \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Multiple secret overrides
uv run olmo-eval beaker launch -n "eval" -m gpt-4o -t simpleqa:judge \
    --harness dr_tulu \
    --secret-env team-openai-key:OPENAI_API_KEY \
    --secret-env shared-serper-key:SERPER_API_KEY \
    --secret-env shared-s2-key:S2_API_KEY \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Format: BEAKER_SECRET_NAME:ENV_VAR_NAME

This is useful for:

  • Using team-shared API keys instead of personal secrets
  • Testing with different credential sets
  • Sharing jobs that use organization-level secrets

YAML Configuration

For complex or reusable configurations, use YAML config files with the --config/-f option. CLI arguments override values from the config file.

Basic config file (eval_config.yaml):

name: eval-llama3-core
models:
  - llama3.1-8b
tasks:
  - mmlu
  - gsm8k
  - hellaswag
  - arc_challenge

cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
priority: normal
timeout: 24h

Usage:

# Run from config file
uv run olmo-eval beaker launch -f eval_config.yaml --dry-run

# Override specific values
uv run olmo-eval beaker launch -f eval_config.yaml --gpus 4

# Add additional models via CLI
uv run olmo-eval beaker launch -f eval_config.yaml -m olmo-2-7b

Multi-model comparison config:

name: eval-model-comparison
models:
  - llama3.1-8b
  - olmo-2-7b
  - mistral-7b
tasks:
  - mmlu
  - gsm8k
  - hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1

Per-task priorities in config (examples/configs/prioritized_tasks.yaml):

Use @priority suffix on tasks to run different tasks at different priority levels. Tasks with different priorities create separate Beaker experiments:

name: eval-prioritized
models:
  - llama3.1-8b
  - olmo-2-7b
tasks:
  # High priority - run first
  - mmlu@high
  - gsm8k@high
  # Normal priority
  - hellaswag@normal
  - arc_challenge@normal
  # Low priority - run when resources available
  - winogrande@low
  - arc_easy@low
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 1
timeout: 24h

This creates 3 experiments (one per priority level, with both models in each):

eval-prioritized-high:   models=[llama3.1-8b, olmo-2-7b], tasks=[mmlu, gsm8k]
eval-prioritized-normal: models=[llama3.1-8b, olmo-2-7b], tasks=[hellaswag, arc_challenge]
eval-prioritized-low:    models=[llama3.1-8b, olmo-2-7b], tasks=[winogrande, arc_easy]

Large model config:

name: eval-70b-full
models:
  - meta-llama/Llama-3.1-70B-Instruct
tasks:
  - mmlu
  - gsm8k
  - hellaswag
cluster: h100
workspace: ai2/olmo-eval-debug
budget: ai2/oe-base
gpus: 4
priority: high
preemptible: false
timeout: 48h
retries: 2
description: "Full evaluation suite for Llama 70B"

Config file fields:

FieldTypeRequiredDescription
namestringyesExperiment name
modelslistyesList of model names or presets
taskslistyesList of task specs (with optional @priority)
clusterstringyesCluster alias or full name
gpusintnoDefault GPUs per model instance (auto-detected based on provider)
max_gpus_per_nodeintnoMax GPUs per node, splits tasks if exceeded (default: 8)
prioritystringnoDefault priority (default: normal)
preemptibleboolnoAllow preemption (default: true)
timeoutstringnoJob timeout (default: 24h)
retriesintnoRetry count on failure
workspacestringyesBeaker workspace
budgetstringyesBeaker budget
beaker_imagestringnoContainer image to use (config-only)
descriptionstringnoOptional Beaker description
groupslistnoBeaker groups to add experiments to

See examples/beaker/configs/ for more configuration examples.

Cluster Aliases

# List available cluster aliases
uv run olmo-eval beaker clusters

Programmatic API

from olmo_eval.launch import BeakerJobConfig, BeakerLauncher

config = BeakerJobConfig(
    name="eval-llama3-mmlu",
    command=["uv", "run", "olmo-eval", "run", "-m", "llama3.1-8b", "-t", "mmlu"],
    cluster="h100",
    num_gpus=1,
)

launcher = BeakerLauncher()
experiment = launcher.launch(config)
print(f"Launched: {launcher.beaker.experiment.url(experiment)}")

Docker Image Management

Docker images provide the runtime environment (Python, PyTorch, CUDA) but do NOT include:

  • Source code - Gantry mounts your git repository at runtime
  • Inference providers - Installed at job startup from each model's resolved provider config

This approach allows you to:

  • Use any git commit without rebuilding images
  • Keep images small and cacheable

Building Images

Images are tagged with CUDA and PyTorch versions: cu{version}-trc{version}-{arch}

# Build with defaults
./scripts/build_image.sh

# Specific CUDA + PyTorch version
./scripts/build_image.sh --cuda-version 12.8.1 --torch-version 2.9.0

# Production build
./scripts/build_image.sh --platform linux/amd64

# See supported CUDA+PyTorch pairs
./scripts/build_image.sh --help

Supported CUDA versions: 12.6.1, 12.8.0, 12.8.1, 12.9.1 PyTorch version: Configurable via --torch-version Configuration: See scripts/build_config.sh

Pushing Images

# Push most recent build
./scripts/beaker/push_beaker_image.sh

# Preview without pushing
./scripts/beaker/push_beaker_image.sh --dry-run

The script auto-detects the image name from the tag (e.g., olmo-eval-cu128-trc291-amd64)

What's in the Image

The image contains:

  • Python 3.12 (via uv)
  • PyTorch with CUDA support
  • System dependencies (git, uv, ca-certificates)

The image does NOT contain:

  • olmo-eval source code (provided by gantry at runtime)
  • olmo-eval dependencies like click, datasets, rich, etc. (installed at job startup)
  • Storage backends like boto3, psycopg (installed at job startup if needed)
  • Inference providers like vllm, transformers, litellm (installed at job startup)

Installing Inference Providers at Runtime

Inference providers are NOT baked into images. They are installed at job startup from the resolved provider configuration for each model:

# In config file
models:
  - llama3.1-8b
  - gpt-4o
# Or force the provider kind via a harness override
uv run olmo-eval beaker launch -n "eval" \
    --harness default \
    -o provider.kind=vllm_server \
    -m llama3.1-8b \
    -t mmlu \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Manual installation inside container
uv pip install -e '.[vllm]'  # includes vllm[runai]

Task-Specific Dependencies

Tasks can declare runtime dependencies that are installed at job startup (see Tasks). Dependencies are automatically merged, deduplicated, and installed after the inference provider.

You can also add or override dependencies via the CLI:

# Add dependencies to a task via -o flag
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
    -t humaneval:3shot:bpb -o 'dependencies=["code-sandbox==1.0", "git+https://github.com/user/repo@v2.0"]' \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

# Dependencies from multiple tasks are merged
uv run olmo-eval beaker launch -n "eval" -m llama3.1-8b \
    -t humaneval:3shot:bpb -o 'dependencies=["pkg1"]' \
    -t mbpp:3shot:bpb -o 'dependencies=["pkg2"]' \
    --cluster h100 \
    -w "ai2/olmo-eval-debug" \
    -B "ai2/oe-base"

Development

This repo uses uv with a checked-in uv.lock for reproducible installs. The default dependency groups (dev + vllm) are installed automatically, which covers storage, beaker, hf, and the vLLM inference provider.

# Install dependencies from the lockfile
uv sync --frozen

# Install pre-commit hooks
make setup

# Run linter / formatter
make lint
make fix    # auto-fix

# Run tests (and type checks)
make test
make verify

# Update the lockfile after editing pyproject.toml
uv lock

CI runs uv sync --frozen and uv run --frozen ..., so any change to pyproject.toml must be accompanied by a refreshed uv.lock.