Evaluation Framework

April 8, 2026 · View on GitHub

Three layers for evaluating models trained with Tinker.

1. Evaluators — Inline Training Eval

Lightweight interfaces called every N training steps. Return dict[str, float] metrics.

from tinker_cookbook.eval import SamplingClientEvaluator

class MyEval(SamplingClientEvaluator):
    async def __call__(self, sampling_client):
        # generate, grade, return metrics
        return {"eval/accuracy": 0.85}

Pass evaluators to your training loop via evaluator_builders.

2. Benchmarks — Standalone Evaluation

Full benchmark framework reusing the RL Env abstraction. Each benchmark creates Env instances; the runner handles concurrency, trajectory storage, and aggregation.

Run benchmarks

from tinker_cookbook.eval.benchmarks import run_benchmark, run_benchmarks

# Single benchmark
result = await run_benchmark("gsm8k", sampling_client, renderer)
print(f"GSM8K: {result.score:.1%}")  # GSM8K: 78.3%

# When save_dir is set, the runner automatically resumes from previously completed examples.

# Multiple benchmarks (parallel by default)
results = await run_benchmarks(
    ["gsm8k", "mmlu_pro", "ifeval"],
    sampling_client, renderer,
    BenchmarkConfig(save_dir="evals/step500", max_examples=200),
)

Available benchmarks

Stable benchmarks — verified against published scores:

BenchmarkTypeGradingPrerequisites
gsm8kSingle-turnProgrammatic (numeric)
math500Single-turnProgrammatic (numeric)
aime_2025Single-turnProgrammatic (numeric)
aime_2026Single-turnProgrammatic (numeric)
mmlu_proSingle-turnProgrammatic (MCQA)
mmlu_reduxSingle-turnProgrammatic (MCQA)
gpqaSingle-turnProgrammatic (MCQA)HF auth (gated)
ifevalSingle-turnProgrammatic (IF constraints)
mbppSingle-turnCode executionModal
cevalSingle-turnProgrammatic (MCQA, Chinese)
supergpqaSingle-turnProgrammatic (MCQA, 4-10 options)
ifbenchSingle-turnIF constraints (58 types)67.3% on Qwen3.5-35B-A3B (official 70.2%). Requires ifbench package.

Experimental benchmarks (_-prefixed modules) — functional but need further validation:

BenchmarkTypeGradingStatus
hmmt_feb_2025Single-turnLaTeX answer (sympy)Sympy grading, requires antlr4
hmmt_nov_2025Single-turnLaTeX answer (sympy)Sympy grading, requires antlr4
arena_hardSingle-turnLLM-as-judgeWorks with self-judge, needs cross-model judge
longbenchSingle-turnProgrammaticLimited by 65K context window
livecodebenchSingle-turnCode execution (Modal)47.4% on Qwen3.5-35B-A3B (needs 1800s timeout)
bfclSingle-turnFunction call ASTGround truth format mismatch
terminal_benchMulti-turnSandbox + tests (Modal)27.7% on Qwen3.5-35B-A3B (ctx overflow on 65K model)
swe_benchMulti-turnSandbox + pytest (Modal)0% — 65K context too small for multi-turn repo exploration
tau2_benchMulti-turnTool dispatch + user sim30% (needs separate user simulator model)

Prerequisites:

Install all eval dependencies at once:

uv pip install 'tinker-cookbook[eval]'

Or install only what you need per benchmark:

uv pip install 'tinker-cookbook[eval-math500]'        # math-verify, pylatexenc, sympy
uv pip install 'tinker-cookbook[eval-hmmt]'            # antlr4 for sympy LaTeX parsing
uv pip install 'tinker-cookbook[eval-mbpp]'            # Modal sandbox
uv pip install 'tinker-cookbook[eval-livecodebench]'   # Modal sandbox
uv pip install 'tinker-cookbook[eval-terminal-bench]'  # Modal sandbox
uv pip install 'tinker-cookbook[eval-swe-bench]'       # Modal sandbox
uv pip install 'tinker-cookbook[eval-ifbench]'         # nltk, emoji, syllapy, langdetect

Additional setup:

  • IFBench: Also requires uv pip install 'ifbench @ git+https://github.com/allenai/IFBench.git' (not on PyPI). The benchmark raises ImportError without it.
  • HF auth (gated): Set HF_TOKEN or run huggingface-cli login for gated datasets (GPQA).
  • Modal auth: Run modal token new for sandbox benchmarks (MBPP, LiveCodeBench, Terminal Bench, SWE-bench).
  • judge_sampling_client: Benchmarks using LLM-as-judge or user simulation require a separate Tinker sampling client for the judge model. Pass via BenchmarkConfig(judge_sampling_client=..., judge_renderer=...).

Browse results

from tinker_cookbook.eval.benchmarks import load_result, load_trajectories, print_trajectory

# Load aggregated score
result = load_result("evals/step500", "gsm8k")
print(f"{result.name}: {result.score:.1%} ({result.num_correct}/{result.num_examples})")

# Browse incorrect examples
wrong = load_trajectories("evals/step500", "gsm8k", incorrect_only=True)
for t in wrong[:5]:
    print(f"Expected: {t.logs['expected']}, Got: {t.logs['extracted']}")
    print_trajectory(t)

Pass@k evaluation

When num_samples > 1, the runner evaluates each example multiple times and computes unbiased pass@k estimates (per the Codex paper):

config = BenchmarkConfig(num_samples=10, save_dir="evals/pass_at_k")
result = await run_benchmark("mbpp", sampling_client, renderer, config)
print(result.pass_at_k)  # {1: 0.45, 5: 0.72, 10: 0.85}

Use benchmarks as inline training evaluators

BenchmarkEvaluator bridges any benchmark into the SamplingClientEvaluator interface:

from tinker_cookbook.eval.benchmark_evaluator import BenchmarkEvaluator

evaluator_builders = [
    lambda: BenchmarkEvaluator("gsm8k", renderer, max_examples=100),
    lambda: BenchmarkEvaluator("ifeval", renderer, max_examples=50),
]

Add a new benchmark

  1. Create tinker_cookbook/eval/benchmarks/my_benchmark.py
  2. Implement a MessageEnv (recommended) — the renderer handles thinking-token stripping and prompt building automatically:
from tinker_cookbook.eval.benchmarks._common import build_messages, make_example_id
from tinker_cookbook.renderers import get_text_content
from tinker_cookbook.renderers.base import Message
from tinker_cookbook.rl.message_env import MessageEnv, MessageStepResult

class MyMessageEnv(MessageEnv):
    def __init__(self, question: str, expected: str, example_id: str = ""):
        self.question = question
        self.expected = expected
        self.example_id = example_id

    async def initial_observation(self) -> list[Message]:
        return build_messages(self.question)

    async def step(self, message: Message) -> MessageStepResult:
        response = get_text_content(message)  # thinking already stripped
        correct = self.expected in response
        return MessageStepResult(
            reward=1.0 if correct else 0.0,
            episode_done=True,
            next_messages=[],
            metrics={"correct": float(correct)},
            logs={"example_id": self.example_id, "expected": self.expected},
        )
  1. Implement a BenchmarkBuilder that creates envs and wraps them with EnvFromMessageEnv:
from tinker_cookbook.eval.benchmarks._types import BenchmarkBuilder, BenchmarkConfig
from tinker_cookbook.eval.benchmarks import register
from tinker_cookbook.rl.message_env import EnvFromMessageEnv

class MyBenchmarkBuilder(BenchmarkBuilder):
    name = "my_benchmark"

    def make_envs(self, renderer, config):
        ds = load_dataset("my/dataset", split="test")
        if config.max_examples is not None:
            ds = ds.select(range(min(config.max_examples, len(ds))))
        envs = []
        for row in ds:
            msg_env = MyMessageEnv(row["question"], row["answer"])
            envs.append(EnvFromMessageEnv(
                renderer=renderer,
                message_env=msg_env,
                failed_parse_reward=0.0,
                context_overflow_reward=0.0,
            ))
        return envs

register(MyBenchmarkBuilder())

Key points:

  • MessageEnv + EnvFromMessageEnv: Thinking-token stripping and context overflow handling are automatic. Your step() receives a clean message with thinking already removed.
  • example_id: Set self.example_id on your MessageEnv for stable cross-run comparison and resumability. Use make_example_id(prefix, text) for a deterministic content hash. EnvFromMessageEnv forwards it automatically. Without it, the runner falls back to positional index (fragile).
  • failed_parse_reward=0.0, context_overflow_reward=0.0: Truncated or unparseable responses score 0 and are tracked in BenchmarkResult.num_truncated.
  • Sandbox benchmarks: Use SandboxMixin from _common.py and set requires_sandbox = True on the builder. See mbpp.py for an example.
  • Multi-turn benchmarks: Set multi_turn = True on the builder (uses agent_concurrency instead of concurrency). See _terminal_bench.py for an example.

3. EvalStore — Cross-Checkpoint Comparison

Persistent, file-based storage for tracking evaluation across checkpoints. Matches examples by example_id to identify regressions and improvements.

from tinker_cookbook.stores.eval_store import EvalStore
from tinker_cookbook.eval.benchmarks import run_benchmarks, BenchmarkConfig

store = EvalStore("~/experiments/evals")

# Run evals for a checkpoint
run_id = store.create_run(
    model_name="nvidia/...",
    checkpoint_name="sft_step500",
    benchmarks=["gsm8k", "ifeval"],
)
await run_benchmarks(
    ["gsm8k", "ifeval"], sampling_client, renderer,
    BenchmarkConfig(save_dir=store.run_dir(run_id)),
)
store.finalize_run(run_id)

# Query results
result = store.read_result(run_id, "gsm8k")
print(f"GSM8K: {result.score:.1%}")

# List all runs
for run in store.list_runs():
    print(f"{run.run_id}: {run.scores}")

Storage layout

eval_store/
  runs.jsonl                          # Append-only index
  runs/
    sft_step500_20260327_143022/
      metadata.json                   # Model, checkpoint, config, scores
      gsm8k/
        result.json                   # Aggregated BenchmarkResult
        trajectories.jsonl            # Per-example StoredTrajectory
      ifeval/
        result.json
        trajectories.jsonl

Configuration

BenchmarkConfig controls runtime behavior:

ParameterDefaultDescription
max_examplesNone (all)Limit number of examples
concurrency64Max concurrent rollouts (single-turn)
agent_concurrency8Max concurrent rollouts (multi-turn)
timeout_seconds300Per-example timeout
max_tokens32768Max generation tokens
temperature0.6Sampling temperature
num_samples1Number of samples per example for pass@k evaluation
save_dirNoneDirectory for saving trajectories/results
judge_sampling_clientNoneSampling client for LLM-as-judge benchmarks

Important: scores are setup-dependent

Benchmark scores are highly sensitive to evaluation settings. Small changes in max_tokens, temperature, system_prompt, or timeout_seconds can shift scores by 10–30%. Always document your exact configuration when reporting results.

Common pitfalls with thinking models:

  • max_tokens truncation: Thinking models generate long reasoning chains that may fill max_tokens before producing an answer. For LiveCodeBench v6, 78/91 wrong answers were truncated at 32K tokens — increasing max_tokens to 64K would likely recover most of them.
  • Timeouts: Thinking models need 1800s+ for code benchmarks. LiveCodeBench went from 20% (600s) to 47.4% (1800s) on Qwen3.5-35B-A3B.
  • Context overflow: Multi-turn benchmarks (terminal_bench, swe_bench) can exceed the model's context window as conversations grow. The 65K context window of Qwen3.5-35B-A3B is insufficient for SWE-bench.
  • System prompt: GSM8K improved from 84.7% to 95.6% by instructing the model to use \boxed{}.

Treat these scores as reference points for a specific configuration, not definitive model capabilities. The framework's primary value is consistent, reproducible evaluation — not producing leaderboard numbers.

Verification

Reference scores on Qwen3.5-35B-A3B with max_tokens=32768, temperature=0.6. Official scores from the model card (which may use different settings).

Stable benchmarks:

BenchmarkOur ScoreOfficialMatch?Settings
MMLU-Pro85.2%*85.3Match32K tokens
MMLU-Redux93.5%93.3Match32K tokens
GPQA Diamond89.7%*84.2Above*32K tokens
IFEval92.5%*91.9Match32K tokens
GSM8K95.6%*system_prompt=\boxed{}, 32K tokens
MATH-50097.9%*system_prompt=\boxed{}, 32K tokens
MBPP84.4%*Modal sandbox, 32K tokens
AIME 2026 (pass@4)90.0%93.33Closesystem_prompt=\boxed{}, 32K tokens

* Excluding context overflow — the thinking model's reasoning chain exceeds context on some examples. These are scored as failures (reward=0).

Experimental benchmarks (Modal sandbox):

BenchmarkOur ScoreOfficialNotes
LiveCodeBench v647.4% (175 ex)74.678/91 wrong due to 32K truncation; excl. truncated: 86.5%
Terminal Bench 227.7% (112 ex)40.524 ctx overflow + 14 timeout on 65K model
SWE-bench Verified0% (500 ex)69.265K context too small — all ctx overflow
TAU2-Bench30.0% (50 ex)81.2Same-model user sim limits score; official uses GPT-4.1

Verified scores

Reference scores using BenchmarkConfig.for_model() with recommended settings.

GPT-OSS-120B (128K context)

Model: openai/gpt-oss-120b:peft:131072. Renderer: gpt_oss_high_reasoning. Official scores from the GPT-OSS technical report.

BenchmarkRawCompletedOfficialMatch?
GPQA Diamond80.8%80.8%80.1%Match
MMLU-Pro80.6%80.6%90.0% (MMLU, different)Different benchmark
GSM8K95.9%95.9%
MATH-50095.4%95.4%
IFEval91.7%91.7%
AIME 202576.7%76.7%92.5% (with tools)No tools in our eval
Terminal Bench28.6%28.6%Not in paper
SWE-bench Verified2.2%2.3%62.4%Agent scaffold gap (see below)

Zero truncation across all benchmarks — 128K context is sufficient for all prompts. Raw and Completed scores are identical (no max_tokens truncation issues).

SWE-bench gap: The official eval uses a specialized agent scaffold with file editing tools. Our harness provides only a bash tool — the model reads code but rarely generates sed edits. Improving the tool scaffold (e.g., adding a str_replace_editor) is expected to close most of this gap. See mini-swe-agent for a reference bash-only implementation that achieves 74%+ with frontier models.

Qwen3.5-35B-A3B (64K context)

Model: Qwen/Qwen3.5-35B-A3B. Renderer: qwen3_5. Official scores from the Qwen3.5-35B-A3B model card.

BenchmarkRawCompletedOfficialMatch?
MMLU-Redux89.2%93.8%93.3Match
GPQA Diamond70.7%89.7%84.2Above
IFEval86.9%92.5%91.9Match
C-Eval89.2%90.1%90.2Match
SuperGPQA~59%~67%63.4Match
MATH-50092.0%97.9%
GSM8K81.7%88.0%
MBPP84.4%87.1%
IFBench67.3%70.2Match
AIME 2026 pass@496.7%93.33Above

"Completed" excludes truncated examples (model hit max_tokens before answering). For thinking models, score_completed is the right comparison against published scores.

Testing

pytest tinker_cookbook/eval/benchmarks/benchmark_test.py