Evaluation Framework
April 8, 2026 · View on GitHub
Three layers for evaluating models trained with Tinker.
1. Evaluators — Inline Training Eval
Lightweight interfaces called every N training steps. Return dict[str, float] metrics.
from tinker_cookbook.eval import SamplingClientEvaluator
class MyEval(SamplingClientEvaluator):
async def __call__(self, sampling_client):
# generate, grade, return metrics
return {"eval/accuracy": 0.85}
Pass evaluators to your training loop via evaluator_builders.
2. Benchmarks — Standalone Evaluation
Full benchmark framework reusing the RL Env abstraction. Each benchmark creates Env instances; the runner handles concurrency, trajectory storage, and aggregation.
Run benchmarks
from tinker_cookbook.eval.benchmarks import run_benchmark, run_benchmarks
# Single benchmark
result = await run_benchmark("gsm8k", sampling_client, renderer)
print(f"GSM8K: {result.score:.1%}") # GSM8K: 78.3%
# When save_dir is set, the runner automatically resumes from previously completed examples.
# Multiple benchmarks (parallel by default)
results = await run_benchmarks(
["gsm8k", "mmlu_pro", "ifeval"],
sampling_client, renderer,
BenchmarkConfig(save_dir="evals/step500", max_examples=200),
)
Available benchmarks
Stable benchmarks — verified against published scores:
| Benchmark | Type | Grading | Prerequisites |
|---|---|---|---|
| gsm8k | Single-turn | Programmatic (numeric) | — |
| math500 | Single-turn | Programmatic (numeric) | — |
| aime_2025 | Single-turn | Programmatic (numeric) | — |
| aime_2026 | Single-turn | Programmatic (numeric) | — |
| mmlu_pro | Single-turn | Programmatic (MCQA) | — |
| mmlu_redux | Single-turn | Programmatic (MCQA) | — |
| gpqa | Single-turn | Programmatic (MCQA) | HF auth (gated) |
| ifeval | Single-turn | Programmatic (IF constraints) | — |
| mbpp | Single-turn | Code execution | Modal |
| ceval | Single-turn | Programmatic (MCQA, Chinese) | — |
| supergpqa | Single-turn | Programmatic (MCQA, 4-10 options) | — |
| ifbench | Single-turn | IF constraints (58 types) | 67.3% on Qwen3.5-35B-A3B (official 70.2%). Requires ifbench package. |
Experimental benchmarks (_-prefixed modules) — functional but need further validation:
| Benchmark | Type | Grading | Status |
|---|---|---|---|
| hmmt_feb_2025 | Single-turn | LaTeX answer (sympy) | Sympy grading, requires antlr4 |
| hmmt_nov_2025 | Single-turn | LaTeX answer (sympy) | Sympy grading, requires antlr4 |
| arena_hard | Single-turn | LLM-as-judge | Works with self-judge, needs cross-model judge |
| longbench | Single-turn | Programmatic | Limited by 65K context window |
| livecodebench | Single-turn | Code execution (Modal) | 47.4% on Qwen3.5-35B-A3B (needs 1800s timeout) |
| bfcl | Single-turn | Function call AST | Ground truth format mismatch |
| terminal_bench | Multi-turn | Sandbox + tests (Modal) | 27.7% on Qwen3.5-35B-A3B (ctx overflow on 65K model) |
| swe_bench | Multi-turn | Sandbox + pytest (Modal) | 0% — 65K context too small for multi-turn repo exploration |
| tau2_bench | Multi-turn | Tool dispatch + user sim | 30% (needs separate user simulator model) |
Prerequisites:
Install all eval dependencies at once:
uv pip install 'tinker-cookbook[eval]'
Or install only what you need per benchmark:
uv pip install 'tinker-cookbook[eval-math500]' # math-verify, pylatexenc, sympy
uv pip install 'tinker-cookbook[eval-hmmt]' # antlr4 for sympy LaTeX parsing
uv pip install 'tinker-cookbook[eval-mbpp]' # Modal sandbox
uv pip install 'tinker-cookbook[eval-livecodebench]' # Modal sandbox
uv pip install 'tinker-cookbook[eval-terminal-bench]' # Modal sandbox
uv pip install 'tinker-cookbook[eval-swe-bench]' # Modal sandbox
uv pip install 'tinker-cookbook[eval-ifbench]' # nltk, emoji, syllapy, langdetect
Additional setup:
- IFBench: Also requires
uv pip install 'ifbench @ git+https://github.com/allenai/IFBench.git'(not on PyPI). The benchmark raisesImportErrorwithout it. - HF auth (gated): Set
HF_TOKENor runhuggingface-cli loginfor gated datasets (GPQA). - Modal auth: Run
modal token newfor sandbox benchmarks (MBPP, LiveCodeBench, Terminal Bench, SWE-bench). judge_sampling_client: Benchmarks using LLM-as-judge or user simulation require a separate Tinker sampling client for the judge model. Pass viaBenchmarkConfig(judge_sampling_client=..., judge_renderer=...).
Browse results
from tinker_cookbook.eval.benchmarks import load_result, load_trajectories, print_trajectory
# Load aggregated score
result = load_result("evals/step500", "gsm8k")
print(f"{result.name}: {result.score:.1%} ({result.num_correct}/{result.num_examples})")
# Browse incorrect examples
wrong = load_trajectories("evals/step500", "gsm8k", incorrect_only=True)
for t in wrong[:5]:
print(f"Expected: {t.logs['expected']}, Got: {t.logs['extracted']}")
print_trajectory(t)
Pass@k evaluation
When num_samples > 1, the runner evaluates each example multiple times and computes unbiased pass@k estimates (per the Codex paper):
config = BenchmarkConfig(num_samples=10, save_dir="evals/pass_at_k")
result = await run_benchmark("mbpp", sampling_client, renderer, config)
print(result.pass_at_k) # {1: 0.45, 5: 0.72, 10: 0.85}
Use benchmarks as inline training evaluators
BenchmarkEvaluator bridges any benchmark into the SamplingClientEvaluator interface:
from tinker_cookbook.eval.benchmark_evaluator import BenchmarkEvaluator
evaluator_builders = [
lambda: BenchmarkEvaluator("gsm8k", renderer, max_examples=100),
lambda: BenchmarkEvaluator("ifeval", renderer, max_examples=50),
]
Add a new benchmark
- Create
tinker_cookbook/eval/benchmarks/my_benchmark.py - Implement a
MessageEnv(recommended) — the renderer handles thinking-token stripping and prompt building automatically:
from tinker_cookbook.eval.benchmarks._common import build_messages, make_example_id
from tinker_cookbook.renderers import get_text_content
from tinker_cookbook.renderers.base import Message
from tinker_cookbook.rl.message_env import MessageEnv, MessageStepResult
class MyMessageEnv(MessageEnv):
def __init__(self, question: str, expected: str, example_id: str = ""):
self.question = question
self.expected = expected
self.example_id = example_id
async def initial_observation(self) -> list[Message]:
return build_messages(self.question)
async def step(self, message: Message) -> MessageStepResult:
response = get_text_content(message) # thinking already stripped
correct = self.expected in response
return MessageStepResult(
reward=1.0 if correct else 0.0,
episode_done=True,
next_messages=[],
metrics={"correct": float(correct)},
logs={"example_id": self.example_id, "expected": self.expected},
)
- Implement a
BenchmarkBuilderthat creates envs and wraps them withEnvFromMessageEnv:
from tinker_cookbook.eval.benchmarks._types import BenchmarkBuilder, BenchmarkConfig
from tinker_cookbook.eval.benchmarks import register
from tinker_cookbook.rl.message_env import EnvFromMessageEnv
class MyBenchmarkBuilder(BenchmarkBuilder):
name = "my_benchmark"
def make_envs(self, renderer, config):
ds = load_dataset("my/dataset", split="test")
if config.max_examples is not None:
ds = ds.select(range(min(config.max_examples, len(ds))))
envs = []
for row in ds:
msg_env = MyMessageEnv(row["question"], row["answer"])
envs.append(EnvFromMessageEnv(
renderer=renderer,
message_env=msg_env,
failed_parse_reward=0.0,
context_overflow_reward=0.0,
))
return envs
register(MyBenchmarkBuilder())
Key points:
MessageEnv+EnvFromMessageEnv: Thinking-token stripping and context overflow handling are automatic. Yourstep()receives a clean message with thinking already removed.example_id: Setself.example_idon your MessageEnv for stable cross-run comparison and resumability. Usemake_example_id(prefix, text)for a deterministic content hash.EnvFromMessageEnvforwards it automatically. Without it, the runner falls back to positional index (fragile).failed_parse_reward=0.0, context_overflow_reward=0.0: Truncated or unparseable responses score 0 and are tracked inBenchmarkResult.num_truncated.- Sandbox benchmarks: Use
SandboxMixinfrom_common.pyand setrequires_sandbox = Trueon the builder. Seembpp.pyfor an example. - Multi-turn benchmarks: Set
multi_turn = Trueon the builder (usesagent_concurrencyinstead ofconcurrency). See_terminal_bench.pyfor an example.
3. EvalStore — Cross-Checkpoint Comparison
Persistent, file-based storage for tracking evaluation across checkpoints. Matches examples by example_id to identify regressions and improvements.
from tinker_cookbook.stores.eval_store import EvalStore
from tinker_cookbook.eval.benchmarks import run_benchmarks, BenchmarkConfig
store = EvalStore("~/experiments/evals")
# Run evals for a checkpoint
run_id = store.create_run(
model_name="nvidia/...",
checkpoint_name="sft_step500",
benchmarks=["gsm8k", "ifeval"],
)
await run_benchmarks(
["gsm8k", "ifeval"], sampling_client, renderer,
BenchmarkConfig(save_dir=store.run_dir(run_id)),
)
store.finalize_run(run_id)
# Query results
result = store.read_result(run_id, "gsm8k")
print(f"GSM8K: {result.score:.1%}")
# List all runs
for run in store.list_runs():
print(f"{run.run_id}: {run.scores}")
Storage layout
eval_store/
runs.jsonl # Append-only index
runs/
sft_step500_20260327_143022/
metadata.json # Model, checkpoint, config, scores
gsm8k/
result.json # Aggregated BenchmarkResult
trajectories.jsonl # Per-example StoredTrajectory
ifeval/
result.json
trajectories.jsonl
Configuration
BenchmarkConfig controls runtime behavior:
| Parameter | Default | Description |
|---|---|---|
max_examples | None (all) | Limit number of examples |
concurrency | 64 | Max concurrent rollouts (single-turn) |
agent_concurrency | 8 | Max concurrent rollouts (multi-turn) |
timeout_seconds | 300 | Per-example timeout |
max_tokens | 32768 | Max generation tokens |
temperature | 0.6 | Sampling temperature |
num_samples | 1 | Number of samples per example for pass@k evaluation |
save_dir | None | Directory for saving trajectories/results |
judge_sampling_client | None | Sampling client for LLM-as-judge benchmarks |
Important: scores are setup-dependent
Benchmark scores are highly sensitive to evaluation settings. Small changes in max_tokens, temperature, system_prompt, or timeout_seconds can shift scores by 10–30%. Always document your exact configuration when reporting results.
Common pitfalls with thinking models:
max_tokenstruncation: Thinking models generate long reasoning chains that may fillmax_tokensbefore producing an answer. For LiveCodeBench v6, 78/91 wrong answers were truncated at 32K tokens — increasingmax_tokensto 64K would likely recover most of them.- Timeouts: Thinking models need 1800s+ for code benchmarks. LiveCodeBench went from 20% (600s) to 47.4% (1800s) on Qwen3.5-35B-A3B.
- Context overflow: Multi-turn benchmarks (terminal_bench, swe_bench) can exceed the model's context window as conversations grow. The 65K context window of Qwen3.5-35B-A3B is insufficient for SWE-bench.
- System prompt: GSM8K improved from 84.7% to 95.6% by instructing the model to use
\boxed{}.
Treat these scores as reference points for a specific configuration, not definitive model capabilities. The framework's primary value is consistent, reproducible evaluation — not producing leaderboard numbers.
Verification
Reference scores on Qwen3.5-35B-A3B with max_tokens=32768, temperature=0.6.
Official scores from the model card (which may use different settings).
Stable benchmarks:
| Benchmark | Our Score | Official | Match? | Settings |
|---|---|---|---|---|
| MMLU-Pro | 85.2%* | 85.3 | Match | 32K tokens |
| MMLU-Redux | 93.5% | 93.3 | Match | 32K tokens |
| GPQA Diamond | 89.7%* | 84.2 | Above* | 32K tokens |
| IFEval | 92.5%* | 91.9 | Match | 32K tokens |
| GSM8K | 95.6%* | — | — | system_prompt=\boxed{}, 32K tokens |
| MATH-500 | 97.9%* | — | — | system_prompt=\boxed{}, 32K tokens |
| MBPP | 84.4%* | — | — | Modal sandbox, 32K tokens |
| AIME 2026 (pass@4) | 90.0% | 93.33 | Close | system_prompt=\boxed{}, 32K tokens |
* Excluding context overflow — the thinking model's reasoning chain exceeds context on some examples. These are scored as failures (reward=0).
Experimental benchmarks (Modal sandbox):
| Benchmark | Our Score | Official | Notes |
|---|---|---|---|
| LiveCodeBench v6 | 47.4% (175 ex) | 74.6 | 78/91 wrong due to 32K truncation; excl. truncated: 86.5% |
| Terminal Bench 2 | 27.7% (112 ex) | 40.5 | 24 ctx overflow + 14 timeout on 65K model |
| SWE-bench Verified | 0% (500 ex) | 69.2 | 65K context too small — all ctx overflow |
| TAU2-Bench | 30.0% (50 ex) | 81.2 | Same-model user sim limits score; official uses GPT-4.1 |
Verified scores
Reference scores using BenchmarkConfig.for_model() with recommended settings.
GPT-OSS-120B (128K context)
Model: openai/gpt-oss-120b:peft:131072. Renderer: gpt_oss_high_reasoning.
Official scores from the GPT-OSS technical report.
| Benchmark | Raw | Completed | Official | Match? |
|---|---|---|---|---|
| GPQA Diamond | 80.8% | 80.8% | 80.1% | Match |
| MMLU-Pro | 80.6% | 80.6% | 90.0% (MMLU, different) | Different benchmark |
| GSM8K | 95.9% | 95.9% | — | — |
| MATH-500 | 95.4% | 95.4% | — | — |
| IFEval | 91.7% | 91.7% | — | — |
| AIME 2025 | 76.7% | 76.7% | 92.5% (with tools) | No tools in our eval |
| Terminal Bench | 28.6% | 28.6% | — | Not in paper |
| SWE-bench Verified | 2.2% | 2.3% | 62.4% | Agent scaffold gap (see below) |
Zero truncation across all benchmarks — 128K context is sufficient for all prompts.
Raw and Completed scores are identical (no max_tokens truncation issues).
SWE-bench gap: The official eval uses a specialized agent scaffold with file editing tools.
Our harness provides only a bash tool — the model reads code but rarely generates sed
edits. Improving the tool scaffold (e.g., adding a str_replace_editor) is expected to
close most of this gap. See mini-swe-agent
for a reference bash-only implementation that achieves 74%+ with frontier models.
Qwen3.5-35B-A3B (64K context)
Model: Qwen/Qwen3.5-35B-A3B. Renderer: qwen3_5.
Official scores from the Qwen3.5-35B-A3B model card.
| Benchmark | Raw | Completed | Official | Match? |
|---|---|---|---|---|
| MMLU-Redux | 89.2% | 93.8% | 93.3 | Match |
| GPQA Diamond | 70.7% | 89.7% | 84.2 | Above |
| IFEval | 86.9% | 92.5% | 91.9 | Match |
| C-Eval | 89.2% | 90.1% | 90.2 | Match |
| SuperGPQA | ~59% | ~67% | 63.4 | Match |
| MATH-500 | 92.0% | 97.9% | — | — |
| GSM8K | 81.7% | 88.0% | — | — |
| MBPP | 84.4% | 87.1% | — | — |
| IFBench | 67.3% | — | 70.2 | Match |
| AIME 2026 pass@4 | — | 96.7% | 93.33 | Above |
"Completed" excludes truncated examples (model hit max_tokens before answering).
For thinking models, score_completed is the right comparison against published scores.
Testing
pytest tinker_cookbook/eval/benchmarks/benchmark_test.py