Eval Snapshot Workflow

May 27, 2026 · View on GitHub

This project keeps publishable benchmark numbers in a local gitignored file so README metrics can be updated without committing private baseline files.

Files

Local metrics (gitignored): ${EVAL_BASELINES_DIR:-~/.cache/origin-eval}/readme_metrics.json
Tracked template: docs/eval/readme_metrics.example.json
README updater: scripts/update-readme-eval.py

Update flow

Run benchmark(s) locally and record headline metrics.
Update ${EVAL_BASELINES_DIR:-~/.cache/origin-eval}/readme_metrics.json.
Regenerate README snapshot:

python3 scripts/update-readme-eval.py

Commit the README and script/docs changes (the local metrics JSON stays untracked).

Notes

LongMemEval and LoCoMo use Recall@5, MRR, and NDCG@10 as headline fields.
Current README numbers are retrieval-only, single-run local snapshots unless a reproducibility pass is explicitly documented.
Name the retrieval mode once in surrounding prose when all rows use the same mode.
Keep notes in the metrics JSON for maintainer-facing caveats and run metadata; the root README does not render them.

Files

Update flow

Notes

Links