Benchmark Setup
May 5, 2026 · View on GitHub
This document covers the benchmark-native asset path for pre-release source reproduction.
It is separate from the local clip corpus because benchmark assets have different licensing, larger size, and stricter comparability requirements.
Reproduction Target
On this M3 Air, benchmark work defaults to generalized reproduction, not strict rerun:
- same method family
- same benchmark structure
- local hardware-aware subset or chunking when needed
- explicit caveats about model precision, frame count, and subset size
Do not silently blur that into "full replication."
The frozen imported targets are in docs/claim-register.md.
Install The Benchmark Helpers
uv sync --group dev --group research --group vlm --group benchmark
Local Layout
Benchmark assets live under ignored paths:
data/benchmarks/
├── tomato/
│ ├── downloads/
│ ├── hf/
│ └── videos/
├── mvbench/
│ ├── downloads/
│ ├── hf/
│ └── video/
└── videomme/
├── downloads/
├── hf/
└── videos/
Nothing under data/benchmarks/ is committed to git.
Fetch Commands
Fetch TOMATO QA tables from Hugging Face:
uv run python scripts/fetch_benchmarks.py --dataset tomato --mode metadata
Fetch the TOMATO video bundle from the official Google Drive file:
uv run python scripts/fetch_benchmarks.py --dataset tomato --mode assets
If the official Drive bundle is quota-blocked or the transfer is corrupted, the
fetch script falls back to the public ellisbrown/TOMATO Hugging Face mirror,
which hosts video_shard_000.tar.zst through video_shard_005.tar.zst. The
selected asset source is recorded in data/benchmarks/tomato/SOURCE.json.
If the Drive path is already known-bad on the current machine, skip straight to the mirror:
uv run python scripts/fetch_benchmarks.py --dataset tomato --mode assets --tomato-video-source mirror
Fetch MVBench task JSON and NTU reference list from Hugging Face:
uv run python scripts/fetch_benchmarks.py --dataset mvbench --mode metadata
Fetch the MVBench hosted video bundles from Hugging Face:
uv run python scripts/fetch_benchmarks.py --dataset mvbench --mode assets
The default MVBench asset profile is predecessor18, which matches the
predecessor-style 18-task slice and avoids downloading hosted bundles that
are not needed for the first generalized reproduction pass.
In practice that hosted predecessor-style slice still needs perception.zip
for several of the saved 18 tasks. The default profile therefore includes it
even though the earliest draft of this repo's fetch list did not.
To fetch every Hugging Face-hosted MVBench archive instead:
uv run python scripts/fetch_benchmarks.py --dataset mvbench --mode assets --mvbench-profile all
Fetch the TOMATO and MVBench stacks (both means TOMATO + MVBench):
uv run python scripts/fetch_benchmarks.py --dataset both --mode all
Fetch VideoMME metadata:
uv run python scripts/fetch_benchmarks.py --dataset videomme --mode metadata
VideoMME videos are intentionally fetched by checked manifest subset rather than by the full 101 GB corpus. Use:
uv run python scripts/fetch_videomme_subset.py \
--manifest research/benchmark_manifests/videomme_dev_v1.toml \
--manifest research/benchmark_manifests/videomme_holdout_v1.toml \
--cache-dir data/benchmarks/videomme/downloads/hf_cache
See docs/videomme-download-handoff.md for the complete VideoMME acquisition and verification flow.
Dry-run without downloading anything:
uv run python scripts/fetch_benchmarks.py --dataset both --mode all --dry-run
Benchmark Runner
The benchmark-native Track A runner is:
uv run python scripts/run_benchmark_track_a.py run --benchmark tomato
Useful control modes:
--cache-mode default: run the normal same-position cached-feature substitution path--cache-mode identity: route unchanged dense features back through the benchmark runner to verify cache-path transparency on the exact benchmark code path
Useful diagnosis option:
--refresh-interval <k>: force a dense refresh everykframes while keeping the cached-feature path active between refreshes--manifest <path>: run an explicit frozen slice instead of the historicalfirst N per groupselection path--feature-cache-dir <path>: store or reuse dense vision features for repeated Track A planner sweeps--no-feature-replay: disable dense feature replay and force dense recomputation--allow-dirty: bypass the default clean-tree guard for debugging only; reportable benchmark artifacts should come from clean commits
Current reuse-accounting rule on the benchmark runner:
reuse_ratio_meanis the pad-masked active-region reuse ratioreuse_ratio_mean_rawis also recorded for descriptive comparison- identity-mode controls report reuse as
nullbecause the planner is bypassed
Current replay rule on the benchmark runner:
- replay is a Track A experiment accelerator only
- cache hits are recorded per item as
feature_cache_hit - replay does not justify speedup or compression language
Recommended first TOMATO smoke on this machine:
uv run python scripts/run_benchmark_track_a.py run \
--benchmark tomato \
--per-group 1 \
--chunk-size 1 \
--frame-count 8 \
--max-tokens 32 \
--output-path results/tomato_smoke.jsonl \
--summary-path results/tomato_smoke_summary.json
Recommended first generalized TOMATO subset after the smoke passes:
uv run python scripts/run_benchmark_track_a.py run \
--benchmark tomato \
--manifest research/benchmark_manifests/tomato_dev_v1.toml \
--chunk-size 1 \
--frame-count 8 \
--max-tokens 32 \
--output-path results/tomato_subset.jsonl \
--summary-path results/tomato_subset_summary.json
For long semantic runs, the runner also supports cooperative stop and summary checkpointing:
uv run python scripts/run_benchmark_track_a.py run \
--benchmark tomato \
--manifest research/benchmark_manifests/tomato_dev_v1.toml \
--chunk-size 1 \
--stop-file /tmp/vlmaxxing-stop \
--summary-path results/tomato_subset_summary.json
Then request clean termination with:
touch /tmp/vlmaxxing-stop
The runner stops at the next chunk boundary and rewrites the summary JSON.
Benchmark slice policy now lives under
research/benchmark_manifests/.
Use *_dev_v1.toml during planner search and keep *_holdout_v1.toml frozen
until the next policy choice is ready for evaluation.
Replay methodology and invalidation rules live in docs/methodology/feature-replay.md.
TOMATO Notes
- official code repo:
yale-nlp/TOMATO - official QA tables:
yale-nlp/TOMATOon Hugging Face - official video bundle: Google Drive file linked from the TOMATO repo README
- QA table license from the dataset card:
CC BY-SA 4.0
This repo uses the official Hugging Face tables plus the official Google Drive
video bundle when available. If the Drive bundle is unavailable, it falls back
to the public ellisbrown/TOMATO shard mirror and records that choice in the
local source record so the acquisition path remains auditable.
Expected final layout:
data/benchmarks/tomato/
├── hf/
│ └── data/*.parquet
└── videos/
├── human/
├── object/
└── simulated/
MVBench Notes
- official dataset repo:
OpenGVLab/MVBenchon Hugging Face - task JSON is hosted directly on Hugging Face
- most video bundles are hosted directly on Hugging Face as zip archives
320NTU RGB+D videos remain manual because of upstream license restrictions
Current implication:
- this repo can automate the Hugging Face-hosted portion of MVBench
- full
20-task coverage may still require the NTU manual download - the imported predecessor run only saved an
18-task local slice, so a local generalized reproduction can still be meaningful before NTU is complete - the default fetch profile mirrors that predecessor-style hosted subset:
FunQA_test,Moments_in_Time_Raw,clevrer,data0613,perception,scene_qa,ssv2_video,sta,star, andvlnqa
Expected final layout:
data/benchmarks/mvbench/
├── hf/
│ ├── json/*.json
│ └── video/MVBench_videos_ntu.txt
└── video/
├── clevrer/
├── star/
├── ssv2_video/
└── ...
What The Fetch Script Guarantees
- fail-loud if the expected local layout is not created
- preserve downloaded archives under
downloads/ - record the upstream repo or Drive source in a checked local JSON note
- keep benchmark data out of git
What The Fetch Script Does Not Guarantee
- that the full benchmark is tractable on this machine
- that every MVBench task is complete without NTU
- that the pre-release source frame count or model precision can be matched
Those caveats belong in the experiment note and docs/reproduction-status.md.