longctx

May 6, 2026 · View on GitHub

Live results from mass-validation runs. Rows update as new data lands.

MRCR v2 8-needle, Qwen2.5-32B-Instruct, single AMD MI300X

binchar rangenpipelinetop-Kchunk sizeavg scoreprefix passrun date
8K16K–32K82RAG8message-level0.822100%2026-05-06
32K64K–128K98RAG8message-level0.69797%2026-05-06
64K128K–256K95RAG8message-level0.64198%2026-05-06
64K128K–256K95chunked85000.67098%2026-05-06
1M2M–5M30RAG8message-level0.440100%2026-05-06
1M2M–5M30chunked85000.40997%2026-05-06

The 1M bin scores reflect an in-progress score-narrowing campaign. Several retrieval levers are being characterized (top-K sweeps, BM25 hybrid, position-aware retrieval, oracle gen-ceiling). Numbers will be updated as runs complete.

Reproducing

pip install longctx[eval]
longctx-eval --bin 8k --n 30 --model qwen25-32b \
    --server http://localhost:5050/v1/chat/completions \
    --data-dir /path/to/mrcr/v2

Or for the full multi-bin curve:

longctx-bench --data-dir /path/to/mrcr/v2 --model qwen25-32b \
    --bins 8k 32k 64k --n 80 --include-chunked

Generator notes

  • Qwen2.5-32B-Instruct (vanilla, 32K native context window, served via vLLM --max-model-len 32768): the headline numbers above. longctx feeds the model only the retrieved top-K, so the 32K window is sufficient regardless of haystack size.
  • Qwen2.5-14B-Instruct-1M (1M native context): also tested. Scores in the same ballpark at small bins; the 1M-context model is not required for longctx since the model never sees the full haystack.
  • Other generators (Mistral-7B-Instruct-v0.3, Qwen3-8B) need the bundled longctx.templates.MISTRAL_VERBATIM_TEMPLATE / QWEN3_NO_THINK_TEMPLATE to produce verbatim-prefix outputs.

Methodology

Mass-validation runs use n ≥ 80 for the headline cells (8K, 32K, 64K). Single-run scores at n ≤ 30 have ±0.05 swing across adjacent runs; trust the n ≥ 80 numbers for any cross-cell comparison. The 1M bin is currently characterized at n=30; mass-val will land once the score-narrowing campaign converges on a recipe.