AVP Benchmarks

March 25, 2026 · View on GitHub

+14.1pp on code generation vs text (p=0.004) · 14-78% fewer tokens · 1.2-4x faster – 7 benchmarks, 5 models, 2 families. Cross-model rosetta across 4 model pairs.


Accuracy

Same-model latent transfer matches or improves accuracy on structured tasks. Tested on NVIDIA A100, n=100-500 per benchmark.

Code Generation

DirectLatent (AVP)Text
HumanEval (Qwen 7B, n=164)58.5%67.1%53.0%
HumanEval (Llama 3B, n=164)50.6%54.3%44.5%

Latent vs text (Qwen 7B): p=0.004. Validated across 4 seeds at T=0.01 (70.0%±0.3% latent vs 57.6%±0.3% text). Replicated on Llama 3B.

Math Reasoning

DirectLatent (AVP)Text
GSM8K (Qwen 7B, n=200)91.0%90.5%87.0%
GSM8K (Llama 3B, n=200)74.5%76.0%79.0%

Bug Fixing

DirectLatent (AVP)Text
DebugBench (Qwen 7B, n=100)50.0%51.0%49.0%
DebugBench (Llama 3B, n=100)31.0%30.0%31.0%

Comprehension

DirectLatent (AVP)Text
HotpotQA (Qwen 7B, n=200)51.5%52.5%50.5%

All modes within noise. Latent's advantage here is purely efficiency.


Efficiency

Token savings are structural – pre-computed KV-cache replaces re-processed text. Savings hold across every benchmark, every model.

AgentsBenchmarkToken SavingsSpeedup
2GSM8K, DebugBench46-56%1.5-3x
2HumanEval14%1.2x
3Fan-out56-60%1.5x
4GSM8K chain73-78%2-4x

HumanEval token savings are lower because prompts are short (~182 tokens avg) and the latent reviewer generates longer, more complete code solutions (+53% more output tokens).

Text prompts grow O(n²) with agent count. Latent stays O(n).


Cross-Model (Rosetta Stone) – Experimental

Experimental. Cross-model projection requires cross_model=True. Accuracy varies by task type – works well on structured tasks (math, code), degrades on comprehension and bug fixing.

Different models communicate via vocabulary-mediated projection. Zero training – uses existing embedding matrices.

Rosetta Accuracy

Source → TargetGSM8K (n=200)HumanEval (n=164)DebugBench (n=100)
Qwen 7B → Qwen 3B82.5%66.5%
Qwen 7B → Llama 3B77.0%47.0%34.0%
Llama 3B → Qwen 7B90.0%79.3%45.0%
Qwen 7B → Qwen 1.5B58.5%42.1%26.0%

Target model solo baselines: Qwen 7B = 91.0% / 58.5% / 50.0%. Qwen 3B = 82.5% / 61.0%. Llama 3B = 76.0% / 50.6% / 31.0%. Qwen 1.5B = 62.0%.

Accuracy is bounded by the target model's own capability. Advisory quality gate included for prompts >300 tokens where projection degrades.

Rosetta vs Text Cross-Model

DirectionBenchmarkRosettaTextDelta
Qwen 7B → Qwen 3BGSM8K82.5%88.5%text +6.0pp
Qwen 7B → Qwen 3BHumanEval66.5%62.2%rosetta +4.3pp
Qwen 7B → Llama 3BGSM8K77.0%86.5%text +9.5pp
Llama 3B → Qwen 7BGSM8K90.0%82.0%rosetta +8.0pp
Qwen 7B → Llama 3BHumanEval47.0%57.9%text +10.9pp
Llama 3B → Qwen 7BHumanEval79.3%61.6%rosetta +17.7pp
Qwen 7B → Llama 3BDebugBench34.0%44.0%text +10.0pp
Llama 3B → Qwen 7BDebugBench45.0%40.0%rosetta +5.0pp

Direction matters: rosetta beats text when the stronger model is the solver. Text wins when the weaker model is the solver. On code generation (HumanEval), rosetta wins in both directions.


Competition Math

DirectLatent (AVP)Text
MATH (Qwen 7B, n=500)67.8%66.8%66.6%

All three modes are statistically identical (p=1.0 latent vs text). Earlier runs at 512 max tokens showed a false text advantage due to solver truncation – with proper token budget (2048), the gap disappears.


Configuration

ParameterValue
Latent steps20 (validated: 10 ≈ 20 > 40 > 80)
Temperature0.7
Max new tokens512 (MATH: 2048)
Seed42
HardwareNVIDIA A100 80GB

Limitations

  • Self-hosted only – requires direct KV-cache access (cloud APIs don't expose this)
  • Single-embedding bottleneck – cross-model transfers one vector; fails on long comprehension tasks
  • Multi-hop coherence – KV-cache across 4+ sequential hops loses signal
  • 1B-7B models tested – larger models may behave differently

Reproduce

pip install avp[hf] datasets

python -m benchmarks.humaneval.run_humaneval \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --mode all --max-samples 50 --latent-steps 20 --seed 42

All benchmark code: benchmarks/. Llama models require HF access and HF_TOKEN.