AVP Benchmarks
March 25, 2026 · View on GitHub
+14.1pp on code generation vs text (p=0.004) · 14-78% fewer tokens · 1.2-4x faster – 7 benchmarks, 5 models, 2 families. Cross-model rosetta across 4 model pairs.
Accuracy
Same-model latent transfer matches or improves accuracy on structured tasks. Tested on NVIDIA A100, n=100-500 per benchmark.
Code Generation
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| HumanEval (Qwen 7B, n=164) | 58.5% | 67.1% | 53.0% |
| HumanEval (Llama 3B, n=164) | 50.6% | 54.3% | 44.5% |
Latent vs text (Qwen 7B): p=0.004. Validated across 4 seeds at T=0.01 (70.0%±0.3% latent vs 57.6%±0.3% text). Replicated on Llama 3B.
Math Reasoning
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| GSM8K (Qwen 7B, n=200) | 91.0% | 90.5% | 87.0% |
| GSM8K (Llama 3B, n=200) | 74.5% | 76.0% | 79.0% |
Bug Fixing
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| DebugBench (Qwen 7B, n=100) | 50.0% | 51.0% | 49.0% |
| DebugBench (Llama 3B, n=100) | 31.0% | 30.0% | 31.0% |
Comprehension
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| HotpotQA (Qwen 7B, n=200) | 51.5% | 52.5% | 50.5% |
All modes within noise. Latent's advantage here is purely efficiency.
Efficiency
Token savings are structural – pre-computed KV-cache replaces re-processed text. Savings hold across every benchmark, every model.
| Agents | Benchmark | Token Savings | Speedup |
|---|---|---|---|
| 2 | GSM8K, DebugBench | 46-56% | 1.5-3x |
| 2 | HumanEval | 14% | 1.2x |
| 3 | Fan-out | 56-60% | 1.5x |
| 4 | GSM8K chain | 73-78% | 2-4x |
HumanEval token savings are lower because prompts are short (~182 tokens avg) and the latent reviewer generates longer, more complete code solutions (+53% more output tokens).
Text prompts grow O(n²) with agent count. Latent stays O(n).
Cross-Model (Rosetta Stone) – Experimental
Experimental. Cross-model projection requires
cross_model=True. Accuracy varies by task type – works well on structured tasks (math, code), degrades on comprehension and bug fixing.
Different models communicate via vocabulary-mediated projection. Zero training – uses existing embedding matrices.
Rosetta Accuracy
| Source → Target | GSM8K (n=200) | HumanEval (n=164) | DebugBench (n=100) |
|---|---|---|---|
| Qwen 7B → Qwen 3B | 82.5% | 66.5% | – |
| Qwen 7B → Llama 3B | 77.0% | 47.0% | 34.0% |
| Llama 3B → Qwen 7B | 90.0% | 79.3% | 45.0% |
| Qwen 7B → Qwen 1.5B | 58.5% | 42.1% | 26.0% |
Target model solo baselines: Qwen 7B = 91.0% / 58.5% / 50.0%. Qwen 3B = 82.5% / 61.0%. Llama 3B = 76.0% / 50.6% / 31.0%. Qwen 1.5B = 62.0%.
Accuracy is bounded by the target model's own capability. Advisory quality gate included for prompts >300 tokens where projection degrades.
Rosetta vs Text Cross-Model
| Direction | Benchmark | Rosetta | Text | Delta |
|---|---|---|---|---|
| Qwen 7B → Qwen 3B | GSM8K | 82.5% | 88.5% | text +6.0pp |
| Qwen 7B → Qwen 3B | HumanEval | 66.5% | 62.2% | rosetta +4.3pp |
| Qwen 7B → Llama 3B | GSM8K | 77.0% | 86.5% | text +9.5pp |
| Llama 3B → Qwen 7B | GSM8K | 90.0% | 82.0% | rosetta +8.0pp |
| Qwen 7B → Llama 3B | HumanEval | 47.0% | 57.9% | text +10.9pp |
| Llama 3B → Qwen 7B | HumanEval | 79.3% | 61.6% | rosetta +17.7pp |
| Qwen 7B → Llama 3B | DebugBench | 34.0% | 44.0% | text +10.0pp |
| Llama 3B → Qwen 7B | DebugBench | 45.0% | 40.0% | rosetta +5.0pp |
Direction matters: rosetta beats text when the stronger model is the solver. Text wins when the weaker model is the solver. On code generation (HumanEval), rosetta wins in both directions.
Competition Math
| Direct | Latent (AVP) | Text | |
|---|---|---|---|
| MATH (Qwen 7B, n=500) | 67.8% | 66.8% | 66.6% |
All three modes are statistically identical (p=1.0 latent vs text). Earlier runs at 512 max tokens showed a false text advantage due to solver truncation – with proper token budget (2048), the gap disappears.
Configuration
| Parameter | Value |
|---|---|
| Latent steps | 20 (validated: 10 ≈ 20 > 40 > 80) |
| Temperature | 0.7 |
| Max new tokens | 512 (MATH: 2048) |
| Seed | 42 |
| Hardware | NVIDIA A100 80GB |
Limitations
- Self-hosted only – requires direct KV-cache access (cloud APIs don't expose this)
- Single-embedding bottleneck – cross-model transfers one vector; fails on long comprehension tasks
- Multi-hop coherence – KV-cache across 4+ sequential hops loses signal
- 1B-7B models tested – larger models may behave differently
Reproduce
pip install avp[hf] datasets
python -m benchmarks.humaneval.run_humaneval \
--model_name Qwen/Qwen2.5-7B-Instruct \
--mode all --max-samples 50 --latent-steps 20 --seed 42
All benchmark code: benchmarks/. Llama models require HF access and HF_TOKEN.