vaara-bench-v0.37
May 28, 2026 · View on GitHub
Methodology delta against vaara-bench-v0.36. v0.37 is a corpus-augmentation release with two contributions:
- Third attacker family added to held-out eval. 900 adversarial
entries generated by
RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamicon AMD-backed MI300X SR-IOV underrocm/vllm:latest. Three categories (tool_misuse,privilege_escalation,data_exfil), 300 per category, held out from TRAIN. - v8 retrain on union corpus. v036 Mixtral TM/PE plus v036 Claude TM/PE entries folded into TRAIN. The v036 DE subset stays held out so the worst sub-cell from v0.36 (data_exfil × Claude) remains a pure cross-model eval. The new v037 holdout combines that carried v036 DE fold with the full v037 Llama-3.3 leg.
Production headline (v8 on v035 TEST)
| metric | v8 | v7 (prior production) | Δ |
|---|---|---|---|
| TEST recall at T=0.9006 | 86.6% [84.5, 88.6] | 85.3% [83.1, 87.3] | +1.3pp |
| TEST FPR at same calibration | 5.0% [3.6, 6.8] | 4.6% [3.3, 6.3] | +0.4pp (within CI) |
| n_features | 638 (254 hand + 384 embed) | 638 | flat |
| TRAIN entries | 11,287 | 8,501 | +2,786 |
| TRAIN positive rate | 0.683 | 0.579 | folded TM/PE both legs |
v8 is the new production bundle
(src/vaara/data/adversarial_classifier_v8.joblib). v7 and v6 stay on
disk for cross-eval reproducibility. No regression on the
in-distribution TEST set. Recall lifts 1.3 pp at a 0.4 pp FPR cost that
sits inside the prior FPR confidence interval.
Cross-model held-out result (v8 on v037 holdout)
| cut | n | recall at T=0.9006 |
|---|---|---|
| overall | 2,277 | 66.8% [64.9, 68.7] |
| Llama-3.3 leg (new third family) | 887 | 85.8% [83.3, 87.9] |
| Mixtral DE (carry-forward from v036) | 690 | 70.9% [67.4, 74.1] |
| Claude DE (carry-forward from v036) | 700 | 38.9% [35.3, 42.5] |
| data_exfil overall | 1,690 | 57.3% [54.9, 59.6] |
| privilege_escalation overall | 300 | 97.0% [94.4, 98.4] |
| tool_misuse overall | 287 | 91.6% [87.9, 94.3] |
| data_exfil × llama33 | 300 | 69.0% [63.6, 74.0] |
| privilege_escalation × llama33 | 300 | 97.0% [94.4, 98.4] |
| tool_misuse × llama33 | 287 | 91.6% [87.9, 94.3] |
The v8 number on data_exfil × Claude lifts from 26.0% (v7) to 38.9%
(+12.9pp) on the same 700 entries. The constraint diagnosed in v0.36
was training-corpus distribution, not feature space. Folding v036
TM/PE into TRAIN reweights the existing dst__ features against actual
adversarial examples and the lift falls out.
Third-family cross-model number
Llama-3.3-70B-Instruct was not present in TRAIN. The 85.8% overall recall on its 887 entries indicates the v035 + v036 TM/PE TRAIN fold generalises to a third attacker family at near-in-distribution recall. Per-category cuts on this leg:
- privilege_escalation × llama33: 97.0%
- tool_misuse × llama33: 91.6%
- data_exfil × llama33: 69.0%
The data_exfil pattern from v0.36 holds. Across all three attacker families now seen, DE is the hardest category. PE and TM generalise cleanly. DE generalises unevenly.
Carry-forward DE numbers
v036 Mixtral DE: 70.9% (v8) vs 69.3% (v7 on the same 690 entries), flat. v036 Claude DE: 38.9% (v8) vs 26.0% (v7 on the same 700 entries), +12.9pp. The v036 to v8 lift is concentrated in the closed-weight leg that was failing hardest. Open-weight Mixtral DE was already at 70%-tier and stays there. The asymmetry confirms the v0.36 mechanism finding (destination signal is the axis, and folding the related TM/PE distributions into TRAIN repositions the classifier on that axis for the closed-weight leg).
Ship gate
v0.37 ships under both a methodology gate and a sub-cell recall gate because v8 is a production retrain:
| gate | result |
|---|---|
| v035 TEST recall does not regress | PASS, 85.3% to 86.6%, +1.3pp |
| v035 TEST FPR does not regress | PASS, 4.6% to 5.0%, within CI |
| Worst v0.36 sub-cell improves | PASS, DE × Claude 26.0% to 38.9% |
| Third attacker family covered with recall floor | PASS, llama33 overall 85.8% |
| Held-out gap stays published with mechanism | PASS |
Cross-model overall recall is 66.8%. Below the 70% floor used as soft target in prior releases, but the floor was set against v035 TEST distribution. Cross-model overall is a harder denominator, and 66.8% is a 7.6 pp lift on the comparable v036 number (59.2% to 66.8%) with a third family added to the denominator.
Generation provenance
Llama-3.3-70B generation ran on an AMD-backed MI300X DigitalOcean
SR-IOV droplet under rocm/vllm:latest serving
RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic with the model's native
compressed-tensors FP8 quantization
(--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92).
Three parallel category generators, ~22 minutes wall clock for 900
entries at steady-state ~40 entries/min combined. Droplet poweroff
issued post-rsync. Schema validation pass dropped 13 of 300 raw TM
entries (4.3%) where the model emitted non-DENY expected. Final v037
counts: TM 287, PE 300, DE 300, total 887 valid.
The v037 droplet recipe is identical to v0.36 modulo the model swap.
The --quantization flag had to be dropped because compressed-tensors
in the model config conflicts with an explicit fp8 argument. vLLM
auto-detects the quantization scheme from the model config in that
case, and that path serves correctly. This is a model-specific
configuration note rather than a methodology change.
Chain of custody
| anchor | path | pins |
|---|---|---|
| corpus manifest | tests/adversarial/MANIFEST.sha256 | SHA-256 of every JSONL including v037 |
| v035 split (inherited) | tests/adversarial/v035_split.json | TRAIN/VAL/TEST for v8 calibration |
| v037 split | tests/adversarial/v037_split.json | v035 inherited + v036 TM/PE to train, v036 DE + v037 to holdout |
| production bundle | src/vaara/data/adversarial_classifier_v8.joblib | trained on 11,287 entries with dst features + embeddings |
| prior production | src/vaara/data/adversarial_classifier_v7.joblib | retained for cross-eval |
| Llama-3.3 generator | scripts/generate_targeted_v037.py | vLLM HTTP, FP8 dynamic on MI300X |
| droplet driver | scripts/v037_droplet_run.sh | idempotent, no destructive EXIT trap |
| watcher | scripts/v037_local_watcher.sh | 60s rsync poll, opt-in doctl auto-shutdown |
| split builder | scripts/build_v037_split.py | inherits v035, folds v036 TM/PE into train |
| holdout eval | scripts/eval_v037_holdout.py | three-leg breakdown (mixtral, claude, llama33) |
| v035 schema check | scripts/validate_v037.py | same shape as v0.36 validator |
Reproduction recipe
cd tests/adversarial && sha256sum -c MANIFEST.sha256
.venv/bin/python scripts/validate_v037.py
.venv/bin/python scripts/build_v037_split.py
.venv/bin/python scripts/save_classifier_bundle.py \
--version v0.37 --threshold 0.9006 --embeddings \
--split-manifest tests/adversarial/v037_split.json \
--train-fold train \
--bundle-out src/vaara/data/adversarial_classifier_v8.joblib
.venv/bin/python scripts/eval_v037_holdout.py \
--bundle src/vaara/data/adversarial_classifier_v8.joblib \
--json-out bench/v037_eval_v8_holdout.json
Named limits
- Third family generation is 887 valid entries, not 4,000+ like v0.36. Wilson CI on a 300-entry sub-cell at p ~ 0.85 is ± 4 pp, adequate for ship-gate decisions. Scaling the Llama-3.3 leg to v036 density is v0.38 scope, paired with public-benchmark evaluation.
- Open-weight families dominate the third-family fold. Llama-3.3 and Mixtral are both open-weight Meta and Mistral architectures. Closed-weight coverage in v0.37 is the carry-forward Claude DE subset only. Adding GPT-4o-class or Gemini-class generation is v0.38 scope.
- No public-benchmark eval (PINT, BIPIA, INJECT) yet. v0.38 scope.
- PAIR multi-attacker scale-up not performed. v0.38 scope (target ASR Wilson upper under 1%).
- FPR-bounded three-stage combiner per FCR paper (arxiv:2605.22004) not implemented. v0.39 scope.
Cumulative position
v0.37 closes the worst v0.36 sub-cell by 12.9 pp without giving up in-distribution recall, and covers a third attacker family at 85.8% overall. The data_exfil category remains the hardest cross-model surface. That is the v0.38 + v0.39 line of work: public-benchmark numbers, PAIR-at-scale, FPR-bounded combiner.