vaara-bench-v0.37

May 28, 2026 · View on GitHub

Methodology delta against vaara-bench-v0.36. v0.37 is a corpus-augmentation release with two contributions:

  1. Third attacker family added to held-out eval. 900 adversarial entries generated by RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic on AMD-backed MI300X SR-IOV under rocm/vllm:latest. Three categories (tool_misuse, privilege_escalation, data_exfil), 300 per category, held out from TRAIN.
  2. v8 retrain on union corpus. v036 Mixtral TM/PE plus v036 Claude TM/PE entries folded into TRAIN. The v036 DE subset stays held out so the worst sub-cell from v0.36 (data_exfil × Claude) remains a pure cross-model eval. The new v037 holdout combines that carried v036 DE fold with the full v037 Llama-3.3 leg.

Production headline (v8 on v035 TEST)

metricv8v7 (prior production)Δ
TEST recall at T=0.900686.6% [84.5, 88.6]85.3% [83.1, 87.3]+1.3pp
TEST FPR at same calibration5.0% [3.6, 6.8]4.6% [3.3, 6.3]+0.4pp (within CI)
n_features638 (254 hand + 384 embed)638flat
TRAIN entries11,2878,501+2,786
TRAIN positive rate0.6830.579folded TM/PE both legs

v8 is the new production bundle (src/vaara/data/adversarial_classifier_v8.joblib). v7 and v6 stay on disk for cross-eval reproducibility. No regression on the in-distribution TEST set. Recall lifts 1.3 pp at a 0.4 pp FPR cost that sits inside the prior FPR confidence interval.

Cross-model held-out result (v8 on v037 holdout)

cutnrecall at T=0.9006
overall2,27766.8% [64.9, 68.7]
Llama-3.3 leg (new third family)88785.8% [83.3, 87.9]
Mixtral DE (carry-forward from v036)69070.9% [67.4, 74.1]
Claude DE (carry-forward from v036)70038.9% [35.3, 42.5]
data_exfil overall1,69057.3% [54.9, 59.6]
privilege_escalation overall30097.0% [94.4, 98.4]
tool_misuse overall28791.6% [87.9, 94.3]
data_exfil × llama3330069.0% [63.6, 74.0]
privilege_escalation × llama3330097.0% [94.4, 98.4]
tool_misuse × llama3328791.6% [87.9, 94.3]

The v8 number on data_exfil × Claude lifts from 26.0% (v7) to 38.9% (+12.9pp) on the same 700 entries. The constraint diagnosed in v0.36 was training-corpus distribution, not feature space. Folding v036 TM/PE into TRAIN reweights the existing dst__ features against actual adversarial examples and the lift falls out.

Third-family cross-model number

Llama-3.3-70B-Instruct was not present in TRAIN. The 85.8% overall recall on its 887 entries indicates the v035 + v036 TM/PE TRAIN fold generalises to a third attacker family at near-in-distribution recall. Per-category cuts on this leg:

  • privilege_escalation × llama33: 97.0%
  • tool_misuse × llama33: 91.6%
  • data_exfil × llama33: 69.0%

The data_exfil pattern from v0.36 holds. Across all three attacker families now seen, DE is the hardest category. PE and TM generalise cleanly. DE generalises unevenly.

Carry-forward DE numbers

v036 Mixtral DE: 70.9% (v8) vs 69.3% (v7 on the same 690 entries), flat. v036 Claude DE: 38.9% (v8) vs 26.0% (v7 on the same 700 entries), +12.9pp. The v036 to v8 lift is concentrated in the closed-weight leg that was failing hardest. Open-weight Mixtral DE was already at 70%-tier and stays there. The asymmetry confirms the v0.36 mechanism finding (destination signal is the axis, and folding the related TM/PE distributions into TRAIN repositions the classifier on that axis for the closed-weight leg).

Ship gate

v0.37 ships under both a methodology gate and a sub-cell recall gate because v8 is a production retrain:

gateresult
v035 TEST recall does not regressPASS, 85.3% to 86.6%, +1.3pp
v035 TEST FPR does not regressPASS, 4.6% to 5.0%, within CI
Worst v0.36 sub-cell improvesPASS, DE × Claude 26.0% to 38.9%
Third attacker family covered with recall floorPASS, llama33 overall 85.8%
Held-out gap stays published with mechanismPASS

Cross-model overall recall is 66.8%. Below the 70% floor used as soft target in prior releases, but the floor was set against v035 TEST distribution. Cross-model overall is a harder denominator, and 66.8% is a 7.6 pp lift on the comparable v036 number (59.2% to 66.8%) with a third family added to the denominator.

Generation provenance

Llama-3.3-70B generation ran on an AMD-backed MI300X DigitalOcean SR-IOV droplet under rocm/vllm:latest serving RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic with the model's native compressed-tensors FP8 quantization (--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92). Three parallel category generators, ~22 minutes wall clock for 900 entries at steady-state ~40 entries/min combined. Droplet poweroff issued post-rsync. Schema validation pass dropped 13 of 300 raw TM entries (4.3%) where the model emitted non-DENY expected. Final v037 counts: TM 287, PE 300, DE 300, total 887 valid.

The v037 droplet recipe is identical to v0.36 modulo the model swap. The --quantization flag had to be dropped because compressed-tensors in the model config conflicts with an explicit fp8 argument. vLLM auto-detects the quantization scheme from the model config in that case, and that path serves correctly. This is a model-specific configuration note rather than a methodology change.

Chain of custody

anchorpathpins
corpus manifesttests/adversarial/MANIFEST.sha256SHA-256 of every JSONL including v037
v035 split (inherited)tests/adversarial/v035_split.jsonTRAIN/VAL/TEST for v8 calibration
v037 splittests/adversarial/v037_split.jsonv035 inherited + v036 TM/PE to train, v036 DE + v037 to holdout
production bundlesrc/vaara/data/adversarial_classifier_v8.joblibtrained on 11,287 entries with dst features + embeddings
prior productionsrc/vaara/data/adversarial_classifier_v7.joblibretained for cross-eval
Llama-3.3 generatorscripts/generate_targeted_v037.pyvLLM HTTP, FP8 dynamic on MI300X
droplet driverscripts/v037_droplet_run.shidempotent, no destructive EXIT trap
watcherscripts/v037_local_watcher.sh60s rsync poll, opt-in doctl auto-shutdown
split builderscripts/build_v037_split.pyinherits v035, folds v036 TM/PE into train
holdout evalscripts/eval_v037_holdout.pythree-leg breakdown (mixtral, claude, llama33)
v035 schema checkscripts/validate_v037.pysame shape as v0.36 validator

Reproduction recipe

cd tests/adversarial && sha256sum -c MANIFEST.sha256
.venv/bin/python scripts/validate_v037.py
.venv/bin/python scripts/build_v037_split.py
.venv/bin/python scripts/save_classifier_bundle.py \
    --version v0.37 --threshold 0.9006 --embeddings \
    --split-manifest tests/adversarial/v037_split.json \
    --train-fold train \
    --bundle-out src/vaara/data/adversarial_classifier_v8.joblib
.venv/bin/python scripts/eval_v037_holdout.py \
    --bundle src/vaara/data/adversarial_classifier_v8.joblib \
    --json-out bench/v037_eval_v8_holdout.json

Named limits

  1. Third family generation is 887 valid entries, not 4,000+ like v0.36. Wilson CI on a 300-entry sub-cell at p ~ 0.85 is ± 4 pp, adequate for ship-gate decisions. Scaling the Llama-3.3 leg to v036 density is v0.38 scope, paired with public-benchmark evaluation.
  2. Open-weight families dominate the third-family fold. Llama-3.3 and Mixtral are both open-weight Meta and Mistral architectures. Closed-weight coverage in v0.37 is the carry-forward Claude DE subset only. Adding GPT-4o-class or Gemini-class generation is v0.38 scope.
  3. No public-benchmark eval (PINT, BIPIA, INJECT) yet. v0.38 scope.
  4. PAIR multi-attacker scale-up not performed. v0.38 scope (target ASR Wilson upper under 1%).
  5. FPR-bounded three-stage combiner per FCR paper (arxiv:2605.22004) not implemented. v0.39 scope.

Cumulative position

v0.37 closes the worst v0.36 sub-cell by 12.9 pp without giving up in-distribution recall, and covers a third attacker family at 85.8% overall. The data_exfil category remains the hardest cross-model surface. That is the v0.38 + v0.39 line of work: public-benchmark numbers, PAIR-at-scale, FPR-bounded combiner.