vaara-bench-v0.37

May 28, 2026 · View on GitHub

Methodology delta against vaara-bench-v0.36. v0.37 is a corpus-augmentation release with two contributions:

Third attacker family added to held-out eval. 900 adversarial entries generated by RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic on AMD-backed MI300X SR-IOV under rocm/vllm:latest. Three categories (tool_misuse, privilege_escalation, data_exfil), 300 per category, held out from TRAIN.
v8 retrain on union corpus. v036 Mixtral TM/PE plus v036 Claude TM/PE entries folded into TRAIN. The v036 DE subset stays held out so the worst sub-cell from v0.36 (data_exfil × Claude) remains a pure cross-model eval. The new v037 holdout combines that carried v036 DE fold with the full v037 Llama-3.3 leg.

Production headline (v8 on v035 TEST)

metric	v8	v7 (prior production)	Δ
TEST recall at T=0.9006	86.6% [84.5, 88.6]	85.3% [83.1, 87.3]	+1.3pp
TEST FPR at same calibration	5.0% [3.6, 6.8]	4.6% [3.3, 6.3]	+0.4pp (within CI)
n_features	638 (254 hand + 384 embed)	638	flat
TRAIN entries	11,287	8,501	+2,786
TRAIN positive rate	0.683	0.579	folded TM/PE both legs

v8 is the new production bundle (src/vaara/data/adversarial_classifier_v8.joblib). v7 and v6 stay on disk for cross-eval reproducibility. No regression on the in-distribution TEST set. Recall lifts 1.3 pp at a 0.4 pp FPR cost that sits inside the prior FPR confidence interval.

Cross-model held-out result (v8 on v037 holdout)

cut	n	recall at T=0.9006
overall	2,277	66.8% [64.9, 68.7]
Llama-3.3 leg (new third family)	887	85.8% [83.3, 87.9]
Mixtral DE (carry-forward from v036)	690	70.9% [67.4, 74.1]
Claude DE (carry-forward from v036)	700	38.9% [35.3, 42.5]
data_exfil overall	1,690	57.3% [54.9, 59.6]
privilege_escalation overall	300	97.0% [94.4, 98.4]
tool_misuse overall	287	91.6% [87.9, 94.3]
data_exfil × llama33	300	69.0% [63.6, 74.0]
privilege_escalation × llama33	300	97.0% [94.4, 98.4]
tool_misuse × llama33	287	91.6% [87.9, 94.3]

The v8 number on data_exfil × Claude lifts from 26.0% (v7) to 38.9% (+12.9pp) on the same 700 entries. The constraint diagnosed in v0.36 was training-corpus distribution, not feature space. Folding v036 TM/PE into TRAIN reweights the existing dst__ features against actual adversarial examples and the lift falls out.

Third-family cross-model number

Llama-3.3-70B-Instruct was not present in TRAIN. The 85.8% overall recall on its 887 entries indicates the v035 + v036 TM/PE TRAIN fold generalises to a third attacker family at near-in-distribution recall. Per-category cuts on this leg:

privilege_escalation × llama33: 97.0%
tool_misuse × llama33: 91.6%
data_exfil × llama33: 69.0%

The data_exfil pattern from v0.36 holds. Across all three attacker families now seen, DE is the hardest category. PE and TM generalise cleanly. DE generalises unevenly.

Carry-forward DE numbers

v036 Mixtral DE: 70.9% (v8) vs 69.3% (v7 on the same 690 entries), flat. v036 Claude DE: 38.9% (v8) vs 26.0% (v7 on the same 700 entries), +12.9pp. The v036 to v8 lift is concentrated in the closed-weight leg that was failing hardest. Open-weight Mixtral DE was already at 70%-tier and stays there. The asymmetry confirms the v0.36 mechanism finding (destination signal is the axis, and folding the related TM/PE distributions into TRAIN repositions the classifier on that axis for the closed-weight leg).

Ship gate

v0.37 ships under both a methodology gate and a sub-cell recall gate because v8 is a production retrain:

gate	result
v035 TEST recall does not regress	PASS, 85.3% to 86.6%, +1.3pp
v035 TEST FPR does not regress	PASS, 4.6% to 5.0%, within CI
Worst v0.36 sub-cell improves	PASS, DE × Claude 26.0% to 38.9%
Third attacker family covered with recall floor	PASS, llama33 overall 85.8%
Held-out gap stays published with mechanism	PASS

Cross-model overall recall is 66.8%. Below the 70% floor used as soft target in prior releases, but the floor was set against v035 TEST distribution. Cross-model overall is a harder denominator, and 66.8% is a 7.6 pp lift on the comparable v036 number (59.2% to 66.8%) with a third family added to the denominator.

Generation provenance

Llama-3.3-70B generation ran on an AMD-backed MI300X DigitalOcean SR-IOV droplet under rocm/vllm:latest serving RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic with the model's native compressed-tensors FP8 quantization (--max-model-len 8192 --enforce-eager --gpu-memory-utilization 0.92). Three parallel category generators, ~22 minutes wall clock for 900 entries at steady-state ~40 entries/min combined. Droplet poweroff issued post-rsync. Schema validation pass dropped 13 of 300 raw TM entries (4.3%) where the model emitted non-DENY expected. Final v037 counts: TM 287, PE 300, DE 300, total 887 valid.

The v037 droplet recipe is identical to v0.36 modulo the model swap. The --quantization flag had to be dropped because compressed-tensors in the model config conflicts with an explicit fp8 argument. vLLM auto-detects the quantization scheme from the model config in that case, and that path serves correctly. This is a model-specific configuration note rather than a methodology change.

Chain of custody

anchor	path	pins
corpus manifest	`tests/adversarial/MANIFEST.sha256`	SHA-256 of every JSONL including v037
v035 split (inherited)	`tests/adversarial/v035_split.json`	TRAIN/VAL/TEST for v8 calibration
v037 split	`tests/adversarial/v037_split.json`	v035 inherited + v036 TM/PE to train, v036 DE + v037 to holdout
production bundle	`src/vaara/data/adversarial_classifier_v8.joblib`	trained on 11,287 entries with dst features + embeddings
prior production	`src/vaara/data/adversarial_classifier_v7.joblib`	retained for cross-eval
Llama-3.3 generator	`scripts/generate_targeted_v037.py`	vLLM HTTP, FP8 dynamic on MI300X
droplet driver	`scripts/v037_droplet_run.sh`	idempotent, no destructive EXIT trap
watcher	`scripts/v037_local_watcher.sh`	60s rsync poll, opt-in doctl auto-shutdown
split builder	`scripts/build_v037_split.py`	inherits v035, folds v036 TM/PE into train
holdout eval	`scripts/eval_v037_holdout.py`	three-leg breakdown (mixtral, claude, llama33)
v035 schema check	`scripts/validate_v037.py`	same shape as v0.36 validator

Reproduction recipe

cd tests/adversarial && sha256sum -c MANIFEST.sha256
.venv/bin/python scripts/validate_v037.py
.venv/bin/python scripts/build_v037_split.py
.venv/bin/python scripts/save_classifier_bundle.py \
    --version v0.37 --threshold 0.9006 --embeddings \
    --split-manifest tests/adversarial/v037_split.json \
    --train-fold train \
    --bundle-out src/vaara/data/adversarial_classifier_v8.joblib
.venv/bin/python scripts/eval_v037_holdout.py \
    --bundle src/vaara/data/adversarial_classifier_v8.joblib \
    --json-out bench/v037_eval_v8_holdout.json

Named limits

Third family generation is 887 valid entries, not 4,000+ like v0.36. Wilson CI on a 300-entry sub-cell at p ~ 0.85 is ± 4 pp, adequate for ship-gate decisions. Scaling the Llama-3.3 leg to v036 density is v0.38 scope, paired with public-benchmark evaluation.
Open-weight families dominate the third-family fold. Llama-3.3 and Mixtral are both open-weight Meta and Mistral architectures. Closed-weight coverage in v0.37 is the carry-forward Claude DE subset only. Adding GPT-4o-class or Gemini-class generation is v0.38 scope.
No public-benchmark eval (PINT, BIPIA, INJECT) yet. v0.38 scope.
PAIR multi-attacker scale-up not performed. v0.38 scope (target ASR Wilson upper under 1%).
FPR-bounded three-stage combiner per FCR paper (arxiv:2605.22004) not implemented. v0.39 scope.

Cumulative position

v0.37 closes the worst v0.36 sub-cell by 12.9 pp without giving up in-distribution recall, and covers a third attacker family at 85.8% overall. The data_exfil category remains the hardest cross-model surface. That is the v0.38 + v0.39 line of work: public-benchmark numbers, PAIR-at-scale, FPR-bounded combiner.