RAXE Performance Benchmarks
March 30, 2026 ยท View on GitHub
Overview
RAXE uses a two-layer detection architecture:
- L1 (Rule-Based): 515+ compiled regex rules. No external dependencies. Sub-millisecond to low-single-digit millisecond latency.
- L2 (ML-Based): Gemma-based ONNX classifier for semantic threat detection. Included with
pip install raxe.
All benchmarks measure scan latency only (excludes one-time initialization).
Methodology
- Timing:
time.perf_counter()(monotonic, nanosecond resolution) - Iterations: 100 timed iterations per test (configurable via
--iterations) - Warmup: 5 iterations discarded before measurement (configurable via
--warmup) - Percentiles: P50 (median), P95, P99 computed from sorted latency array
- Dry run: Benchmarks use
dry_run=Trueto exclude telemetry and history overhead - Memory: RSS via
resource.getrusage()(peak resident set size)
Running Benchmarks
# Full benchmark suite (L1 + L2 if available)
python scripts/run_benchmarks.py
# L1 only (skips ML model loading)
python scripts/run_benchmarks.py --l1-only
# More iterations for stable results
python scripts/run_benchmarks.py --iterations 500
# JSON output only
python scripts/run_benchmarks.py --output json
# Save JSON to file
python scripts/run_benchmarks.py --json-file results.json
# All options
python scripts/run_benchmarks.py --iterations 500 --warmup 10 --output both --json-file results.json
Input Sizes Tested
| Label | Characters | Description |
|---|---|---|
tiny_10 | ~10 | Single short sentence |
short_100 | ~100 | Typical user prompt |
medium_500 | ~500 | Detailed question |
long_1000 | ~1000 | Multi-paragraph prompt |
xlarge_5000 | ~5000 | Very long input |
threat | ~62 | Known prompt injection |
Sample Results
Results below are from a single run and will vary by hardware.
Run python scripts/run_benchmarks.py to get numbers for your environment.
L1 Rule-Based Detection
| Input Size | P50 (ms) | P95 (ms) | P99 (ms) |
|---|---|---|---|
| tiny_10 | run benchmark | run benchmark | run benchmark |
| short_100 | run benchmark | run benchmark | run benchmark |
| medium_500 | run benchmark | run benchmark | run benchmark |
| long_1000 | run benchmark | run benchmark | run benchmark |
| xlarge_5000 | run benchmark | run benchmark | run benchmark |
| threat | run benchmark | run benchmark | run benchmark |
L1 + L2 Combined Detection
| Input Size | P50 (ms) | P95 (ms) | P99 (ms) |
|---|---|---|---|
| tiny_10 | run benchmark | run benchmark | run benchmark |
| short_100 | run benchmark | run benchmark | run benchmark |
| medium_500 | run benchmark | run benchmark | run benchmark |
| long_1000 | run benchmark | run benchmark | run benchmark |
| xlarge_5000 | run benchmark | run benchmark | run benchmark |
| threat | run benchmark | run benchmark | run benchmark |
Memory Footprint
| Measurement | RSS (MB) |
|---|---|
| Before init (Python baseline) | run benchmark |
| After L1-only init | run benchmark |
| After L1+L2 init | run benchmark |
Performance Targets
| Metric | Target | Notes |
|---|---|---|
| L1 P95 | < 5 ms | Rules-only, typical input |
| L1+L2 P95 | < 10 ms | Combined, with stub L2 |
| L1+L2 P95 (full ML) | < 50 ms | With ONNX model loaded |
| Init time | < 500 ms | One-time startup cost |
| Throughput | > 1000 scans/sec | L1-only sustained |
CI Integration
The benchmark.yml GitHub Actions workflow runs benchmarks on every push
and PR. Results are tracked at the
performance dashboard.
Performance regressions > 20% trigger a PR comment alert.
Interpreting Results
- P50 (median): Typical latency most users experience.
- P95: Latency at the 95th percentile. Use this for SLA targets.
- P99: Tail latency. High P99 relative to P95 indicates occasional spikes.
- Stdev: Standard deviation. Low stdev means consistent performance.
Factors that affect results:
- CPU speed and architecture (Apple Silicon vs x86)
- System load during benchmark
- Python version (3.11+ has faster regex)
- Number of rules loaded (more rules = higher L1 latency)
- L2 model type (stub vs ONNX INT8)