Benchmarks

April 30, 2026 ยท View on GitHub

Regression Test Suite

The project includes 98 regression tests covering 5 verdict families.

Run the full suite:

python -m pytest tests/ -q

Verdict Family Coverage

FamilyDescriptionTest Count
supportedClaim clearly backed by sources25
likely_supportedMore supporting than conflicting evidence20
contestedSignificant supporting and conflicting evidence18
likely_falseConflicting evidence outweighs supporting17
insufficient_evidenceNot enough information18
Total98

Running Benchmarks

# Full regression suite
python -m pytest tests/ -q

# Verbose output
python -m pytest tests/ -v

# Specific test module
python -m pytest tests/test_verify_claim.py -v
python -m pytest tests/test_core.py -v

Calibration Note

Confidence thresholds (support >= 1.35 for 'supported', etc.) are heuristic and have not been calibrated against a gold-standard dataset. Confidence levels (HIGH/MEDIUM/LOW) reflect relative signal strength, not probabilistic accuracy. See docs/trust-model.md for the semantic oracle boundary.