Dimensions

March 20, 2026 · View on GitHub

SecLens computes 35 dimensions from evaluation results, organized into 7 categories. These dimensions form the basis for role-specific scoring — each role weights them differently based on what matters for their decision context.

Category A: Detection (D1–D8)

Core vulnerability detection metrics.

IDDimensionWhat It Measures
D1MCCOverall classification quality, accounting for class imbalance. The single most balanced metric — penalizes models that always say "vulnerable" or "safe."
D2Detection RateOf all real vulnerabilities, what percentage were detected? A model with 60% detection rate misses 4 in 10 vulnerabilities.
D3PrecisionOf all code flagged as vulnerable, what percentage was actually vulnerable? Low precision = high false positive noise.
D4F1Balanced combination of detection rate and precision. Punishes models that sacrifice one for the other.
D5True Negative RateOf all safe code, what percentage was correctly cleared? Measures how quiet the model stays on clean code.
D6CWE AccuracyAmong detected vulnerabilities, correct CWE identification rate. Finding a vulnerability isn't enough — you need to know what kind.
D7Mean Location IoUAverage precision of vulnerability localization. Higher IoU = model points to the right code.
D8Actionable Finding RatePercentage of vulnerabilities reported with correct verdict AND correct CWE AND correct location — complete findings that need zero human triage.

Category B: Coverage & Consistency (D9–D13)

How reliably the model works across different vulnerability types, languages, and tool outputs.

IDDimensionWhat It Measures
D9CWE Coverage BreadthPercentage of vulnerability categories with at least one correct detection. A specialist vs generalist indicator.
D10Worst Category FloorDetection rate of the model's weakest vulnerability category. Average accuracy hides blind spots — this exposes them.
D11Cross-Language ConsistencyHow consistent performance is across programming languages. Low variance = predictable behavior.
D12Worst Language FloorDetection rate of the model's weakest language. Critical for teams using that specific stack.
D13SAST FP FilteringAccuracy on SAST false positive tasks. Can the model correctly dismiss findings from traditional static analysis tools?

Category C: Reasoning & Evidence (D14–D17)

Does the model explain its findings with supporting evidence?

IDDimensionWhat It Measures
D14Evidence CompletenessPercentage of responses with a complete evidence chain — source (input entry), sink (dangerous operation), and data flow path.
D15Reasoning PresencePercentage of responses that include a written explanation. Verdicts without reasoning are black boxes.
D16Reasoning + Correct VerdictAmong responses with reasoning, how often is the verdict correct? Low scores suggest the model confabulates.
D17FP Reasoning QualityAmong false positives, what percentage include reasoning? An explained wrong answer is at least reviewable.

Category D: Operational Efficiency (D18–D23)

What does it cost and how fast is it?

IDDimensionWhat It Measures
D18Cost per TaskAverage API cost per evaluation. Directly determines financial viability at scale.
D19Cost per True PositiveAverage cost to find one real vulnerability. Combines cost efficiency with detection effectiveness.
D20MCC per DollarDetection quality per unit of cost. The ultimate efficiency metric.
D21Wall Time per TaskAverage elapsed time. Determines whether the tool fits in real-time pipelines or batch jobs.
D22ThroughputTasks per minute. Scale readiness metric.
D23Tokens per TaskAverage token consumption. Model-agnostic cost proxy independent of pricing.

Category E: Tool-Use & Navigation (D24–D27)

How effectively the model uses tools to investigate code. Layer 2 only.

IDDimensionWhat It Measures
D24Tool Calls per TaskInvestigation intensity. Too few = not exploring enough. Too many = flailing.
D25Turns per TaskConversation length. Fewer turns = faster convergence.
D26Navigation EfficiencyPercentage of tasks resolved with 5 or fewer tool calls. Measures focused investigation.
D27Tool EffectivenessVerdict accuracy among tasks where tools were used. Do tools actually help?

Category F: Risk & Severity (D28–D30)

Does the model prioritize high-severity vulnerabilities?

IDDimensionWhat It Measures
D28Severity-Weighted Detection RateDetection rate where missing a critical vulnerability costs 4x more than missing a low-severity one. Uses advisory-reported severity.
D29Critical Miss RateDetection rate specifically on critical and high severity vulnerabilities. Zero-tolerance metric.
D30Severity CoveragePercentage of severity levels with at least one correct detection. Does the model work across the severity spectrum?

Category G: Robustness (D31–D35)

Does the model work reliably without crashing or producing unusable output?

IDDimensionWhat It Measures
D31Parse Success RatePercentage of responses that are fully parseable structured JSON. Fundamental for pipeline integration.
D32Format ComplianceAmong responses that produced some output, what percentage was well-formed? Isolates instruction-following from infrastructure failures.
D33Error RatePercentage of tasks completed without errors (API failures, timeouts, etc.). Higher = better.
D34Autonomous Completion RatePercentage of tasks that completed without error AND produced parseable output. The strictest reliability metric.
D35Graceful DegradationDoes accuracy drop proportionally with task difficulty, or does it cliff? Predictable behavior is essential for deployment trust.