Usage Guide

March 20, 2026 · View on GitHub

Configuration

Set your LLM provider API keys as environment variables. Add them to your shell profile (~/.zshrc or ~/.bashrc) for persistence:

# Add to ~/.zshrc or ~/.bashrc
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AIza..."
export OPENROUTER_API_KEY="sk-or-..."

Then reload your shell:

source ~/.zshrc   # or source ~/.bashrc

Or set them for a single session:

export ANTHROPIC_API_KEY="sk-ant-..."
seclens run -m "anthropic/claude-sonnet-4-20250514" -d dataset.jsonl

Supported Providers

ProviderEnvironment VariableNotes
AnthropicANTHROPIC_API_KEYClaude models
OpenAIOPENAI_API_KEYGPT models
Google GeminiGOOGLE_API_KEYGemini models
OpenRouterOPENROUTER_API_KEYMulti-provider gateway
OllamaLocal models, no key needed
LiteLLMLITELLM_API_KEYKey for upstream provider

Running Evaluations

Basic Run

# Layer 2 (tool-use, default)
seclens run -m "anthropic/claude-sonnet-4-20250514" -d dataset.jsonl

# Layer 1 (code-in-prompt)
seclens run -m "openai/gpt-4.1" -d dataset.jsonl --layer code-in-prompt

# With HuggingFace dataset
seclens run -m "google/gemini-2.5-flash" -d enginesec/SecLens:test

Run Options

FlagDefaultDescription
-m, --modelrequiredModel identifier (e.g., anthropic/claude-sonnet-4-20250514)
-d, --datasetrequiredDataset path (local JSONL or HuggingFace repo:split)
-l, --layertool-useEvaluation layer (code-in-prompt or tool-use)
--modeguidedEvaluation mode (guided with category hint, open without)
-p, --promptbasePrompt preset (base, minimal, security_expert) or custom YAML
-w, --workers5Parallel evaluation workers
--max-costunlimitedBudget cap in USD
--max-turns200Max LLM turns per task (Layer 2)
--seed42Random seed for reproducibility
--resumeoffResume from existing output file
--retry-failedPath to results file — re-evaluate failed/missing tasks
--debugoffSave full message chains to debug JSONL

Output Files

Each run produces:

out/
  results_model_tu_guided_base_20260320_143022.jsonl   # Per-task results
  report_model_tu_guided_base_20260320_143022.json     # Pre-computed model report
  debug_results_model_tu_guided_base_20260320.jsonl    # Debug chains (if --debug)

Retrying Failed Tasks

If tasks fail due to API errors, timeouts, or context overflow:

seclens run -m "model" -d dataset.jsonl --retry-failed out/results_model.jsonl

This identifies failed, corrupt, and missing tasks, re-evaluates only those, and replaces the old entries in-place.

Viewing Results

Summary (Aggregate Metrics)

seclens summary -r out/report_model.json

Shows leaderboard score, MCC, CWE accuracy, location accuracy, cost metrics, per-category and per-language breakdowns.

Role Report

# Single role
seclens report -r out/report_model.json --role ciso

# All five roles
seclens report -r out/report_model.json --all-roles

Shows decision score, grade, dimension category breakdown, per-vulnerability-category performance, per-language performance, and a natural-language recommendation.

Cross-Model Comparison

# Through one role's lens
seclens compare -r model_a.jsonl -r model_b.jsonl --role ciso

# All roles matrix
seclens compare -r model_a.jsonl -r model_b.jsonl --all-roles

JSON Output

All commands support -o output.json for programmatic consumption:

seclens report -r results.jsonl --role ciso -o ciso_report.json
seclens report -r results.jsonl --all-roles -o all_roles.json

Prompt Presets

Three built-in presets control how the model is instructed:

PresetDescriptionUse Case
baseStructured baseline with output format instructionsDefault for leaderboard runs
minimalBare-bones prompt with minimal guidanceTesting raw capability
security_expertSecurity audit methodology with anti-pattern guidanceTesting with expert framing

In guided mode, the system prompt includes a category hint (e.g., "Focus on SQL injection vulnerabilities"). In open mode, no hint is provided.

Helper Scripts

Migrate Old Results

python scripts/migrate_results.py out/              # batch
python scripts/migrate_results.py out/ --dry-run     # preview

Converts old results files to current schema (numeric layers to named, backfills paired_with and category on post-patch tasks).

Batch Generate Model Reports

python scripts/generate_model_reports.py out/        # generates missing reports only

Tips

  • Large repos (moodle, tensorflow): reduce workers (-w 2) to avoid disk space issues from concurrent clones
  • Small /tmp: set TMPDIR=/path/to/larger/disk before running
  • Ollama: no API key needed, runs locally. Use ollama/model:tag format
  • Cost control: use --max-cost 5.0 to cap spending per run
  • Reproducibility: the --seed flag ensures bootstrap CIs are deterministic