CircleGuardBench - A full-fledged benchmark for evaluating protection capabilities of AI models

May 11, 2025 ยท View on GitHub

CircleGuardBench Logo

CircleGuardBench - A full-fledged benchmark for evaluating protection capabilities of AI models

CircleGuardBench is the first-of-its-kind benchmark for evaluating the protection capabilities of large language model (LLM) guard systems. It tests how well guard models block harmful content, resist jailbreaks, avoid false positives, and operate efficiently in real-time environments on a taxonomy close to real-world data. Created by White Circle

What It Measures

  • Harmful content detection across 17 critical risk categories
  • Jailbreak resistance using adversarial prompt variations
  • False positive rate on safe, neutral inputs
  • Runtime performance under realistic constraints
  • Integral score combining accuracy and speed for real-world readiness

Key Features

  • Standardized evaluation of multiple LLMs and guard models
  • Support for major inference engines (openai_api, vllm, sglang, transformers)
  • Custom taxonomy aligned with real-world abuse cases and moderation APIs (OpenAI, Google, etc.)
  • Integral scoring system that penalizes unsafe outputs and slow response times
  • Leaderboard generation with per-category and macro-averaged scores

Why It Matters

Most benchmarks focus on accuracy alone, ignoring critical production requirements like speed and jailbreak resistance. CircleGuardBench is the first benchmark to integrate accuracy, attack robustness, and latency into a single, practical evaluation framework โ€” helping safety teams choose guard models that actually work in production.

Installation

# Clone the repository
git clone https://github.com/whitecircle-ai/circle-guard-bench.git
cd circle-guard-bench

# Installation using Poetry (basic installation)
poetry install

# Installation with additional inference engines
poetry install --extras "vllm sglang transformers"

# Or only with vllm
poetry install --extras "vllm"

# Or only with sglang
poetry install --extras "sglang"

# Or only with transformers
poetry install --extras "transformers"

# Installation using pip (or uv)
pip install -e .

# Installation with additional inference engines using pip
pip install -e ".[vllm,sglang,transformers]"

Quick Start

# Run evaluation on a specific model
guardbench run "gpt-4o-mini (CoT)"

# Run evaluation on all configured models
guardbench run --all

# View the leaderboard
guardbench leaderboard

# Display available models
guardbench models

# Show dataset information
guardbench dataset_info

Check out full example on colab: Open In Colab

Configuration

Models Configuration

Models are configured in configs/models.json. Example:

[
  {
    "name": "gpt-4o-mini (CoT)",
    "type": "llm_regexp",
    "eval_on": "all",
    "inference_engine": "openai_api",
    "params": {
      "api_model_name": "openai/gpt-4o-mini",
      "endpoint": "https://openrouter.ai/api/v1/"
    },
    "max_concurrency": 20,
    "use_cot": true
  }
]

Prompt Templates

Prompt templates are stored in the prompts directory as Jinja2 templates.

Check these templates for example of prompting with reasoning:

  • cot_prompt_eval_regexp.jinja
  • cot_answer_eval_regexp.jinja

Supported Inference Engines

CircleGuardBench supports several inference engines:

  • openai_api: For OpenAI models and compatible APIs (requires the openai package)
  • vllm: For local model execution using vLLM (requires the optional dependency vllm)
  • sglang: For running models locally using SGLang (requires the optional dependency sglang)
  • transformers: For inferenceing classifiers, other models (like instruct llms) are not supported. (requires the optional dependency transformers)

To use the vllm or sglang engines, you need to install the corresponding optional dependencies.

Dataset Format

CircleGuardBench uses a standardized dataset format. Below are the required columns and their descriptions:

Column NameDescriptionRequired
harm_categoryCategory of potential harm associated with the prompt.Only if prompt_verdict is 'unsafe'
promptThe original input prompt.Always
prompt_verdictThe safety verdict ('safe' or 'unsafe') for the original prompt.Always
prompt_hashA unique identifier hash for the prompt.Always
default_answerThe model's response to the original prompt.Always
default_answer_verdictThe safety verdict ('safe' or 'unsafe') for the default_answer.Always
jailbreaked_promptA modified version of the prompt designed to bypass safety filters.Only if prompt_verdict is 'unsafe'
jailbreaked_answerThe model's response to the jailbreaked_prompt.Only if prompt_verdict is 'unsafe'

CLI Commands

CircleGuardBench provides a powerful command-line interface for managing the evaluation process:

  • guardbench run [MODEL_NAME]: Run evaluation for a specific model
  • guardbench run --all: Run evaluation for all configured models
  • guardbench run --force: Force re-evaluation even for already evaluated models
  • guardbench run --model-type TYPE: Specify model type when using cached results
  • guardbench leaderboard: Display evaluation results in a leaderboard format
  • guardbench models: List all configured models with detailed information
  • guardbench prompts: List available prompt templates
  • guardbench dataset_info: Display information about the loaded dataset

Advanced Usage

Sorting Leaderboard Results

# Sort by accuracy score
guardbench leaderboard --sort-by accuracy

# Sort by recall
guardbench leaderboard --sort-by recall

# Sort by average runtime
guardbench leaderboard --sort-by avg_runtime_ms

Filtering Metrics

# Show only metrics for default prompts
guardbench leaderboard --metric-type default_prompts

# Show only metrics for jailbreak attempts
guardbench leaderboard --metric-type jailbreaked_prompts

Grouping by Categories

# Group results by categories
guardbench leaderboard --use-categories

# Disable category grouping
guardbench leaderboard --no-categories

Leaderboard

The CircleGuardBench leaderboard provides a detailed overview of the performance of various models. It includes:

  • A summary table with macro-average metrics across all metric types
  • Detailed tables for each metric type (default prompts, jailbreak prompts, etc.)
  • Option to group by harm categories
  • Sorting by various metrics (F1, recall, precision, etc.)

Example of leaderboard without categories

CircleGuardBench Leaderboard Example

Dataset and Categories

The dataset used for the benchmark (public part) is located on Huggingface, to access it you will need to accept the terms of our license for use.

CircleGuardBench covers 17 categories of harmful content:

CircleGuardBench Categories

Descriptions of categories

#Category NameDescription
1Violence & TerrorismInstructions for violent crimes, sabotage, or evading law enforcement.
2Deception & MisinformationManipulation, fake news, data falsification, or emotional influence.
3Cybercrime & HackingPrompts about malware, social engineering, SQL injection, or exploits.
4Drugs & SubstancesContent on illegal drugs, trafficking, or making controlled substances.
5Animal AbuseGuides promoting animal cruelty, fighting, or exploitation.
6Financial FraudTax evasion, insider trading, scams, or unethical business practices.
7Hate & ExtremismRacism, sexism, anti-LGBTQ+ content, or calls for hate crimes.
8Corruption & LoopholesBribery, public policy manipulation, or legal exploitation.
9Illicit Creative ContentRoleplays or stories focused on criminal or unethical themes.
10Academic CheatingPrompts about exam cheating, plagiarism, or faking schoolwork.
11Environmental HarmRequests to pollute, sabotage infrastructure, or harm ecosystems.
12Weapons & ExplosivesInstructions to build weapons or hazardous materials.
13Child AbuseAny content related to harming or exploiting minors.
14Sexual ViolenceNon-consensual, explicit, or violent sexual requests.
15Human TraffickingLabor exploitation and trafficking schemes.
16AI JailbreakingPrompts trying to bypass model safeguards or filter evasion.
17Self-Harm & SuicidePromoting or detailing methods of self-harm or suicidal ideation.

About the Authors

CircleGuardBench is developed by White Circle, a company focused on AI safety and responsible AI deployment. White Circle builds tools, benchmarks, and evaluation methods to help developers and researchers create safer, more reliable large language models.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache 2.0.