Hangman Benchmark
September 19, 2025 ยท View on GitHub
A benchmark for testing AI models' ability to play the classic game of Hangman. Uses the Inspect framework to evaluate the models.
This eval was built as a demonstration of how to enable models to play games using tools within the Inspect framework.
Installation
Using pip
pip install hangman-bench
Using uv (recommended)
uv add hangman-bench
Usage
inspect eval hangman_bench/hangman
# To limit the number of samples to 10
inspect eval hangman_bench/hangman --limit=10
# To evaluate against a specific model
inspect eval hangman_bench/hangman --model openai/gpt-4o-mini
# To allow the model to guess the word before guessing all letters
inspect eval hangman_bench/hangman -T allow-word-guesses=True
Task Parameters
language: The language to use for the words (default: "english")difficulty: Specific difficulty level to use (1-5), or None for mixed difficulties (default: None)max_guesses: Maximum number of incorrect guesses allowed (default: 10)shuffle: Whether to shuffle the words before playing (default: True)allow_word_guesses: Whether to allow the model to guess the word before guessing all letters (default: False)
Development
Using uv (recommended)
- Clone the repository
- Install with development dependencies:
uv sync --dev - Run tests:
uv run pytest - Run evaluations:
uv run inspect eval hangman_bench/hangman
Using pip
- Clone the repository
- Install with development dependencies:
pip install -e ".[dev]" - Run tests:
pytest
License
MIT