Hangman Benchmark

September 19, 2025 · View on GitHub

A benchmark for testing AI models' ability to play the classic game of Hangman. Uses the Inspect framework to evaluate the models.

This eval was built as a demonstration of how to enable models to play games using tools within the Inspect framework.

Installation

Using pip

pip install hangman-bench

Using uv (recommended)

uv add hangman-bench

Usage

inspect eval hangman_bench/hangman

# To limit the number of samples to 10
inspect eval hangman_bench/hangman --limit=10

# To evaluate against a specific model
inspect eval hangman_bench/hangman --model openai/gpt-4o-mini

# To allow the model to guess the word before guessing all letters
inspect eval hangman_bench/hangman -T allow-word-guesses=True

Task Parameters

language: The language to use for the words (default: "english")
difficulty: Specific difficulty level to use (1-5), or None for mixed difficulties (default: None)
max_guesses: Maximum number of incorrect guesses allowed (default: 10)
shuffle: Whether to shuffle the words before playing (default: True)
allow_word_guesses: Whether to allow the model to guess the word before guessing all letters (default: False)

Development

Using uv (recommended)

Clone the repository
Install with development dependencies:
```
uv sync --dev
```
Run tests:
```
uv run pytest
```

Run evaluations:

uv run inspect eval hangman_bench/hangman

Using pip

Clone the repository
Install with development dependencies:
```
pip install -e ".[dev]"
```
Run tests:
```
pytest
```

License

MIT