Joke Evaluation Benchmark
February 1, 2026 ยท View on GitHub
A simple benchmark for evaluating language models' ability to distinguish between good and bad jokes, based on Reddit voting data. current scores
Overview
This project tests whether language models can identify which of two jokes is funnier, using a dataset of 195,000+ Reddit jokes with community scores. Models are presented with pairs of jokes (one highly upvoted, one poorly rated) and must choose the funnier one.
How It Works
Core Concept
- Dataset: 195,000+ jokes from Reddit, sorted by score (upvotes - downvotes)
- Test pairs: For each trial, select:
- One joke from the bottom
offsetpositions (lowest scores) - One joke from the top
offsetpositions (highest scores)
- One joke from the bottom
- Shuffling: Joke positions (A/B) are randomized to avoid positional bias
- Evaluation: Model must output "A" or "B" indicating which joke is funnier
- Scoring: Percentage of correct choices over some variable number of trials
The Benchmark
Models are tested on their ability to:
- Understand humor (setup + punchline structure)
- Align with human preferences (Reddit voting patterns)
- Follow simple instructions (output only "A" or "B")
- Ensure
llcatis available:pipx install llcat
Run a single test:
# Get the correct answer and prompt for offset 1
./joke_eval.py 1
Prompt Format
Which joke is funnier? Say only "A" or "B". Do not be conversational.
<Joke A><setup>{setup}</setup>
<punchline>{punchline}</punchline></Joke A>
<Joke B><setup>{setup}</setup>
<punchline>{punchline}</punchline>
</Joke B>
Response Parsing
The benchmark extracts the first "A" or "B" from the model's response, ignoring all other text and whitespace.
Randomization
Each test uses random.shuffle() to randomize which joke appears as A vs B, preventing positional bias in model responses.
License
MIT License - see LICENSE file for details.
Acknowledgments
- Reddit joke dataset from taivop/joke-dataset
- Inspired by the need for simple, interpretable LLM benchmarks
- Thanks to the Ollama team for easy local model serving