about-evaluation.md

September 16, 2025 · View on GitHub

Knowledge

Two cool overviews on the challenges of automatic evaluation!

Challenges in LM evaluation, a presentation by Hailey Schoelkopf and Lintang Sutawika.
Lessons from the trenches on Reproducible Evaluation of LMs, a paper by EleutherAI
Two podcasts by Latent Space on evaluation
- Benchmarks 101, on automatic benchmarks history and well-known associated issues
- Benchmarks 201, on which evaluation method to use when, plus some tidbits about the Leaderboard with yours truly!

Cool summaries and experience feedbacks:

lm_eval, by Eleuther (also known as "the Harness"). The powerhouse of LLM evaluations, allowing you to evaluate any LLMs from many providers on a range of benchmarks, in a stable and reproducible way.
lighteval, by Hugging Face (disclaimer: I'm one of the authors). A light LLM evaluation suite, focused on customization and recent benchmarks.

Open LLM Leaderboard, by Hugging Face. Neutral 3rd party evaluation of Open LLMs on reference static benchmarks (open to submissions)
HELM, by Stanford. Also evaluates models on static benchmarks, but uses win-rates to rank models
Chatbot Arena, by LMSys Arena using crowdsourced human evaluation to score 150 LLMs
LLM Performance Leaderboard, by Artificial Analysis Performance benchmarks and pricing of the biggest LLM API providers, if you want to use an API instead of running things locally
All our blogs about evaluations and leaderboards
Leaderboard finder: Find the most relevant leaderboard for your use case

End-to-end custom domain evaluation tutorial: This tutorial guides you through building a custom evaluation task for your domain. It uses with synthetic data and manual evaluation with Argilla and distilabel.