User Guide

April 8, 2026 · View on GitHub

This document details the interface exposed by lm-eval and provides details on what flags are available to users.

Command-line Interface

The lm-eval CLI is organized into subcommands:

Command	Description
`lm-eval run`	Run evaluations on language models
`lm-eval ls`	List available tasks, groups, subtasks, or tags
`lm-eval validate`	Validate task configurations

Run the library via the lm-eval entrypoint or python -m lm_eval.

Use -h or --help to see available options:

lm-eval -h              # Show all subcommands
lm-eval run -h          # Show options for run command
lm-eval ls -h           # Show options for list command

Legacy Compatibility: The original single-command interface still works. Running lm-eval --model hf --tasks hellaswag automatically inserts the run subcommand.

Quick Start

# List available tasks
lm-eval ls tasks

# Basic evaluation
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag

# With few-shot examples
lm-eval run --model hf --model_args pretrained=gpt2 --tasks arc_easy --num_fewshot 5

# Save results and model outputs
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag --output_path ./results/ --log_samples

# Use a config file
lm-eval run --config eval_config.yaml

`lm-eval run`

Run evaluations on language models.

lm-eval run --model <model> --tasks <task> [options]

Quick Examples

# Basic evaluation with HuggingFace model
lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag

# Multiple tasks with few-shot examples
lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5

# Custom generation parameters
lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95

# Use a YAML configuration file
lm-eval run --config my_config.yaml --tasks mmlu

Model and Tasks

Argument	Short	Description
`--model`	`-M`	Model type/provider name (default: `hf`). See supported models.
`--model_args`	`-a`	Model constructor arguments as `key=val key2=val2` or `key=val,key2=val2`. For HuggingFace models, see `HFLM` for available arguments.
`--tasks`	`-t`	Space or comma-separated list of task names or groups. Use `lm-eval ls tasks` to see available tasks.
`--apply_chat_template`		Apply chat template to prompts. Use without argument for default template, or specify template name.
`--limit`	`-L`	Limit examples per task. Integer for count, float (0.0-1.0) for percentage. For testing only.
`--use_cache`	`-c`	Path prefix for SQLite cache of model responses (e.g., `/path/to/cache_`).

Evaluation Settings

Argument	Short	Description
`--num_fewshot`	`-f`	Number of few-shot examples in context.
`--batch_size`	`-b`	Batch size: integer, `auto`, or `auto:N` to auto-tune N times (default: 1).
`--max_batch_size`		Maximum batch size when using `--batch_size auto`.
`--device`		Device to use: `cuda`, `cuda:0`, `cpu`, `mps` (default: `cuda`).
`--gen_kwargs`		Generation arguments as `key=val key2=val2`. Values parsed with `ast.literal_eval`. Example: `temperature=0.8 'stop=["\n\n"]'`

Data and Output

Argument	Short	Description
`--output_path`	`-o`	Output directory or JSON file for results. Required with `--log_samples`.
`--log_samples`	`-s`	Save all model inputs/outputs for post-hoc analysis.
`--samples`	`-E`	JSON mapping task names to sample indices, e.g., `'{"task1": [0,1,2]}'`. Incompatible with `--limit`.

Caching and Performance

Argument	Description
`--cache_requests`	Cache preprocessed prompts: `true`, `refresh`, or `delete`. Cached files stored in `lm_eval/cache/.cache` or path set by `LM_HARNESS_CACHE_PATH` env var.
`--check_integrity`	Run task test suite validation before evaluation.

Prompt Formatting

Argument	Description
`--system_instruction`	Custom system instruction prepended to prompts.
`--fewshot_as_multiturn`	Format few-shot examples as multi-turn conversation. Auto-enabled with `--apply_chat_template`. Set to `false` to disable.

Task Management

Argument	Description
`--include_path`	Additional directory containing external task YAML files.

Logging and Tracking

Argument	Short	Description
`--verbosity`	`-v`	(Deprecated) Use `LMEVAL_LOG_LEVEL` env var instead.
`--write_out`	`-w`	Print prompts for first few documents (for debugging).
`--show_config`		Display full task configuration after evaluation.
`--wandb_args`		Weights & Biases arguments as `key=val`. E.g., `project=my-project name=run-1`.
`--wandb_config_args`		Additional W&B config arguments.
`--hf_hub_log_args`		HuggingFace Hub logging arguments. See HF Hub Logging.

Advanced Options

Argument	Short	Description
`--predict_only`	`-x`	Save predictions only, skip metric computation. Implies `--log_samples`.
`--seed`		Random seeds as single integer or comma-separated list for `python,numpy,torch,fewshot`. Default: `0,1234,1234,1234`. Use `None` to skip. Example: `--seed 42` or `--seed 0,None,8,52`.
`--trust_remote_code`		Allow executing remote code from HuggingFace Hub.
`--confirm_run_unsafe_code`		Confirm understanding of risks for tasks executing arbitrary Python.
`--metadata`		JSON string passed to TaskConfig. Required for some tasks like RULER. Example: `--metadata '{"max_seq_length": 4096}'`.

Evaluating Thinking/Reasoning Models

Models like Qwen3 or DeepSeek-R1 can produce a chain-of-thought reasoning trace before their final answer. To strip this thinking trace before metrics are computed, use the think_end_token and enable_thinking model arguments (passed via --model_args).

enable_thinking: Activates thinking mode in the chat template (passed as a kwarg to apply_chat_template).
think_end_token: The delimiter marking the end of the thinking section. Everything up to and including the last occurrence of this token is discarded from the output. This option is required when using enable_thinking=True.

With the vllm or sglang backends, think_end_token must be a string (e.g. </think>):

lm-eval run --model vllm \
  --model_args pretrained=Qwen/Qwen3-32B,enable_thinking=True,think_end_token="</think>" \
  --tasks gsm8k --apply_chat_template

With the hf backend, think_end_token can be either a string or a token ID (integer). Using the token ID avoids edge cases where the token string appears in normal text:

lm-eval run --model hf \
  --model_args pretrained=Qwen/Qwen3-32B,enable_thinking=True,think_end_token=200008 \
  --tasks gsm8k --apply_chat_template

The correct think_end_token for a given model can be found in its tokenizer_config.json (look for the token closing the thinking block in the chat template). For example, see Qwen3-32B's tokenizer_config.json.

Note: enable_thinking=True is only compatible with generative tasks. It cannot be used with loglikelihood-based tasks.

Configuration File

Argument	Short	Description
`--config`	`-C`	Path to YAML configuration file. CLI arguments override config file values. See Configuration Files.

HuggingFace Hub Logging

The --hf_hub_log_args argument accepts these keys:

Key	Description
`hub_results_org`	Organization name on HF Hub. Defaults to token owner.
`details_repo_name`	Repository name for detailed results.
`results_repo_name`	Repository name for aggregated results.
`push_results_to_hub`	`True`/`False` - push results to Hub.
`push_samples_to_hub`	`True`/`False` - push samples to Hub. Requires `--log_samples`.
`public_repo`	`True`/`False` - make repository public.
`leaderboard_url`	URL to associated leaderboard.
`point_of_contact`	Contact email for results dataset.
`gated`	`True`/`False` - gate the details dataset.

`lm-eval ls`

List available tasks, groups, subtasks, or tags.

lm-eval ls [tasks|groups|subtasks|tags] [--include_path DIR]

Arguments

Argument	Description
`tasks`	List all available tasks (groups, subtasks, and tags).
`groups`	List only task groups (e.g., `mmlu`, `glue`, `superglue`).
`subtasks`	List only individual subtasks (e.g., `mmlu_anatomy`, `hellaswag`).
`tags`	List task tags (e.g., `reasoning`, `knowledge`).
`--include_path`	Additional directory for external task definitions.

Task Organization

Groups: Collections of related tasks with aggregated metrics across subtasks (e.g., mmlu contains 57 subtasks)
Subtasks: Individual evaluation tasks (e.g., mmlu_anatomy, hellaswag)
Tags: Categories for filtering tasks without aggregated metrics (e.g., reasoning, language)

Examples

# List all tasks
lm-eval ls tasks

# List only task groups
lm-eval ls groups

# Include external tasks
lm-eval ls tasks --include_path /path/to/external/tasks

`lm-eval validate`

Validate task configurations before running evaluations.

lm-eval validate --tasks <task1,task2> [--include_path DIR]

Arguments

Argument	Short	Description
`--tasks`	`-t`	(Required) Comma-separated list of task names to validate.
`--include_path`		Additional directory for external task definitions.

Validation Checks

The validate command performs:

Task existence: Verifies all specified tasks are available
Configuration syntax: Checks YAML/JSON configuration files
Dataset access: Validates dataset paths and configurations
Required fields: Ensures all mandatory task parameters are present
Metric definitions: Verifies metric functions and aggregation methods
Filter pipelines: Validates filter chains and their parameters
Template rendering: Tests prompt templates with sample data

Examples

# Validate a single task
lm-eval validate --tasks hellaswag

# Validate multiple tasks
lm-eval validate --tasks arc_easy,arc_challenge,hellaswag

# Validate a task group
lm-eval validate --tasks mmlu

# Validate external tasks
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks

Python API

For programmatic usage, see the Python API Guide.

Environment Variables

Variable	Description
`LMEVAL_LOG_LEVEL`	Logging level (`DEBUG`, `INFO`, `WARNING`, `ERROR`).
`LM_HARNESS_CACHE_PATH`	Path for cached requests (default: `lm_eval/cache/.cache`).
`HF_TOKEN`	HuggingFace Hub token for private datasets/models.
`TOKENIZERS_PARALLELISM`	Set to `false` to avoid tokenizer warnings (auto-set by CLI).