User Guide

April 8, 2026 ยท View on GitHub

This document details the interface exposed by lm-eval and provides details on what flags are available to users.

Command-line Interface

The lm-eval CLI is organized into subcommands:

CommandDescription
lm-eval runRun evaluations on language models
lm-eval lsList available tasks, groups, subtasks, or tags
lm-eval validateValidate task configurations

Run the library via the lm-eval entrypoint or python -m lm_eval.

Use -h or --help to see available options:

lm-eval -h              # Show all subcommands
lm-eval run -h          # Show options for run command
lm-eval ls -h           # Show options for list command

Legacy Compatibility: The original single-command interface still works. Running lm-eval --model hf --tasks hellaswag automatically inserts the run subcommand.


Quick Start

# List available tasks
lm-eval ls tasks

# Basic evaluation
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag

# With few-shot examples
lm-eval run --model hf --model_args pretrained=gpt2 --tasks arc_easy --num_fewshot 5

# Save results and model outputs
lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag --output_path ./results/ --log_samples

# Use a config file
lm-eval run --config eval_config.yaml

lm-eval run

Run evaluations on language models.

lm-eval run --model <model> --tasks <task> [options]

Quick Examples

# Basic evaluation with HuggingFace model
lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag

# Multiple tasks with few-shot examples
lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5

# Custom generation parameters
lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95

# Use a YAML configuration file
lm-eval run --config my_config.yaml --tasks mmlu

Model and Tasks

ArgumentShortDescription
--model-MModel type/provider name (default: hf). See supported models.
--model_args-aModel constructor arguments as key=val key2=val2 or key=val,key2=val2. For HuggingFace models, see HFLM for available arguments.
--tasks-tSpace or comma-separated list of task names or groups. Use lm-eval ls tasks to see available tasks.
--apply_chat_templateApply chat template to prompts. Use without argument for default template, or specify template name.
--limit-LLimit examples per task. Integer for count, float (0.0-1.0) for percentage. For testing only.
--use_cache-cPath prefix for SQLite cache of model responses (e.g., /path/to/cache_).

Evaluation Settings

ArgumentShortDescription
--num_fewshot-fNumber of few-shot examples in context.
--batch_size-bBatch size: integer, auto, or auto:N to auto-tune N times (default: 1).
--max_batch_sizeMaximum batch size when using --batch_size auto.
--deviceDevice to use: cuda, cuda:0, cpu, mps (default: cuda).
--gen_kwargsGeneration arguments as key=val key2=val2. Values parsed with ast.literal_eval. Example: temperature=0.8 'stop=["\n\n"]'

Data and Output

ArgumentShortDescription
--output_path-oOutput directory or JSON file for results. Required with --log_samples.
--log_samples-sSave all model inputs/outputs for post-hoc analysis.
--samples-EJSON mapping task names to sample indices, e.g., '{"task1": [0,1,2]}'. Incompatible with --limit.

Caching and Performance

ArgumentDescription
--cache_requestsCache preprocessed prompts: true, refresh, or delete. Cached files stored in lm_eval/cache/.cache or path set by LM_HARNESS_CACHE_PATH env var.
--check_integrityRun task test suite validation before evaluation.

Prompt Formatting

ArgumentDescription
--system_instructionCustom system instruction prepended to prompts.
--fewshot_as_multiturnFormat few-shot examples as multi-turn conversation. Auto-enabled with --apply_chat_template. Set to false to disable.

Task Management

ArgumentDescription
--include_pathAdditional directory containing external task YAML files.

Logging and Tracking

ArgumentShortDescription
--verbosity-v(Deprecated) Use LMEVAL_LOG_LEVEL env var instead.
--write_out-wPrint prompts for first few documents (for debugging).
--show_configDisplay full task configuration after evaluation.
--wandb_argsWeights & Biases arguments as key=val. E.g., project=my-project name=run-1.
--wandb_config_argsAdditional W&B config arguments.
--hf_hub_log_argsHuggingFace Hub logging arguments. See HF Hub Logging.

Advanced Options

ArgumentShortDescription
--predict_only-xSave predictions only, skip metric computation. Implies --log_samples.
--seedRandom seeds as single integer or comma-separated list for python,numpy,torch,fewshot. Default: 0,1234,1234,1234. Use None to skip. Example: --seed 42 or --seed 0,None,8,52.
--trust_remote_codeAllow executing remote code from HuggingFace Hub.
--confirm_run_unsafe_codeConfirm understanding of risks for tasks executing arbitrary Python.
--metadataJSON string passed to TaskConfig. Required for some tasks like RULER. Example: --metadata '{"max_seq_length": 4096}'.

Evaluating Thinking/Reasoning Models

Models like Qwen3 or DeepSeek-R1 can produce a chain-of-thought reasoning trace before their final answer. To strip this thinking trace before metrics are computed, use the think_end_token and enable_thinking model arguments (passed via --model_args).

  • enable_thinking: Activates thinking mode in the chat template (passed as a kwarg to apply_chat_template).
  • think_end_token: The delimiter marking the end of the thinking section. Everything up to and including the last occurrence of this token is discarded from the output. This option is required when using enable_thinking=True.

With the vllm or sglang backends, think_end_token must be a string (e.g. </think>):

lm-eval run --model vllm \
  --model_args pretrained=Qwen/Qwen3-32B,enable_thinking=True,think_end_token="</think>" \
  --tasks gsm8k --apply_chat_template

With the hf backend, think_end_token can be either a string or a token ID (integer). Using the token ID avoids edge cases where the token string appears in normal text:

lm-eval run --model hf \
  --model_args pretrained=Qwen/Qwen3-32B,enable_thinking=True,think_end_token=200008 \
  --tasks gsm8k --apply_chat_template

The correct think_end_token for a given model can be found in its tokenizer_config.json (look for the token closing the thinking block in the chat template). For example, see Qwen3-32B's tokenizer_config.json.

Note: enable_thinking=True is only compatible with generative tasks. It cannot be used with loglikelihood-based tasks.

Configuration File

ArgumentShortDescription
--config-CPath to YAML configuration file. CLI arguments override config file values. See Configuration Files.

HuggingFace Hub Logging

The --hf_hub_log_args argument accepts these keys:

KeyDescription
hub_results_orgOrganization name on HF Hub. Defaults to token owner.
details_repo_nameRepository name for detailed results.
results_repo_nameRepository name for aggregated results.
push_results_to_hubTrue/False - push results to Hub.
push_samples_to_hubTrue/False - push samples to Hub. Requires --log_samples.
public_repoTrue/False - make repository public.
leaderboard_urlURL to associated leaderboard.
point_of_contactContact email for results dataset.
gatedTrue/False - gate the details dataset.

lm-eval ls

List available tasks, groups, subtasks, or tags.

lm-eval ls [tasks|groups|subtasks|tags] [--include_path DIR]

Arguments

ArgumentDescription
tasksList all available tasks (groups, subtasks, and tags).
groupsList only task groups (e.g., mmlu, glue, superglue).
subtasksList only individual subtasks (e.g., mmlu_anatomy, hellaswag).
tagsList task tags (e.g., reasoning, knowledge).
--include_pathAdditional directory for external task definitions.

Task Organization

  • Groups: Collections of related tasks with aggregated metrics across subtasks (e.g., mmlu contains 57 subtasks)
  • Subtasks: Individual evaluation tasks (e.g., mmlu_anatomy, hellaswag)
  • Tags: Categories for filtering tasks without aggregated metrics (e.g., reasoning, language)

Examples

# List all tasks
lm-eval ls tasks

# List only task groups
lm-eval ls groups

# Include external tasks
lm-eval ls tasks --include_path /path/to/external/tasks

lm-eval validate

Validate task configurations before running evaluations.

lm-eval validate --tasks <task1,task2> [--include_path DIR]

Arguments

ArgumentShortDescription
--tasks-t(Required) Comma-separated list of task names to validate.
--include_pathAdditional directory for external task definitions.

Validation Checks

The validate command performs:

  • Task existence: Verifies all specified tasks are available
  • Configuration syntax: Checks YAML/JSON configuration files
  • Dataset access: Validates dataset paths and configurations
  • Required fields: Ensures all mandatory task parameters are present
  • Metric definitions: Verifies metric functions and aggregation methods
  • Filter pipelines: Validates filter chains and their parameters
  • Template rendering: Tests prompt templates with sample data

Examples

# Validate a single task
lm-eval validate --tasks hellaswag

# Validate multiple tasks
lm-eval validate --tasks arc_easy,arc_challenge,hellaswag

# Validate a task group
lm-eval validate --tasks mmlu

# Validate external tasks
lm-eval validate --tasks my_custom_task --include_path ./custom_tasks

Python API

For programmatic usage, see the Python API Guide.


Environment Variables

VariableDescription
LMEVAL_LOG_LEVELLogging level (DEBUG, INFO, WARNING, ERROR).
LM_HARNESS_CACHE_PATHPath for cached requests (default: lm_eval/cache/.cache).
HF_TOKENHuggingFace Hub token for private datasets/models.
TOKENIZERS_PARALLELISMSet to false to avoid tokenizer warnings (auto-set by CLI).