Python API

December 2, 2025 ยท View on GitHub

This guide covers programmatic usage of the evaluation harness in Python scripts and applications.

Overview

The library provides three main ways to run evaluations programmatically:

FunctionUse Case
simple_evaluate()Most common - accepts model name strings or LM objects
EvaluatorConfigConfig-based - load settings from YAML or dataclass
evaluate()Low-level - full control over task dictionaries

Quick Start

The simplest way to run an evaluation:

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

print(results["results"])

Using simple_evaluate()

The simple_evaluate() function is the recommended entry point for most use cases.

Basic Usage

import lm_eval

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2,dtype=float32",
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
)

With a Pre-initialized Model

import lm_eval
from lm_eval.models.huggingface import HFLM

# Initialize model separately
lm = HFLM(pretrained="gpt2", batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
    num_fewshot=0,
)

With External Tasks

import lm_eval
from lm_eval.tasks import TaskManager

# Include custom task definitions
task_manager = TaskManager(include_path="/path/to/custom/tasks")

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["my_custom_task"],
    task_manager=task_manager,
)

Common Parameters

ParameterTypeDescription
modelstr or LMModel name (e.g., "hf", "vllm") or LM instance
model_argsstr or dictModel constructor arguments
taskslist[str]Task names to evaluate
num_fewshotintNumber of few-shot examples
batch_sizeint or strBatch size or "auto"
devicestrDevice (cuda, cpu, mps)
limitint or floatLimit examples per task
log_samplesboolSave model inputs/outputs
task_managerTaskManagerFor external tasks
gen_kwargsdictGeneration arguments
apply_chat_templatebool or strUse chat template
system_instructionstrSystem prompt
fewshot_as_multiturnboolMulti-turn few-shot

See lm_eval/evaluator.py for the complete parameter list.

Return Value

simple_evaluate() returns a dictionary with:

{
    "results": {
        "task_name": {
            "metric_name": value,
            "metric_name,stderr": stderr_value,
        }
    },
    "configs": {...},      # Task configurations
    "versions": {...},     # Task versions
    "n-shot": {...},       # Few-shot counts
    "higher_is_better": {...},
    "n-samples": {...},
    "samples": {...},      # If log_samples=True
}

Using EvaluatorConfig

The EvaluatorConfig class provides a structured way to manage evaluation settings.

From YAML File

from lm_eval.config.evaluate_config import EvaluatorConfig
import lm_eval

# Load configuration from YAML
config = EvaluatorConfig.from_config("eval_config.yaml")

# Process tasks
task_manager = config.process_tasks()

# Run evaluation
results = lm_eval.simple_evaluate(
    model=config.model,
    model_args=config.model_args,
    tasks=config.tasks,
    num_fewshot=config.num_fewshot,
    batch_size=config.batch_size,
    device=config.device,
    task_manager=task_manager,
    log_samples=config.log_samples,
    gen_kwargs=config.gen_kwargs,
    apply_chat_template=config.apply_chat_template,
    system_instruction=config.system_instruction,
)

Direct Instantiation

from lm_eval.config.evaluate_config import EvaluatorConfig

config = EvaluatorConfig(
    model="hf",
    model_args={"pretrained": "gpt2", "dtype": "float32"},
    tasks=["hellaswag", "arc_easy"],
    num_fewshot=5,
    batch_size=8,
    device="cuda:0",
    output_path="./results/",
    log_samples=True,
)

# Validate and process
task_manager = config.process_tasks()

Config Fields

See the Configuration Guide for all available fields.


Using evaluate()

The evaluate() function provides lower-level control, accepting pre-built task dictionaries.

With Custom Task Objects

import lm_eval
from lm_eval.tasks import TaskManager, get_task_dict
from lm_eval.models.huggingface import HFLM

# Initialize model
lm = HFLM(pretrained="gpt2", batch_size=16)

# Build task dictionary
task_manager = TaskManager(include_path="/path/to/custom/tasks")
task_dict = get_task_dict(
    ["hellaswag", "my_custom_task"],
    task_manager
)

# Run evaluation
results = lm_eval.evaluate(
    lm=lm,
    task_dict=task_dict,
    num_fewshot=5,
    limit=100,
)

Mixed Task Sources

from lm_eval.tasks import get_task_dict

# Combine different task sources
task_dict = get_task_dict(
    [
        "mmlu",                           # Stock task name
        "my_custom_task",                 # From include_path
        {"task": "inline_task", ...},     # Inline config dict
    ],
    task_manager
)

Custom Models

To evaluate a custom model, create a subclass of lm_eval.api.model.LM:

from lm_eval.api.model import LM

class MyCustomLM(LM):
    def __init__(self, model, batch_size=1):
        super().__init__()
        self.model = model
        self._batch_size = batch_size

    def loglikelihood(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    def generate_until(self, requests):
        # Return list of generated strings
        ...

    def loglikelihood_rolling(self, requests):
        # Return list of (logprob, is_greedy) tuples
        ...

    @property
    def batch_size(self):
        return self._batch_size

Then use it with simple_evaluate():

my_model = load_my_model()
lm = MyCustomLM(model=my_model, batch_size=16)

results = lm_eval.simple_evaluate(
    model=lm,
    tasks=["hellaswag"],
)

For detailed guidance on implementing custom models, see the Model Guide.


Logging

Configure logging for debugging:

from lm_eval.utils import setup_logging

# Set log level
setup_logging("DEBUG")  # DEBUG, INFO, WARNING, ERROR

# Or use environment variable
import os
os.environ["LMEVAL_LOG_LEVEL"] = "DEBUG"

Examples

Batch Evaluation of Multiple Models

import lm_eval

models = [
    "gpt2",
    "gpt2-medium",
    "gpt2-large",
]

all_results = {}
for model_name in models:
    results = lm_eval.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_name}",
        tasks=["hellaswag"],
        batch_size="auto",
    )
    all_results[model_name] = results["results"]

Save and Load Results

import json
import lm_eval
from lm_eval.utils import handle_non_serializable

results = lm_eval.simple_evaluate(
    model="hf",
    model_args="pretrained=gpt2",
    tasks=["hellaswag"],
)

# Save results
with open("results.json", "w") as f:
    json.dump(results, f, default=handle_non_serializable, indent=2)