ai-eval-harness

May 22, 2026 · View on GitHub

License: MIT Version: v0.1 Python: 3.10+

Open-source eval harness for RAG and agent systems. Built on Ragas + a small cost-tracking layer, with promptfoo wiring for prompt-level regression suites. Model-agnostic on principle: Claude, GPT, and open-source models score on the same prompts and corpora.

Maintained by Paiteq and used internally on client engagements by Paiteq and GetWidget. MIT-licensed so you can run the same harness on your own infra and verify our published benchmark numbers.

Status: v0.1 — public scaffold. Ragas wiring, cost/latency tracking, local-echo smoke provider, and one example RAG config ship in v0.1. Benchmarks produced with this harness publish at getwidget.dev/benchmarks, starting with rag-2026-q2 in June.

Why this exists

Most "best LLM for RAG" content online is vibes-based. Authors run a handful of prompts on a curated corpus, declare a winner, and never publish the prompts or corpus. The result is unverifiable, undated, and rots within a quarter.

We wanted a different shape:

  • Reproducible. Anyone can clone this repo, point it at a corpus, and get scores inside our published confidence intervals.
  • Dated. Every benchmark we publish carries the quarter in the URL and H1. Undated benchmarks are uncitable.
  • Model-agnostic. Same prompts, same corpus, same rubric across Claude / GPT / Gemini / open-source.
  • Cost on the same axis as quality. Recall@5 and pass@1 are meaningless without $/1k queries on the same dated run.

What the harness covers

CapabilityBacked byStatus
Context precision, context recallRagas✅ v0.1
Faithfulness, answer relevancyRagas✅ v0.1
Cost + latency tracking per-runcustom✅ v0.1
Local-echo provider (no-credentials smoke path)custom✅ v0.1
Prompt-level assertions over precomputed answerspromptfoo✅ v0.1
LLM-graded promptfoo assertions, regex/contains matcherspromptfoo⚪ v0.2
Agent reliability (tool-use, multi-step, error recovery)custom rubric⚪ v0.2

Roadmap

MilestoneDateWhat ships
v0.1✅ 2026-05 (shipped)Ragas wiring. promptfoo assertion lane. CLI runner. Cost + latency tracking. Local-echo smoke provider. One example RAG config.
v0.22026-07Agent reliability bench (100-task harness, tool-use + multi-step). LLM-graded promptfoo assertions.
v0.32026-08Public dataset cards on huggingface.co/paiteq-ai. PyPI release.
v1.02026-Q4Stable CLI + Python API. Versioned benchmark snapshots.

Track milestones in GitHub issues and releases once they ship.

Published benchmarks

Each benchmark uses this harness end-to-end. Full methodology, prompts, scores, and per-model cost live on the benchmark page; the dataset is mirrored to HuggingFace.

BenchmarkPublish targetPage
RAG retrieval, 2026-Q22026-06getwidget.dev/benchmarks/rag-2026-q2/
Agent reliability, 2026-Q32026-09getwidget.dev/benchmarks/agent-reliability-2026-q3/

Quick start

# From source (PyPI publish lands in v0.3).
git clone https://github.com/paiteq/ai-eval-harness.git
cd ai-eval-harness
pip install -e ".[ragas,anthropic,openai]"

# Smoke run against the bundled local-echo provider — no API keys needed.
ai-eval run examples/rag-baseline.yaml --no-score --out runs/

# With Ragas scoring + real models (Claude, GPT). Export keys first.
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
ai-eval run examples/rag-baseline.yaml --out runs/

The bundled examples/rag-baseline.yaml runs on a 10-row local corpus (examples/data/rag-baseline.jsonl) so the full pipeline is exercisable without external data.

Defining your own eval

Three sections, all YAML:

name: my-rag-eval
dataset:
  source: path/to/my-corpus.jsonl  # or org/name on HuggingFace
  max_rows: 100
  columns:
    question: query        # remap if your corpus uses different names
    ground_truth: answer
    contexts: docs

models:
  - name: claude-sonnet-4-6
    provider: anthropic
    model_id: claude-sonnet-4-6
    input_cost_per_mtok: 3.00
    output_cost_per_mtok: 15.00

  - name: gpt-5
    provider: openai
    model_id: gpt-5
    input_cost_per_mtok: 2.50
    output_cost_per_mtok: 10.00

metrics:
  ragas:
    - context_precision
    - context_recall
    - faithfulness
    - answer_relevancy
  promptfoo:                 # optional, requires `npm i -g promptfoo`
    - contains-ground-truth

Each row in the dataset needs three fields: question, ground_truth, and a contexts list. Retrieval is precomputed in v0.1 — bring your own retriever and pass the top-K chunks per question. End-to-end retriever wiring lands later.

Two scoring lanes (Ragas + promptfoo)

The harness runs scoring in two parallel lanes:

  1. Ragas — semantic metrics (faithfulness, answer relevancy, context precision/recall). Needs a judge LLM (typically OpenAI). Runs on every row of the dataset.
  2. promptfoo — assertion-based pass/fail over the precomputed answers. v0.1 wires one default assertion (the answer must contain a snippet of the ground truth). No extra API spend — promptfoo evaluates the already-generated answers via the echo provider, it doesn't re-run the LLMs. Requires the promptfoo Node binary on PATH (npm i -g promptfoo); skipped silently otherwise. LLM-graded assertions + regex/contains matchers land in v0.2.

Lane results land in separate fields on the EvalResult: ragas_scores and promptfoo_scores. The CLI prints both tables side-by-side. The JSON report under runs/<config-name>.json carries both, plus promptfoo_note if the lane was requested but skipped.

License

MIT. See LICENSE. Free for personal and commercial use, no attribution required (though a link back is appreciated).

Maintained by

This harness is built and maintained by the engineering team behind:

  • Paiteq — AI engineering studio (Claude, OpenAI, open-source). Eval-first delivery.
  • GetWidget — AI-native development studio, founded 2017 (Dallas + Bengaluru). Open-source Flutter UI Kit (4,800+ stars).
  • Hire Flutter Dev — vetted senior Flutter developers, AI-augmented delivery.

Contact: paiteq.com/contact · getwidget.dev/contact-us