OfficeQA

May 12, 2026 · View on GitHub

A Grounded Reasoning Benchmark by Databricks

OfficeQA is a benchmark by Databricks, built for evaluating model / agent performance on end to end Grounded Reasoning tasks. The benchmark is split into two subsets:

  1. OfficeQA Pro: The default for evaluating frontier models (N=133)
  2. OfficeQA Full: A version of the benchmark containing additional easier questions to hillclimb systems on (N=246)

Additional details:

Data Access

As of May 2026, all large files (benchmark CSVs, Treasury Bulletin PDFs, and parsed docs) have been moved from this GitHub repo to Hugging Face. The CSVs are gated to ensure agents browsing the web do not have access— request access on Hugging Face to get the benchmark questions and answers.

Once you've requested and been granted access, you can load the benchmark data:

from datasets import load_dataset
# Authenticate first: huggingface_hub.login() or set HF_TOKEN env var
dataset = load_dataset("databricks/officeqa", data_files="officeqa_pro.csv", split="train")

Overview

OfficeQA evaluates how well AI systems can reason over real-world documents to answer complex questions. The benchmark uses historical U.S. Treasury Bulletin PDFs (1939-2025), which contain dense financial tables, charts, and text data.

Repository Contents:

File/DirDescription
reward.pyEvaluation script for scoring model outputs
corpus_scripts/Scripts and notebooks for working with the Treasury Bulletin corpus

All benchmark data (CSVs, PDFs, parsed docs) is on Hugging Face.

Dataset Schema (officeqa_pro.csv / officeqa_full.csv):

ColumnDescription
uidUnique question identifier
questionThe question to answer
answerGround truth answer
source_docsOriginal URL(s) from the Federal Reserve Archive
source_filesCorresponding parsed filename(s) (e.g., treasury_bulletin_1941_01.txt)
difficultyeasy or hard

Results

Headline results on OfficeQA Pro (N=133). See the OfficeQA Technical Report for the full evaluation methodology and additional settings.

Agent Harness Performance

End-to-end performance of frontier agents operating over the Treasury Bulletin corpus.

GPT-5.1 and Opus 4.5 Results included as reference point to results from the OfficeQA blog and re-run with latest OfficeQA Pro. Recorded on March 9 2026 OfficeQA Technical Report.

GPT-5.4 and Opus 4.6 Results recorded on March 9 2026 OfficeQA Technical Report. Opus 4.7 Results recorded on April 21 2026.

LLM with Oracle Page(s) + Web Search (PDF Only)

LLM performance when provided the oracle page(s) needed to answer each question along with web search access, evaluated across varying absolute relative error tolerances.

GPT-5.4 and Opus 4.6 Results recorded on March 9 2026 OfficeQA Technical Report. Opus 4.7 Results recorded on April 21 2026.

Getting Started

1. Load the benchmark questions (from Hugging Face)

from datasets import load_dataset
# Authenticate first (dataset is gated)
# huggingface_hub.login() or set HF_TOKEN env var

# Pro subset — default for evaluating frontier models
dataset = load_dataset("databricks/officeqa", data_files="officeqa_pro.csv", split="train")

# Full benchmark — includes easier questions for hillclimbing
dataset = load_dataset("databricks/officeqa", data_files="officeqa_full.csv", split="train")

2. Clone the code repository (for reward.py and scripts)

git clone https://github.com/databricks/officeqa.git
cd officeqa

3. Download the corpus (from Hugging Face)

We recommend using parsed txt files found here:

from huggingface_hub import snapshot_download

# Download transformed text (recommended for LLM/RAG workflows, ~460MB)
local_dir = snapshot_download(
    repo_id="databricks/officeqa",
    repo_type="dataset",
    allow_patterns="treasury_bulletins_parsed/transformed/*.txt",
)

If you'd like to use the raw json parse or original PDFs, you can also download them here:

# Download parsed JSON docs (~730MB, with bounding boxes, tables, metadata)
local_dir = snapshot_download(
    repo_id="databricks/officeqa",
    repo_type="dataset",
    allow_patterns="treasury_bulletins_parsed/jsons/*.json",
)

# Download original PDFs (~4GB)
local_dir = snapshot_download(
    repo_id="databricks/officeqa",
    repo_type="dataset",
    allow_patterns="treasury_bulletin_pdfs/*",
)
FormatBest forSize
PDFsSystems with native PDF support, or you want to parse from scratch~4GB
Parsed JSONFull structural information, coordinates~730MB
Transformed TXTLLM/agent consumption, cleaner text~460MB

See corpus_scripts/ for scripts to create alternative text representations from the parsed JSONs, and to visualize the parsed bounding boxes on top of the PDFs.

4. Evaluate your model outputs

from reward import score_answer

# Score a single prediction
score = score_answer(
    ground_truth="123.45",
    predicted="123.45",
    tolerance=0.01  # 1% tolerance for numerical answers
)
print(f"Score: {score}")  # 1.0 for correct, 0.0 for incorrect

The reward.py script provides fuzzy matching for numerical answers with configurable tolerance levels:

  • 0.0% - Exact match
  • 0.1% - Within 0.1% relative error
  • 1.0% - Within 1% relative error
  • 5.0% - Within 5% relative error etc.

Mapping source URLs to parsed files

The source_files column in the dataset CSVs provides the direct filenames (e.g., treasury_bulletin_1941_01.txt) for easy reference. Here's how the URL-to-filename conversion works:

URL format: https://fraser.stlouisfed.org/title/treasury-bulletin-407/{MONTH}-{YEAR}-{ID}?page={PAGE}

Filename format: treasury_bulletin_{YEAR}_{MONTH_NUM}.{ext}

Month name to number mapping:

january   → 01    july      → 07
february  → 02    august    → 08
march     → 03    september → 09
april     → 04    october   → 10
may       → 05    november  → 11
june      → 06    december  → 12

Example:

  • URL: https://fraser.stlouisfed.org/title/treasury-bulletin-407/january-1941-6529
  • JSON file: treasury_bulletins_parsed/jsons/treasury_bulletin_1941_01.json
  • Text file: treasury_bulletins_parsed/transformed/treasury_bulletin_1941_01.txt
  • PDF file: treasury_bulletin_pdfs/treasury_bulletin_1941_01.pdf