StructuredRAG: JSON Response Formatting with Large Language Models

April 27, 2026 · View on GitHub

A benchmark for measuring how well LLMs follow JSON response format instructions across RAG-inspired tasks. Supports OpenAI, Anthropic, Google, Ollama, and Modal/vLLM providers out of the box.

You can find our research paper on ArXiv!

Experimental Results

Quick Start

uv sync

Set your API key and configure structured_rag/configs/benchmark.yaml:

export OPENAI_API_KEY=sk-...

provider: openai          # openai | anthropic | google | ollama | ollama_cloud | modal_vllm
model: gpt-5.4-nano
api_key_env: OPENAI_API_KEY  # which env var to read the key from

strategy: fstring          # fstring | fstring_structured | dspy | dspy_opro | all
tasks:
  - AssessAnswerability    # or "all" for all 7 tasks

save_dir: results

Run:

uv run python -m structured_rag.scripts.run_benchmark

Or point to a custom config:

uv run python -m structured_rag.scripts.run_benchmark path/to/custom.yaml

Tasks

The benchmark tests 7 RAG-inspired structured output tasks across different JSON complexity levels:

Output Type	Task	Example
`string`	GenerateAnswer	`{"answer": "The National Gallery of Art..."}`
`integer`	RateContext	`{"context_score": 5}`
`boolean`	AssessAnswerability	`{"answerable_question": true}`
`List[string]`	ParaphraseQuestions	`{"paraphrased_questions": ["...", "...", "..."]}`
`composite`	GenerateAnswerWithConfidence	`{"answer": "...", "confidence": 5}`
`List[composite]`	GenerateAnswersWithConfidence	`[{"answer": "...", "confidence": 5}, ...]`
`composite`	RAGAS	`{"faithfulness_score": 2.5, "answer_relevance_score": 1.0, ...}`

Composite Models

class GenerateAnswerWithConfidence(BaseModel):
    answer: str
    confidence: int

class RAGASMetrics(BaseModel):
    faithfulness_score: float
    answer_relevance_score: float
    context_relevance_score: float

Prompting Strategies

Strategy	Description
`fstring`	f-string prompting with inline JSON format instructions
`fstring_structured`	f-string prompting with provider-native structured outputs (OpenAI, Google)
`dspy`	DSPy Follow-the-Format (FF) prompting
`dspy_opro`	DSPy with OPRO-optimized JSON signature
`all`	Run all 4 strategies

Supported Providers

Provider	Config value	API key env var
OpenAI	`openai`	`OPENAI_API_KEY`
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`
Google Gemini	`google`	`GOOGLE_API_KEY`
Ollama (local)	`ollama`	--
Ollama Cloud	`ollama_cloud`	`OLLAMA_API_KEY`
Modal vLLM	`modal_vllm`	`MODAL_API_KEY`

Metrics

The benchmark reports two separate scores:

JSON Format Success Rate -- did the LLM produce valid, parseable JSON matching the expected schema?
Task Accuracy -- for tasks with ground truth (e.g. AssessAnswerability), did the LLM get the right answer?

Architecture

The codebase follows hexagonal (ports & adapters) architecture:

structured_rag/
  core/
    domain/       # Pydantic models, task definitions, validation metrics
    ports/        # Abstract interfaces (LLMPort, PromptingStrategy)
    services/     # Experiment runner, result saving
  adapters/
    llm/          # One adapter per provider (OpenAI, Anthropic, Google, Ollama, Modal/vLLM)
    prompting/    # Strategy implementations (f-string, DSPy)
  configs/        # benchmark.yaml
  scripts/        # run_benchmark.py entry point

Adding a new LLM provider requires creating one adapter file implementing LLMPort and registering it in adapters/llm/registry.py.

Dataset

The WikiQuestions dataset contains 112 samples built from Wikipedia title-abstract pairs with generated answerable/unanswerable questions. Also available on HuggingFace Datasets.

News

Weaviate Podcast #119 with Will Kurt and Cameron Pfiffer from dottxt.ai -- YouTube | Spotify
Weaviate Podcast #108 with Zhi Rui Tam on "Let Me Speak Freely?" -- YouTube | Spotify

Citation

@misc{shorten2024,
      title={StructuredRAG: JSON Response Formatting with Large Language Models}, 
      author={Connor Shorten and Charles Pierse and Thomas Benjamin Smith and Erika Cardenas and Akanksha Sharma and John Trengrove and Bob van Luijt},
      year={2024},
      eprint={2408.11061},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11061}, 
}