RAG-over-PDF
May 31, 2026 ยท View on GitHub
Chat with your PDFs: hybrid retrieval, reranking, and streaming answers with page-level citations.
Upload one or many PDFs and ask questions across them. The app extracts text page by page, runs hybrid search (dense embeddings plus BM25) over the chunks, reranks the candidates for precision, and streams a grounded answer with citations that point back to the exact source and page. It is a readable, framework-free RAG implementation you can clone, run, and ship.
Built by Sarma Linux. Built to ship, not to sit on a shelf.
What this is
A working Retrieval-Augmented Generation starter that takes RAG past the toy stage. It does what the serious implementations do (hybrid search, a reranking stage, real citations) without burying any of it under a framework. Every moving part is a short, readable TypeScript module you can open and understand.
The whole retrieval pipeline is a few hundred lines. No vector DB to provision, no Pinecone account, no LangChain weight. You can read it end to end in one sitting.
What it solves
- "I have hundreds of internal PDFs nobody reads, can we make them searchable and citable?"
- "We need a chatbot grounded in our actual documentation, with sources, not the open web"
- "Dense-only retrieval keeps missing exact product codes and error strings"
- "I want to learn how production-grade RAG actually works under the hood, without a framework hiding it"
Architecture
graph TD U[User browser] -->|upload PDFs| UP["/api/upload"] U -->|ask question| CH["/api/chat"] UP --> PDF[pdf-parse<br/>page by page] PDF --> CK[Chunker<br/>page-aware] CK --> EM[OpenAI embeddings] EM --> VS[(Vector store<br/>dense + BM25)] CH --> EMQ[Embed question] EMQ --> HS[Hybrid search<br/>RRF fusion] VS --> HS HS --> RR[Reranker] RR --> LLM[OpenAI chat] LLM -->|NDJSON: citations + tokens| U classDef ext fill:#a78bfa,stroke:#a78bfa,color:#fff class EM,EMQ,LLM ext
Single-process, single-machine. The vector store is one file behind a small interface, so swapping the in-memory index for pgvector, Supabase Vector, Pinecone, or Qdrant is a contained change.
Quick start
git clone https://github.com/sarmakska/rag-over-pdf.git
cd rag-over-pdf
pnpm install
cp .env.example .env.local # then add your OPENAI_API_KEY
pnpm dev
Open http://localhost:3000, upload one or more PDFs, tick which documents to search, and ask a question. Answers stream in with a numbered source list that links each claim to a document and page.
Features
- Hybrid search. Dense embeddings plus a BM25 lexical index, fused with Reciprocal Rank Fusion. Dense handles meaning and paraphrase, BM25 handles exact terms such as error codes and identifiers. Fusion is more robust than either alone.
- Reranker step. Hybrid search casts a wide net for recall, then a reranker reorders the candidates for precision. The default LLM reranker scores each candidate against the question; if it fails it falls back to a deterministic lexical reranker.
- Citation streaming. The chat response is a newline-delimited JSON stream. Citations arrive first so the UI shows sources immediately, then answer tokens stream, then a done event closes it.
- Multi-document chat. Index many PDFs at once. Ask across all of them, or scope a question to a subset.
- Page-level highlights. Text is extracted page by page, every chunk knows its page, and each citation carries the source filename, page number, and a snippet.
What is in the box
app/api/uploadparses a PDF page by page, chunks the text with page tracking, embeds each chunk, and adds it to the store as its own document. Also lists and deletes documents.app/api/chatembeds the question, runs hybrid retrieval, reranks, and streams citations then answer tokens as NDJSON. Accepts adocIdsarray to scope the question.lib/pdf.tspage-aware PDF text extraction via the pdf-parse pagerender hook.lib/chunker.tsfixed-size character chunker with overlap and page tracking.lib/bm25.tsa compact, dependency-free BM25 sparse index.lib/vector-store.tsin-memory cosine store plus BM25, hybrid search with RRF, and multi-document support. This is the one file you replace to move to a real database.lib/reranker.tsthe LLM reranker and its deterministic lexical fallback.lib/retrieval.tsthe orchestrator that wires hybrid search to the reranker.lib/citations.tsthe NDJSON streaming protocol shared by server and client.lib/openai.tsa lazily constructed OpenAI client plus the embedding helper, sonext buildruns without an API key.app/page.tsxthe upload, document list, chat, and citation UI built with Tailwind.tests/unit and end-to-end tests with committed fixture PDFs. No network, no key required.
When to use this / when not to
Use this when you want to learn how production-grade retrieval-augmented generation actually works without a framework hiding the moving parts, when you are prototyping a documentation chatbot grounded in your own PDFs and need real citations, or when you need a clean starting point you can extend into a production system.
Do not use this as-is for a high-traffic production deployment. The in-memory store clears on restart and holds its documents in a single process, and chunking is fixed-size rather than structure-aware. Swap the vector store for pgvector or a managed index and harden the upload path before you put real load on it.
Documentation
Full architecture notes, a retrieval deep-dive, tuning guides, a pgvector migration path, and deployment recipes live in the project wiki.
Tech stack
| Layer | Choice | Why |
|---|---|---|
| Framework | Next.js 14 App Router | Streaming, server routes, edge-ready |
| Language | TypeScript | Catch errors before runtime |
| PDF parsing | pdf-parse | Pure JS, no native deps, page-by-page extraction |
| Embeddings | OpenAI text-embedding-3-small | Cheap, 1536 dims, fast |
| Sparse retrieval | BM25 (in-repo) | Exact-term recall, no dependency |
| Fusion | Reciprocal Rank Fusion | No score normalisation needed |
| Reranking | LLM cross-encoder, lexical fallback | Precision on the recall pool |
| Generation | OpenAI gpt-4o-mini (streaming) | Cheap, fast, follows instructions |
| Styling | Tailwind CSS | Get on with it |
Configuration
| Env var | Required | Default | Purpose |
|---|---|---|---|
OPENAI_API_KEY | yes | none | Used for embeddings, reranking, and generation |
EMBEDDING_MODEL | no | text-embedding-3-small | Override the embedding model |
CHAT_MODEL | no | gpt-4o-mini | Override the generation and reranking model |
CHUNK_SIZE | no | 1000 | Characters per chunk |
CHUNK_OVERLAP | no | 200 | Overlap between chunks |
TOP_K | no | 5 | Chunks kept after reranking |
Swap to pgvector (when you need persistence)
The in-memory store lives in lib/vector-store.ts. Replace its methods with Postgres calls and you are on a real DB. The retrieval pipeline depends only on the interface.
create extension if not exists vector;
create table chunks (
id text primary key,
doc_id text not null,
source text,
page int,
content text not null,
embedding vector(1536) not null,
created_at timestamptz default now()
);
create index on chunks using hnsw (embedding vector_cosine_ops);
create index on chunks (doc_id);
See Swap-to-pgvector for the full migration, including how to keep BM25 in Postgres with tsvector.
Deploy to Vercel
Vercel prompts for OPENAI_API_KEY. That is the only configuration needed.
Limitations (honest list)
- In-memory store. Restarting the server clears the index. Fine for demos, swap to pgvector for anything real.
- BM25 reindex on every upload. The in-memory BM25 index rebuilds its corpus when documents change. Trivial at starter scale, move term statistics into Postgres at large scale.
- Fixed-size chunking. Production RAG benefits from semantic or structure-aware chunking. Out of scope for a starter.
- Cost. Each question is an embedding call, a reranking call, and a generation call. With the small models that is well under a penny per question, but it is not free. Set the reranker aside if you want to drop one call.
Roadmap
- PDF upload and page-by-page parsing
- In-memory vector store
- Streaming answers
- Hybrid search (BM25 + embeddings)
- Reranker step
- Citation streaming
- Multi-document chat
- Page-level highlights
- pgvector adapter as a drop-in
- Local embedding option (sentence-transformers via Ollama)
- Semantic chunking
PRs welcome.
Related work
- SarmaLink-AI multi-provider AI backend with automatic failover
- StaffPortal open-source staff management platform
License
MIT. Use it however you want. Attribution appreciated, not required.
Built by Sarma Linux.
More open source by Sarma
Part of a portfolio of twelve production-shaped open-source repositories built and maintained by Sarma.
| Repository | What it is |
|---|---|
| Sarmalink-ai | Multi-provider OpenAI-compatible AI gateway with 14-engine failover and intent-based plugin auto-routing |
| agent-orchestrator | Durable multi-agent workflows in TypeScript with deterministic replay and Inspector UI |
| voice-agent-starter | Sub-second full-duplex voice agent loop. WebRTC, mediasoup, pluggable STT / LLM / TTS |
| ai-eval-runner | Evals as code. Python, DuckDB, FastAPI viewer, regression mode for CI |
| mcp-server-toolkit | Production Model Context Protocol server starter (Python / FastAPI) |
| local-llm-router | OpenAI-compatible proxy that routes to Ollama or cloud providers based on policy |
| rag-over-pdf | Hybrid-search RAG starter for PDF corpora with reranking and citations |
| receipt-scanner | Vision OCR for receipts with Zod-validated JSON output |
| webhook-to-email | Webhook receiver that forwards events to email via Resend |
| k8s-ops-toolkit | Helm chart for shipping Next.js to Kubernetes with full observability stack |
| terraform-stack | Vercel + Supabase + Cloudflare + DigitalOcean modules in one Terraform repo |
| staff-portal | Open-source HR / ops portal: leave, attendance, expenses, kiosk mode |
Engineering essays at sarmalinux.com/blog and all projects at sarmalinux.com/open-source