Chunking

March 6, 2026 · View on GitHub

🏥 Quick Return to Emergency Room

You are in a specialist desk.
For full triage and doctors on duty, return here:

Think of this page as a sub-room.
If you want full consultation and prescriptions, go back to the Emergency Room lobby.

A compact hub to stabilize document chunking across formats, pipelines, and retrieval systems.
This folder routes chunk-related bugs to structural fixes and provides checklists, schema, and live recipes.
No infra change required.


Orientation: what each page does

PageWhat it solvesTypical symptom
Chunk ID SchemaUnique ID + schema for each chunkDuplicate or drifting chunks across runs
Chunking ChecklistMinimal audit list for validityChunks too long, too short, or incomplete
Code / Tables / BlocksPreserve structure for code, tables, blocksRetrieval drops formatting or logic
Section DetectionDetect paragraph and section anchorsAnchors missing, snippets cut mid-thought
Title HierarchyMaintain document heading hierarchyOnly partial or meaningless sub-sections retrieved
PDF Layouts & OCRRepair PDF/OCR-specific chunkingCitations collapse after parsing
Reindex & MigrationSafe chunk migration during reindexIndex rebuilt but old refs mismatch
Eval RAG Precision & RecallDeterministic evaluation recipes“Better” chunking cannot be proven
Live Monitoring (RAG)Online health checks for chunkingSudden drift or collapse after deploy

When to use this folder

  • Your chunks look fine by eye but retrieval skips important sections.
  • PDF / OCR parsing collapses headers, math, or tables.
  • Hybrid retrievers underperform due to inconsistent chunk boundaries.
  • Reindexing breaks old citations.
  • Context flips between runs with same corpus.

Acceptance targets

  • Chunk boundaries align with semantic windows
  • ΔS(question, retrieved) ≤ 0.45
  • Coverage of target section ≥ 0.70
  • λ_observe convergent across 3 paraphrases and 2 seeds
  • Traceability contract fields always present: {snippet_id, section_id, source_url, offsets, tokens}

60-second fix checklist

  1. Check chunk IDs
    Apply chunk_id_schema. Ensure unique + stable across reindex.

  2. Audit with checklist
    Run the chunking-checklist before ingest.

  3. Preserve structure
    Use code_tables_blocks for code, tables, blocks.

  4. Validate anchors
    Confirm section and title detection. Apply title_hierarchy.

  5. Reindex safely
    Use reindex_migration with hash/version lock.

  6. Monitor live
    Apply live_monitoring_rag to catch collapse early.


Minimal probe pack

Context: I loaded TXT OS and the WFGY pages.

Task:
- Given doc corpus D, log ΔS(question, retrieved) and λ across 3 paraphrases.
- Validate chunk IDs and section anchors.
- If ΔS ≥ 0.60 or λ flips, propose the smallest structural change:
  chunk schema, checklist, or reindex.
- Verify coverage ≥ 0.70 after fix.

Return JSON:
{ "citations": [...], "ΔS": 0.xx, "λ_state": "<>", "coverage": 0.xx, "next_fix": "..." }

🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars