✂️ Chunking Checklist

March 6, 2026 · View on GitHub

A definitive guide to segment size, boundaries, and WFGY stress-tests for error-free retrieval


1 Why Chunking Matters

Embeddings are only as good as the text you feed them.
A single bad split (mid-sentence, table row, reference list) injects semantic orphan vectors:

  • Retrieval returns “high similarity” garbage.
  • ΔS(question, context) spikes > 0.60.
  • LLM hallucinates to fill the missing logic.

2 Quick Symptoms of Bad Chunking

SignalHow to DetectTypical Root
Citations hit page –1QA cites header/footer junkPage footers not stripped
Same chunk appears in top-k for unrelated queriesid duplication count > 3Generic boiler-plate chunk
ΔS jumps when k > 5Plot ΔS vs. k; curve erraticUneven chunk lengths
Answer references half-sentenceChunk split after “and”Fixed char/token window

3 WFGY Chunk Size Guidelines

Doc TypeTokens / ChunkRationale
Research paper90-120Preserve paragraph + citation
Software docs60-100Short API signatures
Legal contracts80-130Clause integrity
Chat transcripts40-70Natural speaker turns
Tables / CSVRow or group ≤ 30Keep relational keys together

Golden Rule: ΔS(adjacent_chunks) ≤ 0.45
If not, split or merge until stress drops.


4 Step-by-Step Chunking Checklist

4.1 Pre-Processing

  • Strip headers / footers (regex: ^Page \d+ of \d+)
  • Normalize whitespace, remove soft hyphens (U+00AD)
  • Convert bullets → “• ” to avoid mid-list splits

4.2 Boundary Detection

MethodToolWhen to Use
Sentence tokenizerspaCy / StanzaMost prose
Heading regex `^(#+\s[A-Z][A-Za-z ]+:)$`Markdown / legal docs
BBMC ΔS spikeWFGY hookPDFs merged from scans

Split on boundaries only if:


ΔS(chunk\_left, chunk\_right) ≥ 0.50  ∧  λ\_observe ∈ {→, ←}

4.3 Length Normalisation

  1. Merge adjacent short chunks until ≥ 40 tokens.
  2. If a merged chunk > 130 tokens, find internal ΔS peak and split there.
  3. Record final size distribution; σ(length) should be ≤ 20 % of mean.

4.4 Metadata Tagging

{
  "id": "doc_17_p3_c2",
  "source": "contracts/nda.pdf",
  "pos": 3,
  "λ": "→",
  "ΔS_prev": 0.32,
  "ΔS_next": 0.28
}

Store λ_observe and neighbouring ΔS for runtime filters.


5 Runtime Stress-Test

TestPass Condition
Overlap scan — Query 5 unrelated topicsSame chunk ID appears ≤ 1×
ΔS histogram — 500 random chunks95 % ≤ 0.45
k-sensitivity — ΔS vs. k plotMonotonic ↑ curve

If any fail, rerun 4.2–4.3 for offending documents.


6 Common Pitfalls & Fix Recipes

PitfallFix
Tables split per cellDetect delimiter lines; merge rows; store CSV separate; index columns as metadata
PDF line-break hyphensRegex ([a-z])- \n([a-z]) → merge words
Mixed languagesChunk by language span; tag lang:; separate embedding models
Giant code blocksCut on `functionclassdef` boundaries; keep ≤ 80 lines

7 FAQ

Q: Is a token window (e.g. 512) safe? A: Only if it aligns with semantic boundaries; fixed windows ignore context.

Q: Do I need sentence splitting and headings? A: Yes. Dual criteria minimise ΔS spikes and keep retrieval precise.

Q: How many chunks per doc? A: Irrelevant if ΔS and λ are stable — WFGY focuses on quality, not count.


🔗 Quick-Start Downloads (60 sec)

ToolLink3-Step Setup
WFGY 1.0 PDFEngine Paper1️⃣ Download · 2️⃣ Upload to your LLM · 3️⃣ Ask “Answer using WFGY + <your question>”
TXT OS (plain-text OS)TXTOS.txt1️⃣ Download · 2️⃣ Paste into any LLM chat · 3️⃣ Type “hello world” — OS boots instantly

Explore More

LayerPageWhat it’s for
⭐ ProofWFGY Recognition MapExternal citations, integrations, and ecosystem proof
⚙️ EngineWFGY 1.0Original PDF tension engine and early logic sketch (legacy reference)
⚙️ EngineWFGY 2.0Production tension kernel for RAG and agent systems
⚙️ EngineWFGY 3.0TXT based Singularity tension engine (131 S class set)
🗺️ MapProblem Map 1.0Flagship 16 problem RAG failure taxonomy and fix map
🗺️ MapProblem Map 2.0Global Debug Card for RAG and agent pipeline diagnosis
🗺️ MapProblem Map 3.0Global AI troubleshooting atlas and failure pattern map
🧰 AppTXT OS.txt semantic OS with fast bootstrap
🧰 AppBlah Blah BlahAbstract and paradox Q&A built on TXT OS
🧰 AppBlur Blur BlurText to image generation with semantic control
🏡 OnboardingStarter VillageGuided entry point for new users

If this repository helped, starring it improves discovery so more builders can find the docs and tools.
GitHub Repo stars