Chapter 5: Document Loading & Splitting

March 21, 2026 · View on GitHub

Welcome to Chapter 5: Document Loading & Splitting. In this part of LangChain Architecture: Internal Design Deep Dive, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Before any information can be retrieved or reasoned about, it must be loaded from its source and broken into manageable pieces. This chapter dissects the BaseLoader and TextSplitter hierarchies, examines how metadata flows through the pipeline, and explores the architectural decisions behind different chunking strategies.

The Document Data Model

At the center of the loading and splitting pipeline is a remarkably simple data structure:

from langchain_core.documents import Document

class Document(Serializable):
    """A unit of information with content and metadata."""

    page_content: str           # The actual text content
    metadata: dict = {}         # Arbitrary key-value metadata
    type: Literal["Document"] = "Document"

The Document class carries two things: the text itself and a metadata dictionary. The metadata is critical -- it flows through the entire pipeline (loading, splitting, embedding, retrieval) and provides provenance information that helps downstream components understand where each piece of text came from.

doc = Document(
    page_content="LangChain is a framework for building LLM applications.",
    metadata={
        "source": "docs/intro.md",
        "page": 1,
        "author": "LangChain Team",
        "created_at": "2024-01-15"
    }
)

BaseLoader Architecture

All document loaders inherit from BaseLoader, which is itself a Runnable:

classDiagram
    class Runnable~Input, Output~ {
        <<interface>>
        +invoke(input) Output
        +batch(inputs) List~Output~
    }

    class BaseLoader {
        <<abstract>>
        +load() List~Document~
        +lazy_load() Iterator~Document~
        +aload() List~Document~
        +alazy_load() AsyncIterator~Document~
        +invoke(input) List~Document~
    }

    class TextLoader {
        +file_path: str
        +encoding: str
        +lazy_load() Iterator~Document~
    }

    class PyPDFLoader {
        +file_path: str
        +lazy_load() Iterator~Document~
    }

    class WebBaseLoader {
        +web_paths: List~str~
        +lazy_load() Iterator~Document~
    }

    class DirectoryLoader {
        +path: str
        +glob: str
        +loader_cls: Type~BaseLoader~
        +lazy_load() Iterator~Document~
    }

    class CSVLoader {
        +file_path: str
        +source_column: str
        +lazy_load() Iterator~Document~
    }

    Runnable <|-- BaseLoader
    BaseLoader <|-- TextLoader
    BaseLoader <|-- PyPDFLoader
    BaseLoader <|-- WebBaseLoader
    BaseLoader <|-- DirectoryLoader
    BaseLoader <|-- CSVLoader

The Lazy Loading Pattern

BaseLoader uses a lazy loading pattern where lazy_load() is the primary method and load() is a convenience wrapper:

class BaseLoader(Runnable[str, List[Document]]):

    def lazy_load(self) -> Iterator[Document]:
        """Subclasses MUST implement this. Yields documents one at a time."""
        raise NotImplementedError

    def load(self) -> List[Document]:
        """Convenience method: materializes all documents into a list."""
        return list(self.lazy_load())

    async def alazy_load(self) -> AsyncIterator[Document]:
        """Default async implementation: runs lazy_load in executor."""
        for doc in await asyncio.get_event_loop().run_in_executor(
            None, self.lazy_load
        ):
            yield doc

    async def aload(self) -> List[Document]:
        """Async convenience method."""
        return [doc async for doc in self.alazy_load()]

    def invoke(self, input: str, config=None) -> List[Document]:
        """Runnable interface: ignores input, calls load()."""
        return self.load()

The lazy loading approach is essential for large document collections. A DirectoryLoader processing thousands of files does not need to hold them all in memory simultaneously:

from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader(
    path="./docs/",
    glob="**/*.md",
    loader_cls=TextLoader,
    show_progress=True
)

# Lazy iteration -- processes one file at a time
for doc in loader.lazy_load():
    process_document(doc)  # Each doc is processed and can be GC'd

# Eager loading -- loads ALL docs into memory
all_docs = loader.load()  # Use only when you need random access

Loader Implementation Pattern

Here is what a typical loader implementation looks like:

class TextLoader(BaseLoader):
    """Load a text file."""

    file_path: str
    encoding: str = "utf-8"
    autodetect_encoding: bool = False

    def lazy_load(self) -> Iterator[Document]:
        text = ""
        try:
            with open(self.file_path, encoding=self.encoding) as f:
                text = f.read()
        except UnicodeDecodeError:
            if self.autodetect_encoding:
                detected = detect_encoding(self.file_path)
                with open(self.file_path, encoding=detected) as f:
                    text = f.read()
            else:
                raise

        # Metadata tracks provenance
        metadata = {"source": self.file_path}
        yield Document(page_content=text, metadata=metadata)

Notice that the loader attaches a source metadata key. This is a convention followed by most loaders and is used downstream for deduplication and citation.

Metadata Enrichment

Different loaders attach different metadata:

LoaderMetadata Keys
TextLoadersource
PyPDFLoadersource, page
WebBaseLoadersource (URL), title, description
CSVLoadersource, row, column values
GitLoadersource, file_path, file_name, file_type
NotionDBLoadersource, all Notion properties

TextSplitter Architecture

Once documents are loaded, they usually need to be split into smaller chunks for embedding and retrieval. The TextSplitter hierarchy handles this:

classDiagram
    class TextSplitter {
        <<abstract>>
        +chunk_size: int
        +chunk_overlap: int
        +length_function: Callable
        +split_text(text: str) List~str~
        +split_documents(docs: List~Document~) List~Document~
        +create_documents(texts, metadatas) List~Document~
        #_merge_splits(splits, separator) List~str~
    }

    class CharacterTextSplitter {
        +separator: str = "\n\n"
        +split_text(text) List~str~
    }

    class RecursiveCharacterTextSplitter {
        +separators: List~str~
        +split_text(text) List~str~
    }

    class TokenTextSplitter {
        +encoding_name: str
        +split_text(text) List~str~
    }

    class MarkdownHeaderTextSplitter {
        +headers_to_split_on: List~Tuple~
        +split_text(text) List~Document~
    }

    class HTMLHeaderTextSplitter {
        +headers_to_split_on: List~Tuple~
        +split_text(text) List~Document~
    }

    class SemanticChunker {
        +embeddings: Embeddings
        +split_text(text) List~str~
    }

    TextSplitter <|-- CharacterTextSplitter
    TextSplitter <|-- RecursiveCharacterTextSplitter
    TextSplitter <|-- TokenTextSplitter
    TextSplitter <|-- MarkdownHeaderTextSplitter
    TextSplitter <|-- HTMLHeaderTextSplitter
    TextSplitter <|-- SemanticChunker

The Core Algorithm

The base TextSplitter class implements a two-phase algorithm:

  1. Split: Break the text into small pieces using a separator.
  2. Merge: Recombine pieces until they reach chunk_size, with chunk_overlap characters of overlap between consecutive chunks.
class TextSplitter(ABC):
    chunk_size: int = 4000
    chunk_overlap: int = 200
    length_function: Callable[[str], int] = len
    strip_whitespace: bool = True

    @abstractmethod
    def split_text(self, text: str) -> List[str]:
        """Split a single text string into chunks."""

    def split_documents(self, documents: List[Document]) -> List[Document]:
        """Split documents, preserving and enriching metadata."""
        texts, metadatas = [], []
        for doc in documents:
            texts.append(doc.page_content)
            metadatas.append(doc.metadata)
        return self.create_documents(texts, metadatas)

    def create_documents(self, texts, metadatas=None) -> List[Document]:
        """Create Document objects from texts with metadata."""
        documents = []
        for i, text in enumerate(texts):
            metadata = (metadatas[i] if metadatas else {}).copy()
            for chunk in self.split_text(text):
                documents.append(
                    Document(page_content=chunk, metadata=metadata)
                )
        return documents

    def _merge_splits(self, splits: List[str], separator: str) -> List[str]:
        """Merge small splits into chunks of chunk_size with overlap."""
        docs = []
        current_doc: List[str] = []
        total = 0

        for split in splits:
            split_len = self.length_function(split)

            # If adding this split would exceed chunk_size, finalize current doc
            if total + split_len > self.chunk_size and current_doc:
                doc = separator.join(current_doc)
                docs.append(doc)

                # Keep overlap: remove splits from the front until under overlap
                while total > self.chunk_overlap and len(current_doc) > 1:
                    removed = current_doc.pop(0)
                    total -= self.length_function(removed)

            current_doc.append(split)
            total += split_len

        # Don't forget the last chunk
        if current_doc:
            docs.append(separator.join(current_doc))

        return docs

RecursiveCharacterTextSplitter: The Default Choice

RecursiveCharacterTextSplitter is the most commonly used splitter. It tries a hierarchy of separators, from most preferred to least:

class RecursiveCharacterTextSplitter(TextSplitter):
    separators: List[str] = ["\n\n", "\n", " ", ""]

    def split_text(self, text: str) -> List[str]:
        final_chunks: List[str] = []
        separator = self.separators[-1]  # Last resort separator

        # Find the most preferred separator that exists in the text
        for sep in self.separators:
            if sep == "":
                separator = sep
                break
            if sep in text:
                separator = sep
                break

        # Split on the chosen separator
        splits = text.split(separator) if separator else list(text)

        good_splits: List[str] = []
        for split in splits:
            if self.length_function(split) < self.chunk_size:
                good_splits.append(split)
            else:
                # This split is too big -- merge what we have and recurse
                if good_splits:
                    merged = self._merge_splits(good_splits, separator)
                    final_chunks.extend(merged)
                    good_splits = []

                # Recursively split with the NEXT separator
                sub_splits = self._split_text_with_next_separator(split)
                final_chunks.extend(sub_splits)

        if good_splits:
            final_chunks.extend(self._merge_splits(good_splits, separator))

        return final_chunks

The recursive approach ensures that chunks respect natural text boundaries (paragraphs > lines > words > characters):

flowchart TD
    Text["Input Text"] --> S1{"Split on '\\n\\n'\n(paragraphs)"}
    S1 -->|"Chunk ≤ chunk_size"| Good1["Keep as chunk"]
    S1 -->|"Chunk > chunk_size"| S2{"Split on '\\n'\n(lines)"}
    S2 -->|"Chunk ≤ chunk_size"| Good2["Keep as chunk"]
    S2 -->|"Chunk > chunk_size"| S3{"Split on ' '\n(words)"}
    S3 -->|"Chunk ≤ chunk_size"| Good3["Keep as chunk"]
    S3 -->|"Chunk > chunk_size"| S4{"Split on ''\n(characters)"}
    S4 --> Good4["Keep as chunk"]

    Good1 --> Merge["_merge_splits()\nCombine into chunk_size\nwith overlap"]
    Good2 --> Merge
    Good3 --> Merge
    Good4 --> Merge

    Merge --> Output["Final Chunks"]

    classDef split fill:#fff3e0,stroke:#e65100
    classDef good fill:#e8f5e9,stroke:#1b5e20

    class S1,S2,S3,S4 split
    class Good1,Good2,Good3,Good4 good

Language-Aware Splitters

RecursiveCharacterTextSplitter provides factory methods for programming languages:

# Python-aware splitting
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=2000,
    chunk_overlap=200,
)

# Uses Python-specific separators:
# ["\nclass ", "\ndef ", "\n\tdef ", "\n\n", "\n", " ", ""]
LanguageSeparators (in order)
Python\nclass , \ndef , \n\tdef , \n\n, \n, , ``
JavaScript\nfunction , \nconst , \nlet , \nclass , \n\n, \n, , ``
Markdown\n#{1,6} , \n````, \n\n, \n, `, ``
HTML<div, <p, <br, <li, <h1...<h6, <span, <table, <tr

TokenTextSplitter

For applications that need precise token-level control (e.g., staying within model context windows), TokenTextSplitter counts tokens instead of characters:

from langchain_text_splitters import TokenTextSplitter

splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # GPT-4 tokenizer
    chunk_size=1000,              # 1000 tokens per chunk
    chunk_overlap=100,            # 100 tokens overlap
)

chunks = splitter.split_text(long_document)

Semantic Chunking

The SemanticChunker uses embeddings to find natural breakpoints in text:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)

# Chunks are split where semantic similarity drops
chunks = chunker.split_text(document_text)

The algorithm:

  1. Split text into sentences.
  2. Compute embeddings for each sentence.
  3. Calculate cosine similarity between adjacent sentences.
  4. Split at points where similarity drops below the threshold.
flowchart LR
    S1["Sentence 1"] --> E1["Embed"]
    S2["Sentence 2"] --> E2["Embed"]
    S3["Sentence 3"] --> E3["Embed"]
    S4["Sentence 4"] --> E4["Embed"]
    S5["Sentence 5"] --> E5["Embed"]

    E1 ---|"sim=0.92"| E2
    E2 ---|"sim=0.88"| E3
    E3 ---|"sim=0.31"| E4
    E4 ---|"sim=0.90"| E5

    E3 -.->|"LOW SIMILARITY\nSplit here!"| Split["Chunk Boundary"]

    classDef sent fill:#e1f5fe,stroke:#01579b
    classDef embed fill:#f3e5f5,stroke:#4a148c
    classDef split fill:#ffebee,stroke:#c62828

    class S1,S2,S3,S4,S5 sent
    class E1,E2,E3,E4,E5 embed
    class Split split

Metadata Propagation

When documents are split, metadata from the parent document is copied to every child chunk. The splitter can also add chunk-specific metadata:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True  # Adds start character index to metadata
)

doc = Document(
    page_content="Long document text...",
    metadata={"source": "report.pdf", "page": 3}
)

chunks = splitter.split_documents([doc])
for chunk in chunks:
    print(chunk.metadata)
    # {"source": "report.pdf", "page": 3, "start_index": 0}
    # {"source": "report.pdf", "page": 3, "start_index": 450}
    # {"source": "report.pdf", "page": 3, "start_index": 900}

Putting It All Together: The Loading Pipeline

A typical document processing pipeline chains loaders and splitters together:

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Step 1: Load
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True
)

# Step 2: Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True
)

# Step 3: Process
docs = loader.lazy_load()
chunks = splitter.split_documents(list(docs))

# Step 4: Embed and store (covered in Chapter 6)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings()
)
flowchart LR
    Source["PDF Files"] -->|DirectoryLoader| Raw["Raw Documents\n(1 per page)"]
    Raw -->|RecursiveCharacterTextSplitter| Chunks["Chunks\n(~1000 tokens each)"]
    Chunks -->|OpenAIEmbeddings| Vectors["Embedded Chunks"]
    Vectors -->|Chroma| Store["Vector Store"]

    classDef source fill:#ffebee,stroke:#c62828
    classDef process fill:#fff3e0,stroke:#e65100
    classDef store fill:#e8f5e9,stroke:#1b5e20

    class Source source
    class Raw,Chunks,Vectors process
    class Store store

Chunking Strategy Comparison

StrategyBest ForProsCons
Fixed characterSimple textFast, predictable sizeMay split mid-sentence
Recursive characterGeneral purposeRespects text structureSlight overhead
Token-basedLLM context managementPrecise token countsRequires tokenizer
Markdown headersStructured docsPreserves document hierarchyOnly works with Markdown
SemanticResearch, high-quality RAGSemantically coherent chunksRequires embedding calls
Code-awareSource codeRespects function boundariesLanguage-specific

Summary

ConceptKey Takeaway
DocumentSimple page_content + metadata data class
BaseLoaderLazy-loading pattern with lazy_load() as the primary method
TextSplitterTwo-phase split-then-merge algorithm with configurable overlap
RecursiveCharacterTextSplitterTries separators from most to least preferred recursively
Metadata propagationParent metadata is copied to every child chunk
Semantic chunkingUses embeddings to find natural breakpoints

Key Takeaways

  1. Documents are the universal data carrier. The Document(page_content, metadata) data class flows through every stage of the pipeline.
  2. Lazy loading is the default. lazy_load() is the method that subclasses implement; load() is a convenience wrapper that materializes everything into memory.
  3. Splitting is recursive. RecursiveCharacterTextSplitter tries the most natural separator first and falls back to finer-grained separators only when chunks are still too large.
  4. Metadata is preserved and enriched. Every split inherits the parent document's metadata, and splitters can add chunk-specific keys like start_index.
  5. Chunking strategy matters. The choice of splitter directly impacts retrieval quality. Semantic chunking produces the most coherent chunks but requires embedding calls.

Next Steps

Now that we understand how documents are loaded and split, let's explore how they are stored and retrieved. Continue to Chapter 6: Vector Store Abstraction.


Built with insights from the LangChain project.

What Problem Does This Solve?

Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for Document, List, self so behavior stays predictable as complexity grows.

In practical terms, this chapter helps you avoid three common failures:

  • coupling core logic too tightly to one implementation path
  • missing the handoff boundaries between setup, execution, and validation
  • shipping changes without clear rollback or observability strategy

After working through this chapter, you should be able to reason about Chapter 5: Document Loading & Splitting as an operating subsystem inside LangChain Architecture: Internal Design Deep Dive, with explicit contracts for inputs, state transitions, and outputs.

Use the implementation notes around text, chunk_size, metadata as your checklist when adapting these patterns to your own repository.

How it Works Under the Hood

Under the hood, Chapter 5: Document Loading & Splitting usually follows a repeatable control path:

  1. Context bootstrap: initialize runtime config and prerequisites for Document.
  2. Input normalization: shape incoming data so List receives stable contracts.
  3. Core execution: run the main logic branch and propagate intermediate state through self.
  4. Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
  5. Output composition: return canonical result payloads for downstream consumers.
  6. Operational telemetry: emit logs/metrics needed for debugging and performance tuning.

When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.

Source Walkthrough

Use the following upstream sources to verify implementation details while reading this chapter:

  • View Repo Why it matters: authoritative reference on View Repo (github.com).

Suggested trace strategy:

  • search upstream code for Document and List to map concrete implementation paths
  • compare docs claims against actual runtime/config code before reusing patterns in production

Chapter Connections