langchain-plasmate

March 28, 2026 · View on GitHub

LangChain document loader for Plasmate SOM (Structured Object Model).

Plasmate SOM converts web pages into a clean, structured text representation that's ideal for LLM processing. This loader makes it easy to ingest web content into your LangChain pipelines with excellent compression ratios (typically 80-95% smaller than raw HTML).

Installation

pip install langchain-plasmate

Quick Start

from langchain_plasmate import PlasmateSOMLLoader

# Load a single page
loader = PlasmateSOMLLoader(
    urls=["https://example.com"],
    api_key="your-plasmate-api-key"
)
docs = loader.load()

print(docs[0].page_content)
# Output: Clean, structured text representation of the page

print(docs[0].metadata)
# Output: {'source': 'https://example.com', 'title': '...', 'compression_ratio': 0.15, ...}

Load Multiple Pages

loader = PlasmateSOMLLoader(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ],
    api_key="your-api-key"
)

# Uses batch API for efficiency
docs = loader.load()

Lazy Loading

# For memory efficiency with many URLs
for doc in loader.lazy_load():
    process(doc)

Configuration

API Key

The loader uses the Plasmate Cache API by default. Set your API key either:

In code:

loader = PlasmateSOMLLoader(urls=[...], api_key="your-key")

Via environment variable:

export PLASMATE_API_KEY="your-key"

loader = PlasmateSOMLLoader(urls=[...])  # Auto-detects from env

Get your API key at cache.plasmate.app.

Local CLI Fallback

If no API key is provided, the loader falls back to the local plasmate CLI tool:

# Install plasmate CLI
npm install -g plasmate

# Use without API key
loader = PlasmateSOMLLoader(urls=["https://example.com"])
docs = loader.load()  # Uses local CLI

Custom API Base

For self-hosted Plasmate instances:

loader = PlasmateSOMLLoader(
    urls=[...],
    api_key="your-key",
    api_base="https://your-plasmate-instance.com"
)

Document Structure

Each loaded document contains:

`page_content`

A formatted text representation of the page, extracted from the SOM structure. Includes:

Page title as a heading
Structured content from regions/elements
Properly formatted headings, lists, links, and code blocks

`metadata`

Field	Description
`source`	Original URL
`title`	Page title
`som_version`	SOM format version
`compression_ratio`	Ratio of SOM size to HTML size (lower = better compression)
`html_bytes`	Original HTML size in bytes
`som_bytes`	Compressed SOM size in bytes

Use Cases

RAG pipelines: Load web documentation into vector stores
Web scraping: Extract clean content from complex pages
Content analysis: Process web pages for summarization or classification
Knowledge base building: Ingest web content into your LLM applications

License

Apache-2.0