langchain-plasmate

March 28, 2026 ยท View on GitHub

LangChain document loader for Plasmate SOM (Structured Object Model).

Plasmate SOM converts web pages into a clean, structured text representation that's ideal for LLM processing. This loader makes it easy to ingest web content into your LangChain pipelines with excellent compression ratios (typically 80-95% smaller than raw HTML).

Installation

pip install langchain-plasmate

Quick Start

from langchain_plasmate import PlasmateSOMLLoader

# Load a single page
loader = PlasmateSOMLLoader(
    urls=["https://example.com"],
    api_key="your-plasmate-api-key"
)
docs = loader.load()

print(docs[0].page_content)
# Output: Clean, structured text representation of the page

print(docs[0].metadata)
# Output: {'source': 'https://example.com', 'title': '...', 'compression_ratio': 0.15, ...}

Load Multiple Pages

loader = PlasmateSOMLLoader(
    urls=[
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    ],
    api_key="your-api-key"
)

# Uses batch API for efficiency
docs = loader.load()

Lazy Loading

# For memory efficiency with many URLs
for doc in loader.lazy_load():
    process(doc)

Configuration

API Key

The loader uses the Plasmate Cache API by default. Set your API key either:

  1. In code:

    loader = PlasmateSOMLLoader(urls=[...], api_key="your-key")
    
  2. Via environment variable:

    export PLASMATE_API_KEY="your-key"
    
    loader = PlasmateSOMLLoader(urls=[...])  # Auto-detects from env
    

Get your API key at cache.plasmate.app.

Local CLI Fallback

If no API key is provided, the loader falls back to the local plasmate CLI tool:

# Install plasmate CLI
npm install -g plasmate

# Use without API key
loader = PlasmateSOMLLoader(urls=["https://example.com"])
docs = loader.load()  # Uses local CLI

Custom API Base

For self-hosted Plasmate instances:

loader = PlasmateSOMLLoader(
    urls=[...],
    api_key="your-key",
    api_base="https://your-plasmate-instance.com"
)

Document Structure

Each loaded document contains:

page_content

A formatted text representation of the page, extracted from the SOM structure. Includes:

  • Page title as a heading
  • Structured content from regions/elements
  • Properly formatted headings, lists, links, and code blocks

metadata

FieldDescription
sourceOriginal URL
titlePage title
som_versionSOM format version
compression_ratioRatio of SOM size to HTML size (lower = better compression)
html_bytesOriginal HTML size in bytes
som_bytesCompressed SOM size in bytes

Use Cases

  • RAG pipelines: Load web documentation into vector stores
  • Web scraping: Extract clean content from complex pages
  • Content analysis: Process web pages for summarization or classification
  • Knowledge base building: Ingest web content into your LLM applications

License

Apache-2.0