langchain-plasmate
March 28, 2026 ยท View on GitHub
LangChain document loader for Plasmate SOM (Structured Object Model).
Plasmate SOM converts web pages into a clean, structured text representation that's ideal for LLM processing. This loader makes it easy to ingest web content into your LangChain pipelines with excellent compression ratios (typically 80-95% smaller than raw HTML).
Installation
pip install langchain-plasmate
Quick Start
from langchain_plasmate import PlasmateSOMLLoader
# Load a single page
loader = PlasmateSOMLLoader(
urls=["https://example.com"],
api_key="your-plasmate-api-key"
)
docs = loader.load()
print(docs[0].page_content)
# Output: Clean, structured text representation of the page
print(docs[0].metadata)
# Output: {'source': 'https://example.com', 'title': '...', 'compression_ratio': 0.15, ...}
Load Multiple Pages
loader = PlasmateSOMLLoader(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
],
api_key="your-api-key"
)
# Uses batch API for efficiency
docs = loader.load()
Lazy Loading
# For memory efficiency with many URLs
for doc in loader.lazy_load():
process(doc)
Configuration
API Key
The loader uses the Plasmate Cache API by default. Set your API key either:
-
In code:
loader = PlasmateSOMLLoader(urls=[...], api_key="your-key") -
Via environment variable:
export PLASMATE_API_KEY="your-key"loader = PlasmateSOMLLoader(urls=[...]) # Auto-detects from env
Get your API key at cache.plasmate.app.
Local CLI Fallback
If no API key is provided, the loader falls back to the local plasmate CLI tool:
# Install plasmate CLI
npm install -g plasmate
# Use without API key
loader = PlasmateSOMLLoader(urls=["https://example.com"])
docs = loader.load() # Uses local CLI
Custom API Base
For self-hosted Plasmate instances:
loader = PlasmateSOMLLoader(
urls=[...],
api_key="your-key",
api_base="https://your-plasmate-instance.com"
)
Document Structure
Each loaded document contains:
page_content
A formatted text representation of the page, extracted from the SOM structure. Includes:
- Page title as a heading
- Structured content from regions/elements
- Properly formatted headings, lists, links, and code blocks
metadata
| Field | Description |
|---|---|
source | Original URL |
title | Page title |
som_version | SOM format version |
compression_ratio | Ratio of SOM size to HTML size (lower = better compression) |
html_bytes | Original HTML size in bytes |
som_bytes | Compressed SOM size in bytes |
Use Cases
- RAG pipelines: Load web documentation into vector stores
- Web scraping: Extract clean content from complex pages
- Content analysis: Process web pages for summarization or classification
- Knowledge base building: Ingest web content into your LLM applications
Links
License
Apache-2.0