LlamaIndex Plasmate Reader
March 28, 2026 ยท View on GitHub
A LlamaIndex reader for Plasmate SOM (Structured Object Model), providing clean, structured web content optimized for AI agents and RAG pipelines.
What is Plasmate SOM?
Plasmate SOM converts messy HTML into a clean, semantic structure that AI models can easily understand. Instead of parsing raw HTML with all its noise, you get structured content with:
- Semantic regions (headers, navigation, main content, footers)
- Clean text extraction from headings, paragraphs, links, lists, and tables
- Compression ratios typically 10x smaller than raw HTML
- Consistent structure across any website
Installation
pip install llama-index-readers-plasmate
Quick Start
from llama_index_plasmate import PlasmateReader
# Initialize the reader
reader = PlasmateReader()
# Load documents from URLs
documents = reader.load_data(urls=[
"https://example.com/page1",
"https://example.com/page2",
])
# Use with LlamaIndex
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is on these pages?")
Configuration
Using the SOM Cache API (Recommended)
The reader uses the Plasmate SOM Cache API by default for fast, cached responses:
reader = PlasmateReader(
api_key="your-api-key", # Optional, for authenticated access
api_base="https://cache.plasmate.app", # Default
)
Using Local Plasmate CLI Fallback
If the API is unavailable, the reader automatically falls back to the local plasmate CLI if installed:
# Install plasmate CLI
npm install -g plasmate
The reader will use the CLI when:
- The API returns an error
- No API key is provided and the endpoint requires authentication
- You explicitly disable the API
Document Metadata
Each document includes rich metadata:
doc = documents[0]
print(doc.metadata)
# {
# "source": "https://example.com/page1",
# "title": "Page Title",
# "som_version": "1.0",
# "compression_ratio": 12.5,
# "html_bytes": 125000,
# "som_bytes": 10000,
# }
API Reference
PlasmateReader
PlasmateReader(
api_key: Optional[str] = None,
api_base: str = "https://cache.plasmate.app",
)
Parameters:
api_key: Optional API key for authenticated access to the SOM Cache APIapi_base: Base URL for the SOM Cache API (default:https://cache.plasmate.app)
load_data
reader.load_data(
urls: List[str],
) -> List[Document]
Parameters:
urls: List of URLs to fetch and convert to documents
Returns:
List of LlamaIndex Document objects with extracted text and metadata.
How It Works
- The reader sends URLs to the Plasmate SOM Cache API
- Plasmate fetches the page and converts HTML to SOM format
- The reader extracts readable text from semantic regions:
- Headings (h1 through h6)
- Paragraphs
- Links (with href context)
- Lists (ordered and unordered)
- Tables
- Text is assembled into a clean document with source metadata
Links
License
Apache 2.0