LlamaIndex Plasmate Reader

March 28, 2026 ยท View on GitHub

A LlamaIndex reader for Plasmate SOM (Structured Object Model), providing clean, structured web content optimized for AI agents and RAG pipelines.

What is Plasmate SOM?

Plasmate SOM converts messy HTML into a clean, semantic structure that AI models can easily understand. Instead of parsing raw HTML with all its noise, you get structured content with:

  • Semantic regions (headers, navigation, main content, footers)
  • Clean text extraction from headings, paragraphs, links, lists, and tables
  • Compression ratios typically 10x smaller than raw HTML
  • Consistent structure across any website

Installation

pip install llama-index-readers-plasmate

Quick Start

from llama_index_plasmate import PlasmateReader

# Initialize the reader
reader = PlasmateReader()

# Load documents from URLs
documents = reader.load_data(urls=[
    "https://example.com/page1",
    "https://example.com/page2",
])

# Use with LlamaIndex
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is on these pages?")

Configuration

The reader uses the Plasmate SOM Cache API by default for fast, cached responses:

reader = PlasmateReader(
    api_key="your-api-key",  # Optional, for authenticated access
    api_base="https://cache.plasmate.app",  # Default
)

Using Local Plasmate CLI Fallback

If the API is unavailable, the reader automatically falls back to the local plasmate CLI if installed:

# Install plasmate CLI
npm install -g plasmate

The reader will use the CLI when:

  • The API returns an error
  • No API key is provided and the endpoint requires authentication
  • You explicitly disable the API

Document Metadata

Each document includes rich metadata:

doc = documents[0]
print(doc.metadata)
# {
#     "source": "https://example.com/page1",
#     "title": "Page Title",
#     "som_version": "1.0",
#     "compression_ratio": 12.5,
#     "html_bytes": 125000,
#     "som_bytes": 10000,
# }

API Reference

PlasmateReader

PlasmateReader(
    api_key: Optional[str] = None,
    api_base: str = "https://cache.plasmate.app",
)

Parameters:

  • api_key: Optional API key for authenticated access to the SOM Cache API
  • api_base: Base URL for the SOM Cache API (default: https://cache.plasmate.app)

load_data

reader.load_data(
    urls: List[str],
) -> List[Document]

Parameters:

  • urls: List of URLs to fetch and convert to documents

Returns:

List of LlamaIndex Document objects with extracted text and metadata.

How It Works

  1. The reader sends URLs to the Plasmate SOM Cache API
  2. Plasmate fetches the page and converts HTML to SOM format
  3. The reader extracts readable text from semantic regions:
    • Headings (h1 through h6)
    • Paragraphs
    • Links (with href context)
    • Lists (ordered and unordered)
    • Tables
  4. Text is assembled into a clean document with source metadata

License

Apache 2.0