haystack-plasmate

April 11, 2026 ยท View on GitHub

Haystack integration for Plasmate - the browser engine for AI agents that converts HTML to structured JSON (SOM).

Overview

Plasmate provides a Semantic Object Model (SOM) representation of web pages that is optimized for LLM consumption, typically saving 10-16x tokens compared to raw HTML. This integration brings Plasmate's capabilities to Haystack 2.0 RAG pipelines.

Installation

pip install haystack-plasmate

You also need the Plasmate CLI installed:

# Build from source
git clone https://github.com/nicepkg/plasmate
cd plasmate
cargo build --release
export PATH="$PATH:$(pwd)/target/release"

Components

PlasmateWebFetcher

Fetches web pages and converts them to Haystack Documents with SOM content.

from haystack_plasmate import PlasmateWebFetcher

# Basic usage
fetcher = PlasmateWebFetcher()
result = fetcher.run(urls=["https://example.com"])
docs = result["documents"]

print(docs[0].content)  # Concise SOM text representation
print(docs[0].meta["url"])  # https://example.com
print(docs[0].meta["title"])  # Page title

# With custom headers (e.g., for authenticated pages)
fetcher = PlasmateWebFetcher(
    headers={"Authorization": "Bearer token123"},
    timeout=60,
)

# Text-only mode (extracts readable text without SOM structure)
fetcher = PlasmateWebFetcher(text_only=True)

PlasmateSOMConverter

Converts raw HTML content to SOM Documents without making HTTP requests.

from haystack_plasmate import PlasmateSOMConverter

converter = PlasmateSOMConverter()

# Convert single HTML string
result = converter.run(html="<html><body><h1>Hello</h1></body></html>")
doc = result["documents"][0]

# Convert multiple HTML sources with metadata
result = converter.run(sources=[
    {"html": "<html>...</html>", "meta": {"source": "page1.html"}},
    {"html": "<html>...</html>", "meta": {"source": "page2.html"}},
])

RAG Pipeline Example

Build a web-aware RAG pipeline that fetches documentation pages:

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_plasmate import PlasmateWebFetcher

# Create pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", PlasmateWebFetcher())
pipeline.add_component("prompt", PromptBuilder(template="""
Based on the following documentation pages, answer the question.

{% for doc in documents %}
---
Source: {{ doc.meta.url }}
{{ doc.content }}
{% endfor %}
---

Question: {{ question }}
Answer:
"""))
pipeline.add_component("llm", OpenAIGenerator())

# Connect components
pipeline.connect("fetcher.documents", "prompt.documents")
pipeline.connect("prompt", "llm")

# Run pipeline
result = pipeline.run({
    "fetcher": {"urls": [
        "https://docs.haystack.deepset.ai/docs/intro",
        "https://docs.haystack.deepset.ai/docs/pipelines",
    ]},
    "prompt": {"question": "How do I create a Haystack pipeline?"},
})

print(result["llm"]["replies"][0])

Indexing Pipeline Example

Index web pages into a document store for later retrieval:

from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_plasmate import PlasmateWebFetcher

# Create document store
document_store = InMemoryDocumentStore()

# Create indexing pipeline
indexing = Pipeline()
indexing.add_component("fetcher", PlasmateWebFetcher())
indexing.add_component("splitter", DocumentSplitter(
    split_by="sentence",
    split_length=3,
))
indexing.add_component("writer", DocumentWriter(document_store=document_store))

# Connect
indexing.connect("fetcher", "splitter")
indexing.connect("splitter", "writer")

# Index documentation
indexing.run({
    "fetcher": {"urls": [
        "https://example.com/docs/getting-started",
        "https://example.com/docs/api-reference",
    ]}
})

Document Metadata

Documents created by Plasmate components include rich metadata:

FieldDescription
urlSource URL
titlePage title
langLanguage code
html_bytesOriginal HTML size
som_bytesCompressed SOM size
element_countTotal DOM elements
interactive_countInteractive elements (links, buttons, inputs)
descriptionMeta description (if available)
open_graphOpen Graph data (if available)
json_ldJSON-LD structured data (if available)

Configuration

PlasmateWebFetcher Options

ParameterTypeDefaultDescription
plasmate_pathstrNonePath to Plasmate CLI (searches PATH if None)
timeoutint30Request timeout in seconds
headersdict{}HTTP headers to include
text_onlyboolFalseExtract text only (no SOM structure)
raise_on_errorboolFalseRaise exceptions on fetch errors

PlasmateSOMConverter Options

ParameterTypeDefaultDescription
plasmate_pathstrNonePath to Plasmate CLI
base_urlstrNoneBase URL for resolving relative links
text_onlyboolFalseExtract text only

Why SOM?

The Semantic Object Model (SOM) provides several benefits over raw HTML:

  1. Token Efficiency: 10-16x smaller than raw HTML, reducing LLM costs
  2. Structured Data: Clean JSON representation of page content
  3. Interactive Elements: Clearly labeled buttons, links, and form fields
  4. Semantic Regions: Page sections (header, main, footer, nav) are identified
  5. Metadata Extraction: Title, description, Open Graph, and JSON-LD data

License

MIT