haystack-plasmate
April 11, 2026 ยท View on GitHub
Haystack integration for Plasmate - the browser engine for AI agents that converts HTML to structured JSON (SOM).
Overview
Plasmate provides a Semantic Object Model (SOM) representation of web pages that is optimized for LLM consumption, typically saving 10-16x tokens compared to raw HTML. This integration brings Plasmate's capabilities to Haystack 2.0 RAG pipelines.
Installation
pip install haystack-plasmate
You also need the Plasmate CLI installed:
# Build from source
git clone https://github.com/nicepkg/plasmate
cd plasmate
cargo build --release
export PATH="$PATH:$(pwd)/target/release"
Components
PlasmateWebFetcher
Fetches web pages and converts them to Haystack Documents with SOM content.
from haystack_plasmate import PlasmateWebFetcher
# Basic usage
fetcher = PlasmateWebFetcher()
result = fetcher.run(urls=["https://example.com"])
docs = result["documents"]
print(docs[0].content) # Concise SOM text representation
print(docs[0].meta["url"]) # https://example.com
print(docs[0].meta["title"]) # Page title
# With custom headers (e.g., for authenticated pages)
fetcher = PlasmateWebFetcher(
headers={"Authorization": "Bearer token123"},
timeout=60,
)
# Text-only mode (extracts readable text without SOM structure)
fetcher = PlasmateWebFetcher(text_only=True)
PlasmateSOMConverter
Converts raw HTML content to SOM Documents without making HTTP requests.
from haystack_plasmate import PlasmateSOMConverter
converter = PlasmateSOMConverter()
# Convert single HTML string
result = converter.run(html="<html><body><h1>Hello</h1></body></html>")
doc = result["documents"][0]
# Convert multiple HTML sources with metadata
result = converter.run(sources=[
{"html": "<html>...</html>", "meta": {"source": "page1.html"}},
{"html": "<html>...</html>", "meta": {"source": "page2.html"}},
])
RAG Pipeline Example
Build a web-aware RAG pipeline that fetches documentation pages:
from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_plasmate import PlasmateWebFetcher
# Create pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", PlasmateWebFetcher())
pipeline.add_component("prompt", PromptBuilder(template="""
Based on the following documentation pages, answer the question.
{% for doc in documents %}
---
Source: {{ doc.meta.url }}
{{ doc.content }}
{% endfor %}
---
Question: {{ question }}
Answer:
"""))
pipeline.add_component("llm", OpenAIGenerator())
# Connect components
pipeline.connect("fetcher.documents", "prompt.documents")
pipeline.connect("prompt", "llm")
# Run pipeline
result = pipeline.run({
"fetcher": {"urls": [
"https://docs.haystack.deepset.ai/docs/intro",
"https://docs.haystack.deepset.ai/docs/pipelines",
]},
"prompt": {"question": "How do I create a Haystack pipeline?"},
})
print(result["llm"]["replies"][0])
Indexing Pipeline Example
Index web pages into a document store for later retrieval:
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_plasmate import PlasmateWebFetcher
# Create document store
document_store = InMemoryDocumentStore()
# Create indexing pipeline
indexing = Pipeline()
indexing.add_component("fetcher", PlasmateWebFetcher())
indexing.add_component("splitter", DocumentSplitter(
split_by="sentence",
split_length=3,
))
indexing.add_component("writer", DocumentWriter(document_store=document_store))
# Connect
indexing.connect("fetcher", "splitter")
indexing.connect("splitter", "writer")
# Index documentation
indexing.run({
"fetcher": {"urls": [
"https://example.com/docs/getting-started",
"https://example.com/docs/api-reference",
]}
})
Document Metadata
Documents created by Plasmate components include rich metadata:
| Field | Description |
|---|---|
url | Source URL |
title | Page title |
lang | Language code |
html_bytes | Original HTML size |
som_bytes | Compressed SOM size |
element_count | Total DOM elements |
interactive_count | Interactive elements (links, buttons, inputs) |
description | Meta description (if available) |
open_graph | Open Graph data (if available) |
json_ld | JSON-LD structured data (if available) |
Configuration
PlasmateWebFetcher Options
| Parameter | Type | Default | Description |
|---|---|---|---|
plasmate_path | str | None | Path to Plasmate CLI (searches PATH if None) |
timeout | int | 30 | Request timeout in seconds |
headers | dict | {} | HTTP headers to include |
text_only | bool | False | Extract text only (no SOM structure) |
raise_on_error | bool | False | Raise exceptions on fetch errors |
PlasmateSOMConverter Options
| Parameter | Type | Default | Description |
|---|---|---|---|
plasmate_path | str | None | Path to Plasmate CLI |
base_url | str | None | Base URL for resolving relative links |
text_only | bool | False | Extract text only |
Why SOM?
The Semantic Object Model (SOM) provides several benefits over raw HTML:
- Token Efficiency: 10-16x smaller than raw HTML, reducing LLM costs
- Structured Data: Clean JSON representation of page content
- Interactive Elements: Clearly labeled buttons, links, and form fields
- Semantic Regions: Page sections (header, main, footer, nav) are identified
- Metadata Extraction: Title, description, Open Graph, and JSON-LD data
License
MIT