ScrapeGraphAI + Plasmate
April 12, 2026 ยท View on GitHub
Use Plasmate's Semantic Object Model (SOM) with ScrapeGraphAI for 10-100x token compression when scraping with LLMs.
What it does
Plasmate is a browser engine for AI agents that converts raw HTML into a compressed Semantic Object Model. This integration replaces ScrapeGraphAI's default HTML fetcher with Plasmate, dramatically reducing token usage while preserving all semantic information needed for LLM extraction.
Before (raw HTML): ~50,000 tokens for a typical web page After (Plasmate SOM): ~500-5,000 tokens for the same page
Installation
# Install the integration
pip install scrapegraphai-plasmate
# Install Plasmate (Rust binary)
cargo install plasmate
# Or download from: https://github.com/nicholasharring/plasmate/releases
Usage
Basic Usage
from scrapegraphai_plasmate import PlasmateScraper
scraper = PlasmateScraper(
prompt="Extract all product names and prices",
source="https://shop.example.com/products",
config={
"llm": {
"model": "gpt-4",
"api_key": "your-api-key", # Or use OPENAI_API_KEY env var
}
},
)
result = scraper.run()
print(result)
print(f"Token savings: {scraper.token_savings:.1f}%")
With Anthropic Claude
scraper = PlasmateScraper(
prompt="Summarize the main article",
source="https://news.example.com/article",
config={
"llm": {
"model": "claude-3-5-sonnet-20241022",
"api_key": "your-api-key", # Or use ANTHROPIC_API_KEY env var
}
},
)
With Ollama (Local)
scraper = PlasmateScraper(
prompt="Extract contact information",
source="https://company.example.com/contact",
config={
"llm": {
"model": "ollama/llama3",
"ollama": True,
}
},
)
Using the Fetcher Directly
from scrapegraphai_plasmate import PlasmateFetcher
fetcher = PlasmateFetcher(output_format="som")
result = fetcher.fetch("https://example.com")
print(result.content) # SOM JSON
print(f"Estimated tokens: {result.som_tokens}")
Custom Graph Nodes
from scrapegraphai_plasmate import PlasmateFetchNode, PlasmateParseNode
# Create nodes for custom ScrapeGraphAI graphs
fetch_node = PlasmateFetchNode()
parse_node = PlasmateParseNode(extract_type="links")
# Use in your graph
state = {"url": "https://example.com"}
state = fetch_node(state)
state = parse_node(state)
print(state["parsed"]) # List of all links
Batch Scraping
from scrapegraphai_plasmate import PlasmateBatchScraper
scraper = PlasmateBatchScraper(
prompt="Extract the page title and main heading",
sources=[
"https://site1.example.com",
"https://site2.example.com",
"https://site3.example.com",
],
config={"llm": {"model": "gpt-4"}},
)
results = scraper.run()
for url, data in results.items():
print(f"{url}: {data['result']}")
Token Savings Comparison
| Page Type | Raw HTML | Plasmate SOM | Savings |
|---|---|---|---|
| News article | 45,000 | 2,500 | 94% |
| E-commerce product | 80,000 | 4,000 | 95% |
| Documentation | 30,000 | 1,500 | 95% |
| Social media | 120,000 | 8,000 | 93% |
| Blog post | 25,000 | 1,200 | 95% |
Output Formats
SOM (Semantic Object Model) - Default
Structured JSON representation of the page:
{
"tag": "html",
"children": [
{
"tag": "article",
"children": [
{"tag": "h1", "children": ["Article Title"]},
{"tag": "p", "children": ["Article content..."]}
]
}
]
}
Text Mode
Clean, readable text extraction:
fetcher = PlasmateFetcher(output_format="text")
result = fetcher.fetch("https://example.com")
print(result.content) # Plain text content
Parse Node Extract Types
The PlasmateParseNode supports several extraction types:
links- Extract all links with href and textimages- Extract all images with src and alttext- Extract all text contentheadings- Extract headings with level and texttables- Extract tables as nested arrays
parse_node = PlasmateParseNode(extract_type="headings")
Configuration Options
PlasmateFetcher
| Option | Type | Default | Description |
|---|---|---|---|
plasmate_path | str | auto | Path to plasmate binary |
output_format | str | "som" | "som" or "text" |
headers | dict | {} | HTTP headers |
timeout | int | 30 | Request timeout (seconds) |
PlasmateScraper
| Option | Type | Default | Description |
|---|---|---|---|
prompt | str | required | Extraction instructions |
source | str | required | URL to scrape |
config | dict | {} | LLM configuration |
schema | Type | None | Pydantic model for output |
output_format | str | "som" | Plasmate output format |
headers | dict | None | HTTP headers |
Requirements
- Python 3.9+
- Plasmate binary installed
- ScrapeGraphAI
- LLM API access (OpenAI, Anthropic, or Ollama)
License
MIT