ScrapeGraphAI + Plasmate

April 12, 2026 · View on GitHub

Use Plasmate's Semantic Object Model (SOM) with ScrapeGraphAI for 10-100x token compression when scraping with LLMs.

What it does

Plasmate is a browser engine for AI agents that converts raw HTML into a compressed Semantic Object Model. This integration replaces ScrapeGraphAI's default HTML fetcher with Plasmate, dramatically reducing token usage while preserving all semantic information needed for LLM extraction.

Before (raw HTML): ~50,000 tokens for a typical web page After (Plasmate SOM): ~500-5,000 tokens for the same page

Installation

# Install the integration
pip install scrapegraphai-plasmate

# Install Plasmate (Rust binary)
cargo install plasmate
# Or download from: https://github.com/nicholasharring/plasmate/releases

Usage

Basic Usage

from scrapegraphai_plasmate import PlasmateScraper

scraper = PlasmateScraper(
    prompt="Extract all product names and prices",
    source="https://shop.example.com/products",
    config={
        "llm": {
            "model": "gpt-4",
            "api_key": "your-api-key",  # Or use OPENAI_API_KEY env var
        }
    },
)

result = scraper.run()
print(result)
print(f"Token savings: {scraper.token_savings:.1f}%")

With Anthropic Claude

scraper = PlasmateScraper(
    prompt="Summarize the main article",
    source="https://news.example.com/article",
    config={
        "llm": {
            "model": "claude-3-5-sonnet-20241022",
            "api_key": "your-api-key",  # Or use ANTHROPIC_API_KEY env var
        }
    },
)

With Ollama (Local)

scraper = PlasmateScraper(
    prompt="Extract contact information",
    source="https://company.example.com/contact",
    config={
        "llm": {
            "model": "ollama/llama3",
            "ollama": True,
        }
    },
)

Using the Fetcher Directly

from scrapegraphai_plasmate import PlasmateFetcher

fetcher = PlasmateFetcher(output_format="som")
result = fetcher.fetch("https://example.com")

print(result.content)  # SOM JSON
print(f"Estimated tokens: {result.som_tokens}")

Custom Graph Nodes

from scrapegraphai_plasmate import PlasmateFetchNode, PlasmateParseNode

# Create nodes for custom ScrapeGraphAI graphs
fetch_node = PlasmateFetchNode()
parse_node = PlasmateParseNode(extract_type="links")

# Use in your graph
state = {"url": "https://example.com"}
state = fetch_node(state)
state = parse_node(state)

print(state["parsed"])  # List of all links

Batch Scraping

from scrapegraphai_plasmate import PlasmateBatchScraper

scraper = PlasmateBatchScraper(
    prompt="Extract the page title and main heading",
    sources=[
        "https://site1.example.com",
        "https://site2.example.com",
        "https://site3.example.com",
    ],
    config={"llm": {"model": "gpt-4"}},
)

results = scraper.run()
for url, data in results.items():
    print(f"{url}: {data['result']}")

Token Savings Comparison

Page Type	Raw HTML	Plasmate SOM	Savings
News article	45,000	2,500	94%
E-commerce product	80,000	4,000	95%
Documentation	30,000	1,500	95%
Social media	120,000	8,000	93%
Blog post	25,000	1,200	95%

Output Formats

SOM (Semantic Object Model) - Default

Structured JSON representation of the page:

{
  "tag": "html",
  "children": [
    {
      "tag": "article",
      "children": [
        {"tag": "h1", "children": ["Article Title"]},
        {"tag": "p", "children": ["Article content..."]}
      ]
    }
  ]
}

Text Mode

Clean, readable text extraction:

fetcher = PlasmateFetcher(output_format="text")
result = fetcher.fetch("https://example.com")
print(result.content)  # Plain text content

Parse Node Extract Types

The PlasmateParseNode supports several extraction types:

links - Extract all links with href and text
images - Extract all images with src and alt
text - Extract all text content
headings - Extract headings with level and text
tables - Extract tables as nested arrays

parse_node = PlasmateParseNode(extract_type="headings")

Configuration Options

PlasmateFetcher

Option	Type	Default	Description
`plasmate_path`	str	auto	Path to plasmate binary
`output_format`	str	"som"	"som" or "text"
`headers`	dict	{}	HTTP headers
`timeout`	int	30	Request timeout (seconds)

PlasmateScraper

Option	Type	Default	Description
`prompt`	str	required	Extraction instructions
`source`	str	required	URL to scrape
`config`	dict	{}	LLM configuration
`schema`	Type	None	Pydantic model for output
`output_format`	str	"som"	Plasmate output format
`headers`	dict	None	HTTP headers

Requirements

Python 3.9+
Plasmate binary installed
ScrapeGraphAI
LLM API access (OpenAI, Anthropic, or Ollama)

License

MIT