ScrapeGraphAI + Plasmate

April 12, 2026 ยท View on GitHub

Use Plasmate's Semantic Object Model (SOM) with ScrapeGraphAI for 10-100x token compression when scraping with LLMs.

What it does

Plasmate is a browser engine for AI agents that converts raw HTML into a compressed Semantic Object Model. This integration replaces ScrapeGraphAI's default HTML fetcher with Plasmate, dramatically reducing token usage while preserving all semantic information needed for LLM extraction.

Before (raw HTML): ~50,000 tokens for a typical web page After (Plasmate SOM): ~500-5,000 tokens for the same page

Installation

# Install the integration
pip install scrapegraphai-plasmate

# Install Plasmate (Rust binary)
cargo install plasmate
# Or download from: https://github.com/nicholasharring/plasmate/releases

Usage

Basic Usage

from scrapegraphai_plasmate import PlasmateScraper

scraper = PlasmateScraper(
    prompt="Extract all product names and prices",
    source="https://shop.example.com/products",
    config={
        "llm": {
            "model": "gpt-4",
            "api_key": "your-api-key",  # Or use OPENAI_API_KEY env var
        }
    },
)

result = scraper.run()
print(result)
print(f"Token savings: {scraper.token_savings:.1f}%")

With Anthropic Claude

scraper = PlasmateScraper(
    prompt="Summarize the main article",
    source="https://news.example.com/article",
    config={
        "llm": {
            "model": "claude-3-5-sonnet-20241022",
            "api_key": "your-api-key",  # Or use ANTHROPIC_API_KEY env var
        }
    },
)

With Ollama (Local)

scraper = PlasmateScraper(
    prompt="Extract contact information",
    source="https://company.example.com/contact",
    config={
        "llm": {
            "model": "ollama/llama3",
            "ollama": True,
        }
    },
)

Using the Fetcher Directly

from scrapegraphai_plasmate import PlasmateFetcher

fetcher = PlasmateFetcher(output_format="som")
result = fetcher.fetch("https://example.com")

print(result.content)  # SOM JSON
print(f"Estimated tokens: {result.som_tokens}")

Custom Graph Nodes

from scrapegraphai_plasmate import PlasmateFetchNode, PlasmateParseNode

# Create nodes for custom ScrapeGraphAI graphs
fetch_node = PlasmateFetchNode()
parse_node = PlasmateParseNode(extract_type="links")

# Use in your graph
state = {"url": "https://example.com"}
state = fetch_node(state)
state = parse_node(state)

print(state["parsed"])  # List of all links

Batch Scraping

from scrapegraphai_plasmate import PlasmateBatchScraper

scraper = PlasmateBatchScraper(
    prompt="Extract the page title and main heading",
    sources=[
        "https://site1.example.com",
        "https://site2.example.com",
        "https://site3.example.com",
    ],
    config={"llm": {"model": "gpt-4"}},
)

results = scraper.run()
for url, data in results.items():
    print(f"{url}: {data['result']}")

Token Savings Comparison

Page TypeRaw HTMLPlasmate SOMSavings
News article45,0002,50094%
E-commerce product80,0004,00095%
Documentation30,0001,50095%
Social media120,0008,00093%
Blog post25,0001,20095%

Output Formats

SOM (Semantic Object Model) - Default

Structured JSON representation of the page:

{
  "tag": "html",
  "children": [
    {
      "tag": "article",
      "children": [
        {"tag": "h1", "children": ["Article Title"]},
        {"tag": "p", "children": ["Article content..."]}
      ]
    }
  ]
}

Text Mode

Clean, readable text extraction:

fetcher = PlasmateFetcher(output_format="text")
result = fetcher.fetch("https://example.com")
print(result.content)  # Plain text content

Parse Node Extract Types

The PlasmateParseNode supports several extraction types:

  • links - Extract all links with href and text
  • images - Extract all images with src and alt
  • text - Extract all text content
  • headings - Extract headings with level and text
  • tables - Extract tables as nested arrays
parse_node = PlasmateParseNode(extract_type="headings")

Configuration Options

PlasmateFetcher

OptionTypeDefaultDescription
plasmate_pathstrautoPath to plasmate binary
output_formatstr"som""som" or "text"
headersdict{}HTTP headers
timeoutint30Request timeout (seconds)

PlasmateScraper

OptionTypeDefaultDescription
promptstrrequiredExtraction instructions
sourcestrrequiredURL to scrape
configdict{}LLM configuration
schemaTypeNonePydantic model for output
output_formatstr"som"Plasmate output format
headersdictNoneHTTP headers

Requirements

  • Python 3.9+
  • Plasmate binary installed
  • ScrapeGraphAI
  • LLM API access (OpenAI, Anthropic, or Ollama)

License

MIT