crawl4ai-plasmate

April 12, 2026 · View on GitHub

Crawl4AI compatibility layer for Plasmate - get 10-100x token compression for AI web crawling.

Why?

Crawl4AI (51K stars) is the most popular AI-friendly web crawler. Plasmate is a browser engine that converts HTML to structured JSON (SOM) with massive token compression.

This integration lets you use Plasmate as a drop-in backend for Crawl4AI, dramatically reducing your LLM costs while maintaining the same API.

Quick Start

# Install the package
pip install crawl4ai-plasmate

# Make sure plasmate is installed
# See: https://github.com/nicholasoxford/plasmate
cargo install plasmate

from crawl4ai_plasmate import PlasmateCrawler

async with PlasmateCrawler() as crawler:
    result = await crawler.arun("https://example.com")
    
    # Crawl4AI-compatible properties
    print(result.markdown)
    print(result.links)
    print(result.media)
    
    # Plasmate-specific benefits
    print(result.som)           # Structured semantic data
    print(result.token_savings) # e.g., 0.94 = 94% fewer tokens

Migration from Crawl4AI

Before (Crawl4AI)

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(headless=True, verbose=True)
run_config = CrawlerRunConfig(word_count_threshold=50)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        "https://example.com",
        config=run_config
    )
    print(result.markdown)

After (Plasmate)

from crawl4ai_plasmate import PlasmateCrawler

# Option 1: Direct usage
async with PlasmateCrawler(verbose=True) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown)

# Option 2: Migration helper
from crawl4ai_plasmate import from_crawl4ai

crawler = from_crawl4ai(
    browser_config={"headless": True, "verbose": True},
    run_config={"word_count_threshold": 50}
)
async with crawler:
    result = await crawler.arun("https://example.com")

Token Savings

Plasmate's SOM (Semantic Object Model) provides 10-100x token compression compared to raw HTML:

from crawl4ai_plasmate import PlasmateCrawler, compare_token_usage

async with PlasmateCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com")
    
    # Check the compression ratio
    print(f"Token savings: {result.token_savings:.1%}")
    # Output: Token savings: 94.2%

Cost Comparison

Scenario	Raw HTML	Plasmate SOM	Savings
News article	~8,000 tokens	~500 tokens	94%
Product page	~15,000 tokens	~1,200 tokens	92%
Documentation	~12,000 tokens	~800 tokens	93%

At GPT-4 pricing ($0.01/1K tokens), processing 10,000 pages monthly:

Raw HTML: ~$800-1,500/month
Plasmate SOM: ~$50-120/month

API Reference

PlasmateCrawler

from crawl4ai_plasmate import PlasmateCrawler

crawler = PlasmateCrawler(
    plasmate_path="plasmate",  # Path to plasmate binary
    timeout=30,                 # Request timeout in seconds
    headers={"Auth": "..."},    # Default headers
    verbose=False,              # Enable verbose output
)

Methods

arun(url, **kwargs) - Crawl a single URL
arun_many(urls, concurrency=5, **kwargs) - Crawl multiple URLs concurrently
start() / close() - Lifecycle methods (or use async context manager)

PlasmateResult

result = await crawler.arun("https://example.com")

# Crawl4AI-compatible properties
result.html          # Reconstructed HTML
result.cleaned_html  # Same as html (already clean)
result.markdown      # Markdown representation
result.text          # Plain text
result.links         # List of LinkItem objects
result.media         # List of MediaItem objects
result.metadata      # Page metadata dict

# Plasmate-specific properties
result.som           # Full SOM structure (dict)
result.som_json      # SOM as formatted JSON string
result.token_savings # Compression ratio (0.0 to 1.0)
result.success       # Whether crawl succeeded
result.error         # Error message if failed

# Element queries
result.get_element_by_id("main")
result.get_elements_by_class("article")
result.get_elements_by_tag("h1")

Extraction Strategies

LLMExtractionStrategy

Uses SOM instead of raw HTML for LLM extraction, dramatically reducing token usage:

from crawl4ai_plasmate import PlasmateCrawler, LLMExtractionStrategy
from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: str
    description: str

strategy = LLMExtractionStrategy(
    provider="openai",
    model="gpt-4",
    schema=Product,
    instruction="Extract product details from this page"
)

async with PlasmateCrawler() as crawler:
    result = await crawler.arun(
        "https://example.com/product",
        extraction_strategy=strategy
    )
    print(result.extracted)

JsonCssExtractionStrategy

Extract data using CSS selectors directly on SOM (no LLM needed):

from crawl4ai_plasmate import PlasmateCrawler, JsonCssExtractionStrategy

strategy = JsonCssExtractionStrategy(
    selectors={
        "title": "h1.product-title",
        "price": ".price-value",
        "description": "#product-description",
    }
)

async with PlasmateCrawler() as crawler:
    result = await crawler.arun(
        "https://example.com/product",
        extraction_strategy=strategy
    )
    print(result.extracted)
    # {"title": "...", "price": "\$99", "description": "..."}

Feature Matrix

Feature	Crawl4AI	Plasmate	Notes
Basic crawling	Yes	Yes	Same API
Markdown extraction	Yes	Yes	Same output
Link extraction	Yes	Yes	Same output
Media extraction	Yes	Yes	Same output
Custom headers	Yes	Yes	Same API
CSS selectors	Yes	Yes	Same API
LLM extraction	Yes	Yes	10-100x fewer tokens
Token compression	No	Yes	Key advantage
JavaScript execution	Yes	Partial	CDP mode only
Screenshots	Yes	No	Not supported
PDF generation	Yes	No	Not supported
Caching	Yes	Soon	Coming soon
Proxies	Yes	Soon	Coming soon

Configuration Mapping

Crawl4AI Setting	Plasmate Equivalent	Notes
`browser_type`	-	Ignored (own engine)
`headless`	-	Always headless
`verbose`	`verbose`	Direct mapping
`headers`	`headers`	Direct mapping
`user_agent`	`headers["User-Agent"]`	Use headers dict
`timeout`	`timeout`	Direct mapping
`css_selector`	`css_selector`	Direct mapping
`extraction_strategy`	`extraction_strategy`	Use migration helper

Performance

Benchmarks comparing Crawl4AI (Playwright) vs Plasmate:

Metric	Crawl4AI	Plasmate	Improvement
Time per page	~2-5s	~0.1-0.5s	5-20x faster
Memory usage	~500MB	~50MB	10x less
Token output	100%	6-10%	10-16x smaller
LLM cost	$1.00	$0.06-0.10	10-16x cheaper

Requirements

Python 3.9+
Plasmate binary (Rust)

Installing Plasmate

# From source
git clone https://github.com/nicholasoxford/plasmate
cd plasmate
cargo build --release
# Binary at ./target/release/plasmate

# Or add to PATH
export PATH="$PATH:/path/to/plasmate/target/release"

License

MIT License - see LICENSE for details.