crawl4ai-plasmate

April 12, 2026 ยท View on GitHub

Crawl4AI compatibility layer for Plasmate - get 10-100x token compression for AI web crawling.

Why?

Crawl4AI (51K stars) is the most popular AI-friendly web crawler. Plasmate is a browser engine that converts HTML to structured JSON (SOM) with massive token compression.

This integration lets you use Plasmate as a drop-in backend for Crawl4AI, dramatically reducing your LLM costs while maintaining the same API.

Quick Start

# Install the package
pip install crawl4ai-plasmate

# Make sure plasmate is installed
# See: https://github.com/nicholasoxford/plasmate
cargo install plasmate
from crawl4ai_plasmate import PlasmateCrawler

async with PlasmateCrawler() as crawler:
    result = await crawler.arun("https://example.com")
    
    # Crawl4AI-compatible properties
    print(result.markdown)
    print(result.links)
    print(result.media)
    
    # Plasmate-specific benefits
    print(result.som)           # Structured semantic data
    print(result.token_savings) # e.g., 0.94 = 94% fewer tokens

Migration from Crawl4AI

Before (Crawl4AI)

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(headless=True, verbose=True)
run_config = CrawlerRunConfig(word_count_threshold=50)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        "https://example.com",
        config=run_config
    )
    print(result.markdown)

After (Plasmate)

from crawl4ai_plasmate import PlasmateCrawler

# Option 1: Direct usage
async with PlasmateCrawler(verbose=True) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown)

# Option 2: Migration helper
from crawl4ai_plasmate import from_crawl4ai

crawler = from_crawl4ai(
    browser_config={"headless": True, "verbose": True},
    run_config={"word_count_threshold": 50}
)
async with crawler:
    result = await crawler.arun("https://example.com")

Token Savings

Plasmate's SOM (Semantic Object Model) provides 10-100x token compression compared to raw HTML:

from crawl4ai_plasmate import PlasmateCrawler, compare_token_usage

async with PlasmateCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com")
    
    # Check the compression ratio
    print(f"Token savings: {result.token_savings:.1%}")
    # Output: Token savings: 94.2%

Cost Comparison

ScenarioRaw HTMLPlasmate SOMSavings
News article~8,000 tokens~500 tokens94%
Product page~15,000 tokens~1,200 tokens92%
Documentation~12,000 tokens~800 tokens93%

At GPT-4 pricing ($0.01/1K tokens), processing 10,000 pages monthly:

  • Raw HTML: ~$800-1,500/month
  • Plasmate SOM: ~$50-120/month

API Reference

PlasmateCrawler

from crawl4ai_plasmate import PlasmateCrawler

crawler = PlasmateCrawler(
    plasmate_path="plasmate",  # Path to plasmate binary
    timeout=30,                 # Request timeout in seconds
    headers={"Auth": "..."},    # Default headers
    verbose=False,              # Enable verbose output
)

Methods

  • arun(url, **kwargs) - Crawl a single URL
  • arun_many(urls, concurrency=5, **kwargs) - Crawl multiple URLs concurrently
  • start() / close() - Lifecycle methods (or use async context manager)

PlasmateResult

result = await crawler.arun("https://example.com")

# Crawl4AI-compatible properties
result.html          # Reconstructed HTML
result.cleaned_html  # Same as html (already clean)
result.markdown      # Markdown representation
result.text          # Plain text
result.links         # List of LinkItem objects
result.media         # List of MediaItem objects
result.metadata      # Page metadata dict

# Plasmate-specific properties
result.som           # Full SOM structure (dict)
result.som_json      # SOM as formatted JSON string
result.token_savings # Compression ratio (0.0 to 1.0)
result.success       # Whether crawl succeeded
result.error         # Error message if failed

# Element queries
result.get_element_by_id("main")
result.get_elements_by_class("article")
result.get_elements_by_tag("h1")

Extraction Strategies

LLMExtractionStrategy

Uses SOM instead of raw HTML for LLM extraction, dramatically reducing token usage:

from crawl4ai_plasmate import PlasmateCrawler, LLMExtractionStrategy
from dataclasses import dataclass

@dataclass
class Product:
    name: str
    price: str
    description: str

strategy = LLMExtractionStrategy(
    provider="openai",
    model="gpt-4",
    schema=Product,
    instruction="Extract product details from this page"
)

async with PlasmateCrawler() as crawler:
    result = await crawler.arun(
        "https://example.com/product",
        extraction_strategy=strategy
    )
    print(result.extracted)

JsonCssExtractionStrategy

Extract data using CSS selectors directly on SOM (no LLM needed):

from crawl4ai_plasmate import PlasmateCrawler, JsonCssExtractionStrategy

strategy = JsonCssExtractionStrategy(
    selectors={
        "title": "h1.product-title",
        "price": ".price-value",
        "description": "#product-description",
    }
)

async with PlasmateCrawler() as crawler:
    result = await crawler.arun(
        "https://example.com/product",
        extraction_strategy=strategy
    )
    print(result.extracted)
    # {"title": "...", "price": "\$99", "description": "..."}

Feature Matrix

FeatureCrawl4AIPlasmateNotes
Basic crawlingYesYesSame API
Markdown extractionYesYesSame output
Link extractionYesYesSame output
Media extractionYesYesSame output
Custom headersYesYesSame API
CSS selectorsYesYesSame API
LLM extractionYesYes10-100x fewer tokens
Token compressionNoYesKey advantage
JavaScript executionYesPartialCDP mode only
ScreenshotsYesNoNot supported
PDF generationYesNoNot supported
CachingYesSoonComing soon
ProxiesYesSoonComing soon

Configuration Mapping

Crawl4AI SettingPlasmate EquivalentNotes
browser_type-Ignored (own engine)
headless-Always headless
verboseverboseDirect mapping
headersheadersDirect mapping
user_agentheaders["User-Agent"]Use headers dict
timeouttimeoutDirect mapping
css_selectorcss_selectorDirect mapping
extraction_strategyextraction_strategyUse migration helper

Performance

Benchmarks comparing Crawl4AI (Playwright) vs Plasmate:

MetricCrawl4AIPlasmateImprovement
Time per page~2-5s~0.1-0.5s5-20x faster
Memory usage~500MB~50MB10x less
Token output100%6-10%10-16x smaller
LLM cost$1.00$0.06-0.1010-16x cheaper

Requirements

  • Python 3.9+
  • Plasmate binary (Rust)

Installing Plasmate

# From source
git clone https://github.com/nicholasoxford/plasmate
cd plasmate
cargo build --release
# Binary at ./target/release/plasmate

# Or add to PATH
export PATH="$PATH:/path/to/plasmate/target/release"

License

MIT License - see LICENSE for details.