crawl4ai-plasmate
April 12, 2026 ยท View on GitHub
Crawl4AI compatibility layer for Plasmate - get 10-100x token compression for AI web crawling.
Why?
Crawl4AI (51K stars) is the most popular AI-friendly web crawler. Plasmate is a browser engine that converts HTML to structured JSON (SOM) with massive token compression.
This integration lets you use Plasmate as a drop-in backend for Crawl4AI, dramatically reducing your LLM costs while maintaining the same API.
Quick Start
# Install the package
pip install crawl4ai-plasmate
# Make sure plasmate is installed
# See: https://github.com/nicholasoxford/plasmate
cargo install plasmate
from crawl4ai_plasmate import PlasmateCrawler
async with PlasmateCrawler() as crawler:
result = await crawler.arun("https://example.com")
# Crawl4AI-compatible properties
print(result.markdown)
print(result.links)
print(result.media)
# Plasmate-specific benefits
print(result.som) # Structured semantic data
print(result.token_savings) # e.g., 0.94 = 94% fewer tokens
Migration from Crawl4AI
Before (Crawl4AI)
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(headless=True, verbose=True)
run_config = CrawlerRunConfig(word_count_threshold=50)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
"https://example.com",
config=run_config
)
print(result.markdown)
After (Plasmate)
from crawl4ai_plasmate import PlasmateCrawler
# Option 1: Direct usage
async with PlasmateCrawler(verbose=True) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
# Option 2: Migration helper
from crawl4ai_plasmate import from_crawl4ai
crawler = from_crawl4ai(
browser_config={"headless": True, "verbose": True},
run_config={"word_count_threshold": 50}
)
async with crawler:
result = await crawler.arun("https://example.com")
Token Savings
Plasmate's SOM (Semantic Object Model) provides 10-100x token compression compared to raw HTML:
from crawl4ai_plasmate import PlasmateCrawler, compare_token_usage
async with PlasmateCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com")
# Check the compression ratio
print(f"Token savings: {result.token_savings:.1%}")
# Output: Token savings: 94.2%
Cost Comparison
| Scenario | Raw HTML | Plasmate SOM | Savings |
|---|---|---|---|
| News article | ~8,000 tokens | ~500 tokens | 94% |
| Product page | ~15,000 tokens | ~1,200 tokens | 92% |
| Documentation | ~12,000 tokens | ~800 tokens | 93% |
At GPT-4 pricing ($0.01/1K tokens), processing 10,000 pages monthly:
- Raw HTML: ~$800-1,500/month
- Plasmate SOM: ~$50-120/month
API Reference
PlasmateCrawler
from crawl4ai_plasmate import PlasmateCrawler
crawler = PlasmateCrawler(
plasmate_path="plasmate", # Path to plasmate binary
timeout=30, # Request timeout in seconds
headers={"Auth": "..."}, # Default headers
verbose=False, # Enable verbose output
)
Methods
arun(url, **kwargs)- Crawl a single URLarun_many(urls, concurrency=5, **kwargs)- Crawl multiple URLs concurrentlystart()/close()- Lifecycle methods (or use async context manager)
PlasmateResult
result = await crawler.arun("https://example.com")
# Crawl4AI-compatible properties
result.html # Reconstructed HTML
result.cleaned_html # Same as html (already clean)
result.markdown # Markdown representation
result.text # Plain text
result.links # List of LinkItem objects
result.media # List of MediaItem objects
result.metadata # Page metadata dict
# Plasmate-specific properties
result.som # Full SOM structure (dict)
result.som_json # SOM as formatted JSON string
result.token_savings # Compression ratio (0.0 to 1.0)
result.success # Whether crawl succeeded
result.error # Error message if failed
# Element queries
result.get_element_by_id("main")
result.get_elements_by_class("article")
result.get_elements_by_tag("h1")
Extraction Strategies
LLMExtractionStrategy
Uses SOM instead of raw HTML for LLM extraction, dramatically reducing token usage:
from crawl4ai_plasmate import PlasmateCrawler, LLMExtractionStrategy
from dataclasses import dataclass
@dataclass
class Product:
name: str
price: str
description: str
strategy = LLMExtractionStrategy(
provider="openai",
model="gpt-4",
schema=Product,
instruction="Extract product details from this page"
)
async with PlasmateCrawler() as crawler:
result = await crawler.arun(
"https://example.com/product",
extraction_strategy=strategy
)
print(result.extracted)
JsonCssExtractionStrategy
Extract data using CSS selectors directly on SOM (no LLM needed):
from crawl4ai_plasmate import PlasmateCrawler, JsonCssExtractionStrategy
strategy = JsonCssExtractionStrategy(
selectors={
"title": "h1.product-title",
"price": ".price-value",
"description": "#product-description",
}
)
async with PlasmateCrawler() as crawler:
result = await crawler.arun(
"https://example.com/product",
extraction_strategy=strategy
)
print(result.extracted)
# {"title": "...", "price": "\$99", "description": "..."}
Feature Matrix
| Feature | Crawl4AI | Plasmate | Notes |
|---|---|---|---|
| Basic crawling | Yes | Yes | Same API |
| Markdown extraction | Yes | Yes | Same output |
| Link extraction | Yes | Yes | Same output |
| Media extraction | Yes | Yes | Same output |
| Custom headers | Yes | Yes | Same API |
| CSS selectors | Yes | Yes | Same API |
| LLM extraction | Yes | Yes | 10-100x fewer tokens |
| Token compression | No | Yes | Key advantage |
| JavaScript execution | Yes | Partial | CDP mode only |
| Screenshots | Yes | No | Not supported |
| PDF generation | Yes | No | Not supported |
| Caching | Yes | Soon | Coming soon |
| Proxies | Yes | Soon | Coming soon |
Configuration Mapping
| Crawl4AI Setting | Plasmate Equivalent | Notes |
|---|---|---|
browser_type | - | Ignored (own engine) |
headless | - | Always headless |
verbose | verbose | Direct mapping |
headers | headers | Direct mapping |
user_agent | headers["User-Agent"] | Use headers dict |
timeout | timeout | Direct mapping |
css_selector | css_selector | Direct mapping |
extraction_strategy | extraction_strategy | Use migration helper |
Performance
Benchmarks comparing Crawl4AI (Playwright) vs Plasmate:
| Metric | Crawl4AI | Plasmate | Improvement |
|---|---|---|---|
| Time per page | ~2-5s | ~0.1-0.5s | 5-20x faster |
| Memory usage | ~500MB | ~50MB | 10x less |
| Token output | 100% | 6-10% | 10-16x smaller |
| LLM cost | $1.00 | $0.06-0.10 | 10-16x cheaper |
Requirements
- Python 3.9+
- Plasmate binary (Rust)
Installing Plasmate
# From source
git clone https://github.com/nicholasoxford/plasmate
cd plasmate
cargo build --release
# Binary at ./target/release/plasmate
# Or add to PATH
export PATH="$PATH:/path/to/plasmate/target/release"
License
MIT License - see LICENSE for details.
Links
- Plasmate - The browser engine for AI agents
- Crawl4AI - The original AI-friendly crawler
- Documentation