Supabase Plasmate

April 12, 2026 · View on GitHub

Store and query web content in Supabase using Plasmate's Semantic Object Model (SOM).

Plasmate is a browser engine that converts HTML to structured JSON with 10-100x token compression. This integration lets you store, search, and manage web content in Supabase with built-in vector search support.

Features

Fetch URLs using Plasmate and store SOM data in Supabase
Semantic search using pgvector embeddings
Batch processing for multiple URLs
Realtime subscriptions for content changes
Automatic stale content detection and refresh

Installation

pip install supabase-plasmate

# Optional: for semantic search with OpenAI embeddings
pip install supabase-plasmate[openai]

# Or with Voyage AI embeddings
pip install supabase-plasmate[voyage]

You also need Plasmate installed:

# Build from source
cargo build --release
export PATH="$PATH:/path/to/plasmate/target/release"

Quick Start

1. Set up the database schema

Run the SQL in sql/schema.sql in your Supabase SQL Editor to create the required tables and functions.

2. Store web content

from supabase_plasmate import PlasmateSupabase

client = PlasmateSupabase(
    supabase_url="https://xxx.supabase.co",
    supabase_key="your-api-key",
)

# Fetch and store a URL
result = client.fetch_and_store(
    url="https://example.com",
    metadata={"category": "examples"},
)

# Retrieve stored content
content = client.get_content("https://example.com")
print(content["text_content"])

3. Batch processing

urls = [
    "https://example.com",
    "https://httpbin.org/html",
    "https://news.ycombinator.com",
]

results = client.batch_fetch_and_store(
    urls=urls,
    on_progress=lambda url, i, total, ok, err: print(f"[{i+1}/{total}] {url}"),
)

4. Semantic search

from supabase_plasmate import VectorSearch
from supabase_plasmate.vector import create_openai_embedding_fn

# Create embedding function
embed_fn = create_openai_embedding_fn(api_key="sk-...")

# Initialize vector search
vector = VectorSearch(
    client=client.client,
    embedding_fn=embed_fn,
)

# Generate embeddings for all stored content
vector.batch_update_embeddings()

# Search for similar content
results = vector.semantic_search(
    query="machine learning tutorials",
    threshold=0.7,
    limit=10,
)

for result in results:
    print(f"{result['url']} - similarity: {result['similarity']:.3f}")

5. Find similar pages

# Find pages similar to a given URL
similar = vector.find_similar(
    url="https://example.com/ml-guide",
    threshold=0.6,
    limit=5,
)

Database Schema

web_content

Stores fetched web pages with their SOM data.

Column	Type	Description
id	UUID	Primary key
url	TEXT	Source URL (unique)
som	JSONB	Plasmate SOM JSON
text_content	TEXT	Extracted text
embedding	vector(1536)	Vector embedding for search
metadata	JSONB	User-defined metadata
fetched_at	TIMESTAMPTZ	When content was fetched
created_at	TIMESTAMPTZ	Row creation time
updated_at	TIMESTAMPTZ	Last update time

crawl_jobs

Tracks batch crawl operations.

Column	Type	Description
id	UUID	Primary key
name	TEXT	Job name
urls	TEXT[]	URLs to crawl
status	TEXT	pending, running, completed, failed
progress	INTEGER	URLs processed
total	INTEGER	Total URLs
errors	JSONB	Error messages
started_at	TIMESTAMPTZ	Job start time
completed_at	TIMESTAMPTZ	Job end time
created_at	TIMESTAMPTZ	Row creation time

Realtime Subscriptions

Subscribe to content changes:

from supabase_plasmate import RealtimeSubscriber

subscriber = RealtimeSubscriber(client.client)

# Subscribe to all changes
subscriber.subscribe_to_changes(
    table="web_content",
    callback=lambda payload: print(f"Changed: {payload}"),
)

# Find stale content
from datetime import timedelta

stale = subscriber.get_stale_content(
    max_age=timedelta(days=7),
    limit=100,
)

# Refresh stale content
await subscriber.refresh_stale_content(
    refresh_fn=client.refresh_content,
    max_age=timedelta(days=7),
)

Custom Headers

Pass custom headers for authenticated requests:

result = client.fetch_and_store(
    url="https://api.example.com/page",
    headers={
        "Authorization": "Bearer your-token",
        "X-Custom-Header": "value",
    },
)

Configuration

Environment Variables

Variable	Description
SUPABASE_URL	Your Supabase project URL
SUPABASE_KEY	Supabase API key
OPENAI_API_KEY	For OpenAI embeddings
VOYAGEAI_API_KEY	For Voyage AI embeddings

Custom Plasmate Path

client = PlasmateSupabase(
    supabase_url="...",
    supabase_key="...",
    plasmate_path="/path/to/plasmate",
)

Examples

See the examples/ directory for more examples:

content_store.py - Complete example with all features

API Reference

PlasmateSupabase

class PlasmateSupabase:
    def __init__(self, supabase_url: str, supabase_key: str, plasmate_path: str = "plasmate")
    def fetch_and_store(self, url: str, table: str = "web_content", headers: dict = None, metadata: dict = None) -> dict
    def batch_fetch_and_store(self, urls: list[str], table: str = "web_content", headers: dict = None, concurrency: int = 5, on_progress: callable = None) -> list[dict]
    def get_content(self, url: str, table: str = "web_content") -> dict | None
    def list_content(self, table: str = "web_content", limit: int = 100, offset: int = 0) -> list[dict]
    def delete_content(self, url: str, table: str = "web_content") -> bool
    def refresh_content(self, url: str, table: str = "web_content", headers: dict = None) -> dict

VectorSearch

class VectorSearch:
    def __init__(self, client: Client, embedding_fn: callable = None, embedding_dim: int = 1536)
    def set_embedding_function(self, fn: callable) -> None
    def update_embedding(self, url: str, table: str = "web_content") -> dict
    def batch_update_embeddings(self, table: str = "web_content", batch_size: int = 100, on_progress: callable = None) -> int
    def semantic_search(self, query: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]
    def find_similar(self, url: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]

RealtimeSubscriber

class RealtimeSubscriber:
    def __init__(self, client: Client)
    def subscribe_to_changes(self, table: str, callback: callable, event: str = "*") -> str
    def unsubscribe(self, subscription_id: str) -> bool
    def get_stale_content(self, table: str = "web_content", max_age: timedelta = timedelta(days=7), limit: int = 100) -> list[dict]
    async def refresh_stale_content(self, refresh_fn: callable, table: str = "web_content", max_age: timedelta = timedelta(days=7), batch_size: int = 10) -> int

License

MIT