Supabase Plasmate

April 12, 2026 ยท View on GitHub

Store and query web content in Supabase using Plasmate's Semantic Object Model (SOM).

Plasmate is a browser engine that converts HTML to structured JSON with 10-100x token compression. This integration lets you store, search, and manage web content in Supabase with built-in vector search support.

Features

  • Fetch URLs using Plasmate and store SOM data in Supabase
  • Semantic search using pgvector embeddings
  • Batch processing for multiple URLs
  • Realtime subscriptions for content changes
  • Automatic stale content detection and refresh

Installation

pip install supabase-plasmate

# Optional: for semantic search with OpenAI embeddings
pip install supabase-plasmate[openai]

# Or with Voyage AI embeddings
pip install supabase-plasmate[voyage]

You also need Plasmate installed:

# Build from source
cargo build --release
export PATH="$PATH:/path/to/plasmate/target/release"

Quick Start

1. Set up the database schema

Run the SQL in sql/schema.sql in your Supabase SQL Editor to create the required tables and functions.

2. Store web content

from supabase_plasmate import PlasmateSupabase

client = PlasmateSupabase(
    supabase_url="https://xxx.supabase.co",
    supabase_key="your-api-key",
)

# Fetch and store a URL
result = client.fetch_and_store(
    url="https://example.com",
    metadata={"category": "examples"},
)

# Retrieve stored content
content = client.get_content("https://example.com")
print(content["text_content"])

3. Batch processing

urls = [
    "https://example.com",
    "https://httpbin.org/html",
    "https://news.ycombinator.com",
]

results = client.batch_fetch_and_store(
    urls=urls,
    on_progress=lambda url, i, total, ok, err: print(f"[{i+1}/{total}] {url}"),
)
from supabase_plasmate import VectorSearch
from supabase_plasmate.vector import create_openai_embedding_fn

# Create embedding function
embed_fn = create_openai_embedding_fn(api_key="sk-...")

# Initialize vector search
vector = VectorSearch(
    client=client.client,
    embedding_fn=embed_fn,
)

# Generate embeddings for all stored content
vector.batch_update_embeddings()

# Search for similar content
results = vector.semantic_search(
    query="machine learning tutorials",
    threshold=0.7,
    limit=10,
)

for result in results:
    print(f"{result['url']} - similarity: {result['similarity']:.3f}")

5. Find similar pages

# Find pages similar to a given URL
similar = vector.find_similar(
    url="https://example.com/ml-guide",
    threshold=0.6,
    limit=5,
)

Database Schema

web_content

Stores fetched web pages with their SOM data.

ColumnTypeDescription
idUUIDPrimary key
urlTEXTSource URL (unique)
somJSONBPlasmate SOM JSON
text_contentTEXTExtracted text
embeddingvector(1536)Vector embedding for search
metadataJSONBUser-defined metadata
fetched_atTIMESTAMPTZWhen content was fetched
created_atTIMESTAMPTZRow creation time
updated_atTIMESTAMPTZLast update time

crawl_jobs

Tracks batch crawl operations.

ColumnTypeDescription
idUUIDPrimary key
nameTEXTJob name
urlsTEXT[]URLs to crawl
statusTEXTpending, running, completed, failed
progressINTEGERURLs processed
totalINTEGERTotal URLs
errorsJSONBError messages
started_atTIMESTAMPTZJob start time
completed_atTIMESTAMPTZJob end time
created_atTIMESTAMPTZRow creation time

Realtime Subscriptions

Subscribe to content changes:

from supabase_plasmate import RealtimeSubscriber

subscriber = RealtimeSubscriber(client.client)

# Subscribe to all changes
subscriber.subscribe_to_changes(
    table="web_content",
    callback=lambda payload: print(f"Changed: {payload}"),
)

# Find stale content
from datetime import timedelta

stale = subscriber.get_stale_content(
    max_age=timedelta(days=7),
    limit=100,
)

# Refresh stale content
await subscriber.refresh_stale_content(
    refresh_fn=client.refresh_content,
    max_age=timedelta(days=7),
)

Custom Headers

Pass custom headers for authenticated requests:

result = client.fetch_and_store(
    url="https://api.example.com/page",
    headers={
        "Authorization": "Bearer your-token",
        "X-Custom-Header": "value",
    },
)

Configuration

Environment Variables

VariableDescription
SUPABASE_URLYour Supabase project URL
SUPABASE_KEYSupabase API key
OPENAI_API_KEYFor OpenAI embeddings
VOYAGEAI_API_KEYFor Voyage AI embeddings

Custom Plasmate Path

client = PlasmateSupabase(
    supabase_url="...",
    supabase_key="...",
    plasmate_path="/path/to/plasmate",
)

Examples

See the examples/ directory for more examples:

  • content_store.py - Complete example with all features

API Reference

PlasmateSupabase

class PlasmateSupabase:
    def __init__(self, supabase_url: str, supabase_key: str, plasmate_path: str = "plasmate")
    def fetch_and_store(self, url: str, table: str = "web_content", headers: dict = None, metadata: dict = None) -> dict
    def batch_fetch_and_store(self, urls: list[str], table: str = "web_content", headers: dict = None, concurrency: int = 5, on_progress: callable = None) -> list[dict]
    def get_content(self, url: str, table: str = "web_content") -> dict | None
    def list_content(self, table: str = "web_content", limit: int = 100, offset: int = 0) -> list[dict]
    def delete_content(self, url: str, table: str = "web_content") -> bool
    def refresh_content(self, url: str, table: str = "web_content", headers: dict = None) -> dict

VectorSearch

class VectorSearch:
    def __init__(self, client: Client, embedding_fn: callable = None, embedding_dim: int = 1536)
    def set_embedding_function(self, fn: callable) -> None
    def update_embedding(self, url: str, table: str = "web_content") -> dict
    def batch_update_embeddings(self, table: str = "web_content", batch_size: int = 100, on_progress: callable = None) -> int
    def semantic_search(self, query: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]
    def find_similar(self, url: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]

RealtimeSubscriber

class RealtimeSubscriber:
    def __init__(self, client: Client)
    def subscribe_to_changes(self, table: str, callback: callable, event: str = "*") -> str
    def unsubscribe(self, subscription_id: str) -> bool
    def get_stale_content(self, table: str = "web_content", max_age: timedelta = timedelta(days=7), limit: int = 100) -> list[dict]
    async def refresh_stale_content(self, refresh_fn: callable, table: str = "web_content", max_age: timedelta = timedelta(days=7), batch_size: int = 10) -> int

License

MIT