Supabase Plasmate
April 12, 2026 ยท View on GitHub
Store and query web content in Supabase using Plasmate's Semantic Object Model (SOM).
Plasmate is a browser engine that converts HTML to structured JSON with 10-100x token compression. This integration lets you store, search, and manage web content in Supabase with built-in vector search support.
Features
- Fetch URLs using Plasmate and store SOM data in Supabase
- Semantic search using pgvector embeddings
- Batch processing for multiple URLs
- Realtime subscriptions for content changes
- Automatic stale content detection and refresh
Installation
pip install supabase-plasmate
# Optional: for semantic search with OpenAI embeddings
pip install supabase-plasmate[openai]
# Or with Voyage AI embeddings
pip install supabase-plasmate[voyage]
You also need Plasmate installed:
# Build from source
cargo build --release
export PATH="$PATH:/path/to/plasmate/target/release"
Quick Start
1. Set up the database schema
Run the SQL in sql/schema.sql in your Supabase SQL Editor to create the required tables and functions.
2. Store web content
from supabase_plasmate import PlasmateSupabase
client = PlasmateSupabase(
supabase_url="https://xxx.supabase.co",
supabase_key="your-api-key",
)
# Fetch and store a URL
result = client.fetch_and_store(
url="https://example.com",
metadata={"category": "examples"},
)
# Retrieve stored content
content = client.get_content("https://example.com")
print(content["text_content"])
3. Batch processing
urls = [
"https://example.com",
"https://httpbin.org/html",
"https://news.ycombinator.com",
]
results = client.batch_fetch_and_store(
urls=urls,
on_progress=lambda url, i, total, ok, err: print(f"[{i+1}/{total}] {url}"),
)
4. Semantic search
from supabase_plasmate import VectorSearch
from supabase_plasmate.vector import create_openai_embedding_fn
# Create embedding function
embed_fn = create_openai_embedding_fn(api_key="sk-...")
# Initialize vector search
vector = VectorSearch(
client=client.client,
embedding_fn=embed_fn,
)
# Generate embeddings for all stored content
vector.batch_update_embeddings()
# Search for similar content
results = vector.semantic_search(
query="machine learning tutorials",
threshold=0.7,
limit=10,
)
for result in results:
print(f"{result['url']} - similarity: {result['similarity']:.3f}")
5. Find similar pages
# Find pages similar to a given URL
similar = vector.find_similar(
url="https://example.com/ml-guide",
threshold=0.6,
limit=5,
)
Database Schema
web_content
Stores fetched web pages with their SOM data.
| Column | Type | Description |
|---|---|---|
| id | UUID | Primary key |
| url | TEXT | Source URL (unique) |
| som | JSONB | Plasmate SOM JSON |
| text_content | TEXT | Extracted text |
| embedding | vector(1536) | Vector embedding for search |
| metadata | JSONB | User-defined metadata |
| fetched_at | TIMESTAMPTZ | When content was fetched |
| created_at | TIMESTAMPTZ | Row creation time |
| updated_at | TIMESTAMPTZ | Last update time |
crawl_jobs
Tracks batch crawl operations.
| Column | Type | Description |
|---|---|---|
| id | UUID | Primary key |
| name | TEXT | Job name |
| urls | TEXT[] | URLs to crawl |
| status | TEXT | pending, running, completed, failed |
| progress | INTEGER | URLs processed |
| total | INTEGER | Total URLs |
| errors | JSONB | Error messages |
| started_at | TIMESTAMPTZ | Job start time |
| completed_at | TIMESTAMPTZ | Job end time |
| created_at | TIMESTAMPTZ | Row creation time |
Realtime Subscriptions
Subscribe to content changes:
from supabase_plasmate import RealtimeSubscriber
subscriber = RealtimeSubscriber(client.client)
# Subscribe to all changes
subscriber.subscribe_to_changes(
table="web_content",
callback=lambda payload: print(f"Changed: {payload}"),
)
# Find stale content
from datetime import timedelta
stale = subscriber.get_stale_content(
max_age=timedelta(days=7),
limit=100,
)
# Refresh stale content
await subscriber.refresh_stale_content(
refresh_fn=client.refresh_content,
max_age=timedelta(days=7),
)
Custom Headers
Pass custom headers for authenticated requests:
result = client.fetch_and_store(
url="https://api.example.com/page",
headers={
"Authorization": "Bearer your-token",
"X-Custom-Header": "value",
},
)
Configuration
Environment Variables
| Variable | Description |
|---|---|
| SUPABASE_URL | Your Supabase project URL |
| SUPABASE_KEY | Supabase API key |
| OPENAI_API_KEY | For OpenAI embeddings |
| VOYAGEAI_API_KEY | For Voyage AI embeddings |
Custom Plasmate Path
client = PlasmateSupabase(
supabase_url="...",
supabase_key="...",
plasmate_path="/path/to/plasmate",
)
Examples
See the examples/ directory for more examples:
content_store.py- Complete example with all features
API Reference
PlasmateSupabase
class PlasmateSupabase:
def __init__(self, supabase_url: str, supabase_key: str, plasmate_path: str = "plasmate")
def fetch_and_store(self, url: str, table: str = "web_content", headers: dict = None, metadata: dict = None) -> dict
def batch_fetch_and_store(self, urls: list[str], table: str = "web_content", headers: dict = None, concurrency: int = 5, on_progress: callable = None) -> list[dict]
def get_content(self, url: str, table: str = "web_content") -> dict | None
def list_content(self, table: str = "web_content", limit: int = 100, offset: int = 0) -> list[dict]
def delete_content(self, url: str, table: str = "web_content") -> bool
def refresh_content(self, url: str, table: str = "web_content", headers: dict = None) -> dict
VectorSearch
class VectorSearch:
def __init__(self, client: Client, embedding_fn: callable = None, embedding_dim: int = 1536)
def set_embedding_function(self, fn: callable) -> None
def update_embedding(self, url: str, table: str = "web_content") -> dict
def batch_update_embeddings(self, table: str = "web_content", batch_size: int = 100, on_progress: callable = None) -> int
def semantic_search(self, query: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]
def find_similar(self, url: str, table: str = "web_content", threshold: float = 0.7, limit: int = 10) -> list[dict]
RealtimeSubscriber
class RealtimeSubscriber:
def __init__(self, client: Client)
def subscribe_to_changes(self, table: str, callback: callable, event: str = "*") -> str
def unsubscribe(self, subscription_id: str) -> bool
def get_stale_content(self, table: str = "web_content", max_age: timedelta = timedelta(days=7), limit: int = 100) -> list[dict]
async def refresh_stale_content(self, refresh_fn: callable, table: str = "web_content", max_age: timedelta = timedelta(days=7), batch_size: int = 10) -> int
License
MIT