README.md

April 28, 2026 ยท View on GitHub

DataSpoc Lens

CI PyPI License Python 3.10+

The data lake query engine for humans and AI agents.

Why Lens?

Data teams store Parquet in S3, GCS, or Azure but still spin up heavy warehouses just to run SQL. DataSpoc Lens mounts cloud buckets as DuckDB views and gives you an interactive shell, notebooks, AI-powered queries, and local caching -- all from a single CLI. Works from the terminal or as an MCP server for AI agents like Claude, Cursor, and Windsurf. No servers, no infrastructure, no data copying.

Installation

pip install dataspoc-lens

Cloud and feature extras:

pip install dataspoc-lens[s3]       # AWS S3
pip install dataspoc-lens[gcs]      # Google Cloud Storage
pip install dataspoc-lens[azure]    # Azure Blob Storage
pip install dataspoc-lens[jupyter]  # JupyterLab integration
pip install dataspoc-lens[ai]       # AI natural language queries
pip install dataspoc-lens[all]      # Everything

Quick Start

1. Initialize and register a bucket

dataspoc-lens init
dataspoc-lens add-bucket s3://my-data-lake

Lens discovers tables automatically -- first from Pipe's .dataspoc/manifest.json, then by scanning for *.parquet files.

2. Explore the catalog

dataspoc-lens catalog
dataspoc-lens catalog --detail orders

3. Query with SQL

dataspoc-lens query "SELECT * FROM orders LIMIT 10"
dataspoc-lens query "SELECT status, COUNT(*) FROM orders GROUP BY status"

4. Launch the interactive shell

dataspoc-lens shell
lens> SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
lens> .tables
lens> .schema orders
lens> .export csv /tmp/orders.csv
lens> .quit

5. Configure AI and ask questions

Before using ask, configure an LLM provider:

Option A -- Local AI (free, no API key):

dataspoc-lens setup-ai

Option B -- Cloud provider:

# Anthropic (default)
export DATASPOC_LLM_API_KEY=sk-ant-...

# OpenAI
export DATASPOC_LLM_PROVIDER=openai
export DATASPOC_LLM_API_KEY=sk-...

Then ask questions in natural language:

dataspoc-lens ask "how many orders were placed yesterday?"
dataspoc-lens ask "top 10 customers by revenue this month"
dataspoc-lens ask --debug "average order value by month"

Lens sends your table schemas and sample data to the LLM, receives SQL, executes it, and prints the results. Use --debug to see the full prompt sent to the LLM.

6. Export results

Add --export to any query or ask command. Format is detected from the file extension:

dataspoc-lens query "SELECT * FROM orders" --export orders.csv
dataspoc-lens query "SELECT * FROM users" --export users.parquet
dataspoc-lens ask "monthly revenue" --export revenue.json

Features

Interactive Shell

SQL REPL with syntax highlighting, autocomplete, and history. Dot commands: .tables, .schema <table>, .buckets, .cache <table>, .export <format> <path>, .help, .quit.

Notebook

Launch JupyterLab or Marimo with all tables pre-mounted:

pip install dataspoc-lens[jupyter]
dataspoc-lens notebook

pip install dataspoc-lens[marimo]
dataspoc-lens notebook --marimo

SQL Transforms

Numbered .sql files in ~/.dataspoc-lens/transforms/ that run in order:

dataspoc-lens transform list
dataspoc-lens transform run

Cache

Copy tables locally for offline work and reduced egress costs:

dataspoc-lens cache orders              # Cache a table
dataspoc-lens cache --list              # Check status (fresh/stale)
dataspoc-lens cache orders --refresh    # Re-download
dataspoc-lens cache --clear             # Clear all

Freshness: compares your cache timestamp against the manifest's last_extraction.

AI Agent Integration

Lens works as an MCP server for Claude Desktop, Claude Code, Cursor, and any MCP-compatible AI agent.

pip install dataspoc-lens[mcp]
dataspoc-lens mcp                           # Start MCP server (stdio)

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"]
    }
  }
}

Your agent can now discover tables, run SQL, ask questions in natural language, and manage cache.

Python SDK

from dataspoc_lens import LensClient

with LensClient() as client:
    tables = client.tables()
    schema = client.schema("orders")
    result = client.query("SELECT status, COUNT(*) FROM orders GROUP BY 1")
    answer = client.ask("top 10 customers by revenue")
    stale = client.cache_refresh_stale()

JSON Output

All CLI commands support --output json for machine-readable output:

dataspoc-lens catalog --output json
dataspoc-lens query "SELECT * FROM orders LIMIT 5" --output json
dataspoc-lens ask "monthly revenue" --output json

Commands

dataspoc-lens init                          # Initialize configuration
dataspoc-lens add-bucket <uri>              # Register a bucket
dataspoc-lens catalog                       # List all tables
dataspoc-lens catalog --detail <table>      # Show table schema
dataspoc-lens query "<sql>"                 # Execute SQL query
dataspoc-lens query "<sql>" --export f.csv  # Execute and export
dataspoc-lens shell                         # Interactive SQL shell
dataspoc-lens ask "<question>"              # Natural language query
dataspoc-lens ask "<question>" --debug      # Show LLM prompt
dataspoc-lens setup-ai                      # Install local AI (Ollama)
dataspoc-lens notebook                      # Launch JupyterLab
dataspoc-lens notebook --marimo             # Launch Marimo
dataspoc-lens transform list                # List transform files
dataspoc-lens transform run                 # Run all transforms
dataspoc-lens cache <table>                 # Cache a table locally
dataspoc-lens cache --list                  # List cached tables
dataspoc-lens cache --clear                 # Clear cache
dataspoc-lens mcp                           # Start MCP server for AI agents
dataspoc-lens ml activate [key]             # Activate DataSpoc ML
dataspoc-lens ml train --target col --from tbl  # Train a model
dataspoc-lens ml predict --model m --from tbl   # Generate predictions
dataspoc-lens ml models                     # List trained models
dataspoc-lens --version                     # Show version

Part of the DataSpoc Platform

ProductRole
DataSpoc PipeIngestion: Singer taps to Parquet in cloud buckets
DataSpoc Lens (this)Virtual warehouse: SQL + Jupyter + AI over your data lake
DataSpoc MLAutoML: train and deploy models from your lake

Pipe writes. Lens reads. ML learns.

Community

License

Apache-2.0 -- free to use, modify, and distribute.