dig2crawl

April 5, 2026 · View on GitHub

Universal agnostic web crawler with multi-level AI extraction.

Given a URL and a natural-language goal, the crawler auto-escalates through 4 extraction levels — CSS selectors → browser interactions → visual (screenshot + Claude Vision) → captcha — until enough data is extracted.

What it does

  • L1 — CSS selectors: Claude reads raw HTML, discovers container + field selectors, validates extraction, produces a reusable SiteProfile
  • L2 — Interactive: When L1 yields too few records, Claude suggests browser actions (clicks, scrolls, dismiss overlays) and the crawler executes them via dig2browser, then re-extracts
  • L3 — Visual: When L2 still falls short, takes a screenshot, sends it to Claude Vision for coordinate-based actions (click at x,y), executes them, re-extracts
  • L4 — Captcha: Architecture stub with CaptchaSolver trait — not implemented unless forced by life
  • Auto-navigation — if Claude lands on a homepage/landing without data, it finds the right sub-page (e.g. /vps/) and redirects automatically (up to 2 hops)
  • SPA JSON extraction — detects __NEXT_DATA__ (Next.js), __NUXT_DATA__ (Nuxt), window.__NUXT__, window.__INITIAL_STATE__ before HTML stripping removes <script> tags; Claude can return json_path extraction mode instead of CSS selectors
  • Detects anti-bot protection, extracts JSON-LD and page metadata as bonus context
  • Saves a SiteProfile and exports a DaemonSpec for scheduled monitoring

Architecture

Single-crate layout — all modules live under src/:

dig2crawl/
├── src/
│   ├── main.rs    — CLI binary (clap) + escalation coordinator
│   ├── lib.rs     — Library root
│   ├── core/      — Types, traits, error types, rate limiter, engine
│   ├── fetch/     — HttpFetcher, BrowserFetcher, interactive action executor
│   ├── agent/     — AgentSession, prompts, protocol, actions, visual, captcha
│   ├── parser/    — SelectorExtractor, JsonLdExtractor, AntiBotDetector, MetadataExtractor, LinkExtractor, SpaJsonExtractor
│   ├── config/    — TOML job config, SiteProfile, DaemonSpec serialisation
│   └── storage/   — SQLite + JSONL output backends

How it works

Discovery runs in 5 steps + escalation:

  1. Fetch — page is fetched via HTTP or headless browser; anti-bot check runs immediately
  2. Context extraction — SPA JSON blocks (__NEXT_DATA__, __NUXT_DATA__) extracted before HTML cleaning; JSON-LD and page metadata injected as bonus context
  3. Auto-navigation — if Claude determines the page is a homepage without target data, it returns a navigate response with the target URL; the crawler fetches the new page and restarts discovery (max 2 hops)
  4. Discovery (L1) — Claude reads page.html from disk, analyses the DOM, writes selectors + confidence score; for SPA sites it may return json_path extraction mode instead of CSS selectors
  5. Validation — the pure-Rust SelectorExtractor applies discovered selectors; Claude reviews the sample
  6. SaveSiteProfile is written to output/<domain>/profile.json

Escalation — if L1 yields fewer records than --min-records or confidence below --min-confidence:

  1. L2 Interactive — Claude suggests BrowserActions (Click, ScrollTo, DismissOverlay, Type, WaitForElement, etc.), the crawler executes them on a live browser page, then re-extracts with the same selectors
  2. L3 Visual — a screenshot is taken and sent to Claude Vision; Claude responds with coordinate-based VisualActions (click at x,y), which are converted to browser actions and executed
  3. L4 Captcha — if anti-bot is detected and --max-level 4, prints a warning (solver trait exists but is not implemented)

After discovery, extract applies the saved profile in pure Rust — no agent needed.

Browser mode (default)

dig2crawl uses a stealth headless browser (dig2browser) by default for all commands.

Supported browsers:

BrowserBackendDetectionNotes
ChromeCDPAuto (first priority)Full stealth: UA override, Client Hints, canvas noise, WebGL spoof
EdgeCDPAuto (second priority)Same Chromium flags as Chrome
FirefoxBiDi"browser": "firefox" in fingerprintStealth via moz:firefoxOptions.prefs, requires geckodriver
  • Uses CDP (Chrome/Edge) or BiDi (Firefox) with stealth scripts that patch navigator.webdriver and related fingerprint vectors
  • Bypasses Cloudflare and other WAF challenges that block plain HTTP clients
  • Waits for a CSS selector to appear before capturing HTML (--wait-selector)
  • Auto-creates a persistent browser profile per domain at %TEMP%/dig2crawl-profiles/<domain>/

By default, dig2browser auto-detects Chrome → Edge. To force a specific browser, set "browser" in the fingerprint config.

Pass --http-only to fall back to a plain reqwest HTTP client — faster, sufficient for static or server-rendered pages.

Fingerprint configuration

Use --fingerprint <path> to load a JSON file that configures the browser fingerprint. All fields are optional — omitted fields use defaults (en-US locale, Standard stealth level, 1920x1080 viewport).

{
  "browser": "chrome",
  "level": "full",
  "locale": "ru-RU",
  "timezone": "Europe/Moscow",
  "viewport": [1440, 900],
  "hardware_concurrency": 4,
  "device_memory_gb": 4,
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."
}
FieldTypeDefaultDescription
browserstring"auto"Browser: "auto", "chrome", "edge", "firefox"
levelstring"standard"Stealth level: "basic", "standard_no_webgl", "standard", "full"
localestring"en-US"BCP-47 locale tag
timezonestringnullIANA timezone (e.g. "Europe/Moscow")
viewport[w, h][1920, 1080]Screen resolution
hardware_concurrencyint8navigator.hardwareConcurrency
device_memory_gbint8navigator.deviceMemory (GB)
user_agentstringChrome 131Full User-Agent string

Firefox note: Firefox uses the BiDi protocol via geckodriver (must be running at http://localhost:4444). Stealth is applied via moz:firefoxOptions.prefs — no CDP-level overrides. set_extra_http_headers is not supported on Firefox.

The fingerprint applies to both auth (visible browser) and all headless commands. This ensures the auth session and subsequent crawling share the same fingerprint — critical for sites that compare session fingerprints against cookie fingerprints (e.g. Yandex SmartCaptcha).

CLI

# Discover selectors and produce a SiteProfile (browser by default)
dig2crawl discover <url> --goal "Extract VPS plans: name, price, cpu, ram, disk"
dig2crawl discover <url> --goal "..." --wait-selector "div.tariffs"
dig2crawl discover <url> --goal "..." --http-only --output-dir ./profiles/mysite

# Discover with escalation control
dig2crawl discover <url> --goal "..." --max-level 2          # stop at L2 (no visual)
dig2crawl discover <url> --goal "..." --min-records 5         # escalate until 5+ records
dig2crawl discover <url> --goal "..." --min-confidence 0.8    # escalate until 80% confidence

# Extract data using a saved profile (pure Rust, no agent)
dig2crawl extract <url> --profile output/<domain>/profile.json
dig2crawl extract <url> --profile output/<domain>/profile.json --max-pages 5 --output records.jsonl

# Export a DaemonSpec for scheduled monitoring
dig2crawl export-spec output/<domain>/profile.json --schedule "0 6 * * *" --output spec.json
dig2crawl export-spec output/<domain>/profile.json --schedule "0 6 * * *" --output spec.toml

# Cookie auth — separate binary, see below
cookie-auth <url>
cookie-auth <url> --profile %TEMP%/custom-profile     # explicit profile path
cookie-auth <url> --fingerprint russian.json           # custom fingerprint for auth

# Debug tools (browser by default, add --http-only for plain HTTP)
dig2crawl fetch <url> [--output page.html] [--metadata] [--jsonld] [--antibot]
dig2crawl test-selector <url> --selector "div.item" --field "title:h2.name" --field "price:.price"
dig2crawl collect-links <url> [--depth 2] [--domain-only]

Global flags:

  • --verbose / -v — debug logging
  • --headed — launch browser in visible (non-headless) mode
  • --browser-profile <PATH> — explicit persistent profile directory (default: auto %TEMP%/dig2crawl-profiles/<domain>/)
  • --fingerprint <PATH> — JSON fingerprint config (locale, timezone, viewport, stealth level, etc.)
  • --bot-auth <JWKS_URL> — enable Web Bot Auth signing
  • --bot-key <PATH> — Ed25519 private key for bot auth (default: keys/bot.key)
  • --model <MODEL> — Claude model to use (default: claude-sonnet-4-6)
  • --max-level <N> — maximum extraction level: 1=CSS, 2=interactive, 3=visual (default: 3)
  • --min-records <N> — minimum records before considering L1 successful (default: 1)
  • --min-confidence <F> — minimum confidence threshold (default: 0.5)

Separate binary for sites behind captcha or login walls (e.g. Yandex SmartCaptcha). Opens a visible browser where you log in manually, saves cookies to a persistent profile reused by subsequent discover/extract/fetch commands.

# Build
cargo build --release --bin cookie-auth

# Step 1: Open browser, pass captcha, close the window
# Profile auto-created at %TEMP%/dig2crawl-profiles/yandex.cloud/
cookie-auth https://yandex.cloud/ru/prices --fingerprint russian.json

# Step 2: Use the saved profile for headless crawling (same fingerprint!)
dig2crawl discover https://yandex.cloud/ru/prices \
    --goal "Extract cloud VM pricing" \
    --fingerprint russian.json
FlagDescription
--fingerprint <PATH>JSON fingerprint config (browser, locale)
--profile <PATH>Explicit profile directory (default: %TEMP%/dig2crawl-profiles/<domain>/)

The --fingerprint flag ensures auth and headless sessions share the same browser fingerprint. Without it, default fingerprint (en-US) is used for both.

Under the hood, cookie-auth calls dig2browser::cookies::open_auth_session_with_locale() which launches the browser with the same stealth args as headless mode. The profile directory is passed via --user-data-dir.

Browser testing (dev-fetch in dig2browser)

For quick browser testing without the full crawler, use dev-fetch from dig2browser — DevTools in your terminal:

cargo install dig2browser
dev-fetch https://cloud.vk.com/pricing --fingerprint russian.json --network-log --cookies --save-html out.html

See dig2browser README for full flag reference.

Agent internals

AgentSession wraps gate4agent PipeSession — the same library used by agent2overlay.

Pipe mode — Claude CLI runs in headless NDJSON-streaming mode:

  • Prompts are delivered via stdin (no file intermediary, no cmd.exe argument length limits)
  • Claude responds via stdout as NDJSON stream-json events (PipeText, PipeToolStart, PipeSessionEnd, etc.)
  • --resume <session_id> is captured from PipeSessionStart events and passed on subsequent calls — L1 → validation → L2 → L3 all share the same conversational context
  • --dangerously-skip-permissions enables tool use (Read, Grep, Bash) in one-shot mode
  • Claude reads HTML/screenshot files referenced in the prompt via its Read tool (with offset/limit for large files)

HTML cleaning — before saving to disk, <script>, <style>, <svg>, <noscript> tags are stripped and whitespace collapsed (typically 70-90% size reduction). Claude reads the cleaned file in chunks if needed.

Temp directory layout during a discover run:

%TEMP%/dig2crawl_<pid>/
├── page.html              — cleaned fetched HTML (scripts/styles stripped)
├── spa_data.json           — extracted SPA JSON (if __NEXT_DATA__ / __NUXT_DATA__ found)
├── l2_page.html           — post-action HTML (if L2 escalated)
└── l3_screenshot.png      — page screenshot (if L3 escalated)

The directory is deleted when the session closes.

Robust JSON parsing — the agent protocol tolerates several classes of model output variation:

  • Regex escape sequences in transform patterns (e.g. \\d+, \\\s*) sanitized — the state machine correctly consumes both characters of valid JSON escapes (\\, \") before inspecting the next character
  • null selectors for fields absent from the page (skipped gracefully, not treated as errors)
  • Untagged next_urls values — accepted as either a plain string or a {"url": "..."} object
  • Mixed-type url_patterns and field specs — arrays and scalars both accepted at every field

Batch crawl results

Tested against VPS hosting providers with auto-navigation and SPA extraction (v0.3.19):

ProviderConfidenceRecordsNotes
serverspace.io0.889L1 CSS selectors
ruvds.com0.933L1, browser mode
4dedic.io0.924233L1, large catalog
robovps.biz0.917Auto-nav homepage → /vps/
ishosting.com0.926Auto-nav ×2, through Cloudflare
timeweb.cloud0.904L3 visual (Nuxt SPA)
adminvps.ru0.88L1 CSS selectors

Auto-navigation successfully finds VPS pricing pages from homepages. SPA sites (Next.js, Nuxt) are handled via embedded JSON extraction or L3 visual fallback.

Output

output/<domain>/
└── profile.json

Example (output/adminvps.ru/profile.json):

{
  "domain": "adminvps.ru",
  "container_selector": "#tariffs .slider__block-content .swiper-slide .slider__item",
  "fields": [
    { "name": "name",  "selector": ".products-new__item-head-title", "extract": "text", "transform": "trim" },
    { "name": "price", "selector": ".slide-title p",                 "extract": "text", "transform": "parse_number" },
    { "name": "cpu",   "selector": ".slide-params-list li:nth-child(1) .char", "extract": "text", "transform": "trim" }
  ],
  "pagination": null,
  "requires_browser": true,
  "confidence": 0.9,
  "validated": true
}

output/ is gitignored.

Prerequisites

  • Claude CLInpm install -g @anthropic-ai/claude-code
  • Chrome — required for --browser mode (auto-detected by dig2browser)
  • Rust 1.75+

Building

cargo build --release --bin dig2crawl

Dependencies

CratePurpose
dig2browserStealth browser automation (CDP + BiDi) — crates.io
gate4agentClaude CLI pipe session with NDJSON streaming — crates.io
scraperCSS selector engine
reqwestHTTP client
rusqliteSQLite storage
clapCLI framework
governorRate limiter
tokioAsync runtime

Support the Project

If you find this tool useful, consider supporting development:

CurrencyNetworkAddress
USDTTRC20TNxMKsvVLYViQ5X5sgCYmkzH4qjhhh5U7X
USDCArbitrum0xEF3B94Fe845E21371b4C4C5F2032E1f23A13Aa6e
ETHEthereum0xEF3B94Fe845E21371b4C4C5F2032E1f23A13Aa6e
BTCBitcoinbc1qjgzthxja8umt5tvrp5tfcf9zeepmhn0f6mnt40
SOLSolanaDZJjmH8Cs5wEafz5Ua86wBBkurSA4xdWXa3LWnBUR94c

License

MIT