kumo
June 25, 2026 ยท View on GitHub
An async web crawling framework for Rust - Scrapy for Rust.
kumo means spider/cloud in Japanese. It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store data.
Why Kumo?
| kumo | Scrapy (Python) | Colly (Go) | |
|---|---|---|---|
| Language | Rust | Python | Go |
| Type safety | Compile-time | Runtime | Partial |
| Async model | Tokio (true async) | Twisted (event loop) | goroutines |
| Memory safety | Guaranteed | GC | GC |
| CSS / XPath / Regex / JSONPath | Yes | Yes | CSS only |
#[derive(Extract)] macro | Yes | No | No |
| LLM extraction (Claude / OpenAI / Gemini / Ollama) | Yes | No | No |
| Browser / JS rendering | Yes (chromiumoxide) | Yes (Playwright) | No |
| Stealth mode (TLS/HTTP2 fingerprint spoofing) | Yes | No | No |
| Distributed frontier (Redis) | Yes | Yes (scrapy-redis) | No |
| Item stream API | Yes | No | No |
| Typed crawl events | Yes | Signals | No |
| Async crawl hooks | Yes | Signals / extensions | No |
| OpenTelemetry export | Yes | No | No |
| Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) | Yes | Yes (pipelines) | No |
| Single binary deploy | Yes | No | Yes |
| Binary size / startup | Small / instant | Large / slow | Small / fast |
Benchmark snapshot - 1,000 books, concurrency 16, median of 3 runs:
| kumo | Colly (Go) | Scrapy (Python) | |
|---|---|---|---|
| Real site - Items/s | 76.7 | 73.5 | 53.3 |
| Local server - Items/s | 12,346 | 4,098 | 180 |
| Peak RSS | 12.5 MB | 31.4 MB | 77.2 MB |
On this local-server parsing workload, Kumo measured 3.0x faster than Colly, 69x faster than Scrapy. Treat these as workload-specific results, not universal production guarantees. See the benchmark folder for full methodology and reproduction steps.
Features
- Async-first - Tokio-based with bounded concurrency via
JoinSet - CSS selectors -
res.css(".selector")backed byscraper - XPath selectors -
res.xpath("//h1/text()")for XML/HTML documents (feature:xpath) - Regex selectors -
res.re(r"\d+"),el.re_first(r"..."), works onResponse,Element, andElementList - JSONPath selectors -
res.jsonpath("$.store.books[*].title")for JSON responses (feature:jsonpath) #[derive(Extract)]- generate CSS extraction boilerplate from field annotations (feature:derive)- Rate limiting - token-bucket
RateLimiterviagovernor - Auto-throttle - adaptive delay based on EWMA latency and 429/503 back-off
- Retry with backoff - exponential backoff, status filtering, jitter, and
Retry-Aftersupport - Item stream -
CrawlEngine::stream()returns an asyncStreamfor real-time item consumption - Typed crawl events -
CrawlEventsignals for dashboards, progress bars, alerts, and embedded runners - Async crawl hooks -
CrawlHookextension points for metrics, audit logging, persistence, and policy checks - robots.txt - per-domain fetch + cache, enabled by default, with
Crawl-delaysupport - Bloom filter dedup - O(1) URL deduplication, 1M URLs at 0.1% false-positive rate
- Request scheduling -
CrawlRequestsupports priority, headers, method/body, metadata, anddont_filter - HTTP cache - disk-backed response cache via
.http_cache(dir), optional TTL - Link extractor -
LinkExtractorwith allow/deny regex,allow_domains,canonicalize,restrict_css - Pluggable storage -
JsonlStore,JsonStore,CsvStore,StdoutStore, PostgreSQL, SQLite, MySQL - Middleware chain - proxy rotation, custom headers, status retry, rate limiting, auto-throttle
- Domain + depth filtering -
allowed_domains()andmax_depth()on theSpidertrait - Multi-spider engine - run multiple spiders concurrently via
.add_spider()/.run_all() - LLM extraction - extract structured data without selectors using Claude, OpenAI, Gemini, or Ollama
- Browser fetcher - headless Chromium via chromiumoxide for JS-rendered pages (feature:
browser) - Distributed frontier - Redis-backed URL frontier for multi-process crawls (feature:
redis-frontier) - Persistent frontier - file-backed URL frontier that survives restarts and exposes recovered state (feature:
persistence) - Sitemap spider -
SitemapSpiderreadssitemap.xmland sitemap index files - Metrics - periodic stats snapshots via
tracing::info!using.metrics_interval() - OpenTelemetry - OTLP/gRPC export of all spans to Jaeger, Grafana Tempo, Datadog, etc. (feature:
otel)
Installation
[dependencies]
kumo = "0.5"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }
For #[derive(Extract)]:
kumo = { version = "0.5", features = ["derive"] }
Quick Start
use kumo::prelude::*;
use serde::Serialize;
#[derive(Debug, Serialize)]
struct Quote {
text: String,
author: String,
}
struct QuotesSpider;
#[async_trait::async_trait]
impl Spider for QuotesSpider {
type Item = Quote;
fn name(&self) -> &str { "quotes" }
fn start_urls(&self) -> Vec<String> {
vec!["https://quotes.toscrape.com".into()]
}
async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
text: el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
}).collect();
let next = res.css("li.next a").first()
.and_then(|el| el.attr("href"))
.map(|href| res.urljoin(&href));
let mut output = Output::new().items(quotes);
if let Some(url) = next { output = output.follow(url); }
Ok(output)
}
}
#[tokio::main]
async fn main() -> Result<(), KumoError> {
CrawlEngine::builder()
.concurrency(5)
.middleware(DefaultHeaders::new().user_agent("kumo/0.4"))
.store(JsonlStore::new("quotes.jsonl")?)
.run(QuotesSpider)
.await?;
Ok(())
}
For more examples - production crawl controls, rate limiting, database stores, LLM extraction, browser mode, and all selector types - see the examples/ folder.
Documentation
Full documentation at kumo.wihlarkop.com
- Getting Started
- Spiders
- Extractors
- derive(Extract)
- Middleware
- Stores
- Production Crawling - baseline configuration, durable output, and recovery
- Production Reports and Diagnosis - health signals, timings, and failure diagnosis
- Advanced topics - item stream, OpenTelemetry, stealth, browser, and more
- Examples
- Feature Flags
Contributing
# Install lefthook (one-time setup)
# macOS
brew install lefthook
# Windows
scoop install lefthook
# or: winget install lefthook
# Linux
curl -1sLf 'https://dl.lefthook.dev/setup.sh' | sh
# Activate the hooks
lefthook install
After lefthook install, every git commit will automatically run cargo fmt (auto-fix) and cargo clippy before the commit goes through.
License
MIT