README.md

July 31, 2026 · View on GitHub

AnyCrawl

📖 Overview

AnyCrawl is a high‑performance crawling and scraping toolkit:

SERP crawling: multiple search engines, batch‑friendly
Web scraping: single‑page content extraction
Site crawling: full‑site traversal and collection
High performance: multi‑threading / multi‑process
Batch tasks: reliable and efficient
AI extraction: LLM‑powered structured data (JSON) extraction from pages

LLM‑friendly. Easy to integrate and use.

🚀 Quick Start

📖 See full docs: Docs

Generate an API Key (self-host)

If you enable authentication (ANYCRAWL_API_AUTH_ENABLED=true), generate an API key:

pnpm --filter api key:generate
# optionally name the key
pnpm --filter api key:generate -- default

The command prints uuid, key and credits. Use the printed key as a Bearer token.

Run Inside Docker

If running AnyCrawl via Docker:

Docker Compose:

docker compose exec api pnpm --filter api key:generate
docker compose exec api pnpm --filter api key:generate -- default

Single container (replace <container_name_or_id>):

docker exec -it <container_name_or_id> pnpm --filter api key:generate
docker exec -it <container_name_or_id> pnpm --filter api key:generate -- default

📚 Usage Examples

💡 Use the Playground to test APIs and generate code in your preferred language.

If self‑hosting, replace https://api.anycrawl.dev with your own server URL.

Web Scraping (Scrape)

Example


curl -X POST https://api.anycrawl.dev/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter	Type	Description	Default
url	string (required)	The URL to be scraped. Must be a valid URL starting with http:// or https://	-
engine	string	Scraping engine to use. Options: `cheerio` (static HTML parsing, fastest), `playwright` (JavaScript rendering with modern engine), `puppeteer` (JavaScript rendering with Chrome)	cheerio
proxy	string	Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: `http://[username]:[password]@proxy:port`	(none)
max_age	number	Cache control (ms). `0` = force refresh (skip cache read); `> 0` = accept cached content within this age; omit to use default.	(none)
store_in_cache	boolean	Cache control. Whether to store the result in cache. To bypass cache reads, use `max_age=0`.	true

More parameters: see Request Parameters.

Cache details (self-host / S3 / map index): see docs/cache.md.

The public scrape and crawl engine values remain cheerio, playwright, and puppeteer. For self-hosted browser engines, playwright and puppeteer are launched through CloakBrowser by default; callers should not send a cloakbrowser engine value.

CloakBrowser requires Node.js 20 or newer. Docker images pre-install its browser binary during image build. For local or custom deployments, set CLOAKBROWSER_CACHE_DIR to a stable writable path and CLOAKBROWSER_AUTO_UPDATE=false to avoid browser downloads during worker startup. If you manage the binary yourself, set CLOAKBROWSER_BINARY_PATH.

LLM Extraction

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'

Atlas Cloud Provider

AnyCrawl supports Atlas Cloud as an OpenAI-compatible LLM provider for extraction and summarization workloads.

Official site: Atlas Cloud
LLM base URL: https://api.atlascloud.ai/v1
Recommended env model format: atlascloud/deepseek-v3

ATLASCLOUD_BASE_URL=https://api.atlascloud.ai/v1
ATLASCLOUD_API_KEY=your-atlascloud-api-key
DEFAULT_LLM_MODEL=atlascloud/deepseek-v3
DEFAULT_EXTRACT_MODEL=atlascloud/deepseek-v3

If you prefer file-based AI config, add an atlascloud provider entry in ai.config.json and map it to any Atlas Cloud model exposed through the OpenAI-compatible chat API.

Site Crawling (Crawl)

Example


curl -X POST https://api.anycrawl.dev/v1/crawl \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "playwright",
  "max_depth": 2,
  "limit": 10,
  "strategy": "same-domain"
}'

Parameters

Parameter	Type	Description	Default
url	string (required)	Starting URL to crawl	-
engine	string	Crawling engine. Options: `cheerio`, `playwright`, `puppeteer`	cheerio
max_depth	number	Max depth from the start URL	10
limit	number	Max number of pages to crawl	100
strategy	enum	Scope: `all`, `same-domain`, `same-hostname`, `same-origin`	same-domain
include_paths	array	Only crawl paths matching these patterns	(none)
exclude_paths	array	Skip paths matching these patterns	(none)
scrape_options	object	Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options	(none)

More parameters and endpoints: see Request Parameters.

Search Engine Results (SERP)

Example

curl -X POST https://api.anycrawl.dev/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter	Type	Description	Default
`query`	string (required)	Search query to be executed	-
`engine`	string	Search engine to use. Options: `google`	google
`pages`	integer	Number of search result pages to retrieve	1
`lang`	string	Language code for search results (e.g., 'en', 'zh', 'all')	en-US

Supported search engines

Google

❓ FAQ

Can I use proxies? Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the proxy request parameter (per request) or ANYCRAWL_PROXY_URL (self‑hosting).
How to handle JavaScript‑rendered pages? Use the Playwright or Puppeteer engines.

README.md

AnyCrawl

Sponsors

📖 Overview

🚀 Quick Start

Generate an API Key (self-host)

Run Inside Docker

📚 Usage Examples

Web Scraping (Scrape)

Example

Parameters

Browser Runtime

LLM Extraction

Atlas Cloud Provider

Site Crawling (Crawl)

Example

Parameters

Search Engine Results (SERP)

Example

Parameters

Supported search engines

❓ FAQ

🤝 Contributing

Backers

📄 License

🎯 Mission