Web Scraper to Markdown ๐ŸŒโœ๏ธ

August 5, 2025 ยท View on GitHub

This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. It is ideal for those looking to leverage web content for AI training or analysis. ๐Ÿค–๐Ÿ’ก

๐Ÿš€ Quick Start

(Or even better, use Docker! ๐Ÿณ)

pipx install crawler-to-md

Alternatively, install with pip

pip install crawler-to-md

Then run the scraper:

crawler-to-md --url https://www.example.com

๐ŸŒŸ Features

  • Scrapes web pages for content and metadata. ๐Ÿ“„
  • Filters links by base URL. ๐Ÿ”
  • Excludes URLs containing certain strings. โŒ
  • Automatically finds links or can use a file of URLs to scrape. ๐Ÿ”—
  • Rate limiting and delay support. ๐Ÿ•˜
  • Exports data to Markdown and JSON, ready for GPT uploads. ๐Ÿ“ค
  • Exports each page as an individual Markdown file if --export-individual is used. ๐Ÿ“
  • Uses SQLite for efficient data management. ๐Ÿ“Š
  • Configurable via command-line arguments. โš™๏ธ
  • Include or exclude specific HTML elements using CSS-like selectors (#id, .class, tag) during Markdown conversion. ๐Ÿงฉ
  • Docker support. ๐Ÿณ

๐Ÿ“‹ Requirements

Python 3.10 or higher is required.

Project dependencies are managed with pyproject.toml. Install them with:

pip install .

๐Ÿ›  Usage

Start scraping with the following command:

crawler-to-md --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--overwrite-cache|-w] [--base-url <BASE_URL>] [--exclude-url <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>] [-p <PROXY_URL>]

Options:

  • --url, -u: The starting URL. ๐ŸŒ
  • --urls-file: Path to a file containing URLs to scrape, one URL per line. If '-', read from stdin. ๐Ÿ“
  • --output-folder, -o: Where to save Markdown files (default: ./output). ๐Ÿ“‚
  • --cache-folder, -c: Where to store the database (default: ./cache). ๐Ÿ’พ
  • --overwrite-cache, -w: Overwrite existing cache database before scraping. ๐Ÿงน
  • --base-url, -b: Filter links by base URL (default: URL's base). ๐Ÿ”Ž
  • --title, -t: Final title of the markdown file. Defaults to the URL. ๐Ÿท๏ธ
  • --exclude-url, -e: Exclude URLs containing this string (repeatable). โŒ
  • --export-individual, -ei: Export each page as an individual Markdown file. ๐Ÿ“
  • --rate-limit, -rl: Maximum number of requests per minute (default: 0, no rate limit). โฑ๏ธ
  • --delay, -d: Delay between requests in seconds (default: 0, no delay). ๐Ÿ•’
  • --proxy, -p: Proxy URL for HTTP or SOCKS requests. ๐ŸŒ
  • --include, -i: CSS-like selector (#id, .class, tag) to include before Markdown conversion (repeatable). โœ…
  • --exclude, -x: CSS-like selector (#id, .class, tag) to exclude before Markdown conversion (repeatable). ๐Ÿšซ

One of the --url or --urls-file options is required.

๐Ÿ“š Log level

By default, the WARN level is used. You can change it with the LOG_LEVEL environment variable.

๐Ÿณ Docker Support

Run with Docker:

docker run --rm \
  -v $(pwd)/output:/app/output \
  -v cache:/home/app/.cache/crawler-to-md \
  ghcr.io/obeone/crawler-to-md --url <URL>

Build from source:

docker build -t crawler-to-md .

docker run --rm \
  -v $(pwd)/output:/app/output \
  crawler-to-md --url <URL>

๐Ÿค Contributing

Contributions are welcome! Feel free to submit pull requests or open issues. ๐ŸŒŸ