SiteOne Crawler - Web to markdown conversion examples

January 6, 2025 ยท View on GitHub

This page belongs to SiteOne Crawler and serves as an overview of the functionality of converting entire web pages to markdown.

Website crawler.siteone.io

  • Open markdown version of crawler.siteone.io - this webpage is based on Starlight.
  • The Markdown version was generated by the specific command below.
  • For better performance, some parts of the page (DOM elements) have been removed by --markdown-exclude-selector.
  • Using --ignore-regex, it was ensured that URL addresses to HTML reports or examples exports were not passed through, so that only absolute URLs to these URLs remained in the markdown.
  • I put the --disable-* attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
  --url=https://crawler.siteone.io/ \
  --ignore-regex='/^.*\/html\//' \
  --ignore-regex='/^.*\/examples\-exports\//' \
  --markdown-export-dir=tmp/crawler.siteone.io/ \
  --markdown-exclude-selector='header' \
  --markdown-exclude-selector='starlight-theme-select' \
  --markdown-exclude-selector='.isMobile' \
  --markdown-exclude-selector='#starlight__on-this-page--mobile' \
  --markdown-exclude-selector='.social-icons' \
  --disable-styles --disable-javascript --disable-fonts

Website react.dev

  • Open markdown version of react.dev.
  • The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
  • I used the --markdown-disable-images so that the images are not included and are removed from the markdown.
  • I used the --disable-all-assets here only to avoid downloading assets (JS, CSS, etc.) unnecessarily. That do not affect the output markdown content.
./crawler \
  --url=https://react.dev/ \
  --markdown-export-dir=tmp/react.dev/ \
  --markdown-disable-images \
  --disable-all-assets

Website docs.astro.build

  • Open markdown version of docs.astro.build - this webpage is based on Starlight.
  • The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
  • I put the --disable-* attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
  --url=https://docs.astro.build/ \
  --markdown-export-dir=tmp/docs.astro.build/ \
  --markdown-exclude-selector='header' \
  --markdown-exclude-selector='starlight-theme-select' \
  --markdown-exclude-selector='.isMobile' \
  --markdown-exclude-selector='#starlight__on-this-page--mobile' \
  --markdown-exclude-selector='.social-icons' \
  --disable-styles --disable-javascript --disable-fonts