SiteOne Crawler - Web to markdown conversion examples
January 6, 2025 ยท View on GitHub
This page belongs to SiteOne Crawler and serves as an overview of the functionality of converting entire web pages to markdown.
Website crawler.siteone.io
- Open markdown version of crawler.siteone.io - this webpage is based on Starlight.
- The Markdown version was generated by the specific command below.
- For better performance, some parts of the page (DOM elements) have been removed by
--markdown-exclude-selector. - Using
--ignore-regex, it was ensured that URL addresses to HTML reports or examples exports were not passed through, so that only absolute URLs to these URLs remained in the markdown. - I put the
--disable-*attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
--url=https://crawler.siteone.io/ \
--ignore-regex='/^.*\/html\//' \
--ignore-regex='/^.*\/examples\-exports\//' \
--markdown-export-dir=tmp/crawler.siteone.io/ \
--markdown-exclude-selector='header' \
--markdown-exclude-selector='starlight-theme-select' \
--markdown-exclude-selector='.isMobile' \
--markdown-exclude-selector='#starlight__on-this-page--mobile' \
--markdown-exclude-selector='.social-icons' \
--disable-styles --disable-javascript --disable-fonts
Website react.dev
- Open markdown version of react.dev.
- The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
- I used the
--markdown-disable-imagesso that the images are not included and are removed from the markdown. - I used the
--disable-all-assetshere only to avoid downloading assets (JS, CSS, etc.) unnecessarily. That do not affect the output markdown content.
./crawler \
--url=https://react.dev/ \
--markdown-export-dir=tmp/react.dev/ \
--markdown-disable-images \
--disable-all-assets
Website docs.astro.build
- Open markdown version of docs.astro.build - this webpage is based on Starlight.
- The Markdown version was generated by the specific command below. For better performance, some parts of the page (DOM elements) have been removed.
- I put the
--disable-*attributes here only to avoid downloading these types of files unnecessarily. They do not affect the output markdown content.
./crawler \
--url=https://docs.astro.build/ \
--markdown-export-dir=tmp/docs.astro.build/ \
--markdown-exclude-selector='header' \
--markdown-exclude-selector='starlight-theme-select' \
--markdown-exclude-selector='.isMobile' \
--markdown-exclude-selector='#starlight__on-this-page--mobile' \
--markdown-exclude-selector='.social-icons' \
--disable-styles --disable-javascript --disable-fonts