Document-To-Markdown

April 30, 2026 · View on GitHub

Convert PDFs to clean Markdown, chunk into logical sections, and extract embedded tables to CSV.

Skills

setup — provision the local extractor venv and verify system tools (pdftotext, ocrmypdf, tesseract).
pdf-to-markdown — convert a single PDF to Markdown, picking marker / docling / pymupdf4llm based on layout complexity.
ocr-scanned-pdf — run ocrmypdf to add a text layer to scanned/image PDFs. Auto-invoked when needed.
chunk-markdown — split a long .md into logical chapters/sections with a TOON manifest.
extract-tables — pull tables from a PDF (camelot/tabula) into CSV files with a TOON index.
doc-to-everything — end-to-end orchestrator: PDF → Markdown → chunks → tables in a self-contained workspace.

Output layout

Running doc-to-everything on book.pdf produces:

book/
  source.pdf
  full.md
  assets/
  chunks/
    index.toon
    00-frontmatter.md
    01-introduction.md
    ...
  tables/
    index.toon
    01-p12-revenue.csv
    ...
  manifest.toon

Installation

claude plugins install document-to-markdown@danielrosehill

System: pdftotext (poppler-utils), ocrmypdf, tesseract-ocr. Python (managed via uv venv under $CLAUDE_USER_DATA/document-to-markdown/venv/): marker-pdf, docling, pymupdf4llm, camelot-py[cv], tabula-py, pandas. Run the setup skill on first use.

Skills

Output layout

Installation

Dependencies