Document-To-Markdown
April 30, 2026 · View on GitHub
Convert PDFs to clean Markdown, chunk into logical sections, and extract embedded tables to CSV.
Skills
setup— provision the local extractor venv and verify system tools (pdftotext, ocrmypdf, tesseract).pdf-to-markdown— convert a single PDF to Markdown, picking marker / docling / pymupdf4llm based on layout complexity.ocr-scanned-pdf— runocrmypdfto add a text layer to scanned/image PDFs. Auto-invoked when needed.chunk-markdown— split a long.mdinto logical chapters/sections with a TOON manifest.extract-tables— pull tables from a PDF (camelot/tabula) into CSV files with a TOON index.doc-to-everything— end-to-end orchestrator: PDF → Markdown → chunks → tables in a self-contained workspace.
Output layout
Running doc-to-everything on book.pdf produces:
book/
source.pdf
full.md
assets/
chunks/
index.toon
00-frontmatter.md
01-introduction.md
...
tables/
index.toon
01-p12-revenue.csv
...
manifest.toon
Installation
claude plugins install document-to-markdown@danielrosehill
Dependencies
System: pdftotext (poppler-utils), ocrmypdf, tesseract-ocr.
Python (managed via uv venv under $CLAUDE_USER_DATA/document-to-markdown/venv/): marker-pdf, docling, pymupdf4llm, camelot-py[cv], tabula-py, pandas. Run the setup skill on first use.