Contributing to doc-preprocess-hub
April 21, 2026 · View on GitHub
Thanks for considering a contribution. This document covers the things you'll want to know before opening a PR.
Ground rules
- Be kind. This is an open project. Code reviews critique the code, not the person.
- Small, focused PRs. One change per PR. Easier to review, easier to revert.
- Tests required. New code paths need tests. Bug fixes need regression tests.
- Don't break the API contract.
orchestrator-apiis versioned under/v1/. Breaking changes go in/v2/. - No secrets. Never commit API keys, tokens, passwords, or customer data. Pre-commit hooks are in place; please run them locally.
What we're looking for
High-leverage contributions
- Real engine integration — the current
worker-mineruandworker-doclingservices are stubs. Wiring them up to the real engines is the biggest unblocker. - Presidio recognizers — especially non-English ones (Chinese, Japanese, Korean, Arabic, etc.). Each new language recognizer is a distinct PR.
- Benchmark datasets — anonymized parsing / PII / SLA evals. If your organization has a document set you can share, open an issue first to discuss data licensing.
- Operator console features — DLQ replay UX, audit query improvements, trace visualization.
- Chart / formula understanding — integrating DePlot or similar for chart-to-data extraction.
Also welcome
- Docs improvements (typos, clarity, missing diagrams)
- Deployment guides for specific environments (k8s, nomad, docker swarm)
- Language SDKs beyond Python (Java, Go, TypeScript)
- Performance improvements with benchmarks showing the delta
Please discuss before opening a PR
- Changes to the API contract (even additions)
- New top-level services
- Dependency upgrades that change major versions of Celery, FastAPI, SQLAlchemy, or AntD Pro
- Licensing changes
Open an issue with the rfc label and let's chat first.
Development setup
# Clone and set up
git clone https://github.com/MackDing/doc-preprocess-hub.git
cd doc-preprocess-hub
cp .env.example .env
# Infrastructure services
docker compose up -d postgres redis rabbitmq minio
# Python services (use one terminal per service while developing)
cd services/orchestrator
pip install -e .
alembic upgrade head
uvicorn app.main:app --reload --port 8000
# Run tests
pytest
See services/*/README.md for service-specific dev notes.
Code style
- Python:
ruff+black, line length 100. Type hints where they add clarity. - TypeScript:
eslint+prettierdefaults (from AntD Pro). - Commits: present-tense imperative. "add PII whitelist config" not "added" or "adding".
- No trailing whitespace. Use
.editorconfig.
PR checklist
Before marking a PR ready for review:
- Tests pass locally (
pytestfor Python,pnpm testfor console) - New code has tests
- Changed behavior is documented (README / ARCHITECTURE / relevant docs)
- No new linter warnings
- No secrets committed
- PR description explains the why, not just the what
- Linked to an issue if one exists
Reporting bugs
Open a GitHub issue. Include:
- What you did (ideally a minimal repro)
- What you expected
- What actually happened
- Environment (OS, Python version, engine versions)
- Logs, if relevant, redacted of any sensitive content
Reporting security issues
Do not open a public issue for security vulnerabilities. Email the maintainer directly or use GitHub's private vulnerability reporting (Security tab → Report a vulnerability).
Give us a reasonable window to fix the issue before public disclosure. We'll credit you in the release notes if you want.
License
By contributing, you agree that your contributions will be licensed under the Apache License 2.0 (see LICENSE).