Rainman Translate Book
May 23, 2026 · View on GitHub
English | 中文
Claude Code skill that translates entire books (PDF/DOCX/EPUB) into any language using parallel subagents.
Inspired by claude_translater. The original project uses shell scripts as its entry point, coordinating the Claude CLI with multiple step scripts to perform chunked translation. This project restructures the workflow as a Claude Code Skill, using subagents to translate chunks in parallel, with manifest-driven integrity checks, resumable runs, and multi-format output unified into a single pipeline. As the project structure and implementation differ significantly from the original, this is an independent project rather than a fork.
How It Works
Input (PDF/DOCX/EPUB)
│
▼
Calibre ebook-convert → HTMLZ → HTML → Markdown
│
▼
Split into chunks (chunk0001.md, chunk0002.md, ...)
│ manifest.json tracks chunk hashes
▼
Parallel subagents (8 concurrent by default)
│ each subagent: read 1 chunk → translate → write output_chunk*.md
│ batched to respect API rate limits
▼
Validate (manifest hash check, 1:1 source↔output match)
│
▼
Merge → Pandoc → HTML (with TOC) → Calibre → DOCX / EPUB / PDF
Each chunk gets its own independent subagent with a fresh context window. This prevents context accumulation and output truncation that happen when translating a full book in a single session.
Features
- Parallel subagents — 8 concurrent translators per batch, each with isolated context
- Resumable + selective re-translation — chunk-level resume, with
run_state.jsontracking glossary-sensitive re-translation - Neighbor context — each chunk can see short read-only excerpts from adjacent chunks for pronoun and entity resolution
- Manifest validation — SHA-256 hash tracking prevents stale or corrupt outputs from being merged
- Multi-format output — HTML (with floating TOC), DOCX, EPUB, PDF
- Optional output controls — explicit EPUB cover, custom temp root, and user-facing export aliases
- Multi-language — zh, en, ja, ko, fr, de, es (extensible)
- PDF/DOCX/EPUB input — Calibre handles the conversion heavy lifting
Prerequisites
- Claude Code CLI — installed and authenticated
- Calibre —
ebook-convertcommand must be available (download) - Pandoc — for HTML↔Markdown conversion (download)
- Python 3 with:
pypandoc— required (pip install pypandoc)beautifulsoup4— optional, for better TOC generation (pip install beautifulsoup4)
Quick Start
1. Install the skill
Option A: npx (recommended)
npx skills add deusyu/translate-book -a claude-code -g
Option B: ClawHub
clawhub install translate-book
Option C: Git clone
git clone https://github.com/deusyu/translate-book.git ~/.claude/skills/translate-book
2. Translate a book
In Claude Code, say:
translate /path/to/book.pdf to Chinese
Or use the slash command:
/translate-book translate /path/to/book.pdf to Japanese
The skill handles the full pipeline automatically — convert, chunk, translate in parallel, validate, merge, and build all output formats.
3. Find your outputs
All files are in {book_name}_temp/:
| File | Description |
|---|---|
output.md | Merged translated Markdown |
book.html | Web version with floating TOC |
book.docx | Word document |
book.epub | E-book |
book.pdf | Print-ready PDF |
Repository Test Assets
- Checked-in baseline inputs live under
tests/baselines/<book-id>/. - Generated full-pipeline outputs live under
tests/.artifacts/and should not be committed. - Because
scripts/convert.pywrites{book_name}_temp/under the current working directory, run repository baseline tests from insidetests/.artifacts/to keep generated files out of the repo root.
Full-Pipeline Baseline Example
mkdir -p tests/.artifacts
cd tests/.artifacts
python3 ../../scripts/convert.py ../baselines/standard-alice/standard-alice.epub --olang zh
# then run translation via the skill
python3 ../../scripts/merge_and_build.py --temp-dir standard-alice_temp --title "test"
Feedback and Contributions
Please open a detailed GitHub issue instead of starting with a pull request. This project is maintained as an AI-assisted skill pipeline, and changes need to be evaluated against the current orchestration rules, chunk/manifest contracts, baseline assets, and release flow in one maintainer-owned context.
Pull requests are not the preferred contribution path and may be closed in favor of an issue. If you already have a patch, include the idea, key diff, failing case, or verification notes in the issue; the maintainer may rework or split the implementation before merging.
A useful issue should include:
- Current behavior and expected behavior
- Input format and environment, such as PDF/DOCX/EPUB, OS, Python, Calibre, and Pandoc versions
- Minimal reproduction steps or a small public-domain sample when possible
- Logs, screenshots, or generated file names that show the failure
Pipeline Details
Step 1: Convert
python3 scripts/convert.py /path/to/book.pdf --olang zh
Calibre converts the input to HTMLZ, which is extracted and converted to Markdown, then split into chunks (~6000 chars each). A manifest.json records the SHA-256 hash of each source chunk for later validation.
By default the working directory is {book_name}_temp/ under the current directory. Use --temp-root /path/to/work to keep the same leaf directory name under a different parent.
Step 1.5: Glossary (term consistency across chunks)
Each chunk is translated by a fresh-context sub-agent, which means the same proper noun can drift across multiple translations on a 100-chunk book. To fix this, the skill builds a glossary before translation:
- Sample 5 chunks (first, last, 3 evenly-spaced middle).
- Extract proper nouns and recurring domain terms; pick canonical translations.
- Write
<temp_dir>/glossary.json(hand-editable schema below). - Run
python3 scripts/glossary.py count-frequencies <temp_dir>to populate per-term frequencies (ASCII terms use word-boundary regex socatdoesn't matchcategory; CJK terms use substring; single-CJK-char terms are rejected; aliases count toward the term they belong to). - For each chunk, the orchestrator calls
python3 scripts/glossary.py print-terms-for-chunk <temp_dir> chunkNNNN.mdand injects the resulting 3-column (原文 | 别名 | 译文) markdown table into that chunk's prompt as a hard constraint. Term selection = (terms whose source OR any alias appears in this chunk) ∪ (top-N most-frequent book-wide).
{
"version": 2,
"terms": [
{"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿",
"category": "place", "aliases": [], "gender": "unknown",
"confidence": "medium", "frequency": 12,
"evidence_refs": [], "notes": ""}
],
"high_frequency_top_n": 20,
"applied_meta_hashes": {}
}
Existing v1 glossary.json files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message — fix the file by hand and reload.
Edit glossary.json between runs to fix translations; existing glossary.json is never overwritten — delete it to rebuild from scratch. scripts/run_state.py records which glossary terms each chunk used, so later glossary changes only re-translate affected chunks after the state has been recorded.
Step 2: Translate (parallel subagents)
The skill launches subagents in batches (default: 8 concurrent). Each subagent:
- Reads one source chunk (e.g.
chunk0042.md) - Translates to the target language
- Uses a per-chunk term table and short read-only previous/next excerpts
- Writes the result to
output_chunk0042.md - Writes
output_chunk0042.meta.jsonobservations for glossary feedback
Before launching subagents, scripts/run_state.py plan <temp_dir> decides which chunks need translation, which existing outputs only need state recording, and which are unchanged. Use --retranslate-untracked only when adopting an old temp dir whose existing outputs should be forced through the current glossary. If a run is interrupted, re-running skips chunks that already have valid output files and current state. Failed chunks are retried once automatically.
Step 3: Merge & Build
python3 scripts/merge_and_build.py --temp-dir book_temp --title "《translated title》"
Optional output flags:
python3 scripts/merge_and_build.py --temp-dir book_temp --title "《translated title》" --cover cover.jpg --export-name "translated-title"
--cover passes an explicit image to the EPUB Calibre step. --export-name creates alias copies such as translated-title.epub while preserving the canonical book.* pipeline artifacts.
Before merging, the script validates:
- Every source chunk has a corresponding output file (1:1 match)
- Source chunk hashes match the manifest (no stale outputs)
- No output files are empty
Then: merge → Pandoc HTML → inject TOC → Calibre generates DOCX, EPUB, PDF.
Note: {book_name}_temp/ is a working directory for a single translation run. If you change the title, author, output language, template, or image assets, either use a fresh temp directory or delete the existing final artifacts (output.md, book*.html, book.docx, book.epub, book.pdf) before re-running.
Project Structure
| File | Purpose |
|---|---|
SKILL.md | Claude Code skill definition — orchestrates the full pipeline |
scripts/convert.py | PDF/DOCX/EPUB → Markdown chunks via Calibre HTMLZ |
scripts/manifest.py | Chunk manifest: SHA-256 tracking and merge validation |
scripts/glossary.py | Glossary management: per-chunk term tables for consistent terminology |
scripts/chunk_context.py | Read-only previous/next chunk excerpts for sub-agent prompts |
scripts/meta.py | Per-chunk sub-agent observation file schema (output_chunkNNNN.meta.json) |
scripts/merge_meta.py | Batch-boundary merge: sub-agent observations → canonical glossary |
scripts/run_state.py | Selective re-translation planner and run_state.json recorder |
scripts/merge_and_build.py | Merge chunks → HTML → DOCX/EPUB/PDF |
scripts/calibre_html_publish.py | Calibre wrapper for format conversion |
scripts/template.html | Web HTML template with floating TOC |
scripts/template_ebook.html | Ebook HTML template |
tests/baselines/ | Checked-in baseline book inputs for full-pipeline testing |
tests/.artifacts/ | Ignored full-pipeline test outputs |
Troubleshooting
| Problem | Solution |
|---|---|
Calibre ebook-convert not found | Install Calibre and ensure ebook-convert is in PATH |
Manifest validation failed | Source chunks changed since splitting — re-run convert.py |
Missing source chunk | Source file deleted — re-run convert.py to regenerate |
| Incomplete translation | Re-run the skill — it resumes from where it stopped |
| Changed title/template/assets but output didn't update | Delete existing output.md, book*.html, book.docx, book.epub, book.pdf from the temp dir, then re-run merge_and_build.py |
| Want page-number footers stripped from PDF output | By default, monotonic page-number sequences (e.g. 1, 2, 3, ...) are auto-detected and dropped while outliers like years (1984), chapter numbers, and citation indices stay preserved. If detection misses your case, pass --strip-page-numbers to convert.py to aggressively delete every standalone-digit line. The flag aborts if a cached input.md or chunk*.md already exists — delete them first so the flag actually takes effect. |
output.md exists but manifest invalid | Stale output — the script auto-deletes and re-merges |
Glossary upgrade rejected: duplicate source | v2 disallows two terms sharing a source/alias surface form. Edit glossary.json to disambiguate (e.g., rename one source from Apple to Apple (Inc.)) and reload. |
| PDF generation fails | Ensure Calibre is installed with PDF output support |
Roadmap
Tracking issue #7 — name/term inconsistency and pronoun/gender errors across chunks. The pipeline now covers high-frequency entities, alias/spelling drift, adjacent-chunk pronoun context, and selective re-translation after glossary changes. Full-book organic validation remains a future quality pass. The plan is four independently shippable phases.
Design principles
- Scripts do bookkeeping; LLMs do semantic merge. State, schemas, dedup, hashing, IO are deterministic Python. Naming, gender attribution, alias judgment, conflict resolution are LLM calls.
- Single writer for shared state. Only the main agent writes
glossary.jsonandrun_state.json; sub-agents write per-chunk meta files. No locking needed. - Conservative merge. New entities require evidence; alias merges need LLM judgment, not just string similarity; gender starts at
unknownand only moves up under explicit evidence; canonical values aren't silently overwritten on conflict. - Three-layer state, three separate files.
glossary.json(canonical, sub-agents read),output_chunkNNNN.meta.json(raw per-chunk observations),run_state.json(orchestration).
Phase 1 — Sub-agent feedback + glossary merge (shipped)
Closes the read+write loop. Glossary v2 adds id, aliases, gender, confidence, evidence_refs, notes (v1 files auto-upgrade on first load; the term table is now 3-col and aliases participate in selection). Sub-agents emit output_chunkNNNN.meta.json alongside each translated chunk. scripts/merge_meta.py (prepare-merge / apply-merge / status) merges per-batch with conservative rules: surface-form uniqueness enforced, malformed metas quarantined (warn + skip + count), confidence escalation via both evidence_chunks and used_term_sources, FIFO-cap at 5. See SKILL.md Step 4 / Step 4.5 / Step 5.
Phase 2 — Neighbor context for pronouns (shipped)
scripts/chunk_context.py injects prev_excerpt (last ~300 chars of previous chunk) and next_excerpt (first ~300 chars of next chunk) into each sub-agent prompt as read-only context. No new state files are introduced.
Phase 3 — Selective re-translation (shipped)
Phase 1's batch feedback only improves forward. Selective rerun closes the backward loop with scripts/run_state.py and run_state.json: per-chunk tracking of glossary_version_used, entity_ids_used, output_hash, source hash, and selected entity hashes; five planning rules cover missing/empty output, manifest source drift, untracked outputs, source drift since record, and glossary term selection/hash changes.
Phase 4 — Bootstrap warm-up (experimental, gated on Phase 1 data)
Phase 1 grows the glossary batch-by-batch, so the first batch sees the smallest glossary and has the highest drift risk. Possible approaches: sequential bootstrap, variable concurrency, or skip entirely. Decision belongs to whoever has run the system on real books.
Phase 4 remains gated on real-book evidence. The shipped schemas can still evolve under compatibility-aware migrations if production runs expose gaps.
Parallel track — Pipeline / UX backlog (partly shipped, separate from issue #7)
Recent PR discussions also surfaced several useful workflow improvements, but these are broader than one-off patches and touch repo contracts (artifact names, temp-dir behavior, cleanup semantics, or EPUB compatibility scope). Current status:
- Explicit EPUB cover support (shipped).
merge_and_build.py --cover <image>passes the image through the HTML -> EPUB Calibre step.--cover-from <epub>/ EPUB cover auto-extraction remains out of scope until the project is ready to own EPUB parsing compatibility across different package layouts. (context: closed #3) - Configurable temp workspace location (shipped).
convert.py --temp-root <dir>keeps the default cwd-local{book_name}_temp/behavior unless explicitly overridden. (context: closed #4) - Safer Calibre/Pandoc artifact cleanup (partly shipped). Page-number and Calibre-marker cleanup is regression-tested, preserving years, chapter numbers, and non-monotonic standalone numbers. Continue improving cleanup incrementally under tests. (context: closed #5)
- Optional user-facing export names (shipped).
merge_and_build.py --export-name <stem>creates alias copies while preserving canonical pipeline artifacts asbook.html,book_doc.html,book.docx,book.epub, andbook.pdf. (context: closed #6)
Star History
If you find this project helpful, please consider giving it a Star ⭐!
Sponsor
If this project saves you time, consider sponsoring to keep it maintained and improved.