Scribe.js Guide
June 3, 2026 · View on GitHub
Scribe.js performs OCR and extracts text from images and PDFs, and writes the results back out in formats like searchable PDF, plain text, hOCR, and Word/Excel.
This guide covers the JavaScript API, from a first script to full document control. For a terse
reference of every function and method, see the API reference. If you want the
scribe command-line tool instead, see the CLI reference. If you are choosing
between Scribe.js and Tesseract.js, see Scribe.js vs. Tesseract.js.
Contents
- Install and import
- Quick start
- Core concepts
- Importing files
- Recognition (OCR)
- Exporting and output
- Configuration
- Browser usage notes
Install and import
npm i scribe.js-ocr
Scribe.js is written in JavaScript using ESM, so it can be imported directly in Node.js or in the browser without a build step.
// Node.js
import scribe from 'scribe.js-ocr';
// Browser (bundler such as Vite, Webpack, or Next.js)
import scribe from 'scribe.js-ocr';
// Browser without a bundler (import map or relative path)
import scribe from '/node_modules/scribe.js-ocr/scribe.js';
In the browser, all files must be served from the same origin as the code importing Scribe.js. Importing from a CDN does not work, and there is no UMD build. See Browser usage notes.
Quick start
There are two ways to use Scribe.js: a single-call helper for trying it out, and a document API for application code.
One-shot: extractText
extractText handles import, recognition, and export in one call. It returns plain text by
default and is smart about PDFs (extracting existing text from text-native PDFs and running OCR
on image-based ones).
import scribe from 'scribe.js-ocr';
const text = await scribe.extractText(['https://tesseract.projectnaptha.com/img/eng_bw.png']);
console.log(text);
await scribe.terminate();
This is the easiest way to try Scribe.js, but it hides every piece of control a real
application typically needs: recognition options, progress events, error handling, per-word OCR
data, and the ability to produce more than one output format. Use it for scripts and
exploration. For production code, prefer openDocument below.
Full control: openDocument and ScribeDoc
This is the recommended path for any application beyond a one-off script. Open a document with
scribe.openDocument and operate on the returned ScribeDoc, which
exposes recognition options, progress and warning handlers, the per-word OCR data, and every
output format.
import scribe from 'scribe.js-ocr';
const doc = await scribe.openDocument(['receipt.png']);
await doc.recognize({ langs: ['eng'] });
// Read recognized words.
for (const word of doc.ocr.active[0].lines.flatMap((line) => line.words)) {
console.log(word.text, word.conf);
}
// Write a searchable PDF.
await doc.download('pdf', 'receipt.pdf');
await doc.terminate();
await scribe.terminate();
extractText is a thin convenience wrapper around this exact flow that throws away the
intermediate state and returns just the text. Everything else in this guide builds on
openDocument and ScribeDoc.
Core concepts
Two API levels
| Level | Entry point | Use when |
|---|---|---|
| One-shot | scribe.extractText(files) | Quick scripts or trying things out. Easy, but no progress events, no error surface, no per-word data. |
| Document | scribe.openDocument(files) -> ScribeDoc | Application code. Use whenever you need OCR options, progress or error handling, word-level data, multiple output formats, or multiple documents. |
The ScribeDoc object
A ScribeDoc represents a single document being processed — its imported pages, OCR text,
layout, fonts, and images. scribe.openDocument(files) creates one, imports the files, and
returns it. Because each document holds its own state, you can have several open at once:
const invoice = await scribe.openDocument(['invoice.pdf']);
const contract = await scribe.openDocument(['contract.pdf']);
// invoice and contract are fully independent.
Resource lifecycle
Scribe.js has two tiers of resources.
- Shared resources — the OCR worker pool and the built-in fonts. These are process-wide and
loaded lazily on first use.
scribe.init()can pre-load them to remove first-use latency, andscribe.terminate()releases them. - Per-document resources — each document's PDF renderer, image cache, and optimized fonts.
doc.terminate()releases these for one document without touching the shared pool.
A typical full lifecycle:
await scribe.init({ ocr: true, font: true }); // optional pre-load
const doc = await scribe.openDocument(files);
await doc.recognize();
await doc.download('pdf', 'out.pdf');
await doc.terminate(); // release this document
await scribe.terminate(); // release shared resources (e.g. before process exit)
You do not have to call init — resources load on demand. You should call doc.terminate() and
scribe.terminate() when finished, especially in Node.js, so the process can exit.
Reusing a document for several PDFs
A ScribeDoc is created empty and the file(s) are attached by importFiles, so the same
document can hold one PDF, be reset with clear(), and be re-used for the next PDF. clear()
wipes the document's OCR text, layout, and image caches but keeps its PDF worker pool alive, so
the second importFiles reuses the workers that were already spun up for the first. This makes
it cheap to process many PDFs in series without paying worker-startup cost each time.
If you know a PDF is on the way but don't have it yet (for example, a long-running server
processing uploads), call doc.preloadPdfWorkers() on an empty document to spawn the worker
pool ahead of time. The workers idle until the first importFiles lands.
const doc = new scribe.ScribeDoc();
await doc.preloadPdfWorkers(); // optional: spawn workers up front
for (const path of pdfPaths) {
await doc.importFiles([path]);
await doc.recognize();
await doc.download('pdf', path.replace(/\.pdf$/, '.searchable.pdf'));
doc.clear(); // reset state; workers are kept alive
}
await doc.terminate(); // finally release the workers
await scribe.terminate();
For workflows with several documents open at the same time, create separate ScribeDoc
instances instead — each gets its own worker pool.
The OCR data model
After import or recognition, a document's text lives under doc.ocr. This is a map of named OCR
versions; doc.ocr.active is the one used for export. Other versions may exist depending on what
ran, for example 'Tesseract Legacy', 'Tesseract LSTM', 'Tesseract Combined', 'User Upload' (imported OCR), and 'pdf' (text pulled from an input PDF).
Each version is an array of pages. The hierarchy is page -> line -> word:
const page = doc.ocr.active[0]; // OcrPage
const line = page.lines[0]; // OcrLine
const word = line.words[0]; // OcrWord
word.text; // 'Hello'
word.conf; // confidence, 0-100
word.bbox; // { left, top, right, bottom } in page pixels
word.style; // { font, size, bold, italic, underline, smallCaps, sup, dropcap, color, opacity }
word.chars; // character-level data when available, otherwise null
Alongside the OCR text, a document carries:
doc.pageMetrics— per-page dimensions and rotation angle.doc.inputData— input metadata (pdfMode,imageMode,pdfType,pageCount, ...).doc.layoutRegions/doc.layoutDataTables— layout regions and detected tables, used for reflow and tabular exports.doc.fonts/doc.images— document-scoped font and image caches.
Importing files
openDocument (and doc.importFiles) accept several input shapes.
Supported input types
| Category | Extensions |
|---|---|
| Images | .png, .jpg, .jpeg |
.pdf | |
| OCR data | .hocr, .xml (Abbyy/ALTO), .html, .stext, .json (AWS Textract / Google Vision), .txt, .docx, .gz (gzipped XML) |
| Sessions | .scribe, .scribe.json |
Notes:
- A PDF and image files cannot be imported together, and only one PDF is imported at a time.
- Importing an image together with an OCR file (e.g. a
.pngplus its.hocr) loads the text over the image without re-running OCR.
Passing files
For File objects (browser) or file paths (Node.js), pass a single array — Scribe.js sorts them
by extension:
// Node.js: file paths
const doc = await scribe.openDocument(['scan.png', 'scan.hocr']);
// Browser: a FileList or File[] from an <input type="file">
const doc = await scribe.openDocument(fileInput.files);
// Browser: URLs (fetched same-origin)
const doc = await scribe.openDocument(['/uploads/scan.png']);
When passing ArrayBuffer inputs, extension sorting is not possible, so provide a
SortedInputFiles object that names each type:
const doc = await scribe.openDocument({
pdfFiles: [pdfArrayBuffer],
imageFiles: [imageArrayBuffer],
ocrFiles: [ocrArrayBuffer],
scribeFiles: [scribeArrayBuffer],
});
Supplemental OCR and ground truth
doc.importFilesSupp(files, ocrName) imports an additional OCR version under a name of your
choice, without replacing doc.ocr.active. This is used for alternate engine output or for
ground-truth data to evaluate against (see doc.compareOCR / doc.evalOCRPage in the API
reference).
Sessions (.scribe)
The .scribe format saves a full session — OCR text, layout, fonts, and annotations — so work
can be resumed exactly. Export with doc.exportData('scribe') and reopen by importing the result
as a scribe file:
const session = await doc.exportData('scribe'); // ArrayBuffer (gzip) by default
const restored = await scribe.openDocument({ scribeFiles: [session] });
Recognition (OCR)
Run the built-in Tesseract engine on a document's pages with doc.recognize(options). Files must
be imported first (they already are after openDocument). Results populate doc.ocr.
await doc.recognize({
langs: ['eng'], // languages present in the document
modeAdv: 'combined', // 'lstm' | 'legacy' | 'combined'
});
Options
| Option | Type | Default | Description |
|---|---|---|---|
langs | string[] | ['eng'] | Language codes. |
mode | 'speed' | 'quality' | 'quality' | Convenience setting: speed -> LSTM only, quality -> Legacy. |
modeAdv | 'lstm' | 'legacy' | 'combined' | 'combined' | Engine selection. Overrides mode. |
combineMode | 'conf' | 'data' | 'none' | 'data' | How to merge with existing OCR data, if any. |
vanillaMode | boolean | false | Use the unmodified upstream Tesseract.js model. |
config | Object<string, string> | {} | Raw Tesseract config parameters. |
model | RecognitionModel | — | A custom recognition model (see cloud adapters). |
modelOptions | Object | {} | Options forwarded to the custom model. |
signal | AbortSignal | — | Cancel a custom-model run. Completed pages are preserved. |
modeAdv trade-offs:
lstm— fastest, neural model only.legacy— the older model; produces strong character metrics used for font optimization.combined— runs both and merges them for the best accuracy. Slowest.
Languages
Pass any Tesseract language codes in langs (e.g. ['eng', 'fra', 'deu']). Some languages pull
in extra fonts automatically: chi_sim loads a Chinese font, and rus / ukr / ell load
Cyrillic/Greek glyph coverage.
By default the .traineddata language files are fetched from a CDN. To use a local or offline
mirror, set scribe.opt.langPath to a directory containing <lang>.traineddata.gz:
scribe.opt.langPath = '/assets/tessdata'; // loads /assets/tessdata/eng.traineddata.gz
Progress
Set scribe.opt.progressHandler to receive progress messages during recognition and export:
scribe.opt.progressHandler = (msg) => {
if (msg.type === 'recognize') console.log('recognizing...');
};
Cloud OCR adapters
Instead of the built-in engine, you can plug in a cloud OCR service by passing a model to
recognize. Scribe.js already knows how to parse each service's output into its OCR data model;
the adapter packages are thin clients that call the service. They are published separately so the
relevant cloud SDK is only installed by projects that use it.
| Service | Package | Model class |
|---|---|---|
| AWS Textract | @scribe.js/aws-textract | RecognitionModelTextract (Node), RecognitionModelTextractBrowser |
| Google Cloud Vision | @scribe.js/gcs-vision | RecognitionModelGoogleVision |
| Google Document AI | @scribe.js/gcs-doc-ai | RecognitionModelGoogleDocAI |
| Azure Document Intelligence | @scribe.js/azure-doc-intel | RecognitionModelAzureDocIntel |
Node.js, with credentials on the server:
import scribe from 'scribe.js-ocr';
import { RecognitionModelTextract } from '@scribe.js/aws-textract';
const doc = await scribe.openDocument(['document.pdf']);
await doc.recognize({
model: RecognitionModelTextract,
modelOptions: { analyzeLayout: true },
});
console.log(await doc.exportData('text'));
await doc.terminate();
await scribe.terminate();
For browser apps, the recommended pattern is a proxy server that holds the credentials and runs
the Node model, with the browser posting documents to it. A ready-to-copy client and server are
in examples/server-textract-proxy/. Calling a cloud
service directly from the browser is possible (@scribe.js/aws-textract/browser) but exposes
credentials, so it is only appropriate for local debugging or short-lived tokens. Each adapter's
own README documents its modelOptions.
Exporting and output
doc.exportData(format, options) returns the document in the requested format. doc.download( format, fileName, options) does the same and saves the result — a browser download or a Node.js
file write.
const text = await doc.exportData('text'); // string
const pdfBytes = await doc.exportData('pdf'); // ArrayBuffer
await doc.download('pdf', 'output.pdf'); // writes output.pdf
Formats
| Format | Output | Notes |
|---|---|---|
'txt' / 'text' | string | Plain text. |
'pdf' | ArrayBuffer | PDF with a text layer (see display modes below). |
'hocr' | string | hOCR XML. |
'alto' | string | ALTO XML (saved with a .xml extension by download). |
'html' | string | HTML, optionally with page images (ScribeDoc.defaults.includeImages). |
'md' | string | Markdown, with tables. |
'docx' | ArrayBuffer | Word document. |
'xlsx' | ArrayBuffer | Excel spreadsheet, from detected tables. |
'scribe' | ArrayBuffer or string | Session file (gzip by default; see ScribeDoc.defaults.compressScribe). |
Page subsetting
await doc.exportData('text', { minPage: 0, maxPage: 4 }); // first 5 pages (inclusive)
await doc.exportData('text', { pageArr: [0, 2, 5] }); // specific pages; overrides min/max
PDFs and the text layer
How the text layer is drawn is controlled by ScribeDoc.defaults.displayMode, or by passing
displayMode per call to exportData / download:
'invis'— invisible text over the page image. The standard "searchable PDF."'proof'— visible text, color-coded by confidence. Useful for reviewing OCR quality.'ebook'— text only, no background image.
Two common PDF workflows:
// 1. Image (or image PDF) -> searchable PDF ('invis' is the default).
const doc = await scribe.openDocument(['scan.png']);
await doc.recognize();
await doc.download('pdf', 'searchable.pdf');
// 2. Add a text layer to an existing PDF (keeps the original pages).
const doc2 = await scribe.openDocument(['image-only.pdf']);
await doc2.recognize();
await doc2.download('pdf', 'image-only.searchable.pdf', { displayMode: 'proof' });
When the input is a PDF that already contains text, ScribeDoc.defaults.usePDFText (or
usePDFText passed to importFiles) controls whether that text is used as the primary or
supplemental source, separately for native (visible) text and OCR (invisible) text layers. See
Configuration.
Configuration
Scribe.js has two layers of settings:
- Process-wide options on
scribe.opt— worker count, asset paths, and handler callbacks. These are shared across every document. Set them before the relevant operation;workerNmust be set before workers initialize. - Per-document defaults on
scribe.ScribeDoc.defaults— recognition, rendering, and export behavior. Every export, recognition, and import function resolves a setting asoptions.X ?? ScribeDoc.defaults.X. MutatingScribeDoc.defaultschanges the default for every subsequent call; passingoptions.XtoexportData,download,importFiles, orrecognizeoverrides it for that one call.
Process-wide (scribe.opt)
// Languages and workers
scribe.opt.langPath = null; // dir of <lang>.traineddata.gz; null = CDN
scribe.opt.workerN = null; // worker count; null = up to 6 (browser) / 8 (Node)
// Handlers
scribe.opt.progressHandler = (msg) => {};
scribe.opt.warningHandler = (msg) => console.warn(msg);
scribe.opt.errorHandler = (msg) => console.error(msg);
See js/containers/app.js for the full opt class.
Per-document defaults (scribe.ScribeDoc.defaults)
// Text output
scribe.ScribeDoc.defaults.reflow = true; // combine lines into paragraphs
scribe.ScribeDoc.defaults.lineNumbers = false; // prefix lines with page:line (txt only)
scribe.ScribeDoc.defaults.removeMargins = false;
// PDF / image output
scribe.ScribeDoc.defaults.displayMode = 'invis'; // 'invis' | 'proof' | 'ebook'
scribe.ScribeDoc.defaults.colorMode = 'color'; // 'color' | 'gray' | 'binary'
scribe.ScribeDoc.defaults.autoRotate = true;
scribe.ScribeDoc.defaults.includeImages = false; // include page images in HTML export
scribe.ScribeDoc.defaults.embedFonts = false; // embed fonts in HTML export (vs. CDN); enable for offline files
// PDF text handling and confidence thresholds
scribe.ScribeDoc.defaults.usePDFText = { native: { supp: true, main: true }, ocr: { supp: true, main: false } };
scribe.ScribeDoc.defaults.keepPDFTextAlways = false;
scribe.ScribeDoc.defaults.confThreshHigh = 85;
scribe.ScribeDoc.defaults.confThreshMed = 75;
// Sessions
scribe.ScribeDoc.defaults.compressScribe = true; // gzip .scribe output
scribe.ScribeDoc.defaults.includeExtraTextScribe = false;
The same names work as per-call options:
await doc.download('pdf', 'out.pdf', { displayMode: 'proof', colorMode: 'binary' });
await doc.exportData('scribe', { compressScribe: false });
See js/containers/scribeDocDefaults.js for the
complete list of per-document defaults.
Browser usage notes
- Same origin. All Scribe.js files must be served from the same origin as the importing code. CDN imports do not work and there is no UMD build.
- Assets. Tesseract
.traineddatafiles load from a CDN by default; pointopt.langPathat a same-origin directory to self-host them. - Inputs. Use a
File/FileListfrom an<input type="file">, or same-origin URLs. - Templates. Working setups for common build systems are listed in the README (ESM/no-build, Next.js, Webpack 5, Vue 2).