DeepSeek-OCR-2 Burn
January 28, 2026 ยท View on GitHub
Experimental, pure-Rust inference for the DeepSeek-OCR-2 vision-language OCR model using Burn.
This repository focuses on:
- Loading Hugging Face
safetensorsweights (PyTorch layout) into Burn. - End-to-end OCR inference (vision -> LLM injection) in Rust.
- Running with either the Vulkan (wgpu) backend or the NdArray (CPU) backend.
Status: the OCR pipeline works, but the runtime and memory usage are not yet competitive with mature Python stacks (e.g. PyTorch + FlashAttention / vLLM). Expect slower inference, especially on integrated GPUs.
Requirements
- Rust (stable, edition 2024).
- For
--backend vulkan: a working Vulkan driver (wgpu uses Vulkan). - For
--backend ndarray: CPU only.
Model Weights (Hugging Face)
Download the model files into hf_deepseek_ocr2/ at the repository root.
Option A: hf (Hugging Face CLI)
pip install -U "huggingface-hub[cli]"
hf download deepseek-ai/DeepSeek-OCR-2 \
--local-dir hf_deepseek_ocr2 \
--include "model-*.safetensors" "tokenizer.json"
You should end up with:
hf_deepseek_ocr2/model-00001-of-000001.safetensorshf_deepseek_ocr2/tokenizer.json
Usage
OCR (image -> text)
The OCR prompt is currently built-in as:
<image>
Free OCR.
Run OCR on an image:
cargo run --release -- generate-ocr \
--backend vulkan \
--weights hf_deepseek_ocr2/model-00001-of-000001.safetensors \
--tokenizer hf_deepseek_ocr2/tokenizer.json \
--image assets/example.png \
--auto-rotate
Common flags:
--backend vulkan|ndarray(default:vulkan)--rotate 0|90|180|270(clockwise) or--auto-rotate--no-cropdisable dynamic tiling (by default large images are tiled into 768x768 crops plus a 1024x1024 global view, matching the HF reference behavior)--trim-memorybest-effort reduce CPU-side memory after loading weights (drops OS page cache for the weights file +malloc_trim(0)on glibc)--kv-cache f32|f16(default:f32) -f16is experimental; outputs may drift on some setups
Text-only generation
cargo run --release -- generate-text \
--backend vulkan \
--weights hf_deepseek_ocr2/model-00001-of-000001.safetensors \
--tokenizer hf_deepseek_ocr2/tokenizer.json \
--prompt "Hello"
Backends
Vulkan (wgpu)
- Uses Burn's WGPU Vulkan backend.
- Best option for GPU acceleration.
- On integrated GPUs the "GPU memory" is shared system RAM; system memory usage can look high.
NdArray (CPU)
- Pure CPU backend (Burn NdArray).
- Requires casting weights to FP32 (NdArray does not support BF16/F16 weights), which increases memory usage.
Memory Usage (Example)
Measured on Linux, Intel iGPU, 32GB RAM, assets/example.png, --max-new-tokens 32.
Because the Vulkan (wgpu) backend uses unified-memory/driver allocations that don't necessarily
show up as process RSS, the most representative metric here is peak system RAM used
(MemTotal - MemAvailable, sampled during the run).
vulkan: peak system RAM used +17.8 GiB (extra over baseline)ndarray: peak system RAM used +13.8 GiB (extra over baseline)
For reference, peak process RSS during the same runs:
vulkan: ~2.8 GiBndarray: ~15.4 GiB
Notes:
- System memory reported by
free/topcan be higher than process RSS on Vulkan (especially on integrated GPUs) due to driver/unified-memory allocations and file page cache. - The numbers above are an example; they vary with image resolution, generation length, driver, and hardware.
Performance Notes
Burn is still evolving, and its GPU kernels and memory planning are not yet as optimized as mature Python runtimes. This project prioritizes correctness and a pure-Rust pipeline, so you should expect slower inference and higher memory usage in many configurations.
License
See LICENSE.