UForm Model Benchmarks
October 30, 2025 · View on GitHub
Accuracy
Embedding Models
Few retrieval benchmarks exist for multimodal embeddings.
The most famous ones for English are "MS-COCO" and "Flickr30k".
Evaluating uform-vl-english model, one can expect the following numbers for search quality.
| Dataset | Recall @ 1 | Recall @ 5 | Recall @ 10 |
|---|---|---|---|
| Flickr | 0.727 | 0.915 | 0.949 |
| MS-COCO ¹ | 0.510 | 0.761 | 0.838 |
For multilingual benchmarks, we've created the unum-cloud/coco-sm repository².
Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model.
| Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
|---|---|---|---|---|---|---|---|
| English 🇺🇸 | 37.8 | 37.7 | 63.5 | 65.0 | 73.5 | 75.9 | 1'452 M |
| Chinese 🇨🇳 | 27.3 | 32.2 | 51.3 | 59.0 | 62.1 | 70.5 | 1'118 M |
| Hindi 🇮🇳 | 20.7 | 31.3 | 42.5 | 57.9 | 53.7 | 69.6 | 602 M |
| Spanish 🇪🇸 | 32.6 | 35.6 | 58.0 | 62.8 | 68.8 | 73.7 | 548 M |
| Arabic 🇸🇦 | 22.7 | 31.7 | 44.9 | 57.8 | 55.8 | 69.2 | 274 M |
| French 🇫🇷 | 31.3 | 35.4 | 56.5 | 62.6 | 67.4 | 73.3 | 274 M |
All languages:
| Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
|---|---|---|---|---|---|---|---|
| Arabic 🇸🇦 | 22.7 | 31.7 | 44.9 | 57.8 | 55.8 | 69.2 | 274 M |
| Armenian 🇦🇲 | 5.6 | 22.0 | 14.3 | 44.7 | 20.2 | 56.0 | 4 M |
| Chinese 🇨🇳 | 27.3 | 32.2 | 51.3 | 59.0 | 62.1 | 70.5 | 1'118 M |
| English 🇺🇸 | 37.8 | 37.7 | 63.5 | 65.0 | 73.5 | 75.9 | 1'452 M |
| French 🇫🇷 | 31.3 | 35.4 | 56.5 | 62.6 | 67.4 | 73.3 | 274 M |
| German 🇩🇪 | 31.7 | 35.1 | 56.9 | 62.2 | 67.4 | 73.3 | 134 M |
| Hebrew 🇮🇱 | 23.7 | 26.7 | 46.3 | 51.8 | 57.0 | 63.5 | 9 M |
| Hindi 🇮🇳 | 20.7 | 31.3 | 42.5 | 57.9 | 53.7 | 69.6 | 602 M |
| Indonesian 🇮🇩 | 26.9 | 30.7 | 51.4 | 57.0 | 62.7 | 68.6 | 199 M |
| Italian 🇮🇹 | 31.3 | 34.9 | 56.7 | 62.1 | 67.1 | 73.1 | 67 M |
| Japanese 🇯🇵 | 27.4 | 32.6 | 51.5 | 59.2 | 62.6 | 70.6 | 125 M |
| Korean 🇰🇷 | 24.4 | 31.5 | 48.1 | 57.8 | 59.2 | 69.2 | 81 M |
| Persian 🇮🇷 | 24.0 | 28.8 | 47.0 | 54.6 | 57.8 | 66.2 | 77 M |
| Polish 🇵🇱 | 29.2 | 33.6 | 53.9 | 60.1 | 64.7 | 71.3 | 41 M |
| Portuguese 🇵🇹 | 31.6 | 32.7 | 57.1 | 59.6 | 67.9 | 71.0 | 257 M |
| Russian 🇷🇺 | 29.9 | 33.9 | 54.8 | 60.9 | 65.8 | 72.0 | 258 M |
| Spanish 🇪🇸 | 32.6 | 35.6 | 58.0 | 62.8 | 68.8 | 73.7 | 548 M |
| Thai 🇹🇭 | 21.5 | 28.7 | 43.0 | 54.6 | 53.7 | 66.0 | 61 M |
| Turkish 🇹🇷 | 25.5 | 33.0 | 49.1 | 59.6 | 60.3 | 70.8 | 88 M |
| Ukranian 🇺🇦 | 26.0 | 30.6 | 49.9 | 56.7 | 60.9 | 68.1 | 41 M |
| Vietnamese 🇻🇳 | 25.4 | 28.3 | 49.2 | 53.9 | 60.3 | 65.5 | 85 M |
| Mean | 26.5±6.4 | 31.8±3.5 | 49.8±9.8 | 58.1±4.5 | 60.4±10.6 | 69.4±4.3 | - |
| Google Translate | 27.4±6.3 | 31.5±3.5 | 51.1±9.5 | 57.8±4.4 | 61.7±10.3 | 69.1±4.3 | - |
| Microsoft Translator | 27.2±6.4 | 31.4±3.6 | 50.8±9.8 | 57.7±4.7 | 61.4±10.6 | 68.9±4.6 | - |
| Meta NLLB | 24.9±6.7 | 32.4±3.5 | 47.5±10.3 | 58.9±4.5 | 58.2±11.2 | 70.2±4.3 | - |
Generative Models
| Model | LLM Size | SQA | MME | MMBench | Average¹ |
|---|---|---|---|---|---|
| UForm-Gen2-Qwen-500m | 0.5B | 45.5 | 880.1 | 42.0 | 29.31 |
| MobileVLM v2 | 1.4B | 52.1 | 1302.8 | 57.7 | 36.81 |
| LLaVA-Phi | 2.7B | 68.4 | 1335.1 | 59.8 | 42.95 |
For captioning evaluation we measure CLIPScore and RefCLIPScore³.
| Model | Size | Caption Length | CLIPScore | RefCLIPScore |
|---|---|---|---|---|
llava-hf/llava-1.5-7b-hf | 7B | Long | 0.878 | 0.529 |
llava-hf/llava-1.5-7b-hf | 7B | Short | 0.886 | 0.531 |
Salesforce/instructblip-vicuna-7b | 7B | Long | 0.902 | 0.534 |
Salesforce/instructblip-vicuna-7b | 7B | Short | 0.848 | 0.523 |
unum-cloud/uform-gen | 1.5B | Long | 0.847 | 0.523 |
unum-cloud/uform-gen | 1.5B | Short | 0.842 | 0.522 |
unum-cloud/uform-gen-chat | 1.5B | Long | 0.860 | 0.525 |
unum-cloud/uform-gen-chat | 1.5B | Short | 0.858 | 0.525 |
Results for VQAv2 evaluation.
| Model | Size | Accuracy |
|---|---|---|
llava-hf/llava-1.5-7b-hf | 7B | 78.5 |
unum-cloud/uform-gen | 1.5B | 66.5 |
¹ Train split was in training data.
² Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
³ We usedapple/DFN5B-CLIP-ViT-H-14-378CLIP model.
Speed
Embedding Models
UForm comes pre-packaged with speed benchmarks for the models.
$ python python/scripts/bench_encoders.py --help
usage: bench_encoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]
options:
-h, --help show this help message and exit
--filter-out FILTER_OUT
Filter out models, backends, or devices with a Regular Expression.
--batch-size BATCH_SIZE
Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.
On Nvidia B200, the results may look like:
uv run python python/scripts/bench_encoders.py --batch-size 2048 --gpu --torch
| Model | Device | Backend | Precision | Images/s | Texts/s |
|---|---|---|---|---|---|
| unum-cloud/uform3-image-text-english-base | cuda | torch | bfloat16 | 6662.7 | 38482.7 |
| unum-cloud/uform3-image-text-english-large | cuda | torch | bfloat16 | 2930.2 | 53927.5 |
| unum-cloud/uform3-image-text-english-small | cuda | torch | bfloat16 | 1385.6 | 6611.2 |
| unum-cloud/uform3-image-text-multilingual-base | cuda | torch | bfloat16 | 7235.1 | 36690.4 |
On the 160-core dual-socket Intel Emerald Rapids CPU-only setup, the results may look like:
uv run python python/scripts/bench_encoders.py --batch-size 128 --cpu --torch --onnx
| Model | Device | Backend | Precision | Images/s | Texts/s |
|---|---|---|---|---|---|
| unum-cloud/uform3-image-text-english-base | cpu | torch | bfloat16 | 164.3 | 3228.1 |
| unum-cloud/uform3-image-text-english-base | cpu | onnx | unknown | 212.8 | 1752.8 |
| unum-cloud/uform3-image-text-english-large | cpu | torch | bfloat16 | 32.1 | 3550.8 |
| unum-cloud/uform3-image-text-english-large | cpu | onnx | unknown | 58.9 | 1348.2 |
| unum-cloud/uform3-image-text-english-small | cpu | torch | bfloat16 | 335.9 | 5801.1 |
| unum-cloud/uform3-image-text-english-small | cpu | onnx | unknown | 325.4 | 2589.3 |
| unum-cloud/uform3-image-text-multilingual-base | cpu | torch | bfloat16 | 153.2 | 4026.2 |
| unum-cloud/uform3-image-text-multilingual-base | cpu | onnx | unknown | 197.5 | 1401.4 |
Generative Models
$ python python/scripts/bench_decoders.py --help
usage: bench_decoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]
options:
-h, --help show this help message and exit
--batch-size BATCH_SIZE
Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.
--max-length MAX_LENGTH
Maximum length of the generated text in tokens.
On Nvidia H100 GPU, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.
| Model | Size | Decoding Speed | Decoding Parallel Streams |
|---|---|---|---|
llava-hf/llava-1.5-7b-hf | 7 B | ~ 141 tokens/s | ~ 4 K tokens/s (32 streams) |
Salesforce/instructblip-vicuna-7b | 7 B | ~ 211 tokens/s | ~ 2 K tokens/s (32 streams) |
unum-cloud/uform-gen | 1.5 B | ~ 252 tokens/s | ~ 3 K tokens/s (128 streams) |
unum-cloud/uform-gen2-dpo | 1.2 B | ~ 372 tokens/s | ~ 10 K tokens/s (64 streams) |
On Nvidia RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.
| Model | Size | Decoding Speed | Speedup |
|---|---|---|---|
llava-hf/llava-1.5-7b-hf | 7 B | ~ 40 tokens/s | |
Salesforce/instructblip-vicuna-7b | 7 B | ~ 40 tokens/s | |
unum-cloud/uform-gen | 1.5 B | ~ 140 tokens/s | x 3.5 |