UForm Model Benchmarks

October 30, 2025 · View on GitHub

Few retrieval benchmarks exist for multimodal embeddings. The most famous ones for English are "MS-COCO" and "Flickr30k". Evaluating uform-vl-english model, one can expect the following numbers for search quality.

Dataset	Recall @ 1	Recall @ 5	Recall @ 10
Flickr	0.727	0.915	0.949
MS-COCO ¹	0.510	0.761	0.838

For multilingual benchmarks, we've created the unum-cloud/coco-sm repository². Evaluating the unum-cloud/uform-vl-multilingual-v2 model, one can expect the following metrics for text-to-image search, compared against xlm-roberta-base-ViT-B-32 OpenCLIP model.

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
English 🇺🇸	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
Chinese 🇨🇳	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
Hindi 🇮🇳	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Spanish 🇪🇸	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Arabic 🇸🇦	22.7	31.7	44.9	57.8	55.8	69.2	274 M
French 🇫🇷	31.3	35.4	56.5	62.6	67.4	73.3	274 M

All languages:

Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
Arabic 🇸🇦	22.7	31.7	44.9	57.8	55.8	69.2	274 M
Armenian 🇦🇲	5.6	22.0	14.3	44.7	20.2	56.0	4 M
Chinese 🇨🇳	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
English 🇺🇸	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
French 🇫🇷	31.3	35.4	56.5	62.6	67.4	73.3	274 M
German 🇩🇪	31.7	35.1	56.9	62.2	67.4	73.3	134 M
Hebrew 🇮🇱	23.7	26.7	46.3	51.8	57.0	63.5	9 M
Hindi 🇮🇳	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Indonesian 🇮🇩	26.9	30.7	51.4	57.0	62.7	68.6	199 M
Italian 🇮🇹	31.3	34.9	56.7	62.1	67.1	73.1	67 M
Japanese 🇯🇵	27.4	32.6	51.5	59.2	62.6	70.6	125 M
Korean 🇰🇷	24.4	31.5	48.1	57.8	59.2	69.2	81 M
Persian 🇮🇷	24.0	28.8	47.0	54.6	57.8	66.2	77 M
Polish 🇵🇱	29.2	33.6	53.9	60.1	64.7	71.3	41 M
Portuguese 🇵🇹	31.6	32.7	57.1	59.6	67.9	71.0	257 M
Russian 🇷🇺	29.9	33.9	54.8	60.9	65.8	72.0	258 M
Spanish 🇪🇸	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Thai 🇹🇭	21.5	28.7	43.0	54.6	53.7	66.0	61 M
Turkish 🇹🇷	25.5	33.0	49.1	59.6	60.3	70.8	88 M
Ukranian 🇺🇦	26.0	30.6	49.9	56.7	60.9	68.1	41 M
Vietnamese 🇻🇳	25.4	28.3	49.2	53.9	60.3	65.5	85 M

Mean	26.5±6.4	31.8±3.5	49.8±9.8	58.1±4.5	60.4±10.6	69.4±4.3	-
Google Translate	27.4±6.3	31.5±3.5	51.1±9.5	57.8±4.4	61.7±10.3	69.1±4.3	-
Microsoft Translator	27.2±6.4	31.4±3.6	50.8±9.8	57.7±4.7	61.4±10.6	68.9±4.6	-
Meta NLLB	24.9±6.7	32.4±3.5	47.5±10.3	58.9±4.5	58.2±11.2	70.2±4.3	-

Generative Models

Model	LLM Size	SQA	MME	MMBench	Average¹
UForm-Gen2-Qwen-500m	0.5B	45.5	880.1	42.0	29.31
MobileVLM v2	1.4B	52.1	1302.8	57.7	36.81
LLaVA-Phi	2.7B	68.4	1335.1	59.8	42.95

For captioning evaluation we measure CLIPScore and RefCLIPScore³.

Model	Size	Caption Length	CLIPScore	RefCLIPScore
`llava-hf/llava-1.5-7b-hf`	7B	Long	0.878	0.529
`llava-hf/llava-1.5-7b-hf`	7B	Short	0.886	0.531

`Salesforce/instructblip-vicuna-7b`	7B	Long	0.902	0.534
`Salesforce/instructblip-vicuna-7b`	7B	Short	0.848	0.523

`unum-cloud/uform-gen`	1.5B	Long	0.847	0.523
`unum-cloud/uform-gen`	1.5B	Short	0.842	0.522

`unum-cloud/uform-gen-chat`	1.5B	Long	0.860	0.525
`unum-cloud/uform-gen-chat`	1.5B	Short	0.858	0.525

Results for VQAv2 evaluation.

Model	Size	Accuracy
`llava-hf/llava-1.5-7b-hf`	7B	78.5
`unum-cloud/uform-gen`	1.5B	66.5

¹ Train split was in training data.
² Lacking a broad enough evaluation dataset, we translated the COCO Karpathy test split with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
³ We used apple/DFN5B-CLIP-ViT-H-14-378 CLIP model.

Speed

Embedding Models

UForm comes pre-packaged with speed benchmarks for the models.

$ python python/scripts/bench_encoders.py --help
usage: bench_encoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --filter-out FILTER_OUT
                        Filter out models, backends, or devices with a Regular Expression.
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.

On Nvidia B200, the results may look like:

uv run python python/scripts/bench_encoders.py --batch-size 2048 --gpu --torch

Model	Device	Backend	Precision	Images/s	Texts/s
unum-cloud/uform3-image-text-english-base	cuda	torch	bfloat16	6662.7	38482.7
unum-cloud/uform3-image-text-english-large	cuda	torch	bfloat16	2930.2	53927.5
unum-cloud/uform3-image-text-english-small	cuda	torch	bfloat16	1385.6	6611.2
unum-cloud/uform3-image-text-multilingual-base	cuda	torch	bfloat16	7235.1	36690.4

On the 160-core dual-socket Intel Emerald Rapids CPU-only setup, the results may look like:

uv run python python/scripts/bench_encoders.py --batch-size 128 --cpu --torch --onnx

Model	Device	Backend	Precision	Images/s	Texts/s
unum-cloud/uform3-image-text-english-base	cpu	torch	bfloat16	164.3	3228.1
unum-cloud/uform3-image-text-english-base	cpu	onnx	unknown	212.8	1752.8
unum-cloud/uform3-image-text-english-large	cpu	torch	bfloat16	32.1	3550.8
unum-cloud/uform3-image-text-english-large	cpu	onnx	unknown	58.9	1348.2
unum-cloud/uform3-image-text-english-small	cpu	torch	bfloat16	335.9	5801.1
unum-cloud/uform3-image-text-english-small	cpu	onnx	unknown	325.4	2589.3
unum-cloud/uform3-image-text-multilingual-base	cpu	torch	bfloat16	153.2	4026.2
unum-cloud/uform3-image-text-multilingual-base	cpu	onnx	unknown	197.5	1401.4

Generative Models

$ python python/scripts/bench_decoders.py --help
usage: bench_decoders.py [-h] [--filter-out FILTER_OUT] [--batch-size BATCH_SIZE]

options:
  -h, --help            show this help message and exit
  --batch-size BATCH_SIZE
                        Batch size for the benchmark. Batch size 1 measures latency. Large batch sizes may not fit on every GPU.
  --max-length MAX_LENGTH
                        Maximum length of the generated text in tokens.

On Nvidia H100 GPU, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model	Size	Decoding Speed	Decoding Parallel Streams
`llava-hf/llava-1.5-7b-hf`	7 B	~ 141 tokens/s	~ 4 K tokens/s (32 streams)
`Salesforce/instructblip-vicuna-7b`	7 B	~ 211 tokens/s	~ 2 K tokens/s (32 streams)
`unum-cloud/uform-gen`	1.5 B	~ 252 tokens/s	~ 3 K tokens/s (128 streams)
`unum-cloud/uform-gen2-dpo`	1.2 B	~ 372 tokens/s	~ 10 K tokens/s (64 streams)

On Nvidia RTX 3090, the following performance is expected on text token generation using float16, equivalent PyTorch settings, and greedy decoding.

Model	Size	Decoding Speed	Speedup
`llava-hf/llava-1.5-7b-hf`	7 B	~ 40 tokens/s
`Salesforce/instructblip-vicuna-7b`	7 B	~ 40 tokens/s
`unum-cloud/uform-gen`	1.5 B	~ 140 tokens/s	x 3.5

Accuracy

Embedding Models

Generative Models

Speed

Embedding Models

Generative Models