Benchmarks

February 24, 2026 · View on GitHub

Text model benchmarks
Multimodal model benchmarks

Text model benchmarks

These benchmarks all use the SmolLM-135M-GGUF model to perform simple text generation.

yzma model get -u https://huggingface.co/QuantFactory/SmolLM-135M-GGUF/resolve/main/SmolLM-135M.Q2_K.gguf
export YZMA_BENCHMARK_MODEL=~/models/SmolLM-135M.Q2_K.gguf

See https://github.com/hybridgroup/yzma/blob/main/pkg/llama/benchmark_test.go

Linux

CPU

amd64

$ go test -benchtime=10s -count=5 -run=nada -bench .
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkInference-32                 99         110913774 ns/op               270.5 tokens/s
BenchmarkInference-32                100         111035054 ns/op               270.2 tokens/s
BenchmarkInference-32                100         110369390 ns/op               271.8 tokens/s
BenchmarkInference-32                100         112705133 ns/op               266.2 tokens/s
BenchmarkInference-32                100         111892770 ns/op               268.1 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   61.199s

arm64

Raspberry Pi 4 Model B Rev 1.4 8GB

ron@raspberrypi:~/yzma/pkg/llama $ go test -benchtime=10s -count=5 -run=nada -bench .
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/llama
BenchmarkInference-4          15         893788634 ns/op                33.56 tokens/s
BenchmarkInference-4          12         923948131 ns/op                32.47 tokens/s
BenchmarkInference-4          12         918284434 ns/op                32.67 tokens/s
BenchmarkInference-4          12         918693617 ns/op                32.66 tokens/s
BenchmarkInference-4          12         917186754 ns/op                32.71 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   64.583s

CUDA

amd64

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   38C    P0            590W /  115W |      15MiB /   8188MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="CUDA0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkInference-32                332          35746370 ns/op               839.2 tokens/s
BenchmarkInference-32                338          35529926 ns/op               844.4 tokens/s
BenchmarkInference-32                336          35614579 ns/op               842.4 tokens/s
BenchmarkInference-32                336          35609522 ns/op               842.5 tokens/s
BenchmarkInference-32                337          35550352 ns/op               843.9 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   67.491s

ROCM

amd64

amdgpu_top v0.11.2
┌──────────────────────────────────────────────────────────────────────────────┐
│GPU Name                                | PCI Bus        |    VRAM Usage    | │
│SCLK    MCLK    VDDGFX  Power           | GFX% UMC%Media%|     GTT Usage    | │
│GPU/MEM_T  Fan     Throttle_Status                                          | │
│------------------------------------------------------------------------------│
│#0  [AMD Radeon RX 7900 XTX   ](gfx1100)| 0000:86:00.0   |    26/ 24560 MiB | │
│   0MHz   96MHz   49mV   14/303W        |   0%   0%   0% |    15/128884 MiB | │
│ 40C/ 46C     0RPM []                                                       | │
└──────────────────────────────────────────────────────────────────────────────┘
┌┤ Processes ├─────────────────────────────────────────────────────────────────┐
│┌┤ #0  AMD Radeon RX 7900 XTX ├──────────────────────────────────────────────┐│
││ Name            |  PID  |KFD| VRAM | GTT  |CPU |GFX |COMP|DMA |VCNU|       ││
││ kronk           | 411062|   |    0M|    2M|  1%|  0%|  0%|  0%|  0%|       ││
││ amdgpu_top      | 589729|   |    0M|    2M|  1%|  0%|  0%|  0%|  0%|       ││
│└────────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────────┘

go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="rocm0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD EPYC 7443P 24-Core Processor
BenchmarkInference-48    	     194	  60798061 ns/op	       493.4 tokens/s
BenchmarkInference-48    	     196	  60271732 ns/op	       497.7 tokens/s
BenchmarkInference-48    	     198	  60255594 ns/op	       497.9 tokens/s
BenchmarkInference-48    	     195	  60948909 ns/op	       492.2 tokens/s
BenchmarkInference-48    	     198	  60715718 ns/op	       494.1 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/llama	60.887s

However, for the same device but with the Vulkan backend:

go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="vulkan0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD EPYC 7443P 24-Core Processor
BenchmarkInference-48    	     328	  36234037 ns/op	       828.0 tokens/s
BenchmarkInference-48    	     339	  35194859 ns/op	       852.4 tokens/s
BenchmarkInference-48    	     333	  35395438 ns/op	       847.6 tokens/s
BenchmarkInference-48    	     338	  35334138 ns/op	       849.0 tokens/s
BenchmarkInference-48    	     339	  35255138 ns/op	       850.9 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/llama	61.232s

arm64

Jetson Orin Nano Developer Kit - 8GB

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.5.0                Driver Version: 540.5.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="CUDA0"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkInference-6          51         222138094 ns/op               135.1 tokens/s
BenchmarkInference-6          52         216104925 ns/op               138.8 tokens/s
BenchmarkInference-6          54         215961553 ns/op               138.9 tokens/s
BenchmarkInference-6          52         215498575 ns/op               139.2 tokens/s
BenchmarkInference-6          52         214849130 ns/op               139.6 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   61.014s

Vulkan

amd64

==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.275

Devices:
========
GPU0:
        apiVersion         = 1.4.318
        driverVersion      = 25.2.8
        vendorID           = 0x8086
        deviceID           = 0xa788
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = Intel(R) Graphics (RPL-S)
        driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
        driverName         = Intel open-source Mesa driver
        driverInfo         = Mesa 25.2.8-0ubuntu0.24.04.1
        conformanceVersion = 1.4.0.0
        deviceUUID         = 868088a7-0400-0000-0002-000000000000
        driverUUID         = 032fbbbb-ddee-3516-8477-c17071969177
GPU1:
        apiVersion         = 1.4.312
        driverVersion      = 580.95.5.0
        vendorID           = 0x10de
        deviceID           = 0x2860
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = NVIDIA GeForce RTX 4070 Laptop GPU
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 580.95.05
        conformanceVersion = 1.4.1.3
        deviceUUID         = 7e611089-1272-699d-8985-ab84fef4311e
        driverUUID         = b92269a1-b525-5615-ab8a-e2095ee37192

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkInference-32                 31         354329548 ns/op                84.67 tokens/s
BenchmarkInference-32                 34         351859490 ns/op                85.26 tokens/s
BenchmarkInference-32                 32         353665267 ns/op                84.83 tokens/s
BenchmarkInference-32                 33         349151210 ns/op                85.92 tokens/s
BenchmarkInference-32                 33         348216889 ns/op                86.15 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   70.757s

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN1"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkInference-32                328          36362981 ns/op               825.0 tokens/s
BenchmarkInference-32                330          36353223 ns/op               825.2 tokens/s
BenchmarkInference-32                327          36207519 ns/op               828.6 tokens/s
BenchmarkInference-32                331          36366451 ns/op               824.9 tokens/s
BenchmarkInference-32                330          36262953 ns/op               827.3 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   83.142s

arm64

Jetson Orin Nano Developer Kit - 8GB

ron@ubuntu:~/yzma/pkg/mtmd$ vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.204
...
Devices:
========
GPU0:
        apiVersion         = 4206843 (1.3.251)
        driverVersion      = 2265006080 (0x87014000)
        vendorID           = 0x10de
        deviceID           = 0x97ba03d7
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = NVIDIA Tegra Orin (nvgpu)
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 540.5.0
        conformanceVersion = 1.3.6.0
        deviceUUID         = 1388f9e0-987e-54a0-908f-6a30d8fd5f29
        driverUUID         = ed5ba772-f592-5949-9d1f-236f7ad81bcc

ron@ubuntu:~/yzma/pkg/llama$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="CPU"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkInference-6          43         432432689 ns/op                69.37 tokens/s
BenchmarkInference-6          20         506747397 ns/op                59.20 tokens/s
BenchmarkInference-6          21         514736186 ns/op                58.28 tokens/s
BenchmarkInference-6          27         496646058 ns/op                60.41 tokens/s
BenchmarkInference-6          22         519434233 ns/op                57.76 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   68.009s

ron@ubuntu:~/yzma/pkg/llama$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="VULKAN0"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkInference-6          52         222098600 ns/op               135.1 tokens/s
BenchmarkInference-6          52         222072877 ns/op               135.1 tokens/s
BenchmarkInference-6          54         219825013 ns/op               136.5 tokens/s
BenchmarkInference-6          52         220919304 ns/op               135.8 tokens/s
BenchmarkInference-6          54         221925680 ns/op               135.2 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   63.318s

macOS

Metal

Apple M4 Max with 128 GB RAM

$ go test -run none -benchtime=10s -count=5 -bench BenchmarkInference -nctx=16000
goos: darwin
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: Apple M4 Max
BenchmarkInference-16	  230		52168178 ns/op	575.1 tokens/s
BenchmarkInference-16	  234		51482815 ns/op	582.7 tokens/s
BenchmarkInference-16	  230		51729562 ns/op	579.9 tokens/s
BenchmarkInference-16	  230		52075140 ns/op	576.1 tokens/s
BenchmarkInference-16	  230		51981549 ns/op	577.1 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/llama	62.042s

Windows

CPU

C:\Users\limbo\ron\yzma\pkg\llama>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=8192
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkInference-32                 51         214577557 ns/op               139.8 tokens/s
BenchmarkInference-32                 56         210247484 ns/op               142.7 tokens/s
BenchmarkInference-32                 52         206580071 ns/op               145.2 tokens/s
BenchmarkInference-32                 57         206447956 ns/op               145.3 tokens/s
BenchmarkInference-32                 57         207005089 ns/op               144.9 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   58.254s

CUDA

C:\Users\ron>nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.57                 Driver Version: 581.57         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8              6W /  240W |      22MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

C:\Users\limbo\ron\yzma\pkg\llama>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="CUDA0"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkInference-32                254          46914384 ns/op               639.5 tokens/s
BenchmarkInference-32                258          46820920 ns/op               640.7 tokens/s
BenchmarkInference-32                255          46929827 ns/op               639.3 tokens/s
BenchmarkInference-32                255          46958283 ns/op               638.9 tokens/s
BenchmarkInference-32                250          47880058 ns/op               626.6 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   62.888s

Vulkan

==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.309


Devices:
========
GPU0:
        apiVersion         = 1.3.270
        driverVersion      = 2.0.294
        vendorID           = 0x1002
        deviceID           = 0x164e
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = AMD Radeon(TM) Graphics
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 23.40.02 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-0c00-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000
GPU1:
        apiVersion         = 1.4.312
        driverVersion      = 581.57.0.0
        vendorID           = 0x10de
        deviceID           = 0x2488
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = NVIDIA GeForce RTX 3070
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 581.57
        conformanceVersion = 1.4.1.3
        deviceUUID         = 91c0b9f4-e340-3c73-1422-c227930ae260
        driverUUID         = 08a6deb5-2838-56d3-b7da-f79802447960

C:\Users\limbo\ron\yzma\pkg\llama>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN0"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkInference-32                 34         329955426 ns/op                90.92 tokens/s
BenchmarkInference-32                 39         302329823 ns/op                99.23 tokens/s
BenchmarkInference-32                 39         302524487 ns/op                99.17 tokens/s
BenchmarkInference-32                 39         304700162 ns/op                98.46 tokens/s
BenchmarkInference-32                 39         304536574 ns/op                98.51 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   61.326s

C:\Users\limbo\ron\yzma\pkg\llama>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN1"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/llama
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkInference-32                294          40543699 ns/op               739.9 tokens/s
BenchmarkInference-32                295          40568015 ns/op               739.5 tokens/s
BenchmarkInference-32                295          40579471 ns/op               739.3 tokens/s
BenchmarkInference-32                297          40277643 ns/op               744.8 tokens/s
BenchmarkInference-32                296          40319531 ns/op               744.1 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/llama   84.981s

Multimodal model benchmarks

These benchmarks all use the Qwen3-VL-2B-Instruct.Q4_K_M.gguf model and projector to provide a description for an image.

yzma model get -u https://huggingface.co/mradermacher/Qwen3-VL-2B-Instruct-GGUF/resolve/main/Qwen3-VL-2B-Instruct.Q4_K_M.gguf
yzma model get -u https://huggingface.co/mradermacher/Qwen3-VL-2B-Instruct-GGUF/resolve/main/Qwen3-VL-2B-Instruct.mmproj-Q8_0.gguf
export YZMA_BENCHMARK_MMMODEL=~/models/Qwen3-VL-2B-Instruct.Q4_K_M.gguf
export YZMA_BENCHMARK_MMPROJ=~/models/Qwen3-VL-2B-Instruct.mmproj-Q8_0.gguf

See https://github.com/hybridgroup/yzma/blob/main/pkg/mtmd/benchmark_test.go

Linux

CPU

amd64

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=8192
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkMultimodalInference-32                1        47402263232 ns/op               26.16 tokens/s
BenchmarkMultimodalInference-32                1        42673907034 ns/op               26.08 tokens/s
BenchmarkMultimodalInference-32                1        42432080672 ns/op               25.81 tokens/s
BenchmarkMultimodalInference-32                1        46803510445 ns/op               26.15 tokens/s
BenchmarkMultimodalInference-32                1        45700830384 ns/op               25.91 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    226.685s

arm64

Raspberry Pi 4 Model B Rev 1.4 8GB

NOTE: Due to less available memory, the benchmarks on this device used the SmolVLM2-500M-Video-Instruct-Q8_0 model.

yzma model get -u https://huggingface.co/ggml-org/SmolVLM2-500M-Video-Instruct-GGUF/resolve/main/SmolVLM2-500M-Video-Instruct-Q8_0.gguf
yzma model get -u https://huggingface.co/ggml-org/SmolVLM2-500M-Video-Instruct-GGUF/resolve/main/mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf

ron@raspberrypi:~/yzma/pkg/mtmd $ export YZMA_BENCHMARK_MMMODEL=/home/ron/models/SmolVLM2-500M-Video-Instruct-Q8_0.gguf
ron@raspberrypi:~/yzma/pkg/mtmd $ export YZMA_BENCHMARK_MMPROJ=/home/ron/models/mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf
ron@raspberrypi:~/yzma/pkg/mtmd $ go test -benchtime=10s -count=5 -run=nada -bench .
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
BenchmarkMultimodalInference-4                 1        50239133481 ns/op                6.748 tokens/s
BenchmarkMultimodalInference-4                 1        49358181828 ns/op                6.341 tokens/s
BenchmarkMultimodalInference-4                 1        48164506831 ns/op                5.917 tokens/s
BenchmarkMultimodalInference-4                 1        40171997080 ns/op                5.551 tokens/s
BenchmarkMultimodalInference-4                 1        41428165840 ns/op                5.504 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    243.876s

CUDA

amd64

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   38C    P0            590W /  115W |      15MiB /   8188MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="CUDA0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkMultimodalInference-32               21         921205057 ns/op              1240 tokens/s
BenchmarkMultimodalInference-32               15        1043496530 ns/op              1114 tokens/s
BenchmarkMultimodalInference-32               18         939373857 ns/op              1219 tokens/s
BenchmarkMultimodalInference-32               14        1118362797 ns/op              1047 tokens/s
BenchmarkMultimodalInference-32                8        1336574088 ns/op               900.2 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    82.619s

arm64

Jetson Orin Nano Developer Kit - 8GB

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.5.0                Driver Version: 540.5.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="CUDA0"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkMultimodalInference-6                 2        7077293280 ns/op               166.9 tokens/s
BenchmarkMultimodalInference-6                 2        8106794026 ns/op               150.8 tokens/s
BenchmarkMultimodalInference-6                 1        10837943077 ns/op              120.7 tokens/s
BenchmarkMultimodalInference-6                 1        12015033493 ns/op              112.1 tokens/s
BenchmarkMultimodalInference-6                 1        10055887615 ns/op              127.6 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    69.733s

ROCM

amd64

amdgpu_top v0.11.2
┌──────────────────────────────────────────────────────────────────────────────┐
│GPU Name                                | PCI Bus        |    VRAM Usage    | │
│SCLK    MCLK    VDDGFX  Power           | GFX% UMC%Media%|     GTT Usage    | │
│GPU/MEM_T  Fan     Throttle_Status                                          | │
│------------------------------------------------------------------------------│
│#0  [AMD Radeon RX 7900 XTX   ](gfx1100)| 0000:86:00.0   |    26/ 24560 MiB | │
│   0MHz   96MHz   49mV   14/303W        |   0%   0%   0% |    15/128884 MiB | │
│ 40C/ 46C     0RPM []                                                       | │
└──────────────────────────────────────────────────────────────────────────────┘
┌┤ Processes ├─────────────────────────────────────────────────────────────────┐
│┌┤ #0  AMD Radeon RX 7900 XTX ├──────────────────────────────────────────────┐│
││ Name            |  PID  |KFD| VRAM | GTT  |CPU |GFX |COMP|DMA |VCNU|       ││
││ kronk           | 411062|   |    0M|    2M|  1%|  0%|  0%|  0%|  0%|       ││
││ amdgpu_top      | 589729|   |    0M|    2M|  1%|  0%|  0%|  0%|  0%|       ││
│└────────────────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────────────┘

go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="rocm0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD EPYC 7443P 24-Core Processor
BenchmarkMultimodalInference-48    	       9	1182597512 ns/op	       987.1 tokens/s
BenchmarkMultimodalInference-48    	      10	1241401135 ns/op	       961.6 tokens/s
BenchmarkMultimodalInference-48    	       8	1323004757 ns/op	       912.9 tokens/s
BenchmarkMultimodalInference-48    	      12	1241431410 ns/op	       961.5 tokens/s
BenchmarkMultimodalInference-48    	       8	1715075982 ns/op	       755.4 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/mtmd	63.492s

However, for the same device but with the Vulkan backend:

go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="vulkan0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD EPYC 7443P 24-Core Processor
BenchmarkMultimodalInference-48    	       9	1147394253 ns/op	      1053 tokens/s
BenchmarkMultimodalInference-48    	      15	 941516811 ns/op	      1245 tokens/s
BenchmarkMultimodalInference-48    	      13	 924097033 ns/op	      1265 tokens/s
BenchmarkMultimodalInference-48    	      18	1018284301 ns/op	      1179 tokens/s
BenchmarkMultimodalInference-48    	      15	1022548971 ns/op	      1172 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/mtmd	71.331s

Vulkan

amd64

==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.275

Devices:
========
GPU0:
        apiVersion         = 1.4.318
        driverVersion      = 25.2.8
        vendorID           = 0x8086
        deviceID           = 0xa788
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = Intel(R) Graphics (RPL-S)
        driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
        driverName         = Intel open-source Mesa driver
        driverInfo         = Mesa 25.2.8-0ubuntu0.24.04.1
        conformanceVersion = 1.4.0.0
        deviceUUID         = 868088a7-0400-0000-0002-000000000000
        driverUUID         = 032fbbbb-ddee-3516-8477-c17071969177
GPU1:
        apiVersion         = 1.4.312
        driverVersion      = 580.95.5.0
        vendorID           = 0x10de
        deviceID           = 0x2860
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = NVIDIA GeForce RTX 4070 Laptop GPU
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 580.95.05
        conformanceVersion = 1.4.1.3
        deviceUUID         = 7e611089-1272-699d-8985-ab84fef4311e
        driverUUID         = b92269a1-b525-5615-ab8a-e2095ee37192

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN0"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkMultimodalInference-32                1        14578268628 ns/op               78.34 tokens/s
BenchmarkMultimodalInference-32                1        22073783877 ns/op               55.59 tokens/s
BenchmarkMultimodalInference-32                1        11278156188 ns/op               97.62 tokens/s
BenchmarkMultimodalInference-32                1        14723860691 ns/op               77.43 tokens/s
BenchmarkMultimodalInference-32                1        11996066619 ns/op               92.45 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    79.922s

$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN1"
goos: linux
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: 13th Gen Intel(R) Core(TM) i9-13900HX
BenchmarkMultimodalInference-32                8        1339951138 ns/op               891.1 tokens/s
BenchmarkMultimodalInference-32               10        1172385505 ns/op               997.5 tokens/s
BenchmarkMultimodalInference-32               13        1276183643 ns/op               929.1 tokens/s
BenchmarkMultimodalInference-32               18        1122849292 ns/op              1035 tokens/s
BenchmarkMultimodalInference-32                7        1471154871 ns/op               825.9 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    76.276s

arm64

Jetson Orin Nano Developer Kit - 8GB

ron@ubuntu:~/yzma/pkg/mtmd$ vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.204
...
Devices:
========
GPU0:
        apiVersion         = 4206843 (1.3.251)
        driverVersion      = 2265006080 (0x87014000)
        vendorID           = 0x10de
        deviceID           = 0x97ba03d7
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = NVIDIA Tegra Orin (nvgpu)
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 540.5.0
        conformanceVersion = 1.3.6.0
        deviceUUID         = 1388f9e0-987e-54a0-908f-6a30d8fd5f29
        driverUUID         = ed5ba772-f592-5949-9d1f-236f7ad81bcc

ron@ubuntu:~/yzma/pkg/mtmd$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="CPU"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkMultimodalInference-6                 1        72233629960 ns/op               15.03 tokens/s
BenchmarkMultimodalInference-6                 1        75555489707 ns/op               15.37 tokens/s
BenchmarkMultimodalInference-6                 1        87238792057 ns/op               14.65 tokens/s
BenchmarkMultimodalInference-6                 1        71406835155 ns/op               15.70 tokens/s
BenchmarkMultimodalInference-6                 1        70659234723 ns/op               15.74 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    383.358s

ron@ubuntu:~/yzma/pkg/mtmd$ go test -benchtime=10s -count=5 -run=nada -bench . -nctx=16000 -device="VULKAN0"
goos: linux
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: ARMv8 Processor rev 1 (v8l)
BenchmarkMultimodalInference-6                 1        13718208893 ns/op               81.13 tokens/s
BenchmarkMultimodalInference-6                 1        16724822437 ns/op               71.39 tokens/s
BenchmarkMultimodalInference-6                 1        13133369170 ns/op               84.14 tokens/s
BenchmarkMultimodalInference-6                 1        13515072899 ns/op               82.43 tokens/s
BenchmarkMultimodalInference-6                 1        12471954537 ns/op               87.24 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    76.766s

macOS

Metal

Apple M4 Max with 128 GB RAM

$ go test -run none -benchtime=10s -count=5 -bench BenchmarkMultimodalInference -nctx=16000
goos: darwin
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: Apple M4 Max
BenchmarkMultimodalInference-16		10		1577948683 ns/op	788.9 tokens/s
BenchmarkMultimodalInference-16		12		1243692014 ns/op	910.8 tokens/s
BenchmarkMultimodalInference-16		 7		1654741804 ns/op	737.2 tokens/s
BenchmarkMultimodalInference-16		 7		1568106947 ns/op	771.9 tokens/s
BenchmarkMultimodalInference-16		10		1704669371 ns/op	706.1 tokens/s
PASS
ok  	github.com/hybridgroup/yzma/pkg/mtmd	76.644s

Windows

CPU

C:\Users\limbo\ron\yzma\pkg\mtmd>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=8192
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkMultimodalInference-32                1        26850046400 ns/op               43.17 tokens/s
BenchmarkMultimodalInference-32                1        48420966900 ns/op               35.44 tokens/s
BenchmarkMultimodalInference-32                1        34259612500 ns/op               39.52 tokens/s
BenchmarkMultimodalInference-32                1        24749935100 ns/op               44.44 tokens/s
BenchmarkMultimodalInference-32                1        36232681200 ns/op               38.75 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    171.920s

CUDA

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.57                 Driver Version: 581.57         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   42C    P8              6W /  240W |      22MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

C:\Users\limbo\ron\yzma\pkg\mtmd>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="CUDA0"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkMultimodalInference-32               14         975072514 ns/op              1212 tokens/s
BenchmarkMultimodalInference-32                9        1124768556 ns/op              1080 tokens/s
BenchmarkMultimodalInference-32                9        1138583744 ns/op              1071 tokens/s
BenchmarkMultimodalInference-32               10        1099877300 ns/op              1099 tokens/s
BenchmarkMultimodalInference-32               10        1116220610 ns/op              1086 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    57.908s

Vulkan

==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.309


Devices:
========
GPU0:
        apiVersion         = 1.3.270
        driverVersion      = 2.0.294
        vendorID           = 0x1002
        deviceID           = 0x164e
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = AMD Radeon(TM) Graphics
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 23.40.02 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-0c00-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000
GPU1:
        apiVersion         = 1.4.312
        driverVersion      = 581.57.0.0
        vendorID           = 0x10de
        deviceID           = 0x2488
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = NVIDIA GeForce RTX 3070
        driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
        driverName         = NVIDIA
        driverInfo         = 581.57
        conformanceVersion = 1.4.1.3
        deviceUUID         = 91c0b9f4-e340-3c73-1422-c227930ae260
        driverUUID         = 08a6deb5-2838-56d3-b7da-f79802447960

C:\Users\limbo\ron\yzma\pkg\mtmd>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN0"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkMultimodalInference-32                1        14997592100 ns/op               73.08 tokens/s
BenchmarkMultimodalInference-32                1        14469341200 ns/op               76.71 tokens/s
BenchmarkMultimodalInference-32                1        24988773000 ns/op               49.22 tokens/s
BenchmarkMultimodalInference-32                1        24924637400 ns/op               49.35 tokens/s
BenchmarkMultimodalInference-32                1        14559276800 ns/op               76.31 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    96.114s

C:\Users\limbo\ron\yzma\pkg\mtmd>go test -benchtime=10s -count=5 -run=nada -bench . -nctx=32000 -device="VULKAN1"
goos: windows
goarch: amd64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkMultimodalInference-32               16         937497038 ns/op              1262 tokens/s
BenchmarkMultimodalInference-32               20        1079753220 ns/op              1126 tokens/s
BenchmarkMultimodalInference-32               19        1003840647 ns/op              1194 tokens/s
BenchmarkMultimodalInference-32                9        1535556511 ns/op               856.7 tokens/s
BenchmarkMultimodalInference-32               12        1018743817 ns/op              1180 tokens/s
PASS
ok      github.com/hybridgroup/yzma/pkg/mtmd    90.525s