Multimodal
June 5, 2026 ยท View on GitHub
Dependencies
To keep things simple, external tools are used for accessing and processing multimedia data. These tools should be available in OS's searching paths.
-
ffmpeg: v7.0.2 or compatible
ffmpegandffplayare used for video/audio IO. -
ImageMagick: v7.1.1 or compatible
magickis used for image IO.
Input multimedia files
Use --multimedia_file_tags to specify a pair of tags, for example:
--multimedia_file_tags {{ }}
Then an image, audio, or video can be embedded in prompt like this:
{{tag:/path/to/an/image/file}}
where tag is image, audio, or video respectively. For some well known file types (jpg, mp3, etc), tag: can be omitted.
Take Fuyu model as an example:
main -m /path/to/fuyu-8b.bin -p "{{image:path/to/bus.png}}Generate a coco-style caption." --multimedia_file_tags {{ }} -ngl all --max_length 4000