Smol Vision ๐Ÿฃ

October 25, 2025 ยท View on GitHub

Smol

Smol Vision ๐Ÿฃ

Recipes for shrinking, optimizing, customizing cutting edge vision and multimodal AI models.

Latest examples ๐Ÿ‘‡๐Ÿป

Note: GitHub refuses to render notebooks for a long time now, so the notebooks of smol-vision with rich outputs now lives here. I still update this repository but it's inconvenient to read here.

NotebookDescription
Quantization/ONNXFaster and Smaller Zero-shot Object Detection with OptimumQuantize the state-of-the-art zero-shot object detection model OWLv2 using Optimum ONNXRuntime tools.
VLM Fine-tuningFine-tune PaliGemmaFine-tune state-of-the-art vision language backbone PaliGemma using transformers.
Intro to Optimum/ORTOptimizing DETR with ๐Ÿค— OptimumA soft introduction to exporting vision models to ONNX and quantizing them.
Model ShrinkingKnowledge Distillation for Computer VisionKnowledge distillation for image classification.
QuantizationFit in vision models using QuantoFit in vision models to smaller hardware using quanto
Speed-upFaster foundation models with torch.compileImproving latency for foundation models using torch.compile
[NEW] VLM Fine-tuningFine-tune Florence-2Fine-tune Florence-2 on DocVQA dataset
VLM Fine-tuningQLoRA/Fine-tune IDEFICS3 or SmolVLM on VQAv2QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
VLM Fine-tuning (Script)QLoRA Fine-tune IDEFICS3 on VQAv2QLoRA/Full Fine-tune IDEFICS3 or SmolVLM on VQAv2 dataset
[NEW] VLM Fine-tuningGrounded Fine-tuningGrounded fine-tuning for vision-language models
[NEW] Vision Model Fine-tuningFine-tune DINOv3Fine-tune DINOv3 for vision tasks
Multimodal RAGMultimodal RAG using ColPali and Qwen2-VLLearn to retrieve documents and pipeline to RAG without hefty document processing using ColPali through Byaldi and do the generation with Qwen2-VL
Multimodal Retriever Fine-tuningFine-tune ColPali for Multimodal RAGLearn to apply contrastive fine-tuning on ColPali to customize it for your own multimodal document RAG use case
Any-to-Any Fine-tuningFine-tune Gemma-3n for all modalities (audio-text-image)Fine-tune Gemma-3n model to handle any modality: audio, text, and image.
Any-to-Any RAGAny-to-Any (Video) RAG with OmniEmbed and QwenDo retrieval and generation across modalities (including video) using OmniEmbed and Qwen.
Speed-up/Memory OptimizationVision language model serving using TGI (SOON)Explore speed-ups and memory improvements for vision-language model serving with text-generation inference
Quantization/Optimum/ORTAll levels of quantization and graph optimizations for Image Segmentation using Optimum (SOON)End-to-end model optimization using Optimum