🍳 MiniCPM-V & o Cookbook

May 8, 2026 Β· View on GitHub

🏠 Main Repository | πŸ“š Full Documentation

Cook up amazing multimodal AI applications effortlessly with MiniCPM-o, bringing vision, speech, and live-streaming capabilities right to your fingertips! For version-specific deployment instructions, see the files in the deployment directory.

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

  • Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
  • Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
  • Researchers: Leverage advanced frameworks including Transformers , LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.

Versatile Deployment Scenarios

Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.

  • Web demo: Launch interactive multimodal AI web demo with FastAPI.
  • Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF, BNB, and AWQ.
  • Edge devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.

⭐️ Live Demonstrations

Explore real-world examples of MiniCPM-V deployed on edge devices using our curated recipes. These demos highlight the model’s high efficiency and robust performance in practical scenarios.

Β Β Β Β 

  • Run locally on iPad with iOS demo, observing the process of drawing a rabbit.

πŸ”₯ Inference Recipes

Ready-to-run examples

RecipeDescription
Vision Capabilities (MiniCPM-V 4.6)
πŸ–ΌοΈ Single-image QAQuestion answering on a single image
🧩 Multi-image QAQuestion answering with multiple images
🎬 Video QAVideo-based question answering
πŸ“„ Document ParserParse and extract content from PDFs and webpages
πŸ“ Text RecognitionReliable OCR for photos and screenshots
🎯 GroundingVisual grounding and object localization in images
Audio Capabilities (MiniCPM-o)
🎀 Speech-to-TextMultilingual speech recognition
πŸ—£οΈ Text-to-SpeechInstruction-following speech synthesis
🎭 Voice CloningRealistic voice cloning and role-play

πŸ‹οΈ Fine-tuning Recipes

Customize your model with your own ingredients

Data preparation

Follow the guidance to set up your training datasets.

Training

We provide training methods serving different needs as following:

FrameworkDescription
TransformersMost flexible for customization
LLaMA-FactoryModular fine-tuning toolkit
SWIFTLightweight and fast parameter-efficient tuning
Align-anythingVisual instruction alignment for multimodal models

πŸ“¦ Serving Recipes

Deploy your model efficiently

MethodDescription
vLLMHigh-throughput GPU inference
SGLangHigh-throughput GPU inference
Llama.cppFast CPU inference on PC, iPhone and iPad
OllamaUser-friendly setup
OpenWebUIInteractive Web demo with Open WebUI
GradioInteractive Web demo with Gradio
FastAPIInteractive Omni Streaming demo with FastAPI
iOSInteractive iOS demo with llama.cpp

πŸ₯„ Quantization Recipes

Compress your model to improve efficiency

FormatKey Feature
GGUFSimplest and most portable format
BNBSimple and easy-to-use quantization method
AWQHigh-performance quantization for efficient inference
GPTQWeight-only INT4 with vLLM-compatible packaging (v4.5 only β€” Qwen3 backbone)

Framework Support Matrix

The latest release is MiniCPM-V 4.6 (Instruct + Thinking). The matrix below tracks v4.6 first, with v4.5 / o4.5 rows kept for reference.

ModelCategoryFrameworkCookbook LinkUpstream PRSupported since (branch)Supported since (release)
MiniCPM-V 4.6 (latest)InferenceTransformersTransformers Dochuggingface/transformers (merged)mainv5.7.0
Edge (On-device)Llama.cppLlama.cpp Doc#22529 (2026-05-06)master (2026-05-06)b9049
OllamaOllama DocMergingMergingWaiting for official release
Serving (Cloud)vLLMvLLM Doc#41254 (2026-04-29)MergingWaiting for official release
SGLangSGLang DocMergingMergingWaiting for official release
MiniCPM-o 4.5Edge (On-device)Llama.cppLlama.cpp Doc#19211(2026-01-30)master(2026-01-30)b7895
OllamaOllama Doc#12078(2025-08-26)MergingWaiting for official release
Serving(Cloud)vLLMvLLM Doc#33431(2026-01-30)MergingWaiting for official release
SGLangSGLang Doc#9610(2025-08-26)MergingWaiting for official release
Cross-versionFinetuningLLaMA-FactoryLLaMA-Factory Doc#9022 (2025-08-26)main (2025-08-26)Waiting for official release
QuantizationGGUFGGUF Docβ€”β€”β€”
BNBBNB Docβ€”β€”β€”
AWQAWQ Doctc-mb/AutoAWQβ€”β€”
GPTQGPTQ Docopenbmb/MiniCPM-V-4_5-GPTQβ€”β€”
DemosGradio DemoGradio Demo Docβ€”β€”β€”

If you'd like us to prioritize support for another open-source framework, please let us know via this short form.

Awesome Works using MiniCPM-V & o

  • text-extract-api: Document extraction API using OCRs and Ollama supported models GitHub Repo stars
  • comfyui_LLM_party: Build LLM workflows and integrate into existing image workflows GitHub Repo stars
  • Ollama-OCR: OCR package uses vlms through Ollama to extract text from images and PDF GitHub Repo stars
  • comfyui-mixlab-nodes: ComfyUI node suite supports Workflow-to-APP、GPT&3D and more GitHub Repo stars
  • OpenAvatarChat: Interactive digital human conversation implementation on single PC GitHub Repo stars
  • pensieve: A privacy-focused passive recording project by recording screen content GitHub Repo stars
  • paperless-gpt: Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR GitHub Repo stars
  • Neuro: A recreation of Neuro-Sama, but running on local models on consumer hardware GitHub Repo stars

πŸ‘₯ Community

Contributing

We love new recipes! Please share your creative dishes:

  1. Fork the repository
  2. Create your recipe
  3. Submit a pull request

Issues & Support

Institutions

This cookbook is developed by OpenBMB and OpenSQZ.

πŸ“œ License

This cookbook is served under the Apache-2.0 License - cook freely, share generously! 🍳

Citation

If you find our model/code/paper helpful, please consider citing our papers πŸ“ and staring us ⭐️!

@misc{yu2025minicpmv45cookingefficient,
      title={MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe}, 
      author={Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji Qi and Zonghao Guo and Chi Chen and Guoyang Zeng and Yuxuan Li and Ganqu Cui and Ning Ding and Xu Han and Yuan Yao and Zhiyuan Liu and Maosong Sun},
      year={2025},
      eprint={2509.18154},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.18154}, 
}

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={Nat Commun 16, 5509 (2025)},
  year={2025}
}