MicroLLaVA
August 24, 2025 ยท View on GitHub
A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24GB VRAM.
๐ฐ News and Updates
- 08/23/2025: Created a new model based on Qwen3-0.6B-base with SigLIP2-so400m ๐ keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m. This model has ~1B parameters and achieves a 78.5 VQAv2 score, on par with the original LLaVA 1.5 (7B).
- 08/23/2025: Added Qwen3 support to TinyLLaVA_Factory, including:
- A new chat template for Qwen3 integration
- Training and evaluation scripts with hyperparameters for a single Nvidia 4090
- Various compatibility fixes such as transformers upgrade required for the new Qwen3-0.6B-base model
- 08/17/2025: the hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
- 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2.
- 08/09/2025: initial version of MicroLlava released
๐ Quick Start
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model from Hugging Face
hf_path = 'keeeeenw/MicroLlava'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda() # Enable CUDA if needed - model runs fairly quickly on CPU
# Setup tokenizer
config = model.config
tokenizer = AutoTokenizer.from_pretrained(
hf_path,
use_fast=False,
model_max_length=config.tokenizer_model_max_length,
padding_side=config.tokenizer_padding_side
)
# Run inference
prompt = "What are the things I should be cautious about when I visit here?"
image_url = "https://llava-vl.github.io/static/images/view.jpg"
output_text, generation_time = model.chat(
prompt=prompt,
image=image_url,
tokenizer=tokenizer
)
print(f'Model output: {output_text}')
print(f'Generation time: {generation_time}')
๐ Model Overview
| Component | Details |
|---|---|
| Framework | Transformers + PyTorch |
| Language Model | MicroLlama (~300M parameters) |
| Vision Encoder | SigLIP2-SO400M |
| Training Hardware | Single NVIDIA RTX 4090 |
| Checkpoint Format | SafeTensors |
| License | Apache 2.0 |
๐ฏ Key Features
- ๐ง Single GPU Training: Train on consumer hardware without DeepSpeed
- โก Fast Training: Pretraining takes ~5 hours, finetuning ~12 hours on RTX 4090
- ๐ฆ Compact: Only ~300M language model parameters
- ๐จ Vision-Language Tasks: Visual Question Answering, image captioning
- ๐ Easy Iteration: Perfect for research and experimentation
๐ Performance
VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384)
| Question Type | Accuracy |
|---|---|
| Yes/No | 72.32% |
| Number | 43.89% |
| Other | 46.65% |
| Overall | 56.91% |
Evaluated on VQAv2 test-dev split
(Deprecated) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384)
| Question Type | Accuracy |
|---|---|
| Yes/No | 65.08% |
| Number | 28.97% |
| Other | 29.32% |
| Overall | 44.01% |
Evaluated on VQAv2 test-dev split
Planned tests include:
- VQAv2 test set (instead of test-dev)
- and datasets from TinyLlava evaluation
- Community contributions with benchmark results are welcome and encouraged.
๐ ๏ธ Training
This model is based on TinyLLaVA Factory with optimizations for single GPU training.
Training Times (RTX 4090)
- Pretraining: ~5 hours on LAION-CC-SBU-558K
- Finetuning: ~12 hours on TinyLLaVA datasets
Key Training Modifications
Pretraining Hyperparameters:
gradient_accumulation_steps: 2 โ 8learning_rate: 1e-3 โ 2.5e-4warmup_ratio: 0.03 โ 0.06bfloat16: True after the Siglip2 upgrade (improved stability)
Finetuning:
- Precision:
bfloat16(improved stability) - Same major hyperparameters as original TinyLLaVA
Reproduce Training
- Clone the training repository:
git clone https://github.com/keeeeenw/TinyLLaVA_Factory.git
cd TinyLLaVA_Factory
- Follow the training guides in the repository for pretraining and finetuning steps.
๐ฏ Use Cases
โ Intended Uses
- Research: Vision-language experimentation on limited hardware
- Education: Learning VLM concepts and implementations
- Prototyping: Quick iteration for domain-specific applications
- Finetuning: Starting point for specialized vision-language tasks
โ ๏ธ Limitations
- Small model size may limit complex reasoning capabilities
- OCR performance may be limited compared to larger models
- Performance varies with image quality and domain
- Minimal safety filtering - implement safeguards for production use
Warning: This model should not be used for safety-critical applications without thorough human review and additional safeguards.
๐ Related Projects
- MicroLlama - The base language model
- TinyLLaVA Factory - Training framework
- SigLIP2 - Vision encoder
๐ Citation
@misc{wang2024microllama,
title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
author = {Zixiao Ken Wang},
year = {2025},
url = {https://huggingface.co/keeeeenw/MicroLlava}
}
๐ค Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Areas for Contribution
- Additional evaluation benchmarks
- Performance optimizations
- Documentation improvements
- Example applications
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Acknowledgments
Special thanks to:
- TinyLLaVA Factory team for the training framework
- SigLIP2 authors for the efficient vision encoder
- LAION community for the pretraining datasets
- Hugging Face for model hosting and tools
โญ Star this repository if you find it useful! โญ
For questions and support, please open an issue or check out the Hugging Face model page.