MicroLLaVA

August 24, 2025 ยท View on GitHub

License Python PyTorch Hugging Face

A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24GB VRAM.

๐Ÿ“ฐ News and Updates

  • 08/23/2025: Created a new model based on Qwen3-0.6B-base with SigLIP2-so400m ๐Ÿ‘‰ keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m. This model has ~1B parameters and achieves a 78.5 VQAv2 score, on par with the original LLaVA 1.5 (7B).
  • 08/23/2025: Added Qwen3 support to TinyLLaVA_Factory, including:
    • A new chat template for Qwen3 integration
    • Training and evaluation scripts with hyperparameters for a single Nvidia 4090
    • Various compatibility fixes such as transformers upgrade required for the new Qwen3-0.6B-base model
  • 08/17/2025: the hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
  • 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2.
  • 08/09/2025: initial version of MicroLlava released

๐Ÿš€ Quick Start

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model from Hugging Face
hf_path = 'keeeeenw/MicroLlava'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda()  # Enable CUDA if needed - model runs fairly quickly on CPU

# Setup tokenizer
config = model.config
tokenizer = AutoTokenizer.from_pretrained(
    hf_path, 
    use_fast=False, 
    model_max_length=config.tokenizer_model_max_length,
    padding_side=config.tokenizer_padding_side
)

# Run inference
prompt = "What are the things I should be cautious about when I visit here?"
image_url = "https://llava-vl.github.io/static/images/view.jpg"

output_text, generation_time = model.chat(
    prompt=prompt,
    image=image_url,
    tokenizer=tokenizer
)

print(f'Model output: {output_text}')
print(f'Generation time: {generation_time}')

๐Ÿ“‹ Model Overview

ComponentDetails
FrameworkTransformers + PyTorch
Language ModelMicroLlama (~300M parameters)
Vision EncoderSigLIP2-SO400M
Training HardwareSingle NVIDIA RTX 4090
Checkpoint FormatSafeTensors
LicenseApache 2.0

๐ŸŽฏ Key Features

  • ๐Ÿ”ง Single GPU Training: Train on consumer hardware without DeepSpeed
  • โšก Fast Training: Pretraining takes ~5 hours, finetuning ~12 hours on RTX 4090
  • ๐Ÿ“ฆ Compact: Only ~300M language model parameters
  • ๐ŸŽจ Vision-Language Tasks: Visual Question Answering, image captioning
  • ๐Ÿ”„ Easy Iteration: Perfect for research and experimentation

๐Ÿ† Performance

VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384)

Question TypeAccuracy
Yes/No72.32%
Number43.89%
Other46.65%
Overall56.91%

Evaluated on VQAv2 test-dev split

(Deprecated) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384)

Question TypeAccuracy
Yes/No65.08%
Number28.97%
Other29.32%
Overall44.01%

Evaluated on VQAv2 test-dev split

Planned tests include:

  1. VQAv2 test set (instead of test-dev)
  2. and datasets from TinyLlava evaluation
  3. Community contributions with benchmark results are welcome and encouraged.

๐Ÿ› ๏ธ Training

This model is based on TinyLLaVA Factory with optimizations for single GPU training.

Training Times (RTX 4090)

  • Pretraining: ~5 hours on LAION-CC-SBU-558K
  • Finetuning: ~12 hours on TinyLLaVA datasets

Key Training Modifications

Pretraining Hyperparameters:

  • gradient_accumulation_steps: 2 โ†’ 8
  • learning_rate: 1e-3 โ†’ 2.5e-4
  • warmup_ratio: 0.03 โ†’ 0.06
  • bfloat16: True after the Siglip2 upgrade (improved stability)

Finetuning:

  • Precision: bfloat16 (improved stability)
  • Same major hyperparameters as original TinyLLaVA

Reproduce Training

  1. Clone the training repository:
git clone https://github.com/keeeeenw/TinyLLaVA_Factory.git
cd TinyLLaVA_Factory
  1. Follow the training guides in the repository for pretraining and finetuning steps.

๐ŸŽฏ Use Cases

โœ… Intended Uses

  • Research: Vision-language experimentation on limited hardware
  • Education: Learning VLM concepts and implementations
  • Prototyping: Quick iteration for domain-specific applications
  • Finetuning: Starting point for specialized vision-language tasks

โš ๏ธ Limitations

  • Small model size may limit complex reasoning capabilities
  • OCR performance may be limited compared to larger models
  • Performance varies with image quality and domain
  • Minimal safety filtering - implement safeguards for production use

Warning: This model should not be used for safety-critical applications without thorough human review and additional safeguards.

๐Ÿ“ Citation

@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava}
}

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Areas for Contribution

  • Additional evaluation benchmarks
  • Performance optimizations
  • Documentation improvements
  • Example applications

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Special thanks to:

  • TinyLLaVA Factory team for the training framework
  • SigLIP2 authors for the efficient vision encoder
  • LAION community for the pretraining datasets
  • Hugging Face for model hosting and tools

โญ Star this repository if you find it useful! โญ

For questions and support, please open an issue or check out the Hugging Face model page.