Granite 4.0 Vision
April 1, 2026 ยท View on GitHub
IBM's Granite 4.0 Vision model with DeepStack multi-layer feature injection and WindowQFormer projectors. Combines a SigLIP vision encoder with a GraniteMoeHybrid language model.
Model
| Model ID | ibm-granite/granite-4.0-3b-vision |
| Architecture | GraniteMoeHybrid LM + SigLIP (vision) + WindowQFormer projectors + DeepStack |
| Parameters | ~3B |
| Vision Encoder | SigLIP (27 layers, 1152 hidden, 384px, patch 16) |
| Projectors | 4 DeepStack + 4 Spatial WindowQFormerDownsampler modules |
| Tasks | Document understanding, VQA, image description, table/chart analysis |
CLI Usage
python -m mlx_vlm.generate \
--model ibm-granite/granite-4.0-3b-vision \
--image path/to/image.jpg \
--prompt "Describe this image in detail."
Python Usage
from mlx_vlm import load, generate
model, processor = load("ibm-granite/granite-4.0-3b-vision")
image = "path/to/image.jpg"
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image."}
]}
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
output = generate(model, processor, image=image, prompt=prompt, max_tokens=512)
print(output)
Architecture
- Vision: SigLIP encoder (27 layers, patch size 16)
- Language: GraniteMoeHybrid with MUP multipliers and fused gate+up MLP
- DeepStack: Vision features from layers [-19, -13, -7, -1] injected at LLM layers [9, 6, 3, 0]
- Spatial Sampling: 4 offset groups (TL/TR/BL/BR) from last vision layer โ LLM layers [12, 15, 18, 21]
- WindowQFormer: Windowed cross-attention with area interpolation downsampling (4/8 rate)
- LoRA: Adapter weights auto-merged during loading (r=256, alpha=256)