Phi-4 Multimodal (phi4mm)
March 6, 2026 · View on GitHub
Phi-4 Multimodal is a tri-modal model supporting text, image, and audio understanding.
Architecture
| Component | Details |
|---|---|
| Language model | Phi-4 (32 layers, 3072 hidden, 24 heads, 8 KV heads) |
| Vision encoder | SigLIP-2 (27 layers, 1152 hidden, 16 heads) |
| Audio encoder | Cascades Conformer (24 blocks, 1024 dim, 16 heads) |
| Vision projector | 2-layer MLP (1152 → 3072 → 3072, GELU) |
| Audio projector | 2-layer MLP with speech/vision modes |
LoRA switching
The original checkpoint ships with two LoRA adapters applied to the LLM backbone:
- Vision LoRA (r=256, alpha=512) — merged at load time by default.
- Speech LoRA (r=320, alpha=640) — stored for runtime switching.
set_modality() automatically selects the correct LoRA (or both) based on the input types.
Model
- Hugging Face ID:
microsoft/Phi-4-multimodal-instruct - Remote processor code is ported in-tree, so
--trust-remote-codeis optional.
CLI
Text understanding
mlx_vlm.generate \
--model microsoft/Phi-4-multimodal-instruct \
--prompt "Explain the theory of relativity in simple terms." \
--max-tokens 256
Image understanding
mlx_vlm.generate \
--model microsoft/Phi-4-multimodal-instruct \
--image /path/to/image.jpg \
--prompt "Describe this image." \
--max-tokens 256
Audio understanding
mlx_vlm.generate \
--model microsoft/Phi-4-multimodal-instruct \
--audio /path/to/audio.wav \
--prompt "" \
--max-tokens 256
Multi-modal (image + audio)
mlx_vlm.generate \
--model microsoft/Phi-4-multimodal-instruct \
--image /path/to/image.jpg \
--audio /path/to/audio.wav \
--prompt "" \
--max-tokens 256
Python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("microsoft/Phi-4-multimodal-instruct")
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = "What animals are in the image?"
formatted_prompt = apply_chat_template(
processor,
model.config,
prompt,
num_images=len(image),
num_audios=len(audio),
)
result = generate(
model=model,
processor=processor,
prompt=formatted_prompt,
image=image,
audio=audio,
max_tokens=256,
temperature=0.0,
)
print(result.text)
Quantization
mlx_vlm.convert \
--model microsoft/Phi-4-multimodal-instruct \
-q \
--mlx-path Phi-4-multimodal-instruct-4bit
During quantization the model pre-merges both LoRA adapters into the LLM weights and quantizes only the language model. Vision encoder, audio encoder, and projectors are kept in bfloat16.
After quantization, LoRA switching is disabled (not needed since both adapters are baked in).
Notes
- Audio input is a 16 kHz mono waveform; the processor handles resampling automatically.
- The
<|image_1|>/<|audio_1|>placeholders are inserted byapply_chat_template.