Gemma 4

April 3, 2026 · View on GitHub

Gemma 4 is a family of multimodal models from Google supporting text, image, audio, and thinking (chain-of-thought reasoning). This unified module handles all Gemma 4 variants — from lightweight 2B to dense 31B and MoE 26B.

Capabilities:

  • Text generation — single and multi-turn conversations
  • Thinking mode — chain-of-thought reasoning with --enable-thinking
  • Image understanding — single and multi-image analysis
  • Audio understanding — speech transcription and audio analysis (2B/4B models)
  • Mixture of Experts — sparse MoE with SwitchGLU and gather_mm (26B-A4B)

Models

ModelTypeParamsMemoryVisionAudioK-eq-V
google/gemma-4-e2b-itDense2B~5 GBYesYesNo
google/gemma-4-e4b-itDense4B~16 GBYesYesNo
google/gemma-4-26b-a4b-itMoE26B (4B active)~52 GBYesNoYes
google/gemma-4-31b-itDense31B~63 GBYesNoYes

K-eq-V: Full attention layers reuse key projections as values (no separate v_proj), reducing parameters and memory while using num_global_key_value_heads for the KV dimension.

Install

pip install -U mlx-vlm

CLI

Text generation

python -m mlx_vlm.generate \
  --model google/gemma-4-e4b-it \
  --prompt "What is the capital of France?" \
  --max-tokens 500 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Image understanding

python -m mlx_vlm.generate \
  --model google/gemma-4-e4b-it \
  --image path/to/image.jpg \
  --prompt "Describe this image." \
  --max-tokens 500 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Audio understanding (2B/4B only)

python -m mlx_vlm.generate \
  --model google/gemma-4-e4b-it \
  --audio path/to/audio.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Thinking mode (chain-of-thought)

python -m mlx_vlm.generate \
  --model google/gemma-4-e4b-it \
  --prompt "I want to do a car wash that is 50 meters away, should I walk or drive?" \
  --enable-thinking \
  --max-tokens 2000 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Image + thinking

python -m mlx_vlm.generate \
  --model google/gemma-4-31b-it \
  --image path/to/image.jpg \
  --prompt "Describe this image." \
  --enable-thinking \
  --max-tokens 2000 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Thinking with budget

python -m mlx_vlm.generate \
  --model google/gemma-4-e4b-it \
  --prompt "Explain quantum entanglement." \
  --enable-thinking \
  --thinking-budget 512 \
  --max-tokens 2000 \
  --temperature 1.0 --top-p 0.95 --top-k 64

Python

Basic text generation

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("google/gemma-4-e4b-it")

prompt = apply_chat_template(
    processor, model.config, "Write a poem about the ocean."
)

result = generate(
    model=model,
    processor=processor,
    prompt=prompt,
    max_tokens=500,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(result)

Image understanding

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("google/gemma-4-31b-it")

image = ["path/to/image.jpg"]
prompt = apply_chat_template(
    processor, model.config, "Describe this image.",
    num_images=1,
)

result = generate(
    model=model,
    processor=processor,
    prompt=prompt,
    image=image,
    max_tokens=500,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(result)

Audio understanding

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("google/gemma-4-e4b-it")

prompt = apply_chat_template(
    processor, model.config, "Transcribe this audio",
    num_audios=1,
)

result = generate(
    model=model,
    processor=processor,
    prompt=prompt,
    audio=["path/to/audio.wav"],
    max_tokens=500,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(result)

Thinking mode

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("google/gemma-4-e4b-it")

prompt = apply_chat_template(
    processor, model.config,
    "Explain quantum entanglement in simple terms.",
    chat_template_kwargs={"enable_thinking": True},
)

result = generate(
    model=model,
    processor=processor,
    prompt=prompt,
    max_tokens=2000,
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
print(result)

Architecture

All Gemma 4 variants share a common architecture with conditional features:

  • Sliding + full attention — pattern of 5 sliding window layers + 1 full attention layer
  • KV sharing — later layers reuse K/V from earlier layers (2B/4B models)
  • K-eq-V attention — full attention layers use key states as values (26B/31B models)
  • Per-layer inputs — additional per-layer token embeddings (2B/4B models)
  • MoE — 128 experts with top-8 routing using gather_mm (26B model only)
  • SigLIP2 vision encoder — shared across all variants
  • Conformer audio encoder — 12 conformer blocks (2B/4B models only)

Notes

  • Thinking mode uses <|channel>...<channel|> delimiters to separate reasoning from the final answer.
  • Audio uses a Conformer-based encoder with 128-bin mel spectrogram input (16kHz). Audio tokens are dynamically expanded based on duration (~40ms per token, up to 750 tokens).
  • Supported audio formats: WAV, MP3, FLAC, and other formats handled by soundfile.
  • The 26B MoE model uses sparse expert routing via mlx_lm.SwitchGLU with mx.gather_mm — only the top-8 experts are computed per token.
  • The 31B dense model is bandwidth-bound at ~5 tok/s on 300 GB/s systems (62.5 GB / 300 GB/s).