Gemma4.java
April 11, 2026 ยท View on GitHub
Fast, zero-dependency, inference engine for Gemma 4 in pure Java.
Features
- Single file, no dependencies, based on llama3.java
- Supports all Gemma 4 model families:
E2B,E4B,31B, and26B-A4B(MoE) - Fast GGUF format parser
- Supported dtypes/quantizations:
F16,BF16,F32,Q4_0,Q4_1,Q4_K,Q5_K,Q6_K,Q8_0 - Matrix-vector kernels using Java's Vector API
- CLI with
--chatand--promptmodes - Thinking mode control with
--think off|on|inline - GraalVM Native Image support
- AOT model preloading for instant time-to-first-token
Setup
Download GGUF models from Hugging Face:
| Model | Architecture | GGUF Repository |
|---|---|---|
| E2B | Dense, ~5B total params | unsloth/gemma-4-E2B-it-GGUF |
| E4B | Dense, ~8B total params | unsloth/gemma-4-E4B-it-GGUF |
| 31B | Dense | unsloth/gemma-4-31B-it-GGUF |
| 26B-A4B | Mixture of Experts (MoE) | unsloth/gemma-4-26B-A4B-it-GGUF |
Optional: pure quantizations
Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K).
A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:
./llama-quantize --pure ./gemma-4-E2B-it-BF16.gguf ./gemma-4-E2B-it-Q4_0.gguf Q4_0
Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.
Build and run
Java 21+ is required, in particular for the MemorySegment mmap-ing feature.
jbang is a good fit for this use case.
No-setup one-liner, no git clone, no manual model download required ... ~5GB download once, then cached by jbang:
jbang gemma4@mukel \
--model %https://hf.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q8_0.gguf \
--system-prompt "like Master Yoda, reply you must" \
--chat
Alternatively:
jbang Gemma4.java --help
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --chat
jbang Gemma4.java --model ./gemma-4-E2B-it-Q4_0.gguf --prompt "Explain quantum computing like I'm five"
Or run it directly (still via jbang):
chmod +x Gemma4.java
./Gemma4.java --help
Optional: Makefile
A simple Makefile is provided. Run make jar to produce gemma4.jar.
Run the resulting gemma4.jar as follows:
java --enable-preview --add-modules jdk.incubator.vector -jar gemma4.jar --help
GraalVM Native Image
Compile with make native to produce a gemma4 executable, then:
./gemma4 --model ./gemma-4-E2B-it-Q4_0.gguf --chat
AOT model preloading
Gemma4.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).
To AOT pre-load a GGUF model:
PRELOAD_GGUF=/path/to/model.gguf make native
A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.
Benchmarks
**Hardware specs: AMD Ryzen 9950X 16C/32T 64GB (6400) Linux 6.18.12.
GraalVM 25+ is recommended for the absolute best performance (JIT mode), it provides partial, but good support for the Vector API, also in Native Image.
By default, the "preferred" vector size is used, it can be force-set with -Dllama.VectorBitSize=0|128|256|512, 0 means disabled.
Related Repositories
License
Apache 2.0