Llama3.java
April 10, 2026 ยท View on GitHub
Practical Llama 3, 3.1 and 3.2 inference implemented in a single Java file.
This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos.
Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.
Features
- Single file, no dependencies
- GGUF format parser
- Llama 3+ tokenizer based on minbpe
- Llama 3+ inference with Grouped-Query Attention
- Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
- Support F16, BF16, F32 weights + Q4_0, Q4_1, Q4_K, Q5_K ,Q6_K, Q8_0 quantizations
- Fast matrix-vector multiplication routines using Java's Vector API
- Simple CLI with
--chatand--instructmodes. - GraalVM's Native Image support
- AOT model pre-loading for instant time-to-first-token
Interactive --chat mode in action:
Practical LLM inference in modern Java
Presented at Devoxx Belgium, 2024
Setup
Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from:
- https://huggingface.co/mukel/Llama-3.2-1B-Instruct-GGUF
- https://huggingface.co/mukel/Llama-3.2-3B-Instruct-GGUF
- https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
- https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
Or from unsloth https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF
Optional: Pure quantizations
In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the token_embd.weights/output.weights tensor are quantized with Q6_K, instead of Q4_0.
A pure quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source
with the llama-quantize utility from llama.cpp as follows:
./llama-quantize --pure ./Meta-Llama-3-8B-Instruct-BF16.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
Pick any of the supported quantizations: Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0.
Build and run
Java 21+ is required, in particular for the MemorySegment mmap-ing feature.
jbang is a perfect fit for this use case, just:
jbang Llama3.java --help
Or execute directly, also via jbang:
chmod +x Llama3.java
./Llama3.java --help
Optional: Makefile
A simple Makefile is provided, run make jar to produce llama3.jar.
Run the resulting llama3.jar as follows:
java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help
GraalVM Native Image
Compile with make native to produce a llama3 executable, then:
./llama3 --model Llama-3.2-1B-Instruct-Q8_0.gguf --chat
AOT model preloading
Llama3.java supports AOT model preloading, enabling 0-overhead, instant inference, with minimal TTFT (time-to-first-token).
To AOT pre-load a GGUF model:
PRELOAD_GGUF=/path/to/model.gguf make native
A specialized, larger binary will be generated, with no parsing overhead for that particular model. It can still run other models, although incurring the usual parsing overhead.
Performance
GraalVM 25+ is recommended for the absolute best performance, it provides partial but good support for the Vector API.
By default, the "preferred" vector size is used, it can be force-set with -Dllama.VectorBitSize=0|128|256|512, 0 means disabled.
llama.cpp
Vanilla llama.cpp built with make.
./llama-cli --version
version: 3862 (3f1ae2e3)
built with cc (GCC) 14.2.1 20240805 for x86_64-pc-linux-gnu
Executed as follows:
./llama-bench -m Llama-3.2-1B-Instruct-Q4_0.gguf -p 0 -n 128
Llama3.java
taskset -c 0-15 ./llama3 \
--model ./Llama-3-1B-Instruct-Q4_0.gguf \
--max-tokens 128 \
--seed 42 \
--stream false \
--prompt "Why is the sky blue?"
Hardware specs: 2019 AMD Ryzen 3950X 16C/32T 64GB (3800) Linux 6.6.47.
**Notes
Running on a single CCD e.g. taskset -c 0-15 ./llama3 ... since inference is constrained by memory bandwidth.
Results
Related Repositories
License
MIT