GptOss.java

April 10, 2026 · View on GitHub

Fast, zero-dependency, inference engine for OpenAI's gpt-oss in pure Java.

Features

Single file, no dependencies, based on llama3.java
Supports GPT-OSS models (including MoE variants)
Fast GGUF format parser
Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0, MXFP4
Matrix-vector kernels using Java's Vector API
CLI with --chat and --instruct modes
Thinking mode control with --think off|on|inline
GraalVM Native Image support
AOT model preloading for instant time-to-first-token

Model	Parameters	GGUF Repository
GPT-OSS 20B	20B (MoE)	unsloth/gpt-oss-20b-GGUF
GPT-OSS 120B	120B (MoE)	unsloth/gpt-oss-120b-GGUF

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./gpt-oss-20b-f32.gguf ./gpt-oss-20b-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

No-setup one-liner, no git clone, no manual model download required ... ~10GB download once, then cached by jbang:

jbang gptoss@mukel \
    --model %{https://hf.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-Q8_0.gguf} \
    --system-prompt "You are a helpful coding assistant" \
    --chat

Alternatively:

jbang GptOss.java --help
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --chat
jbang GptOss.java --model ./gpt-oss-20b-Q8_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x GptOss.java
./GptOss.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce gptoss.jar.

Run the resulting gptoss.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar gptoss.jar --help

GraalVM Native Image

Compile with make native to produce a gptoss executable, then:

./gptoss --model ./gpt-oss-20b-Q8_0.gguf --chat

AOT model preloading

GptOss.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

License

Apache 2.0