Nemotron3.java

April 11, 2026 ยท View on GitHub

Java 21+ License: Apache 2.0 GraalVM Platform

Fast, zero-dependency, inference engine for Nemotron 3 in pure Java.


Features

  • Single file, no dependencies, based on llama3.java
  • Supports Nemotron 3 model families: dense and MoE (Mixture of Experts)
  • Mixed layer types: attention, feed-forward (FFN), and recurrent SSM (State Space Model)
  • Fast GGUF format parser
  • Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
  • Matrix-vector kernels using Java's Vector API
  • CLI with --chat and --prompt modes
  • Thinking mode control with --think off|on|inline
  • GraalVM Native Image support
  • AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

ModelArchitectureGGUF Repository
Nano 8BDenseunsloth/Nemotron-3-Nano-8B-GGUF
Nano 30B-A3BMoE, 30B total / 3B activeunsloth/Nemotron-3-Nano-30B-A3B-GGUF

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./Nemotron-3-Nano-8B-BF16.gguf ./Nemotron-3-Nano-8B-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

jbang Nemotron3.java --help
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Nemotron3.java
./Nemotron3.java --help

GraalVM Native Image

Compile to produce a nemotron3 native executable, then:

./nemotron3 --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat

AOT model preloading

Nemotron3.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model, pass the system property -Dnemotron3.PreloadGGUF=/path/to/model.gguf at build time. A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

License

Apache 2.0